Twitterdämmerung: “Twitter is already slowly breaking”

The following text was written by Hector Martin as a Twitter thread on his account @marcan42 and was originally published on 13 November 2022, 07:16–07:25 UTC.

With Mr Martin’s permission, I have reproduced it here as a single blog entry, for easier reading and more durable citation. Incidentally, he has both a GitHub Sponsors page and a Patreon.

Superscript numbers are footnotes. Subscript letters are Martin’s Tweet boundaries; subscript _a points to the reference list entry which would usually be cited as (Martin, 2022a), for instance.

Scoop from within Twitter: small things are breaking, not enough engineers to fix them.

Noticed that notification counts are not showing? The BE¹ service powering it is down since Thursday. A bug was filed, but the team that would fix it is full on with verification work.
Gergely Orosz, 2022, citation pending

Twitter is already slowly breaking, and it’s only going to get worse.

He says that Twitter was built to be resilient, and that means the core features will continue working even with a reduced workforce.

As a former SRE,² I disagree with that conclusion._a

All large services and platforms are “built to be resilient”. But they are also extremely complicated, with countless internal interactions between microservices, configuration systems, load balancing and directing subsystems, networking fabrics, and more._b

These systems are built to be reliable in the face of things like machine failures, or entire optional microservices going down. That’s not what will take Twitter down. Twitter will crash and burn when a complex interaction between systems goes wrong and causes a cascade failure._c

Cascade failures are caused when a small failure within a system has secondary effects on other systems, and the systems in charge of keeping everything up end up taking everything down instead._d

There are many ways this can happen. For example, you can have a “packet of death” — some data that causes a system to crash. If the data is being delivered repeatedly, or stored persistently, it can cause anything that tries to process it to crash or hang._e

This doesn’t have to be an exploit or externally malicious data. It can just be a circumstance that the design never accounted for. Or simply random corruption. Or any number of things._f

I once had one of those at Google. A pipeline from a downstream team kept crashing because it was trying to process bad data that had come in via a system my team was in charge of. The data had the right format but the wrong shape._g

After digging through forensic logs I figured out what had happened was that a machine had kernel panicked, which had left unwritten filesystem blocks in a log file (Google ran without local data FS journaling at the time), and those stale blocks happened to contain valid data_h from another file with the same container format, but different contents. That was dutifully processed and forwarded on, and ended up crashing every worker that processed it.

One kernel panic took down an entire processing pipeline because of a freak coincidence._i

In that case the machine that was the root cause wasn’t even my responsibility, so I just reported my findings and left it at that. The team doing the processing put in some workaround to skip the bad data. All this can be done in a timely fashion if you have the people._j

“The system processing the data should’ve been built to blocklist bad data that causes crashes!”

Yeah, yeah, there is always something that “could’ve been done to avoid the failure”.

It is impossible to identify and preemptively defend against all such freak circumstances._k

This is why you have engineers on call who know their systems inside out. So that when things do go wrong — and they will — they can be fixed quickly, before things escalate into a widespread failure.

I have many such stories from my short 2.5 years at Google SRE._l

There’s also the dumb human/machine interaction errors. Google Docs smart quotes once took down a whole cluster, because fooctl ‘very-important-username’ is a very different bash command from fooctl 'very-important-username'. _m

Guess what: if you have fresh people from the wrong teams copying and pasting instructions, fixing one of these is going to take much longer than it should.

Team-team interactions are also critical. Often one team can quickly help mitigate an issue with another team’s systems._n

But if you have 1/4th the workforce scrambling to learn new systems while they also focus their efforts on Musk’s stupid idea of the day first, none of that is going to go well._o

So yes, Twitter will coast along, until a freak incident causes a whole core service to fail and there’s nobody to fix it quickly. Until an engineer new to a system makes a typo and brings it all down and there is nobody left who knows how to bring it back up._p

Until a critical system becomes overloaded because nobody was on the pager or nobody could react in time, and a traffic management cascade failure causes a bunch of upstream/downstream stuff to become overloaded and fail._q

You know what’s also notoriously hard with big distributed systems like this?

Bringing stuff back up when it all goes down.

What if the system your engineers use to log in to production systems is down?_r

What if DNS or a similar global service underpinning everything goes down, and you can’t even find the IPs of the machines you are trying to reach?

What if there is a circular dependency between systems, such that you can’t bring one up if the other is down, and they both are?_s

What if that circular dependence involves 8 different subsystems across multiple production clusters and nobody even realized it was there until now?

And then there’s thundering herd/stampede issues. When bringing back a whole subsystem, you need to do it gradually._t

Do all your internal systems have built-in traffic management that can do that and slowly open up the faucet to allow caches to prime themselves and things to recover gracefully? Has that all been tested? How long will it take until caches are hot and ready?_u

In the end, it is very possible that Twitter will go down completely at some point in the next few weeks, and the remaining engineers will have no clue how to bring it back up, due to issues such as these. If the wrong combination of failures happened, it could take weeks._v

People think of servers as things you can just reboot and be fine. That’s not how this works. If you rebooted every single $FAANG³ server simultaneously right now, all of $FAANG would be down for probably months. Or worse. And that’s with functional teams. This stuff is hard._w

None of this is unique to any particular company. I’ve seen the Google war stories myself and been part of some. I’ve heard Amazon war stories. Twitter won’t be quite as bad because they aren’t as big as those two, but they’re big enough to run into the same kinds of issues._x

Here’s one more: I once removed every machine from a production cluster at Google due to a typo. That means they would be automatically wiped and formatted.

I happened to know the system in charge of doing this was deliberately throttled and there was a big red “PAUSE” button._y

I pushed the button, put the machines back into the database, unpushed it, and went on with my day.

Now imagine that scenario, except I have no idea and no training on that system and how to stop it, and everyone who did has been fired._z

On Twitter’s (and everyone else’s) cold boot problem[:]_aa

Cascading data center problems: In or around the spring of 2021, Twitter’s primary data center began to experience problems from a runaway engineering process, requiring the company to move operations to other systems outside of this datacenter. But, the other systems could not handle these rapid changes and also began experiencing problems. Engineers flagged the catastrophic danger that all the data centers might go offline simultaneously. A couple months earlier in February, Mudge had flagged this precise risk to the Board because Twitter data centers were fragile, and Twitter lacked plans and processes to “cold boot.” That meant that if all the centers went offline simultaneously, even briefly, Twitter was unsure if they could bring the service back up.

“Black Swan” existential threat: In fact, in or about Spring of 2021, just such an event was underway, and shutdown looked imminent. Hundreds of engineers nervously watched the data centers struggle to stay running. The senior executive who supervised the Head of Engineering, aware that the incident was on the verge of taking Twitter offline for weeks, months, or permanently, insisted the Board of Directors be informed of an impending catastrophic “Black Swan” event. Board Member [redacted] responded with words to the effect of “Isn’t this exactly what Mudge warned us about?” Mudge told [redacted] that he was correct. In the end, Twitter engineers working around the clock were narrowly able to stabilize the problem before the whole platform shut down.

Mudge (2022), pp. 33–34, quoted in Green (2022) & Martin (2022aa–ab)

[…] Twitter was this close to badly crashing just last year. With all staff on board, they narrowly avoided it.

Next time this happens in Musk’s Twitter, it’s gone._ab

Footnotes

Back end.
Site reliability engineer.
“Facebook, Amazon, Apple, Netflix, and Google,” five prominent American technology companies known for being “totally dominant in their markets” (Fernando, 2022).

References

Fernando, J. (2022, June 29). FAANG stocks: Definition and companies involved (M.J. Boyle & P. Rathburn, Eds.). Investopedia. Retrieved 14 November 2022.

Green, M.D. [@matthew_d_green] (2022, November 13). Pour one out for Twitter’s cold boot plans. [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022a, November 13). Twitter is already slowly breaking, and it’s only going to get worse. He says that Twitter was built to be … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022b, November 13). All large services and platforms are “built to be resilient”. But they are also extremely complicated, with countless internal interactions … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022c, November 13). These systems are built to be reliable in the face of things like machine failures, or entire optional microservices going … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022d, November 13). Cascade failures are caused when a small failure within a system has secondary effects on other systems, and the systems … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022e, November 13). There are many ways this can happen. For example, you can have a “packet of death” — some data that causes … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022f, November 13). This doesn’t have to be an exploit or externally malicious data. It can just be a circumstance that the design … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022g, November 13). I once had one of those at Google. A pipeline from a downstream team kept crashing because it was trying … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022h, November 13). After digging through forensic logs I figured out what had happened was that a machine had kernel panicked, which had left … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022i, November 13). … from another file with the same container format, but different contents. That was dutifully processed and forwarded on, and ended … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022j, November 13). In that case the machine that was the root cause wasn’t even my responsibility, so I just reported my findings … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022k, November 13). “The system processing the data should’ve been built to blocklist bad data that causes crashes!” Yeah, yeah, there is always … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022l, November 13). *This* is why you have engineers on call who know their systems inside out. So that when things do go … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022m, November 13). There’s also the dumb human/machine interaction errors. Google Docs smart quotes once took down a whole cluster, because `fooctl … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022n, November 13). Guess what: if you have fresh people from the wrong teams copying and pasting instructions, fixing one of those is … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022o, November 13). But if you have 1/4th the workforce scrambling to learn new systems while they also focus their efforts on Musk’s … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022p, November 13). So yes, Twitter will coast along, until a freak incident causes a whole core service to fail and there’s nobody … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022q, November 13). Until a critical system becomes overloaded because nobody was on the pager or nobody could react in time, and a … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022r, November 13). You know what’s also notoriously hard with big distributed systems like this? Bringing stuff back up when it *all* goes … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022s, November 13). What if DNS or a similar global service underpinning everything goes down, and you can’t even find the IPs of … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022t, November 13). What if that circular dependency involves 8 different subsystems across multiple production clusters and nobody even realized it was there … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022u, November 13). Do *all* your internal systems have built-in traffic management that can do that and slowly open up the faucet to … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022v, November 13). In the end, it is very possible that Twitter will go down *completely* at some point in the next few … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022w, November 13). People think of servers as things you can just reboot and be fine. That’s not how this works. If you … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022x, November 13). None of this is unique to any particular company. I’ve seen the Google war stories myself and been part of … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022y, November 13). Here’s one more: I once removed every machine from a production cluster at Google due to a typo. That means … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022z, November 13). I pushed the button, put the machines back into the database, unpushed it, and went on with my day. Now … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022aa, November 13). On Twitter’s (and everyone else’s) cold boot problem (source: techpolicy.press/wp-content/upl…) [Quote Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Martin, H. [@marcan42] (2022ab, November 13). And from the same doc: Twitter was this close to badly crashing just last year. With all staff on board … [Tweet]. Twitter (via Archive Today). Retrieved 14 November 2022.

Mudge (2022, July 6). Protected disclosures of Federal Trade Commission Act violations, material misrepresentations and omissions, and fraud by Twitter, Inc. (NASDAQ: TWTR) and CEO Parag Agrawal, SEC TCR# [disclosure to the U.S. Securities and Exchange Commission, U.S. Federal Trade Commission, and U.S. Department of Justice]. Whistleblower Aid.

Several Problems Press

Twitterdämmerung: “Twitter is already slowly breaking”

Footnotes

References

Leave a Reply Cancel reply