Here’s how a Twitter engineer says it will break in the coming weeks
On November 4, just hours after Elon Musk fired half of the 7,500 employees previously working at Twitter, some people began to see small signs that something was wrong with everyone’s favorite hellsite. And they saw it through retweets.
Twitter introduced retweets in 2009, turning an organic thing people were already doing — pasting someone else’s username and tweet preceded by the letters RT — into a software function. In the years since, the retweet and its distant cousin, the quote tweet (which launched in April 2015) have become two of the most common mechanics on Twitter.
But on Friday, a few users who pressed the retweet button saw the years roll back to 2009. Manual retweets, as they were called, were back.
The return of the manual retweet wasn’t Elon Musk’s latest attempt to appease users. Instead, it was the first public crack in the edifice of Twitter’s codebase—a blip on the seismometer that warns of a bigger earthquake to come.
A massive tech platform like Twitter is built upon very many interdependent parts. “The larger catastrophic failures are a little more titillating, but the biggest risk is the smaller things starting to degrade,” says Ben Krueger, a staff site reliability engineer who has more than two decades of experience in the tech industry. “These are very big, very complicated systems.” Krueger says that one 2017 presentation from Twitter staff includes a statistic suggesting more than half their backend infrastructure was dedicated to storing data.
While many of Musk’s detractors may wish the platform goes through the equivalent of thermonuclear destruction, the collapse of something like Twitter happens gradually. For those who know, gradual breakdowns are a sign of concern that a larger crash could be imminent.. And that’s what’s happening now.
It’s the small things
Whether it’s manual RTs appearing for a moment, then slowly morphing into their standard form, ghostly follower counts that race ahead of those the app thinks are following you, or replies that simply refuse to load, small bugs are appearing at the edge of Twitter’s periphery. Even Twitter’s rules, which Musk linked to on November 7, went offline temporarily under the load of millions of eyeballs. In short, it’s becoming unreliable.
“Sometimes you’ll get notifications that are a little off,” says one engineer currently working at Twitter, who’s concerned about the way the platform is reacting after vast swathes of his colleagues who were previously employed to keep the site running smoothly were fired. (That last sentence is why the engineer has been granted anonymity to talk for this story.) After struggling with downtime during its “Fail Whale” days, Twitter eventually became lauded for its team of site reliability engineers, or SREs. Yet this team has been decimated in the aftermath of Musk’s takeover. “It’s small things, at the moment, but they do really add up as far as the perception of stability,” says the engineer.
The small suggestions of something wrong will amplify and multiply as time goes on, he predicts—in part because the skeleton staff remaining to handle these issues will quickly burn out. “Round-the-clock is detrimental to quality, and we’re already kind of seeing this,” he says.
Twitter’s remaining engineering team have largely been tasked with keeping the site stable over the last few days, since the company’s new CEO decided to get rid of a significant chunk of the staff maintaining its codebase. As the company tries to return to some semblance of normalcy, more of their time will be spent addressing Musk’s (often taxing) whims for new products and features, rather than keeping what’s already there running.
This is particularly problematic, says Krueger, for a site like Twitter, which can have unforeseen spikes in user traffic and interest at random. Krueger compares Twitter to online retail sites, where companies can prepare for big traffic events like Black Friday with some predictability. “When it comes to Twitter, they have the possibility of having a Black Friday on any given day at any time of the day,” says Krueger. “At any given day, some news event can happen that can have significant impact on the conversation.” That’s harder to do when you lay off up to 80% of your SREs—a figure Krueger says has been bandied about within the industry but which MIT Technology Review has been unable to confirm. The engineer agreed that percentage sounded “plausible”.
The current Twitter engineer doesn’t see a route out of the issue—other than reversing the layoffs (which the company has reportedly already attempted to roll back somewhat.) “If we’re going to be pushing at a breakneck pace, then things will break,” he says. “There’s no way around that. We’re accumulating technical debt much faster than before—almost as fast as we’re accumulating financial debt.”
The list grows longer
He presents a dystopian future where issues pile up as the backlog of maintenance tasks and fixes grows longer and longer. “Things will be broken. Things will be broken more often. Things will be broken for longer periods of time. Things will be broken in more severe ways,” he says. “Everything will compound until eventually, it’s not usable.”
Twitter’s collapse into an unusable wreck is some time off, the engineer says, but the telltale signs of process rot setting in are already there. It starts with the small things: “Bugs in whatever part of whatever client they’re using; whatever service in the backend they’re trying to use,” the engineer says. “They’ll be small annoyances to start, but as the backend fixes are being delayed, things will accumulate until people will eventually just give up.”
Krueger says that Twitter won’t blink out of life, but that we’ll start to see a greater number of tweets not loading, and accounts coming into and out of existence seemingly at a whim. “I would expect anything that’s writing data on the backend to possibly have slowness, timeouts, and a lot more subtle types of failure conditions,” says Krueger. “But they’re often more insidious. And they also generally take a lot more effort to track down and resolve. If you don’t have enough engineers, that’s going to be a significant problem.”
The juddering manual retweets and faltering follower counts are indications that this is already happening. Twitter engineers have designed failsafes that the platform can fall back on so that the functionality doesn’t go totally offline, but instead provides cut-down versions—that’s what we’re seeing, says Krueger.
Alongside the minor malfunctions, the Twitter engineer also believes that there’ll be significant outages on the horizon, thanks in part to Musk’s cost-cutting drive to reduce Twitter’s cloud computing server load as an attempt to claw back up to $3 million a day in infrastructure costs. Reuters reports that project, which came from Musk’s war room, is called the “Deep Cuts Plan”. One of Reuters’ sources called the plans “delusional”, while University of Surrey cybersecurity professor Alan Woodward says that “unless they’ve massively overengineered the current system, the risk of poorer capacity and availability seems a logical conclusion.”
Meanwhile, when things do go kaput, there’s no longer the institutional knowledge within to quickly fix issues as they arise. “A lot of the people I saw who were leaving after Friday have been there nine, 10, 11 years, which is just ridiculous for a tech company,” says the Twitter engineer. As those individuals walked out of Twitter offices, decades of knowledge about how its systems worked disappeared with them. (Those within Twitter, and those watching from the sidelines, have previously argued Twitter’s knowledge base is overly concentrated in the minds of a handful of programmers, some of whom have been fired.)
Unfortunately, teams stripped back to their bare bones according to those remaining in Twitter include the tech writers’ team. “We had good documentation because of [that team],” says the engineer. No longer. When things go wrong, it’ll be harder to find out what has happened.
Getting answers will be harder externally as well. The communications team has been cut down from between 80 and 100 to just two people, according to one former member of the communications team MIT Technology Review spoke to. “There’s too much for them to do, and they don’t speak enough languages to deal with the press as they need to,” says the engineer.
When MIT Technology Review reached out to Twitter for this story, its email went unreplied to.
There is a heavy element of those in glass houses not throwing stones to Musk’s recent criticism of Mastodon, the open-source alternative to Twitter that has piled on users in the days since the entrepreneur took control of the platform. The Twitter CEO tweeted, then quickly deleted, a post telling users that “If you don’t like Twitter anymore, there is awesome site [sic] called Masterbatedone [sic]”, alongside a physical picture of his laptop screen open on Paul Krugman’s Mastodon profile, showing him trying multiple times to post to the platform. Despite Musk’s attempt to highlight Mastodon’s unreliability, its success has been remarkable: nearly half a million people have signed up to the federated platform since Musk took over Twitter.
It’s happening at the same time that the first cracks in Twitter’s edifice are starting to show. It’s just the beginning, expects Krueger. “I would expect to start seeing significant public-facing problems with the technology within six months,” he says. “And I feel like that’s a generous estimate.”