Twitter's chaotic engineering test case

Elon Musk completed the Twitter acquisition on October 27th. The subsequent events, layoffs of 50% of staff, accelerated deadlines, and infrastructure changes, became a real-world case study in engineering under chaos.

When half of an engineering organisation is let go simultaneously, the risks are significant. Knowledge walk-out occurs when people who know how specific systems work are suddenly gone. Dependency chain breaks happen when services maintained by departed engineers stop getting updates. And morale damage to the remaining team is a natural consequence. Twitter experienced all three risks.

In the first week after the cuts we saw a cascade of incidents that were not caused by code bugs but by missing human knowledge. The service that aggregates real‑time tweet counts relied on a set of Kafka topics that only two engineers knew how to re‑partition. When one of them left, the topic lag spiked to 15 minutes and the monitoring dashboards went silent because the Grafana alerts were tied to a runbook stored in a private Git repo that vanished with the account. The on‑call rotation was halved, so PagerDuty alerts piled up and the remaining responders were forced to triage three times the usual volume at 3 am. Those firefights exposed how tightly coupled operational procedures were to the people who wrote them.

The fact that the platform largely continued operating was a function of its resilience engineering rather than the stability of the transition. This was a testament to the foundation built by previous engineers.

Musk attempted to replace Twitter's engineering culture with a new one: move fast, ship daily, no remote work. However, this culture change was imposed on a team that had built a codebase and operational practices around the previous culture. Such rapid culture change does not produce a new culture; it produces attrition, survivor syndrome in the remaining team, and incomplete knowledge transfer.

Twitter's infrastructure, running on a mix of on-premises hardware and cloud, continued to serve hundreds of millions of users through the transition. The reliability of the underlying platform was a product of years of engineering investment in distributed systems, caching, and traffic management.

The hybrid footprint also forced hard choices about latency versus cost. Twitter kept its tweet‑ing pipeline on a fleet of 2,000 bare‑metal servers in a Virginia data centre while the media processing layer ran on AWS EC2 c5.2xlarge instances behind an internal load balancer. At peak load the traffic manager handled roughly 30 Tbps of data and the timeline API saw 150 k requests per second. Because the on‑prem servers could not be auto‑scaled, capacity engineers had to over‑provision by 30 percent, which meant paying for idle CPU cycles during off‑peak hours. When the cloud side was throttled by a sudden spike in video uploads, the fallback path to the data centre introduced a 120 ms tail latency that showed up in user‑experience metrics.

The lesson for infrastructure engineers is that well-built distributed systems have more resilience to organisational disruption than the chaos around them might suggest. This was evident in Twitter's ability to continue operating despite the turmoil.

The rebrand from Twitter to X.com, completed in 2023, required significant infrastructure work. This included domain changes, link rewriting, API endpoint updates, and brand asset replacements at scale. The speed at which the rebrand was executed was faster than most organisations would attempt for a platform of this scale.

The domain swap was orchestrated with a DNS TTL of 300 seconds, but many edge caches still honored the previous 24‑hour TTL settings baked into Fastly configurations. As a result, a fraction of users were redirected to the old twitter.com endpoints for days, and API clients that hard‑coded the host header received 404 responses. The deployment pipeline, built on Spinnaker, promoted the new X.com services in a single stage. When the stage failed due to a mis‑configured IAM role, half of the microservices never received the updated configuration and continued to emit Twitter‑branded metadata. The team had to manually intervene to roll back the changes on a per‑service basis, which added weeks to the stabilization effort.

Whether the rebrand was executed too quickly was visible in the broken link patterns and API inconsistencies that persisted for months. This highlights the challenges of making significant changes to a large platform.