Systems that need to scale have predictable failure modes at each scale inflection point. The engineering patterns that address scaling challenges are well understood for applications that have crossed the first few scale thresholds.

The database bottleneck

The first scaling inflection point for most applications is the relational database. A single database instance handles reads and writes for all application traffic. The first scaling interventions: add a read replica for read traffic (read/write splitting), add a connection pool (PgBouncer, HikariCP) to reduce connection overhead, and add database indexes for the queries that are generating table scans. These interventions can extend the single-database architecture by orders of magnitude before sharding is required.

Caching as the force multiplier

A cache reduces database load by serving frequently accessed data from memory. The cache hit rate determines the effectiveness: a 90% hit rate means 10x the read throughput with the same database. The key caching decisions: what to cache (read-heavy, expensive-to-compute, tolerable-staleness data), the eviction policy (LRU for general caches), and cache invalidation (time-to-live for eventually consistent data, explicit invalidation for consistency-sensitive data). Redis is the standard distributed cache.

Stateless services and horizontal scaling

Stateless services (all state in the database or cache, no local session state) scale horizontally: add more instances behind a load balancer. The common statefulness traps: local in-memory session state, local filesystem caching, and background timers that depend on a specific instance being alive. Identifying and externalising state dependencies is the prerequisite for horizontal scaling.

The queue-based load leveller

High-volume write operations that are not latency-sensitive (analytics events, audit logs, email sends, async notifications) should not write directly to the database from the request path. A message queue (Service Bus, SQS, Redis) decouples the accept rate (the rate at which the service accepts requests) from the processing rate (the rate at which the database can handle writes). Load spikes are absorbed by the queue rather than cascading to the database.