System Design Article

Horizontal vs Vertical Scaling

Difficulty: Easy

When traffic grows, you have two choices: make the box bigger (vertical) or add more boxes (horizontal). This lesson lays out the cost, complexity, and ceiling of each approach, why stateless services scale horizontally with almost no thought, why stateful services require sharding or replication, and how real teams pick a default. By the end you can answer 'how would you scale this?' with a defensible answer instead of an instinct.

Horizontal vs Vertical Scaling

System Design

Easy

scalability

horizontal-scaling

vertical-scaling

stateless-services

system-design

beginner

534 views

The Two Axes of Scaling

When one server runs out of capacity, you have two choices.

Vertical scaling (scale up)

Replace the box with a bigger box. More CPU cores, more RAM, faster disks, faster NIC. The application and operating system do not change; the same process runs on a beefier machine.

Text

---------- Vertical scaling ----------
  before:  [ 4 vCPU,  16 GB RAM ]   handles 1K req/sec
  after:   [ 32 vCPU, 128 GB RAM ]  handles 8K req/sec

Horizontal scaling (scale out)

Keep the boxes the same; add more of them. A load balancer distributes requests across the fleet.

Text

---------- Horizontal scaling ----------
  before:  [ load balancer ] -> [ server ]                 1K req/sec
  after:   [ load balancer ] -> [ server ] [ server ] [ server ] [ server ]  4K req/sec

The Comparison Table

Dimension	Vertical	Horizontal
Setup complexity	Trivial - resize the VM	Requires load balancing, service discovery, possibly sharding
Application changes	None	Must be stateless OR coordinate state across nodes
Cost curve	Steeply non-linear (a 64-core box costs more than 8x an 8-core)	Roughly linear (more boxes, more cost)
Hard ceiling	Largest available instance (~448 vCPU, ~24 TB RAM today on AWS)	Effectively unbounded - thousands of nodes are routine
Failure domain	Whole service goes down if the box fails	One node failure = 1/N capacity loss
Deploy strategy	Restart the one box (downtime) or blue-green (cost double)	Rolling deploy, no downtime
Best for	Stateful single-writer systems, leader nodes, small workloads	Stateless services, read replicas, worker pools

The table makes vertical sound bad. It is not - it is the right answer for plenty of workloads, especially early ones. The question is which one you should default to.

Stateless vs Stateful (the real fork)

The scaling decision is downstream of one design choice: does your service hold state between requests?

Stateless service

Every request is self-contained. The server holds no memory of previous requests; any node in the fleet can handle any request equally.

Examples: REST API servers, image transcoders, render workers, function-as-a-service handlers.

Stateless services scale horizontally trivially. Add a node, the load balancer starts sending it traffic, no further coordination needed.

Text

---------- Stateless: any node handles any request ----------
  request -> [ load balancer ] -> [ node 1 ] [ node 2 ] [ node 3 ]
              all nodes are interchangeable

Stateful service

The server holds data that persists between requests: a database row, a session, a partial computation, an open connection.

Examples: databases, caches, message queues, WebSocket servers holding live connections, in-memory game state.

Stateful services cannot just be cloned. The new node has no data; the existing nodes must hand off or replicate state. There are three ways out:

Externalize the state so the service itself becomes stateless (move sessions to Redis, cache to Memcached, files to S3).
Replicate the state so every node has a copy (works up to a point; replication overhead grows with N).
Shard the state so each node owns a slice (the only path that scales, at the cost of complexity).

The 'make it stateless' trick

The single most common scaling unlock in modern web architecture is to push state out of the request-handling layer.

Before: web server stores user sessions in process memory. The user must come back to the same node (sticky sessions) or re-login. Adding a node helps with new traffic but cannot help that user's existing session.

After: web server reads the session from Redis on every request. Any node can handle the user's next request. Add 100 nodes; throughput scales 100x.

Pseudocode for moving session state out

JavaScript

Python

import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

// Stateless handler: pulls session from Redis on every request.
app.get('/me', async (req, res) => {
    const sessionId = req.cookies.sid;
    const sessionRaw = await redis.get(`session:${sessionId}`);
    if (!sessionRaw) return res.status(401).end();
    const session = JSON.parse(sessionRaw);
    res.json({ userId: session.userId, role: session.role });
});

Notice: the handler now needs zero in-process state. Any number of API nodes can serve any user.

Where the Bottleneck Moves

Adding capacity does not eliminate bottlenecks; it shifts them. Knowing where the new bottleneck appears is what makes a senior engineer's scaling answer feel earned.

Layer	Vertical bottleneck	Horizontal bottleneck (where it moves)
API server CPU	CPU saturates the box	Load balancer or downstream DB saturates
Database connection limit	Postgres caps at ~500 connections per box	Connection pooler (PgBouncer) saturates
Cache memory	Redis 50 GB ceiling per node	Cluster gossip overhead, hot shards
Request rate	NIC pegged at 25 Gbps	Cross-AZ bandwidth or load balancer cap
Lock contention	Single-row hot updates	Same row contention regardless of N nodes

The last row is the important one: horizontal scaling does not help with lock contention. If 10K requests per second all want to UPDATE the same inventory row, adding more API servers makes the contention worse, not better. The fix is at a different layer (sharding the row, queuing the writes, or using optimistic concurrency).

When Vertical Scaling Is Still the Right Answer

Despite the case for horizontal, vertical wins in three situations.

1. Small workloads

A service handling 100 req/sec on a 4 vCPU box does not need a load balancer, service discovery, or rolling deploys. You will pay 10x the operational complexity for nothing. Vertical scaling - or not scaling at all - is the right call until you have evidence you need horizontal.

2. Single-writer stateful systems

A Postgres primary, a Kafka controller, a Zookeeper ensemble leader, a Redis Cluster master - these are inherently single-process for correctness. You scale them vertically until you can shard them. A 32-core, 256 GB Postgres box can serve a remarkable amount of traffic before sharding is worth the complexity.

3. Latency-sensitive in-process operations

When request latency is dominated by a single CPU-bound operation (ML inference, video transcoding, a complex SQL query), a faster CPU helps directly. Adding a second box does not make the operation faster; it just lets you do more of them in parallel. If you need to do the same one faster, vertical wins.

A Cost Model You Can Quote

On AWS today (rough memory), an m6i.large (2 vCPU, 8 GB) costs about $0.10/hour. An m6i.32xlarge (128 vCPU, 512 GB) costs about $6.10/hour. The big box is 64x the cores at 61x the cost - close to linear pricing.

But the operational cost of horizontal scaling is real: load balancer ($25 to $100/month), monitoring per node, more deployment surface, more failure modes. For workloads under ~$500/month in compute, vertical is often cheaper end to end. Above that, horizontal usually wins.

This matters in interviews: when you say 'I would scale horizontally', be ready for 'why not just buy a bigger box?'. The answer is some version of 'because at our load, the operational cost is dominated by raw compute, and horizontal gives me availability and zero-downtime deploys for free'.

A Mental Algorithm for the Interview

Is the service stateless? If yes, default to horizontal. If no, can you push the state out (Redis, S3, DB)? If yes, do that and now it is stateless.
What is the request profile? Read-heavy stateful: replicate (read replicas). Write-heavy stateful: shard. Single-writer with low write QPS: vertical until painful.
Where is the next bottleneck? Adding 10 more API servers might overwhelm the database. Plan the next layer's scaling strategy in the same answer.
What about availability? Horizontal gives N+1 redundancy for free. Vertical leaves you with one giant single point of failure.
What about cost? For tiny workloads, vertical. For anything serious, horizontal.

Walk this algorithm out loud in interviews and you will sound like you have done it before, because you have.

Real-World Examples

How real systems implement this in production

Stack Overflow on a single SQL Server

For years, Stack Overflow ran the entire site on a single Microsoft SQL Server instance with 384 GB of RAM and a few hundred CPU cores - vertically scaled to absurdity. They could because their workload was 99% reads, the working set fit in RAM, and the rare writes were single-row.

Trade-off: Do not over-engineer horizontal scaling for a workload that one beefy box can handle. Stack Overflow eventually moved to read replicas, but only when the vertical ceiling actually arrived.

Netflix stateless microservices

Netflix's API edge (Zuul) is a fleet of thousands of stateless Java processes behind ELB. Sessions, recommendations, user state - all live in downstream services or Cassandra/EVCache. Adding capacity for a launch is literally one Spinnaker click that doubles the cluster size in 5 minutes.

Trade-off: Invest in statelessness once, scale horizontally forever.

Discord per-channel sharding

Discord's chat servers shard by channel: each Elixir process owns a fixed set of channels, and clients connect to whichever process owns their channels. Horizontal scale comes from adding more processes; each process is stateful but small.

Trade-off: Stateful services can scale horizontally if you can find a good shard key (here, channel ID).

GitHub MySQL read replicas

GitHub runs MySQL (and historically Postgres) with one primary per shard plus many read replicas. Reads scale horizontally across the replica fleet; writes still hit the single primary. The primary is scaled vertically until it hits the box ceiling, then they shard.

Trade-off: Most workloads are read-heavy, and read replicas are the cheapest horizontal scaling you will ever buy.

Quick Interview Phrases

Key terms to use in your answer

scale up vs scale out

stateless services

load balancer

shared-nothing architecture

single point of failure

the bottleneck moves

Common Interview Questions

Questions you might be asked about this topic

Walk me through scaling a web application from 100 req/sec to 100,000 req/sec.

Stage 1 (100 -> 1K): single beefier box, externalize sessions to Redis. Stage 2 (1K -> 10K): horizontal API tier behind ALB, 5-10 nodes. Add CDN for static assets. Add Postgres read replicas. Stage 3 (10K -> 100K): cache layer (Redis), shard the database by tenant or user-id, multi-AZ deployment. Mention that each stage moves the bottleneck: API CPU -> DB connections -> DB CPU -> hot shards. Discuss horizontal scaling assumes statelessness, which is the unlock that makes everything else possible.

Why can you scale a stateless web tier to 1000 nodes but a Postgres primary only to one?

Compare scaling a cache cluster vs a database cluster.

What does it cost (in dollars and complexity) to switch from vertical to horizontal scaling?

Your service hits 100% CPU on its single box. The CTO asks: bigger box or more boxes?

Interview Tips

How to discuss this topic effectively

Always ask 'is this stateless?' before answering any scaling question. Most of the time the answer hinges on it.

Pair every horizontal scaling claim with the next bottleneck. 'I would scale the API horizontally; the next bottleneck will be the database, which I would handle with read replicas.' Showing you can see two layers ahead is what wins senior interviews.

Quote the linear-cost argument when defending horizontal: 'two boxes at 50% load cost the same as one box at 100% but give me availability for free'.

Never propose horizontal scaling for a Postgres primary or Kafka controller without mentioning sharding. Naive horizontal scaling of single-writer systems is the classic junior-engineer mistake.

When the interviewer asks how you would scale, walk the layers from edge to data: CDN, load balancer, API, cache, database. Each layer has its own scaling answer.

Common Mistakes

Pitfalls to avoid in interviews

Thinking horizontal scaling fixes any throughput problem

Horizontal scaling helps when the bottleneck is per-node capacity. If the bottleneck is a shared resource (a single database row, a single Redis key, a single mutex), adding nodes makes it worse. Always identify where the contention actually is.

Adding a load balancer in front of a stateful service without thinking about state

If the service holds state in process memory (sessions, caches, open connections), a load balancer alone does not scale it - users get inconsistent behavior depending on which node they hit. Push the state out first or use sticky sessions as a stopgap.

Defaulting to horizontal scaling for tiny workloads

Load balancers, service discovery, rolling deploys, and N-node monitoring cost real engineering time. For small services (a few thousand req/sec), a single box - or two for availability - is cheaper than the operational tax.

Assuming the cloud provider scales vertically infinitely

AWS's biggest instance today is around 448 vCPU and 24 TB RAM. That is huge, but it is a ceiling. Any service that grows past it must be horizontal. Plan the migration before you actually need it.

Forgetting that horizontal scaling requires deployment infrastructure

Rolling deploys, health checks, blue-green, canary - these are required for horizontal scaling to deliver its zero-downtime promise. Without them, a deploy still takes down all your nodes one after another and you get the worst of both worlds.

Back to System Design