System Design Article
Load Balancing Algorithms & Patterns
Difficulty: Easy
A load balancer is the traffic cop in front of every horizontally scaled service. This lesson covers the four scheduling algorithms you need to know (round-robin, least-connections, weighted, hash), the difference between Layer 4 and Layer 7 load balancing, how health checks pull dead nodes out of rotation, the role of sticky sessions and connection draining, and the tools (NGINX, HAProxy, ELB/ALB, Envoy) that implement all of this. By the end you can pick the right algorithm for a workload and explain to an interviewer exactly how a request finds its way from the load balancer to a healthy backend.
Load Balancing Algorithms & Patterns
A load balancer is the traffic cop in front of every horizontally scaled service. This lesson covers the four scheduling algorithms you need to know (round-robin, least-connections, weighted, hash), the difference between Layer 4 and Layer 7 load balancing, how health checks pull dead nodes out of rotation, the role of sticky sessions and connection draining, and the tools (NGINX, HAProxy, ELB/ALB, Envoy) that implement all of this. By the end you can pick the right algorithm for a workload and explain to an interviewer exactly how a request finds its way from the load balancer to a healthy backend.
998 views
14
What a Load Balancer Does
In front of any horizontally scaled fleet sits a load balancer. Its job is one sentence: distribute incoming requests across the backend pool so no single node is overloaded and dead nodes do not receive traffic.
---------- Load balancer ----------
clients (millions)
|
v
[ load balancer ] (one VIP, public DNS)
/ | \ \
v v v v
[ b1 ] [ b2 ] [ b3 ] [ b4 ] stateless backend poolA real load balancer does five things:
- Pick a backend for each request (the algorithm).
- Run health checks to know which backends are alive.
- Terminate or pass through the connection (TLS, HTTP/2).
- Buffer or stream the request body.
- Drain connections gracefully when a backend deregisters.
Get any of these wrong and you have a production incident.
Layer 4 vs Layer 7
Load balancers operate at one of two OSI layers, and the choice constrains everything else.
Layer 4 (Transport)
The LB sees TCP/UDP packets. It picks a backend based on source/destination IP and port, then forwards packets without parsing the payload. Fast, simple, protocol-agnostic.
Examples: AWS NLB, HAProxy in TCP mode, Linux IPVS, Cloudflare Spectrum.
Use when: raw TCP services (databases, gRPC, WebSockets at high concurrency, custom protocols), maximum throughput, minimum latency, no need to inspect HTTP.
Layer 7 (Application)
The LB parses HTTP requests. It can route based on URL path, headers, cookies, or method. Can do TLS termination, response compression, request rewriting, response caching.
Examples: AWS ALB, NGINX, HAProxy in HTTP mode, Envoy, Traefik, Cloudflare.
Use when: HTTP/HTTPS services, path-based routing (/api to one fleet, /static to another), per-route timeouts, response caching, A/B testing, canary deploys.
---------- Layer 4 vs Layer 7 ----------
L4: SYN -> [ LB picks backend ] -> SYN to b3 -> connection persists
LB just shovels packets after that.
L7: HTTP request -> [ LB parses path /api/v2/users ] -> backend pool API -> picks b3
LB can change request, retry on idempotent methods, cache response.Most web traffic uses Layer 7 because routing flexibility almost always beats raw throughput. Layer 4 is reserved for protocols that are not HTTP or for cases where parsing overhead matters (millions of QPS).
The Algorithms
Five algorithms cover 99% of load balancers in production.
1. Round-Robin
Backend N receives request N % poolSize. Equal distribution; no awareness of backend state.
---------- Round-robin ----------
request 1 -> b1
request 2 -> b2
request 3 -> b3
request 4 -> b1
request 5 -> b2
...Pros: simplest possible algorithm, predictable distribution. Cons: assumes all backends are equally fast and all requests cost the same. A single slow backend gets the same share as the others, so its queue grows unboundedly.
Use when: backends are identical AND requests are uniform (e.g., a stateless API serving short requests).
2. Weighted Round-Robin
Each backend has a weight. Backend with weight 3 gets 3x the requests of weight 1.
Use when: backends have different sizes (mix of m6i.large and m6i.xlarge) or you are doing a canary deploy (5% to new version, 95% to old).
3. Least-Connections
The LB picks the backend with the fewest active connections.
---------- Least-connections ----------
b1: 50 connections
b2: 30 connections <- new request goes here
b3: 45 connectionsPros: adapts to backend speed automatically. A slow backend accumulates connections, so the LB stops sending it new ones. Cons: requires the LB to track connection state per backend. Works perfectly for long-lived connections (WebSockets) and reasonably well for HTTP.
Use when: request durations vary significantly (some backends are slower, some requests are heavy), or the connection model is long-lived.
4. IP Hash / Consistent Hash
The LB hashes the client IP (or a key from the request) and uses the hash to pick a backend. The same client always lands on the same backend, until the pool size changes.
---------- IP hash ----------
hash(client IP 1.2.3.4) % poolSize -> b2 (always)
hash(client IP 5.6.7.8) % poolSize -> b1 (always)Pros: enables sticky sessions without cookies; good for in-memory caches per node. Cons: changing pool size remaps almost every client (modulo problem). Use consistent hashing (covered in the distributed-caching lesson) when the pool changes often, so only 1/N of clients are remapped.
Use when: you need session affinity, in-memory cache locality, or routing by tenant ID.
5. Random (with two-choices)
Pick two backends at random and send the request to the less loaded of the two. This is the power of two random choices algorithm and it is shockingly close to optimal for balancing load with minimal coordination.
Use when: the LB itself is distributed (no central state) and global least-connections is impractical. Used inside Envoy and many service-mesh sidecars.
Decision matrix
| Algorithm | Backend awareness | Best for |
|---|---|---|
| Round-robin | None | Identical backends, uniform requests |
| Weighted RR | Static weight | Mixed backend sizes, canary deploys |
| Least-connections | Live connection count | Variable request durations, long-lived connections |
| IP/consistent hash | Hash of client key | Sticky sessions, cache locality, tenant routing |
| Power of two | Two-sample probe | Distributed LBs (sidecar mesh, no central state) |
Health Checks (the Other Half of Load Balancing)
The algorithm picks a backend; the health check decides which backends are eligible. Without health checks, a dead backend keeps receiving traffic and 1/N of users see errors.
Active health checks
The LB periodically sends a probe (HTTP GET, TCP connect, gRPC health-check RPC) to every backend. Two consecutive failures mark the backend unhealthy; two consecutive successes mark it healthy again.
---------- Active health check ----------
every 5 seconds:
LB -> GET /health -> b1 -> 200 OK (healthy)
LB -> GET /health -> b2 -> timeout (unhealthy after 2 misses)
LB -> GET /health -> b3 -> 200 OK (healthy)Design the /health endpoint to actually verify the backend can serve traffic: check the database connection, check Redis, check the disk. A health endpoint that just returns 200 lies about real failures.
Passive health checks
The LB observes the response from real client traffic. If a backend returns 5xx errors, times out, or refuses connections too often within a window, mark it unhealthy without sending a probe.
Pros: cheap (no extra probes), catches issues active checks miss (specific endpoints failing). Cons: the failure has to surface in real traffic before the LB notices, so some users see errors first.
Production systems use both: active checks for liveness, passive checks for graceful degradation under partial failures.
Configuration that actually matters
NGINX
---------- NGINX upstream config ----------
upstream api {
least_conn;
server b1.api.local:8080 max_fails=2 fail_timeout=30s;
server b2.api.local:8080 max_fails=2 fail_timeout=30s;
server b3.api.local:8080 max_fails=2 fail_timeout=30s;
}
server {
listen 443 ssl;
location / {
proxy_pass http://api;
proxy_connect_timeout 2s;
proxy_read_timeout 30s;
proxy_next_upstream error timeout http_502 http_503;
}
}HAProxy
---------- HAProxy backend config ----------
backend api
balance leastconn
option httpchk GET /health
http-check expect status 200
default-server inter 5s fall 2 rise 2
server b1 b1.api.local:8080 check
server b2 b2.api.local:8080 check
server b3 b3.api.local:8080 checkNote proxy_next_upstream (NGINX) and the equivalent in HAProxy: on a 5xx or timeout from backend N, the LB retries on backend N+1 transparently. This is what makes one slow backend invisible to users.
Sticky Sessions (and Why to Avoid Them)
Sometimes you need a client to keep landing on the same backend - for an in-process session, an open WebSocket, or a partial computation. The LB stamps a cookie or routes by IP hash so the same client returns to the same backend.
---------- Sticky session via cookie ----------
client first request: LB picks b2, sets cookie 'sticky=b2'
client next request: cookie 'sticky=b2' -> LB sends to b2
client next request: cookie 'sticky=b2' -> LB sends to b2Cost: a hot user pins to one backend, defeating load balancing for that user. If b2 is overloaded, the cookie still sends traffic there until b2 dies. On deploy, every sticky session is broken (the new backend has no in-memory state).
Better answer: make the service stateless and externalize the state. Sessions go to Redis. WebSocket reconnects fetch the connection's state from a shared store. Sticky sessions are a stopgap, not a design.
When stickiness is genuinely necessary: WebSocket servers that hold the live socket (you cannot move an open TCP connection across nodes), or in-memory game state where round-trip to Redis adds unacceptable latency.
Connection Draining (Graceful Removal)
When you remove a backend from the pool (deploy, scale-down, instance failure), in-flight requests should complete on the old backend rather than be terminated.
The sequence:
- Mark the backend
drainingin the LB. - Stop sending it new connections.
- Wait for existing connections to finish (timeout: typically 30 to 300 seconds).
- Terminate the backend.
Without draining, a deploy returns 502 to every in-flight request. With it, users do not notice.
---------- Connection draining ----------
T0 deploy starts: mark b1 draining
T0 new connections go to b2, b3 only
T1 b1's existing connections continue serving
T+60 b1 has no live connections; LB removes it
T+60 deploy stops b1, starts new versionAnycast and DNS-Based Load Balancing
The load balancers above operate at one location. To distribute traffic across regions, you have two more options.
DNS round-robin / GeoDNS
The DNS server returns different IPs to different clients - based on round-robin, geographic proximity, or health. Each IP points to a regional load balancer.
Pros: no extra infra, every client gets routed to the nearest region. Cons: DNS TTLs mean failovers take minutes, not seconds. Some clients ignore TTL.
Anycast (BGP)
Multiple regions advertise the same IP address from their respective ASNs. BGP routes each client to the topologically nearest region. Used by Cloudflare, Google DNS (8.8.8.8), AWS Global Accelerator.
Pros: instant failover (BGP withdraws the route), single global IP, sub-millisecond client-to-region routing. Cons: requires BGP peering and your own IP space; usually delivered via a managed service.
Real-World Examples
How real systems implement this in production
AWS offers Application Load Balancer (Layer 7, HTTP-aware) and Network Load Balancer (Layer 4, TCP/UDP). ALB is the default for HTTP services because of path-based routing, target groups per microservice, and integration with WAF. NLB is used when raw throughput matters (millions of QPS), for non-HTTP protocols, or when you need to preserve the client source IP at Layer 4.
Trade-off: Pick the layer based on what you need to inspect; do not pay for HTTP parsing if your protocol is not HTTP.
Cloudflare runs hundreds of POPs worldwide, each advertising the same anycast IPs via BGP. A user in Mumbai connects to the Mumbai POP; a user in Frankfurt to the Frankfurt POP. Within a POP, NGINX-based load balancers distribute requests across thousands of edge servers.
Trade-off: At internet scale, load balancing is multi-tier - BGP/anycast at the edge, regional LB, then per-DC LB.
Istio and Linkerd inject Envoy as a sidecar next to every service pod. The sidecar handles outbound load balancing (typically least-request or power-of-two), TLS, retries, and metrics. The mesh control plane pushes configuration; the data plane handles every request locally.
Trade-off: As architectures get more granular, load balancing moves from a centralized appliance to a distributed sidecar, and algorithms shift from least-connections (requires central state) to power-of-two (local decision).
Stack Overflow runs HAProxy as the entry point in front of their IIS web tier. Two HAProxy instances in active-passive failover handle the entire site's traffic. They use least-connections balancing and tight health-check intervals (every 2 seconds) so a sick backend is pulled in under 6 seconds.
Trade-off: At moderate scale, two well-tuned HAProxy boxes outperform a complex managed service for a fraction of the cost.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Round-robin: equal distribution, no awareness of backend state, fast and stateless. Best for uniform workloads. Least-connections: tracks active connections per backend, sends new requests to the least-loaded; adapts to slow backends. Best for variable request durations. Consistent hashing: hash a request key (client IP, user ID) and route to the corresponding backend; same key always to same backend. Best for sticky sessions and cache locality. Mention that adding/removing a backend remaps minimal keys with consistent hashing (1/N), but every key with naive modulo.
Passive health check: the LB observes the 500s and (after a threshold, e.g., 5 errors in 30s) pulls the backend from the rotation. Active health check: the next /health probe also fails, confirming. The backend is marked unhealthy and stops receiving new connections. Existing connections drain or are reset depending on config. The LB reports the change to monitoring. After a recovery period of two consecutive successful health checks, the backend rejoins. Mention `proxy_next_upstream` for transparently retrying the failed request on another backend.
Three tiers. Tier 1: anycast or GeoDNS routes the client to the nearest region. Tier 2: a regional Layer 7 LB (ALB, NGINX) routes based on path and runs health checks per backend. Tier 3 (optional): a service-mesh sidecar (Envoy) inside the cluster with power-of-two-choices for service-to-service calls. TLS termination at the regional LB. Connection draining for deploys. Across regions, use active-active so failure of one region routes traffic elsewhere via BGP/DNS in seconds. Mention monitoring per layer.
When the protocol is not HTTP (raw TCP, gRPC at extreme scale, custom protocols), when throughput matters more than routing flexibility (millions of QPS that L7 parsing cannot keep up with), or when you need to preserve the client's source IP at the network level. AWS NLB, Cloudflare Spectrum, HAProxy in TCP mode are common L4 choices. For typical HTTP services, L7 is the default because path routing, TLS termination, and per-route timeouts almost always outweigh the small parsing overhead.
Sticky sessions pin a client to the same backend across requests, typically via a cookie or IP hash. Necessary only when the backend genuinely cannot serve the client's request without per-client state in process: live WebSocket connections (you cannot move an open socket across nodes) or in-memory game state with strict latency. For sessions, caches, or shopping carts, the right answer is to push the state to Redis or a JWT and make the backend stateless. Sticky sessions defeat load balancing for hot users, break on deploys, and turn one node failure into a session-loss event for everyone pinned there.
Interview Tips
How to discuss this topic effectively
Always state both the layer and the algorithm in one sentence: 'Layer 7 ALB with least-connections and a /health probe every 5 seconds'. Naming all three signals operational depth.
Mention connection draining whenever you mention deploys. Zero-downtime deploys are a load-balancer feature, not a magic property of containers.
Default to least-connections for variable-latency workloads and power-of-two for distributed sidecar meshes. Round-robin is for textbook examples and uniform-cost requests only.
Treat sticky sessions as a smell, not a feature. The right answer is to make the service stateless; sticky is a stopgap with real costs.
When asked about multi-region, lead with anycast/BGP if you know the team uses Cloudflare or AWS Global Accelerator, otherwise GeoDNS. Both are correct; naming the right tool wins points.
Common Mistakes
Pitfalls to avoid in interviews
Defaulting to round-robin for any workload
Round-robin assumes all requests cost the same and all backends are equally fast. For real workloads with variable latency, least-connections distributes load far better and adapts automatically when a backend gets sick. Reserve round-robin for genuinely uniform workloads.
Writing a /health endpoint that just returns 200
A trivial health endpoint passes even when the backend cannot reach the database, cache, or downstream service - so the LB keeps sending traffic to a broken node. Health endpoints should verify actual dependencies (with a small connection-pool quota) so unhealthy is detected before users see errors.
Using sticky sessions because it is easier than externalizing state
Sticky sessions defeat load balancing for hot users, break on every deploy, and turn one node failure into a session-loss event for everyone pinned there. Push sessions to Redis or use signed JWTs so any backend can serve any request.
Forgetting to configure connection draining
Without graceful draining, every deploy or scale-down kills in-flight requests, returning 502s to real users. Configure the LB to mark removed backends as draining and wait for in-flight requests to complete (typically 30 to 120 seconds) before terminating.
Treating the load balancer as infinitely scalable
A single LB instance has a connection limit and a CPU/NIC ceiling. AWS ALB scales up automatically; HAProxy/NGINX must be sized by you. At very high QPS, the LB itself needs horizontal scaling (multiple LB instances behind anycast or DNS round-robin).
