System Design Article

Load Balancing Algorithms & Patterns

Difficulty: Easy

A load balancer is the traffic cop in front of every horizontally scaled service. This lesson covers the four scheduling algorithms you need to know (round-robin, least-connections, weighted, hash), the difference between Layer 4 and Layer 7 load balancing, how health checks pull dead nodes out of rotation, the role of sticky sessions and connection draining, and the tools (NGINX, HAProxy, ELB/ALB, Envoy) that implement all of this. By the end you can pick the right algorithm for a workload and explain to an interviewer exactly how a request finds its way from the load balancer to a healthy backend.

Load Balancing Algorithms & Patterns

System Design

Easy

load-balancing

round-robin

least-connections

sticky-sessions

health-checks

layer-4

layer-7

nginx

haproxy

system-design

beginner

998 views

What a Load Balancer Does

In front of any horizontally scaled fleet sits a load balancer. Its job is one sentence: distribute incoming requests across the backend pool so no single node is overloaded and dead nodes do not receive traffic.

Text

---------- Load balancer ----------
      clients (millions)
          |
          v
   [ load balancer ]   (one VIP, public DNS)
     /    |    \    \
    v     v     v    v
  [ b1 ] [ b2 ] [ b3 ] [ b4 ]   stateless backend pool

A real load balancer does five things:

Pick a backend for each request (the algorithm).
Run health checks to know which backends are alive.
Terminate or pass through the connection (TLS, HTTP/2).
Buffer or stream the request body.
Drain connections gracefully when a backend deregisters.

Get any of these wrong and you have a production incident.

Layer 4 vs Layer 7

Load balancers operate at one of two OSI layers, and the choice constrains everything else.

Layer 4 (Transport)

The LB sees TCP/UDP packets. It picks a backend based on source/destination IP and port, then forwards packets without parsing the payload. Fast, simple, protocol-agnostic.

Examples: AWS NLB, HAProxy in TCP mode, Linux IPVS, Cloudflare Spectrum.

Use when: raw TCP services (databases, gRPC, WebSockets at high concurrency, custom protocols), maximum throughput, minimum latency, no need to inspect HTTP.

Layer 7 (Application)

The LB parses HTTP requests. It can route based on URL path, headers, cookies, or method. Can do TLS termination, response compression, request rewriting, response caching.

Examples: AWS ALB, NGINX, HAProxy in HTTP mode, Envoy, Traefik, Cloudflare.

Use when: HTTP/HTTPS services, path-based routing (/api to one fleet, /static to another), per-route timeouts, response caching, A/B testing, canary deploys.

Text

---------- Layer 4 vs Layer 7 ----------
  L4: SYN -> [ LB picks backend ] -> SYN to b3 -> connection persists
      LB just shovels packets after that.

  L7: HTTP request -> [ LB parses path /api/v2/users ] -> backend pool API -> picks b3
      LB can change request, retry on idempotent methods, cache response.

Most web traffic uses Layer 7 because routing flexibility almost always beats raw throughput. Layer 4 is reserved for protocols that are not HTTP or for cases where parsing overhead matters (millions of QPS).

The Algorithms

Five algorithms cover 99% of load balancers in production.

1. Round-Robin

Backend N receives request N % poolSize. Equal distribution; no awareness of backend state.

Text

---------- Round-robin ----------
  request 1 -> b1
  request 2 -> b2
  request 3 -> b3
  request 4 -> b1
  request 5 -> b2
  ...

Pros: simplest possible algorithm, predictable distribution. Cons: assumes all backends are equally fast and all requests cost the same. A single slow backend gets the same share as the others, so its queue grows unboundedly.

Use when: backends are identical AND requests are uniform (e.g., a stateless API serving short requests).

2. Weighted Round-Robin

Each backend has a weight. Backend with weight 3 gets 3x the requests of weight 1.

Use when: backends have different sizes (mix of m6i.large and m6i.xlarge) or you are doing a canary deploy (5% to new version, 95% to old).

3. Least-Connections

The LB picks the backend with the fewest active connections.

Text

---------- Least-connections ----------
  b1: 50 connections
  b2: 30 connections   <- new request goes here
  b3: 45 connections

Pros: adapts to backend speed automatically. A slow backend accumulates connections, so the LB stops sending it new ones. Cons: requires the LB to track connection state per backend. Works perfectly for long-lived connections (WebSockets) and reasonably well for HTTP.

Use when: request durations vary significantly (some backends are slower, some requests are heavy), or the connection model is long-lived.

4. IP Hash / Consistent Hash

The LB hashes the client IP (or a key from the request) and uses the hash to pick a backend. The same client always lands on the same backend, until the pool size changes.

Text

---------- IP hash ----------
  hash(client IP 1.2.3.4) % poolSize -> b2 (always)
  hash(client IP 5.6.7.8) % poolSize -> b1 (always)

Pros: enables sticky sessions without cookies; good for in-memory caches per node. Cons: changing pool size remaps almost every client (modulo problem). Use consistent hashing (covered in the distributed-caching lesson) when the pool changes often, so only 1/N of clients are remapped.

Use when: you need session affinity, in-memory cache locality, or routing by tenant ID.

5. Random (with two-choices)

Pick two backends at random and send the request to the less loaded of the two. This is the power of two random choices algorithm and it is shockingly close to optimal for balancing load with minimal coordination.

Use when: the LB itself is distributed (no central state) and global least-connections is impractical. Used inside Envoy and many service-mesh sidecars.

Decision matrix

Algorithm	Backend awareness	Best for
Round-robin	None	Identical backends, uniform requests
Weighted RR	Static weight	Mixed backend sizes, canary deploys
Least-connections	Live connection count	Variable request durations, long-lived connections
IP/consistent hash	Hash of client key	Sticky sessions, cache locality, tenant routing
Power of two	Two-sample probe	Distributed LBs (sidecar mesh, no central state)

Health Checks (the Other Half of Load Balancing)

The algorithm picks a backend; the health check decides which backends are eligible. Without health checks, a dead backend keeps receiving traffic and 1/N of users see errors.

Active health checks

The LB periodically sends a probe (HTTP GET, TCP connect, gRPC health-check RPC) to every backend. Two consecutive failures mark the backend unhealthy; two consecutive successes mark it healthy again.

Text

---------- Active health check ----------
  every 5 seconds:
    LB -> GET /health -> b1 -> 200 OK         (healthy)
    LB -> GET /health -> b2 -> timeout         (unhealthy after 2 misses)
    LB -> GET /health -> b3 -> 200 OK         (healthy)

Design the /health endpoint to actually verify the backend can serve traffic: check the database connection, check Redis, check the disk. A health endpoint that just returns 200 lies about real failures.

Passive health checks

The LB observes the response from real client traffic. If a backend returns 5xx errors, times out, or refuses connections too often within a window, mark it unhealthy without sending a probe.

Pros: cheap (no extra probes), catches issues active checks miss (specific endpoints failing). Cons: the failure has to surface in real traffic before the LB notices, so some users see errors first.

Production systems use both: active checks for liveness, passive checks for graceful degradation under partial failures.

Configuration that actually matters

NGINX

Text

---------- NGINX upstream config ----------
upstream api {
    least_conn;
    server b1.api.local:8080 max_fails=2 fail_timeout=30s;
    server b2.api.local:8080 max_fails=2 fail_timeout=30s;
    server b3.api.local:8080 max_fails=2 fail_timeout=30s;
}
server {
    listen 443 ssl;
    location / {
        proxy_pass http://api;
        proxy_connect_timeout 2s;
        proxy_read_timeout 30s;
        proxy_next_upstream error timeout http_502 http_503;
    }
}

HAProxy

Text

---------- HAProxy backend config ----------
backend api
    balance leastconn
    option httpchk GET /health
    http-check expect status 200
    default-server inter 5s fall 2 rise 2
    server b1 b1.api.local:8080 check
    server b2 b2.api.local:8080 check
    server b3 b3.api.local:8080 check

Note proxy_next_upstream (NGINX) and the equivalent in HAProxy: on a 5xx or timeout from backend N, the LB retries on backend N+1 transparently. This is what makes one slow backend invisible to users.

Sticky Sessions (and Why to Avoid Them)

Sometimes you need a client to keep landing on the same backend - for an in-process session, an open WebSocket, or a partial computation. The LB stamps a cookie or routes by IP hash so the same client returns to the same backend.

Text

---------- Sticky session via cookie ----------
  client first request:  LB picks b2, sets cookie 'sticky=b2'
  client next request:   cookie 'sticky=b2' -> LB sends to b2
  client next request:   cookie 'sticky=b2' -> LB sends to b2

Cost: a hot user pins to one backend, defeating load balancing for that user. If b2 is overloaded, the cookie still sends traffic there until b2 dies. On deploy, every sticky session is broken (the new backend has no in-memory state).

Better answer: make the service stateless and externalize the state. Sessions go to Redis. WebSocket reconnects fetch the connection's state from a shared store. Sticky sessions are a stopgap, not a design.

When stickiness is genuinely necessary: WebSocket servers that hold the live socket (you cannot move an open TCP connection across nodes), or in-memory game state where round-trip to Redis adds unacceptable latency.

Connection Draining (Graceful Removal)

When you remove a backend from the pool (deploy, scale-down, instance failure), in-flight requests should complete on the old backend rather than be terminated.

The sequence:

Mark the backend draining in the LB.
Stop sending it new connections.
Wait for existing connections to finish (timeout: typically 30 to 300 seconds).
Terminate the backend.

Without draining, a deploy returns 502 to every in-flight request. With it, users do not notice.

Text

---------- Connection draining ----------
  T0   deploy starts: mark b1 draining
  T0   new connections go to b2, b3 only
  T1   b1's existing connections continue serving
  T+60 b1 has no live connections; LB removes it
  T+60 deploy stops b1, starts new version

Anycast and DNS-Based Load Balancing

The load balancers above operate at one location. To distribute traffic across regions, you have two more options.

DNS round-robin / GeoDNS

The DNS server returns different IPs to different clients - based on round-robin, geographic proximity, or health. Each IP points to a regional load balancer.

Pros: no extra infra, every client gets routed to the nearest region. Cons: DNS TTLs mean failovers take minutes, not seconds. Some clients ignore TTL.

Anycast (BGP)

Multiple regions advertise the same IP address from their respective ASNs. BGP routes each client to the topologically nearest region. Used by Cloudflare, Google DNS (8.8.8.8), AWS Global Accelerator.

Pros: instant failover (BGP withdraws the route), single global IP, sub-millisecond client-to-region routing. Cons: requires BGP peering and your own IP space; usually delivered via a managed service.

Real-World Examples

How real systems implement this in production

AWS ALB vs NLB

AWS offers Application Load Balancer (Layer 7, HTTP-aware) and Network Load Balancer (Layer 4, TCP/UDP). ALB is the default for HTTP services because of path-based routing, target groups per microservice, and integration with WAF. NLB is used when raw throughput matters (millions of QPS), for non-HTTP protocols, or when you need to preserve the client source IP at Layer 4.

Trade-off: Pick the layer based on what you need to inspect; do not pay for HTTP parsing if your protocol is not HTTP.

Cloudflare anycast at scale

Cloudflare runs hundreds of POPs worldwide, each advertising the same anycast IPs via BGP. A user in Mumbai connects to the Mumbai POP; a user in Frankfurt to the Frankfurt POP. Within a POP, NGINX-based load balancers distribute requests across thousands of edge servers.

Trade-off: At internet scale, load balancing is multi-tier - BGP/anycast at the edge, regional LB, then per-DC LB.

Envoy in a service mesh

Istio and Linkerd inject Envoy as a sidecar next to every service pod. The sidecar handles outbound load balancing (typically least-request or power-of-two), TLS, retries, and metrics. The mesh control plane pushes configuration; the data plane handles every request locally.

Trade-off: As architectures get more granular, load balancing moves from a centralized appliance to a distributed sidecar, and algorithms shift from least-connections (requires central state) to power-of-two (local decision).

Stack Overflow HAProxy

Stack Overflow runs HAProxy as the entry point in front of their IIS web tier. Two HAProxy instances in active-passive failover handle the entire site's traffic. They use least-connections balancing and tight health-check intervals (every 2 seconds) so a sick backend is pulled in under 6 seconds.

Trade-off: At moderate scale, two well-tuned HAProxy boxes outperform a complex managed service for a fraction of the cost.

Quick Interview Phrases

Key terms to use in your answer

Layer 4 vs Layer 7

least-connections

active health checks

connection draining

sticky sessions

anycast routing

Common Interview Questions

Questions you might be asked about this topic

Compare round-robin, least-connections, and consistent hashing as load-balancing algorithms.

Round-robin: equal distribution, no awareness of backend state, fast and stateless. Best for uniform workloads. Least-connections: tracks active connections per backend, sends new requests to the least-loaded; adapts to slow backends. Best for variable request durations. Consistent hashing: hash a request key (client IP, user ID) and route to the corresponding backend; same key always to same backend. Best for sticky sessions and cache locality. Mention that adding/removing a backend remaps minimal keys with consistent hashing (1/N), but every key with naive modulo.

Walk me through what happens at the load-balancer level when one of 10 backends starts returning 500 errors.

Design the load-balancing strategy for a global multi-region API.

When would you use Layer 4 instead of Layer 7?

Explain sticky sessions and when they are necessary.

Interview Tips

How to discuss this topic effectively

Always state both the layer and the algorithm in one sentence: 'Layer 7 ALB with least-connections and a /health probe every 5 seconds'. Naming all three signals operational depth.

Mention connection draining whenever you mention deploys. Zero-downtime deploys are a load-balancer feature, not a magic property of containers.

Default to least-connections for variable-latency workloads and power-of-two for distributed sidecar meshes. Round-robin is for textbook examples and uniform-cost requests only.

Treat sticky sessions as a smell, not a feature. The right answer is to make the service stateless; sticky is a stopgap with real costs.

When asked about multi-region, lead with anycast/BGP if you know the team uses Cloudflare or AWS Global Accelerator, otherwise GeoDNS. Both are correct; naming the right tool wins points.

Common Mistakes

Pitfalls to avoid in interviews

Defaulting to round-robin for any workload

Round-robin assumes all requests cost the same and all backends are equally fast. For real workloads with variable latency, least-connections distributes load far better and adapts automatically when a backend gets sick. Reserve round-robin for genuinely uniform workloads.

Writing a /health endpoint that just returns 200

A trivial health endpoint passes even when the backend cannot reach the database, cache, or downstream service - so the LB keeps sending traffic to a broken node. Health endpoints should verify actual dependencies (with a small connection-pool quota) so unhealthy is detected before users see errors.

Using sticky sessions because it is easier than externalizing state

Sticky sessions defeat load balancing for hot users, break on every deploy, and turn one node failure into a session-loss event for everyone pinned there. Push sessions to Redis or use signed JWTs so any backend can serve any request.

Forgetting to configure connection draining

Without graceful draining, every deploy or scale-down kills in-flight requests, returning 502s to real users. Configure the LB to mark removed backends as draining and wait for in-flight requests to complete (typically 30 to 120 seconds) before terminating.

Treating the load balancer as infinitely scalable

A single LB instance has a connection limit and a CPU/NIC ceiling. AWS ALB scales up automatically; HAProxy/NGINX must be sized by you. At very high QPS, the LB itself needs horizontal scaling (multiple LB instances behind anycast or DNS round-robin).

Back to System Design