System Design Article

Auto-Scaling, Elasticity & Capacity Planning

Difficulty: Medium

Auto-scaling lets your fleet grow when traffic surges and shrink when it ebbs, so you pay for the load you actually have. This lesson covers reactive metric-based scaling, predictive (schedule-based) scaling, and the gotchas that turn auto-scaling into auto-outage: warm-up time, scale-down storms, downstream throttling, and cost runaway. We also walk through capacity planning: how to estimate the fleet size you need from QPS, latency targets, and headroom, before relying on the scaler to fix mistakes at 3 a.m. By the end you can configure an auto-scaling policy with confidence and explain to an interviewer why simply 'putting it on auto-scale' is not the actual answer.

Auto-Scaling, Elasticity & Capacity Planning

System Design

Medium

auto-scaling

elasticity

capacity-planning

kubernetes-hpa

aws-asg

scalability

system-design

intermediate

premium

779 views

Elasticity, in One Sentence

Elasticity is the property that lets a system change its capacity in response to load - up when traffic grows, down when it shrinks - without human intervention.

The non-elastic alternative is provisioning for peak: size the fleet for the busiest moment of the day, week, or year, and pay for that capacity all the time. For a service with 20x peak-to-trough ratio, that is 95% wasted spend in the trough.

Text

---------- Static vs elastic capacity ----------
  static:                           elastic:
   capacity ___________              capacity   /\
            |__________ peak                   /  \___
            |                              ___/         actual load
            |__________________            /   load matches capacity
            time                        time

Elasticity is not free. It introduces a feedback loop (scaler watches metrics, decides, acts) that itself can fail in surprising ways.

Reactive Auto-Scaling

The scaler watches metrics and adds or removes capacity when thresholds cross. The standard pattern.

The control loop

Text

---------- Reactive scaling loop ----------
  every N seconds:                                       
    1. read metric (e.g. avg CPU across fleet)            
    2. compare to target (e.g. 60%)                       
    3. if above target for M consecutive periods:         
         desired = current * (avg / target)               
         add (desired - current) instances                 
    4. if below target:                                   
         remove instances (with cooldown to avoid thrash)

Three numbers determine how this behaves:

Target metric value (e.g., 60% CPU). The scaler tries to keep the average at this number.
Cooldown / stabilization window (e.g., 5 minutes). After a scaling action, ignore further triggers until cooldown elapses to prevent thrash.
Step size or maximum delta (e.g., 2 instances per scale-out, 1 per scale-in). Caps how aggressively the fleet changes.

Common metrics

Metric	Best for	Pitfall
CPU utilization	Stateless API servers, CPU-bound workloads	Misleading for I/O-bound services that wait on the database
Memory	JVM-heavy services, in-memory caches	Memory rarely returns to OS; hard to scale down on
Request count per target	HTTP services with predictable per-request cost	Spikes faster than per-node measurement; can be lagging
Queue depth	Async workers consuming from SQS/Kafka	The right metric for batch/async workloads
Custom (active connections, in-flight requests)	WebSocket servers, long-running jobs	Requires custom CloudWatch/Prometheus integration

Rule of thumb: scale on the metric that actually correlates with capacity exhaustion for your workload. For a CPU-bound API, CPU. For a queue worker, queue depth. For a WebSocket server, active connections.

Pseudocode: simple HPA-style scaler

JavaScript

Python

async function scaleLoop({ getMetric, currentCount, targetValue, minCount, maxCount, cooldownMs }) {
    let lastActionAt = 0;
    while (true) {
        const value = await getMetric();
        const desired = Math.ceil(currentCount() * (value / targetValue));
        const clamped = Math.min(maxCount, Math.max(minCount, desired));
        const now = Date.now();
        if (clamped !== currentCount() && now - lastActionAt > cooldownMs) {
            await setReplicas(clamped);
            lastActionAt = now;
        }
        await sleep(15_000); // poll every 15s
    }
}

The formula desired = current * (metric / target) is the same one used by Kubernetes HPA. If CPU is at 90% with target 60%, desired = current * 1.5 - a 50% scale-out.

Predictive (Scheduled) Scaling

Reactive scaling is reactive; it lags real load by 1 to 5 minutes. For predictable patterns - the daily 9am traffic ramp, the Friday-night gaming spike, the holiday flash sale - you can pre-scale on a schedule.

Text

---------- Scheduled scaling ----------
  weekday  06:30   scale_to(min=10)
  weekday  08:30   scale_to(min=40)    pre-warm before the 9am ramp
  weekday  18:00   scale_to(min=20)
  weekday  22:00   scale_to(min=10)
  weekend  00:00   scale_to(min=8)

Use when: load patterns are predictable. AWS Auto Scaling, Kubernetes CronJobs targeting HPA min/max, and Vertica's elastic resize all support scheduled actions.

Combine with reactive: schedule sets the floor (min=40 at 9am), reactive handles surprises on top.

ML-Driven Predictive Scaling

AWS Auto Scaling can use a forecasting model (typically a Holt-Winters or LSTM-based predictor) to project load 1 to 48 hours ahead and pre-scale accordingly. Useful when patterns are complex (weekday/weekend interaction, monthly cycles, sports schedule).

Caveat: ML predictions can over- or under-scale on novel days (Black Friday, the day a tweet goes viral, an outage). Always set sensible min/max bounds so a bad prediction does not bankrupt or break you.

The Auto-Scaling Pitfalls

This is what interviewers love to ask about. Anyone can configure HPA; only operators who have been on-call for it know these.

1. Warm-up time

New instances are not productive immediately. They must boot, install packages, fetch config, connect to the database, fill caches. Time-to-ready is typically 30 seconds to 5 minutes; sometimes longer for JVM JIT warm-up.

Mitigations:

Warm pools (AWS Warm Pools, GCP managed instance group warm-up): keep N instances pre-booted but stopped, ready to start in seconds.
Pre-baked AMIs / container images with all dependencies installed.
Pre-warming traffic: send the new instance some test traffic (or a fraction of real traffic) before adding it to the LB pool.
Health-check tuning: do not add an instance to the rotation until two consecutive health checks pass.

2. Scale-down storms (thrashing)

A spike triggers scale-out; the spike ends; the scaler scales down too aggressively; the next minor blip triggers another scale-out; the scaler scales down again. The fleet oscillates, and every transition costs warm-up latency and data churn.

Mitigations:

Stabilization window: do not scale down until the metric has been below target for 5 to 15 minutes.
Asymmetric thresholds: scale up at 60% CPU, but only scale down below 40%. Hysteresis prevents rapid flapping.
Step-size cap on scale-in: remove at most 1 to 2 instances per period; let the fleet shrink gracefully.

3. Downstream constraints

Your stateless API tier scales beautifully. The Postgres connection limit does not. At 50 instances each holding 20 connections, you are at 1000 - past the typical Postgres ceiling of ~500. Scaling out has now caused the database outage you were trying to avoid.

Mitigations:

Connection poolers (PgBouncer, RDS Proxy) so each instance multiplexes one connection.
Downstream rate limits at the gateway so a runaway scale-out cannot overwhelm a fixed-capacity backend.
Capacity coupling: the scaler considers downstream constraints and refuses to grow if the database is at 90%.

4. Cold starts (especially serverless)

Lambda, Cloud Run, Azure Functions all spin up new containers on demand. The first request to a new container pays a cold-start penalty (100 ms to 5 seconds depending on language). Under bursty load, every nth request is a cold start.

Mitigations:

Provisioned concurrency (AWS Lambda) keeps N pre-warmed instances.
Min replicas > 0 for Cloud Run / Knative.
Compile-ahead languages (Go, Rust, GraalVM-native Java) have ~10x faster cold starts than CPython or JVM.

5. Cost runaway

A stuck-open feedback loop or a misconfigured metric can scale the fleet to the moon in minutes. A single bad alert at 2 a.m. has caused $50K bills.

Mitigations:

Hard maximum on the scaler (max_size = 200). The scaler refuses to grow past it; an alert fires; humans investigate.
Budget alerts (AWS Budgets, GCP Billing) catching anomalous spend within hours, not at month-end.
Quotas at the cloud-provider level preventing accidental 10000-instance launches.

Capacity Planning: The Math Before the Scaler

Auto-scaling reacts; it does not absolve you of the math. A senior engineer can quote the fleet size from first principles before reaching for the scaler.

The four-number model

Text

---------- Fleet sizing model ----------
  fleet_size = (peak_QPS * avg_request_seconds) / per_node_concurrency * (1 + headroom)

Walking each number:

peak_QPS: requests per second at the highest moment you care about (typically 95th or 99th percentile minute, or a planned event).
avg_request_seconds: average time a request occupies a worker (Little's Law: concurrency = arrival rate * service time).
per_node_concurrency: concurrent requests per node (number of worker threads, async workers, or CPU cores * cores-per-request factor).
headroom: 30% to 50% spare capacity for spikes, deploys, and node failures.

Worked example

A REST API: 10K peak QPS, 50 ms average request, 100 concurrent requests per node.

Text

---------- Worked example ----------
  fleet_size = (10000 * 0.05) / 100 * 1.4
             = 500 / 100 * 1.4
             = 7 nodes

Seven nodes at peak, plus the auto-scaler pulling it down to two or three at trough. Easy to defend in an interview.

Sanity-check the answer

Per-node QPS: 10K / 7 = 1.4K QPS per node. Is that realistic for the runtime? For a tuned Node.js or Go service, yes. For Python with 1 worker, no.
Memory per node: with the runtime and request size, does each node fit in the chosen instance? If not, scale up the instance or scale out the fleet.
Database impact: 7 nodes x average 10 connections = 70 connections to Postgres. Fine for any modern instance.

Headroom matters more than people think

A fleet at 99% utilization cannot absorb a 5% spike. A fleet at 60% can absorb a 40% spike before the auto-scaler reacts. The headroom is the spike-absorption budget. Setting headroom to 30% (i.e., target 70% utilization) is conservative but appropriate for user-facing services. For batch workers, 90% utilization is fine.

Tool Comparison

Tool	Workload	Strength	Weakness
AWS Auto Scaling Group	EC2 fleets	Mature, supports scheduled + predictive scaling, integrates with ELB	EC2-only, slower scale-out (~1 min) than container schedulers
Kubernetes HPA (Horizontal Pod Autoscaler)	Pods inside a cluster	Fast (15s loops), metric-driven, Kubernetes-native	Scales pods only; the cluster nodes are scaled by Cluster Autoscaler
Kubernetes VPA (Vertical Pod Autoscaler)	Pods needing right-sized requests	Adjusts CPU/memory requests over time	Mostly requires pod restart; cannot use with HPA on the same metric
Cluster Autoscaler	Kubernetes node pool	Adds/removes nodes when pods cannot schedule	Slower than HPA (1-3 min); depends on cloud-provider node creation
AWS ECS Service Auto Scaling	ECS tasks	Same model as ASG, container-native	ECS-only
Cloud Run / Knative	Serverless containers	Scales to zero, pay-per-request	Cold starts; longer reaction to bursts
AWS Lambda + Provisioned Concurrency	FaaS	Truly elastic, no cluster to manage	Cold starts unless provisioned; per-request cost

Default recommendation: HPA for pod-level scaling, Cluster Autoscaler for node-level, scheduled actions on top for predictable peaks.

Real-World Examples

How real systems implement this in production

Netflix Scryer

Netflix built Scryer, a predictive auto-scaler that projects EC2 demand a few hours ahead based on historical patterns. They combine Scryer (predictive) with a reactive scaler so predicted ramps are pre-warmed and unexpected spikes are absorbed reactively.

Trade-off: At scale, reactive alone is too slow; combining predictive and reactive gives you fewer cold starts and better p99 latency during ramps.

Airbnb on AWS Lambda

Airbnb uses Lambda for image processing, where load varies wildly throughout the day. Lambda scales from 0 to thousands of concurrent executions in seconds; they pay only for execution time. Cold starts are mitigated with provisioned concurrency for the most latency-sensitive functions.

Trade-off: Serverless is the cleanest answer for highly bursty workloads where the cost of always-running infrastructure exceeds the cost of cold starts.

Pinterest HPA tuning

Pinterest runs thousands of services on Kubernetes with HPA. They documented the importance of tuning the metric, target, and stabilization window per service: image-resize workers scale on queue depth with aggressive scale-out and conservative scale-in, while user-facing API services scale on CPU with symmetric thresholds and a short cooldown.

Trade-off: There is no one-size-fits-all HPA config; tune per workload.

Reddit during mega-events

Reddit runs auto-scaling year-round but for known mega-events (election night, IPO day, Super Bowl) they over-provision in advance and disable scale-in for the duration.

Trade-off: Auto-scaling is not magic; for events you cannot afford to be slow on, the safest strategy is to manually pre-scale and freeze the fleet.

Quick Interview Phrases

Key terms to use in your answer

horizontal pod autoscaler

warm pools

scale-down stabilization

predictive scaling

headroom buffer

Little's Law for capacity

Common Interview Questions

Questions you might be asked about this topic

Walk me through how you would configure auto-scaling for a new REST API service.

Step 1: capacity plan - compute baseline fleet size from peak QPS, request duration, per-node concurrency, plus 40% headroom. Step 2: pick HPA metric - CPU for CPU-bound, in-flight requests or RPS for typical APIs. Step 3: set target (60% CPU), min replicas (3, one per AZ), max replicas (10x baseline). Step 4: asymmetric stabilization - 30s for scale-out, 10min for scale-in. Step 5: add Cluster Autoscaler so node capacity follows pod count. Step 6: alert on metric anomalies and on hitting max. Step 7: load-test in staging to verify the scaler reacts as expected. Mention warm-up time and the readiness probe.

How do you handle a 10x traffic spike that arrives in 60 seconds?

Compare reactive vs predictive auto-scaling. When do you use each?

Estimate the fleet size for a service with 50K peak QPS, 80 ms p50 latency, 200 concurrent requests per node.

Your fleet auto-scaled from 10 to 50 instances during a spike, then a runaway loop scaled to 200 and the bill spiked. What went wrong and how do you prevent it?

Interview Tips

How to discuss this topic effectively

Always compute the fleet size before reaching for the scaler. 'QPS times request time divided by per-node concurrency plus headroom' is the senior-level answer to any 'how many machines?' question.

Pair scale-out and scale-in policies asymmetrically. Aggressive scale-out (target=60%, step=2) and conservative scale-in (cooldown=10min, step=1) is the production default.

Mention warm-up time and warm pools whenever the interviewer brings up sudden spikes. Reactive auto-scaling alone is never fast enough for sub-minute bursts.

Bring up downstream constraints before they do. 'Auto-scaling the API tier without scaling the database is how you cause a database outage' shows you have lived through it.

When discussing serverless, name the cold-start mitigation: provisioned concurrency for Lambda, min-instances for Cloud Run. Cold starts are the most-asked serverless follow-up.

Common Mistakes

Pitfalls to avoid in interviews

Setting auto-scaling and forgetting capacity planning

Auto-scaling reacts to metrics but cannot create capacity that does not exist. A misconfigured min-instance count, an undersized instance type, or a downstream bottleneck still causes outages. Always do the capacity math first; the scaler handles the variance, not the baseline.

Scaling on the wrong metric

CPU is a poor metric for I/O-bound services that wait on the database; memory is a poor metric for languages that do not return memory to the OS. Match the metric to the actual bottleneck: queue depth for workers, in-flight requests for connection-bound services, custom metrics for everything else.

Aggressive symmetric scale-in

Scaling out fast and scaling in fast causes oscillation. Use asymmetric thresholds (e.g., scale up at 60%, scale down below 40%) and a long stabilization window for scale-in (10 minutes or more) so transient dips do not trigger unnecessary churn.

Forgetting downstream coupling

Scaling the stateless tier without scaling its downstream dependencies (database, cache, queue, third-party API) just moves the bottleneck. Either scale the downstream proportionally, add a connection pooler, or rate-limit at the gateway so a runaway scale-out cannot overwhelm a fixed-capacity backend.

Trusting auto-scaling for traffic spikes faster than the warm-up time

If the spike is faster than the time to add and warm an instance (typically 30 seconds to 5 minutes), the scaler arrives too late. Mitigate with sufficient headroom on the existing fleet, scheduled pre-scale for known spikes, warm pools, or a serverless layer for bursts.

Back to System Design