System Design Article

Fault Tolerance, Redundancy & Failover

Difficulty: Medium

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

Fault Tolerance, Redundancy & Failover

System Design

Medium

fault-tolerance

redundancy

failover

circuit-breaker

reliability

availability

distributed-systems

system-design

intermediate

free

510 views

What Fault Tolerance Means

A fault-tolerant system continues to deliver its intended service even when components fail. Failure is not an event you can engineer away - it is a steady-state condition of any sufficiently large system. With 1000 servers, the question is not 'will one fail today' but 'how many will fail today' (typically 1-5 per 1000 per day).

Fault tolerance design principle: assume every component will fail; design so that any single failure does not cause user-visible impact, and so that combinations of failures degrade gracefully rather than crash catastrophically.

Text

---------- Failure as a constant ----------
  In a fleet of 1000 servers:
    ~ 5 disk failures per day
    ~ 1 CPU/memory failure per day
    ~ 2 network link issues per day
    ~ Occasional power, cooling, or rack-level failures

  At cloud-region scale (10K-100K servers):
    Failures every minute. The system must not notice.

Building Block: Redundancy

The simplest answer to failure: have more than one of everything. If component X dies, component X' takes over.

Active-Active

All redundant copies actively serve traffic. Loss of one copy reduces capacity but not availability.

Text

---------- Active-active load balancing ----------
  Clients ----+--> [LB] --> [Server 1]
              |             [Server 2]
              |             [Server 3]
  on Server 2 failure: LB removes it; Servers 1 and 3 absorb the traffic.

Pros: best resource utilization, fastest failover (LB just stops routing). Cons: requires the system to be stateless or to handle distributed state (replication, session stickiness).

Use for: stateless web/API tiers, idempotent workers, read-heavy data tiers with replication.

Active-Passive (Hot Standby)

One copy actively serves traffic; one or more standbys are warm and ready to take over. Failover is triggered when the active copy fails.

Text

---------- Active-passive ----------
  Clients --> [Active Primary]   <----replication---->   [Standby]
              (serves all traffic)                       (idle, but synced)

  on primary failure: standby is promoted to primary; clients redirected.

Pros: simpler stateful semantics; standby has full state mirror. Cons: standby capacity is idle most of the time; failover has a detection + promotion delay (seconds to minutes).

Use for: relational databases (PostgreSQL primary + read replica with failover), stateful services with strong consistency requirements.

N+1 (or N+2) Redundancy

Provision N machines for the load + 1 (or 2) extra to absorb a single (or double) failure.

A fleet of 5 nodes carrying 80% utilization can absorb one failure (the remaining 4 take 100% utilization). With N+2, two simultaneous failures are absorbed.

Common practice: target ~50-70% steady-state utilization so 1-2 failures (or a deploy taking nodes out) does not exceed safe load.

Failure Detection: Health Checks and Heartbeats

You cannot recover from failures you do not detect. Two patterns dominate.

Health Checks

A load balancer (or service mesh) periodically probes each backend (HTTP GET /health, TCP connect) and removes failing backends from the pool.

Three levels:

Liveness: is the process running? (typically a TCP connect or trivial HTTP 200).
Readiness: is the process ready to handle traffic? (e.g., connected to its DB, caches warmed).
Deep: can the service do real work? (e.g., a synthetic transaction touching the DB and key dependencies).

Kubernetes has explicit liveness, readiness, and (newer) startup probes. Most production systems use readiness checks for routing and liveness checks for restart decisions.

Tuning trade-off:

Aggressive (fail after 1-2 missed checks): fast detection, but flaky network can cause false-positive removals.
Conservative (fail after 3-5 missed checks): tolerates transient blips, but failures hang for longer.
Standard production setting: probe every 5s, fail after 2-3 misses, succeed after 1.

Heartbeats

A distributed component sends periodic 'I am alive' messages to a coordinator (leader, registry, monitoring system). Missing heartbeats trigger failure handling.

Used in: cluster managers (etcd, ZooKeeper for membership), gossip protocols (Cassandra, Consul), leader election (Raft heartbeats from leader to followers).

Failover: Automatic vs Manual

When a failure is detected, what happens next?

Automatic Failover

A controller (LB, orchestrator, consensus protocol) promotes a standby or reroutes traffic without human intervention.

Pros: fast (seconds), works at 3 a.m. without paging a human. Cons: must be tested often, can misbehave (false positives, cascading promotion failures, split brain).

Manual Failover

A human operator decides when to fail over, after diagnosing the situation.

Pros: safer for ambiguous failures, easier to handle complex multi-failure scenarios. Cons: requires on-call, slower, only viable when downtime is tolerable.

Industry practice: automatic failover for stateless tiers and well-tested stateful systems (HA databases with proven failover); manual failover for cross-region disaster recovery and ambiguous degraded states.

Cascading Failures and Anti-Patterns

The biggest production outages are not single failures - they are cascades. One component fails, retries pile up, the load shifts to surviving components, those overload and fail, and the failure spreads through the stack.

Retries Without Backoff (the classic mistake)

Client calls service S, S returns an error, client retries immediately. With many clients, S sees a thundering herd of retries that prevents it from recovering.

Fix: exponential backoff with jitter.

Text

retry_delay = base * 2^attempt + random_jitter
  attempt 1: 100ms + random(0, 100ms)
  attempt 2: 200ms + random(0, 100ms)
  attempt 3: 400ms + random(0, 100ms)

Jitter is essential: without it, all clients retry at the same instant, creating periodic spikes.

No Timeouts

Client calls a slow service and waits indefinitely. Caller threads pile up, exhausting the connection pool, and the caller dies even though it was healthy.

Fix: every network call has a timeout. The timeout should be shorter than the upstream's timeout to avoid wasted work.

Pattern: Circuit Breakers

A circuit breaker monitors call success rates and 'opens' (stops attempting calls) when failures cross a threshold. Open breakers prevent the caller from drowning a failing service in retry traffic.

Text

JavaScript

Python

---------- Circuit breaker states ----------
  CLOSED -> calls pass through; track error rate
     | error rate > 50%
     v
  OPEN -> reject calls immediately; return cached value or default
     | wait timeout (e.g., 30s)
     v
  HALF-OPEN -> let a few calls through
     | calls succeed -> CLOSED
     | calls fail -> OPEN

Libraries: Hystrix (Netflix, deprecated), Resilience4j, Polly (.NET), Istio's outlier detection, Envoy's circuit breakers.

Pattern: Bulkheads

The bulkhead pattern (named after ship compartments) isolates resources so a failure in one area cannot drown the rest.

Examples:

Separate connection pools per downstream service. The slow service exhausts its own pool but does not block calls to other services.
Separate thread pools per request type. CPU-intensive requests do not starve light requests.
Per-tenant quotas. One noisy customer does not consume all resources.

Text

---------- Bulkhead: per-service connection pools ----------
  App
   |-- pool A (10 connections) --> Service A
   |-- pool B (10 connections) --> Service B   <- if B is slow,
   |-- pool C (10 connections) --> Service C       only pool B is exhausted;
                                                     pools A and C still work.

Pattern: Graceful Degradation

When a non-critical dependency fails, return a reduced-feature response instead of an error. The user gets a working page (with the recommendation widget missing) instead of an error page.

Examples:

News feed shows posts but the 'people you may know' widget is empty (its service is down).
E-commerce search returns results but personalized ranking falls back to generic relevance.
Map displays roads but live traffic overlay is missing.

Graceful degradation is what makes systems feel reliable to users even when parts are broken. Designing for it requires explicitly identifying critical vs non-critical paths.

Disaster Recovery: RTO and RPO

Two numbers define a DR strategy:

RTO (Recovery Time Objective): the maximum acceptable time to restore service after a disaster.
RPO (Recovery Point Objective): the maximum acceptable data loss measured in time (e.g., RPO = 5 min means up to 5 minutes of recent data may be lost).

Strategy	RTO	RPO	Cost
Daily backup to tape	hours to days	up to 24h	lowest
Hourly DB snapshot to S3	hours	up to 1h	low
Cross-region warm standby	minutes	seconds-minutes	medium
Cross-region active-active	seconds	near-zero	highest

Translating business requirements:

'Critical financial system' -> RTO < 1 minute, RPO ~0 -> active-active with synchronous cross-region replication (Spanner-like).
'Standard web product' -> RTO 15 minutes, RPO 1 minute -> warm standby with async replication and tested failover.
'Internal analytics dashboard' -> RTO 24h, RPO 24h -> nightly backup, restore on demand.

Multi-AZ vs Multi-Region

Multi-AZ (within one region): protects against single-AZ failures (a fire, a network outage in one data center). Inter-AZ latency is ~1 ms; synchronous replication is feasible. This is the default for production-grade systems.

Multi-region: protects against whole-region outages (massive cloud-provider failures, natural disasters, regulatory requirements). Inter-region latency is 50-200 ms; synchronous replication is too slow for most workloads, so async with eventual consistency is the norm.

Most SaaS products run multi-AZ; only the most critical (banking, healthcare, life-safety, large enterprise) justify multi-region.

Chaos Engineering

The most reliable way to know your fault tolerance works is to break things on purpose, in production, on a controlled schedule.

Netflix's Chaos Monkey randomly kills production instances during business hours; engineers know if their service cannot handle a single instance loss, that is their bug to fix. Successive Chaos tools test region failure (Chaos Gorilla), inter-service latency injection (Latency Monkey), and complete cluster shutdown.

Modern practice: scheduled chaos experiments with explicit hypotheses ('our system can lose 30% of pods without user impact'), run during business hours, with abort criteria. Tools: Chaos Mesh (Kubernetes), Gremlin, AWS Fault Injection Simulator.

Benefits:

Surfaces hidden assumptions before customers do.
Forces teams to actually run their runbooks.
Builds organizational confidence in DR plans.

Operational Disciplines

Runbooks for every alert. When an alert fires at 3 a.m., the on-call engineer should not be reading code; they should be running a tested procedure.
Game days. Quarterly exercises where a team simulates a major failure and walks through recovery end-to-end.
Blameless post-mortems. After every significant incident, document what failed and what would have caught it. The output is action items, not blame.
Error budgets. SRE concept: if reliability target is 99.9% (43 min downtime/month), the team has a 'budget' of 43 minutes to spend on risky launches and learning experiments. Encourages calculated risk-taking.
Test failover regularly. Failover code that has not run in 6 months has a 50% chance of having a bug. Run it monthly; ideally automatically as a chaos experiment.

How to Talk About This in an Interview

Start with redundancy and the failure model. 'I would run the API tier active-active behind a load balancer with N+2 capacity; the database in a primary + 2 replicas with automatic failover.'
Mention timeouts, retries with backoff, and circuit breakers. 'Every cross-service call has a 1-second timeout, exponential backoff with jitter, and a circuit breaker that opens at 50% failure rate.'
Distinguish detection from recovery. 'Health checks every 5 seconds detect failures; LB removes the unhealthy backend within 15 seconds; the orchestrator schedules a replacement within 30 seconds.'
Acknowledge cascading failure risks. 'Without bulkheads (per-dependency connection pools), a slow downstream service exhausts the caller's pool and cascades the failure upstream.'
State RTO/RPO targets explicitly. 'For this product, RTO is 5 minutes and RPO is 30 seconds; that requires async replication to a warm standby region with tested failover.'
Mention chaos engineering. 'We validate the design by killing instances and injecting latency in production every week; if the system does not survive, we fix it before customers find it.'

Quick Review

At scale, components fail constantly. Fault tolerance is the architecture, not a feature.
Redundancy: active-active (best utilization, fastest failover), active-passive (simpler state, slower failover), N+1 (capacity buffer).
Detection: health checks (liveness, readiness, deep) and heartbeats.
Anti-cascade patterns: timeouts, exponential backoff with jitter, circuit breakers, bulkheads, graceful degradation.
DR is defined by RTO (time to recover) and RPO (acceptable data loss).
Multi-AZ is default; multi-region is for critical systems.
Chaos engineering validates fault tolerance is real.
Runbooks, game days, error budgets, blameless post-mortems are the SRE disciplines that turn architecture into reliable systems.

Real-World Examples

How real systems implement this in production

Netflix Chaos Monkey and Hystrix

Netflix popularized two patterns: Chaos Monkey randomly kills production EC2 instances during business hours, forcing engineers to design for the failure they will see; Hystrix (now sunsetted in favor of Resilience4j) provided circuit breakers, bulkheads, and graceful degradation across hundreds of microservices. Together they shifted the industry's view of reliability from 'avoid failure' to 'embrace and design for it'.

Trade-off: Chaos engineering surfaces real reliability bugs before customers do, but requires an engineering culture that treats production breakage as a learning opportunity. The cost is significant: engineers spend time on resilience instead of features. The win is measurable: Netflix has had remarkably few major customer-facing outages relative to its scale.

AWS Multi-AZ RDS

AWS RDS Multi-AZ runs a primary database in one AZ with a synchronous standby in another. On primary failure (instance crash, AZ outage), RDS detects it within ~30 seconds and promotes the standby; DNS is updated to point at the new primary; clients reconnect within a minute or two. RPO is near-zero (synchronous replication); RTO is 1-2 minutes.

Trade-off: Multi-AZ doubles the database cost (you pay for the standby) and adds a few ms of write latency from synchronous replication. The win: surviving full AZ failures without manual intervention or data loss. For any production database, the cost is overwhelmingly justified.

Cloudflare's anycast routing

Cloudflare uses BGP anycast to advertise the same IP from data centers worldwide. When a user requests a Cloudflare-protected site, traffic flows to the nearest healthy POP. If a POP fails, BGP withdraws the route and traffic seamlessly shifts to the next nearest POP within seconds. The user notices nothing.

Trade-off: Anycast gives transparent failover and global load balancing without DNS changes, but requires sophisticated network engineering (BGP peering, route flap dampening, capacity planning per POP). The infrastructure investment is huge; the user experience is exceptional.

Stripe error budgets

Stripe publicly committed to 99.999% API availability for payment processing. To meet that target (5.26 min downtime/year), Stripe runs payment infrastructure across multiple regions with automatic failover, has runbooks for every alert class, conducts quarterly DR drills, and uses error budgets to govern launch decisions: if the budget is consumed, no new launches until reliability recovers.

Trade-off: Five-nines is enormously expensive: redundant infrastructure, dedicated SRE staff, slowed feature velocity. For a payment processor where every minute of downtime is millions in lost transactions and trust, the cost is justified. For a typical SaaS product, three or four nines is the practical target.

Quick Interview Phrases

Key terms to use in your answer

active-active redundancy

circuit breaker

exponential backoff with jitter

graceful degradation

RTO and RPO

blast radius

Common Interview Questions

Questions you might be asked about this topic

How do you design a fault-tolerant API service that survives instance and AZ failures?

Multi-AZ deployment, N+2 capacity. API tier is stateless behind a load balancer with health checks (probe every 5s, fail after 2-3 misses). Auto-scaling group keeps minimum instances per AZ. Sessions stored in shared Redis (HA, multi-AZ). Database is primary + 2 replicas across AZs with automatic failover (RDS Multi-AZ or Patroni-style). Every cross-service call has a 1s timeout, exponential backoff with jitter, and a circuit breaker. Acknowledge cascading-failure risk with bulkheads (per-dependency connection pools). Test by killing AZs in chaos experiments.

What is a circuit breaker and when should you use one?

Walk through your approach to disaster recovery for a SaaS product.

How do you prevent cascading failures in a microservices system?

What is chaos engineering and how do you implement it?

Interview Tips

How to discuss this topic effectively

Always state your timeout, retry, and circuit-breaker policy together. 'Every call: 1s timeout, 3 retries with exponential backoff and jitter, circuit breaker opens at 50% errors' is a senior-level one-liner.

Bring up the failure model explicitly. 'I assume any single instance can die at any time; the system must survive without user impact' is what interviewers want to hear.

Distinguish detection latency from failover latency. They are separate budgets and both contribute to the user-visible outage time.

State RTO and RPO numbers, not just 'high availability'. Vague reliability promises are a yellow flag; precise numbers signal you have done DR planning.

Mention chaos engineering even if not asked. Saying 'we kill instances on purpose to validate the design' is the strongest signal you have actually operated reliable systems.

Common Mistakes

Pitfalls to avoid in interviews

Retrying without exponential backoff and jitter

Naive retries pile up immediately on the failing service, preventing recovery (the 'thundering herd'). Always exponential backoff (delay doubles each attempt) with jitter (random offset) so retries spread out instead of synchronizing. Without jitter, all clients retry at the same instant, creating periodic spikes.

Calling a service without a timeout

Without a timeout, a slow downstream service backs up the caller's threads or connections, eventually exhausting them and crashing the caller even though it is healthy. Every network call needs an explicit timeout, shorter than the upstream caller's timeout to avoid wasted work.

Treating automatic failover as set-and-forget

Untested failover code has a roughly 50% chance of bugs after 6 months. Production-grade systems exercise failover monthly (or weekly via chaos experiments). The first time failover runs unexercised, in a real incident, it usually fails.

Not distinguishing critical from non-critical paths

Without explicit identification, a failure in a 'nice-to-have' service (recommendations, analytics widgets) takes down the main page. Mark critical vs non-critical at design time; design non-critical paths to fail open with cached or default values.

Conflating RTO and RPO with 'high availability'

High availability is a vague claim; RTO and RPO are measurable commitments. 'RTO 5 min, RPO 30s' tells you exactly how the system fails: you can lose up to 30 seconds of recent data, and service can be down for up to 5 minutes. Always express DR requirements as numbers.

Back to System Design