Tags

Reliability

Reliability

0 lessons
2 system designs
2 behavioral interviews
18 community items

reliability

System Design

2 articles
System Design

Fault Tolerance, Redundancy & Failover

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

fault-tolerance
redundancy
failover
circuit-breaker
reliability
availability
distributed-systems
system-design
intermediate
free

510

11

Medium
System Design

Monitoring, Logging, Alerting & SLAs

Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.

monitoring
alerting
logging
tracing
sla
slo
reliability
system-design
intermediate
premium

474

4

Medium

Behavioral Interviews

2 articles
Behavioral Interview

Debugging & Production Incident Stories

Production-incident questions are the operational-judgement probe. They test whether you can act calmly under live pressure, separate mitigation from root-cause work, and tell a blameless story that distinguishes systems-level lessons from individual blame. This lesson defines incident-grade storytelling (timeline craft with explicit T+0 / T+5 / T+30 markers), draws the line between fix, remediation, and prevention, walks through blameless-postmortem language you can use in the room without sounding rehearsed, and provides fully worked model STAR answers for the prompts you will hear most. Every model answer in this lesson focuses blame on systems and processes, never on people or teams. After this lesson you will be able to take any real incident from your career and shape it into an answer that scores on calm, judgement, and operational maturity simultaneously.

behavioral
behavioral-interview
debugging
reliability
monitoring
problem-solving
interview-prep
interview-strategy
story-banking
star-method

1.1k

27

Medium
Behavioral Interview

Behavioral for Backend / Infra Engineers

Backend and infrastructure engineering loops grade for a cluster of behavioral signals that frontend and product engineering loops weight less heavily: reliability and oncall judgement, capacity and scale thinking, data-integrity decisions under pressure, and the empathy-for-the-pager dimension that distinguishes engineers who can be trusted with production. The behavioral signal is most often woven into the system-design round and the oncall-and-incident round, with explicit story shapes (the 3am page, the SLO trade-off) that interviewers reach for. This lesson defines the cross-cutting backend signals interviewers grade, walks through how the loop folds the behavioral signal into the technical rounds, maps the signals to the questions interviewers ask, and shows two model answers tailored to the incident-response and capacity-planning story shapes.

behavioral
behavioral-interview
backend
interview-prep
company-specific
reliability
on-call
capacity-planning
role-specific

696

3

Medium

Community

18 items
Article

Metrics, Logs, and Traces: The Three Pillars Without the Marketing

What each pillar actually does, when reaching for it pays off, and the budget I follow so I am not paying observability vendors more than I am paying for compute.

monitoring
logging
tracing
alerting
reliability

313

6

4.4 (9)

May 17, 2026

by @zarakamau

Article

Technical Debt: When It's Debt vs When It's Just Old

Most code labeled technical debt is not debt at all. Here is the test I use to tell debt from age, and the rule I follow when paying it down.

craftsmanship
code-organization
clean-code
reliability
decision-making

926

18

4.2 (13)

May 5, 2026

by @nathanrivera

Interview Experience

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

system-design
interview-prep
distributed-systems
monitoring
reliability

730

9

4.3 (11)

Apr 30, 2026

by @chloesaeed

Question Bundle
Free

SRE Incident Drill Questions From Our On-Call

Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.

Python
reliability
on-call
monitoring
interview-prep

827

26

4.3 (14)

Mar 30, 2026

by @alexsaeed

Article

The Refactoring Playbook: Six Moves I Use Weekly

The small, low-risk refactoring moves I reach for every week, what each one fixes, and the order to apply them so the diff stays reviewable.

craftsmanship
clean-code
code-organization
reliability

489

10

Mar 22, 2026

by @arjunpatel

Article

Database Migrations: A Zero-Downtime Playbook

Adding a column, renaming a column, dropping a column, splitting a table. The expand-contract pattern, the four-step rename, and the migration phases that have kept me from taking the site down.

database
sql
reliability
backend
data-modeling

659

16

Mar 22, 2026

by @hannahchakraborty

Article

Infrastructure as Code: Terraform vs Pulumi vs CDK

Three IaC tools I have shipped to production, the trade-offs that actually matter (state, language, drift, blast radius), and the picks I would make today by team shape.

reliability
craftsmanship
backend
code-organization

832

26

4.0 (11)

Mar 21, 2026

by @nathanmurphy

Article

Rate Limiting on the Edge with a Redis Token Bucket

Token bucket as a single Redis Lua script, evaluated atomically, deployed near the edge. The implementation, the failure modes, and what I would actually ship today.

rate-limiting
token-bucket
redis
api-design
reliability

463

14

4.4 (10)

Mar 17, 2026

by @antonmorgan

Article

Docker for Devs Who Don't Want to Be Sysadmins

The 80% of Docker that I actually use day to day, the layer-cache rules that cut my image builds from 4 minutes to 30 seconds, and the four mistakes that haunted my first year.

backend
reliability
craftsmanship
clean-code

234

5

4.2 (12)

Mar 14, 2026

by @hannahchakraborty

Article

Testing Pyramid vs Trophy: Pick the Right Shape

Most teams ship the testing pyramid by accident. The trophy is what actually matches modern frontend work. Here is how to choose.

testing
unit-testing
craftsmanship
reliability
code-organization

644

9

4.2 (14)

Mar 4, 2026

by @aishasantos

Article

The On-Call Handbook for Engineers Who Hate Being On-Call

The 12 hours before, the first hour of an incident, the playbook discipline that makes 3am pages survivable, and the post-rotation rituals that have stopped on-call from wrecking my health.

on-call
reliability
alerting
handling-failure
craftsmanship

548

15

Mar 4, 2026

by @gabrielkhalil

Article

CI/CD Pipelines: Stop Letting Them Rot

The maintenance habits that have kept my pipelines fast and trusted for years, the seven categories of rot I have actually seen, and the budget I run so the pipeline is treated as production code.

reliability
craftsmanship
performance
code-organization

1k

10

4.1 (9)

Jan 21, 2026

by @sanjayward

Article

Feature Flags: Three Patterns I Keep Reusing

The release flag, the kill switch, and the experiment flag. Different lifetimes, different rollback rules, and the cleanup discipline that has stopped my flag system from becoming a graveyard.

reliability
code-organization
craftsmanship
backend

837

8

4.3 (13)

Jan 9, 2026

by @valentinamwangi

Article

Connection Pooling, PgBouncer, and the Prisma Trap

What a connection pool actually does, why your Postgres falls over at 200 connections, where PgBouncer sits, and the prepared-statement bug that bites every Prisma team that adds it the wrong way.

database
performance
backend
scalability
reliability

322

2

4.3 (13)

Jan 8, 2026

by @ananyanakamura

Article

Idempotency Keys: The Pattern Stripe Taught Everyone

The key itself is the trivial part. The lifecycle, the storage, the body fingerprint, and the TTL are where production teams trip.

idempotency
stripe
api-design
system-design
reliability

577

4

4.1 (12)

Dec 31, 2025

by @chloekelly

Article

Webhook Design: Retries, Signatures, and Replay Protection

Sign requests. Dedupe by event id. Apply idempotently by resource id. Ack fast, process async. Tolerate out-of-order. Five concerns that turn a webhook into critical infrastructure.

webhooks
security
reliability
idempotency
api-design

1k

31

4.3 (11)

Dec 29, 2025

by @oliviadelgado

Article

AWS Lambda Cold Starts: What Actually Helps

Where the cold-start time really comes from, the four levers that have moved my p99 down by hundreds of milliseconds, and the optimizations I have tried and abandoned because they did not pay back.

serverless
performance
backend
reliability

596

18

4.3 (10)

Dec 23, 2025

by @liamsuzuki

Question Bundle
Free

Incident Debrief Questions They Asked Me

A 4-question set drawn from the debrief portion of an SRE-flavored loop. Every behavioral prompt about an on-call story got paired with a design follow-up the interviewer used to stress-test the takeaway.

Python
interview-prep
behavioral-interview
reliability
system-design-interview

798

25

Dec 18, 2025

by @lilyadeyemi