██████╗ ██████╗ ██████╗ ███████╗    ███████╗███╗   ██╗ █████╗ ████████╗ ██████╗██╗  ██╗██╔════╝██╔═══██╗██╔══██╗██╔════╝    ██╔════╝████╗  ██║██╔══██╗╚══██╔══╝██╔════╝██║  ██║██║     ██║   ██║██║  ██║█████╗      ███████╗██╔██╗ ██║███████║   ██║   ██║     ███████║██║     ██║   ██║██║  ██║██╔══╝      ╚════██║██║╚██╗██║██╔══██║   ██║   ██║     ██╔══██║╚██████╗╚██████╔╝██████╔╝███████╗    ███████║██║ ╚████║██║  ██║   ██║   ╚██████╗██║  ██║ ╚═════╝ ╚═════╝ ╚═════╝ ╚══════╝    ╚══════╝╚═╝  ╚═══╝╚═╝  ╚═╝   ╚═╝    ╚═════╝╚═╝  ╚═╝

Learn · Earn · Connect

Reliability

0 lessons

2 system designs

2 behavioral interviews

18 community items

reliability

System Design

2 articles

Fault Tolerance, Redundancy & Failover

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

fault-tolerance

circuit-breaker

distributed-systems

510

11

Monitoring, Logging, Alerting & SLAs

Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.

474

4

Behavioral Interviews

2 articles

Behavioral Interview

Debugging & Production Incident Stories

Production-incident questions are the operational-judgement probe. They test whether you can act calmly under live pressure, separate mitigation from root-cause work, and tell a blameless story that distinguishes systems-level lessons from individual blame. This lesson defines incident-grade storytelling (timeline craft with explicit T+0 / T+5 / T+30 markers), draws the line between fix, remediation, and prevention, walks through blameless-postmortem language you can use in the room without sounding rehearsed, and provides fully worked model STAR answers for the prompts you will hear most. Every model answer in this lesson focuses blame on systems and processes, never on people or teams. After this lesson you will be able to take any real incident from your career and shape it into an answer that scores on calm, judgement, and operational maturity simultaneously.

behavioral-interview

problem-solving

interview-strategy

1.1k

27

Behavioral Interview

Behavioral for Backend / Infra Engineers

Backend and infrastructure engineering loops grade for a cluster of behavioral signals that frontend and product engineering loops weight less heavily: reliability and oncall judgement, capacity and scale thinking, data-integrity decisions under pressure, and the empathy-for-the-pager dimension that distinguishes engineers who can be trusted with production. The behavioral signal is most often woven into the system-design round and the oncall-and-incident round, with explicit story shapes (the 3am page, the SLO trade-off) that interviewers reach for. This lesson defines the cross-cutting backend signals interviewers grade, walks through how the loop folds the behavioral signal into the technical rounds, maps the signals to the questions interviewers ask, and shows two model answers tailored to the incident-response and capacity-planning story shapes.

behavioral-interview

company-specific

capacity-planning

696

3

Community

18 items

Metrics, Logs, and Traces: The Three Pillars Without the Marketing

What each pillar actually does, when reaching for it pays off, and the budget I follow so I am not paying observability vendors more than I am paying for compute.

313

6

4.4 (9)

May 17, 2026

by @zarakamau

Technical Debt: When It's Debt vs When It's Just Old

Most code labeled technical debt is not debt at all. Here is the test I use to tell debt from age, and the rule I follow when paying it down.

code-organization

decision-making

926

18

4.2 (13)

May 5, 2026

by @nathanrivera

Interview Experience

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

distributed-systems

730

9

4.3 (11)

Apr 30, 2026

by @chloesaeed

Question Bundle

SRE Incident Drill Questions From Our On-Call

Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.

827

26

4.3 (14)

Mar 30, 2026

by @alexsaeed

The Refactoring Playbook: Six Moves I Use Weekly

The small, low-risk refactoring moves I reach for every week, what each one fixes, and the order to apply them so the diff stays reviewable.

code-organization

489

10

Mar 22, 2026

by @arjunpatel

Database Migrations: A Zero-Downtime Playbook

Adding a column, renaming a column, dropping a column, splitting a table. The expand-contract pattern, the four-step rename, and the migration phases that have kept me from taking the site down.

659

16

Mar 22, 2026

by @hannahchakraborty

Infrastructure as Code: Terraform vs Pulumi vs CDK

Three IaC tools I have shipped to production, the trade-offs that actually matter (state, language, drift, blast radius), and the picks I would make today by team shape.

code-organization

832

26

4.0 (11)

Mar 21, 2026

by @nathanmurphy

Rate Limiting on the Edge with a Redis Token Bucket

Token bucket as a single Redis Lua script, evaluated atomically, deployed near the edge. The implementation, the failure modes, and what I would actually ship today.

463

14

4.4 (10)

Mar 17, 2026

by @antonmorgan

Docker for Devs Who Don't Want to Be Sysadmins

The 80% of Docker that I actually use day to day, the layer-cache rules that cut my image builds from 4 minutes to 30 seconds, and the four mistakes that haunted my first year.

234

5

4.2 (12)

Mar 14, 2026

by @hannahchakraborty

Testing Pyramid vs Trophy: Pick the Right Shape

Most teams ship the testing pyramid by accident. The trophy is what actually matches modern frontend work. Here is how to choose.

code-organization

644

9

4.2 (14)

Mar 4, 2026

by @aishasantos

The On-Call Handbook for Engineers Who Hate Being On-Call

The 12 hours before, the first hour of an incident, the playbook discipline that makes 3am pages survivable, and the post-rotation rituals that have stopped on-call from wrecking my health.

handling-failure

548

15

Mar 4, 2026

by @gabrielkhalil

CI/CD Pipelines: Stop Letting Them Rot

The maintenance habits that have kept my pipelines fast and trusted for years, the seven categories of rot I have actually seen, and the budget I run so the pipeline is treated as production code.

code-organization

1k

10

4.1 (9)

Jan 21, 2026

by @sanjayward

Feature Flags: Three Patterns I Keep Reusing

The release flag, the kill switch, and the experiment flag. Different lifetimes, different rollback rules, and the cleanup discipline that has stopped my flag system from becoming a graveyard.

code-organization

837

8

4.3 (13)

Jan 9, 2026

by @valentinamwangi

Connection Pooling, PgBouncer, and the Prisma Trap

What a connection pool actually does, why your Postgres falls over at 200 connections, where PgBouncer sits, and the prepared-statement bug that bites every Prisma team that adds it the wrong way.

322

2

4.3 (13)

Jan 8, 2026

by @ananyanakamura

Idempotency Keys: The Pattern Stripe Taught Everyone

The key itself is the trivial part. The lifecycle, the storage, the body fingerprint, and the TTL are where production teams trip.

577

4

4.1 (12)

Dec 31, 2025

by @chloekelly

Webhook Design: Retries, Signatures, and Replay Protection

Sign requests. Dedupe by event id. Apply idempotently by resource id. Ack fast, process async. Tolerate out-of-order. Five concerns that turn a webhook into critical infrastructure.

1k

31

4.3 (11)

Dec 29, 2025

by @oliviadelgado

AWS Lambda Cold Starts: What Actually Helps

Where the cold-start time really comes from, the four levers that have moved my p99 down by hundreds of milliseconds, and the optimizations I have tried and abandoned because they did not pay back.

596

18

4.3 (10)

Dec 23, 2025

by @liamsuzuki

Question Bundle

Incident Debrief Questions They Asked Me

A 4-question set drawn from the debrief portion of an SRE-flavored loop. Every behavioral prompt about an on-call story got paired with a design follow-up the interviewer used to stress-test the takeaway.

behavioral-interview

system-design-interview

798

25

Dec 18, 2025

by @lilyadeyemi