Reliability
reliability
System Design
Fault Tolerance, Redundancy & Failover
Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.
Monitoring, Logging, Alerting & SLAs
Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.
Behavioral Interviews
Debugging & Production Incident Stories
Production-incident questions are the operational-judgement probe. They test whether you can act calmly under live pressure, separate mitigation from root-cause work, and tell a blameless story that distinguishes systems-level lessons from individual blame. This lesson defines incident-grade storytelling (timeline craft with explicit T+0 / T+5 / T+30 markers), draws the line between fix, remediation, and prevention, walks through blameless-postmortem language you can use in the room without sounding rehearsed, and provides fully worked model STAR answers for the prompts you will hear most. Every model answer in this lesson focuses blame on systems and processes, never on people or teams. After this lesson you will be able to take any real incident from your career and shape it into an answer that scores on calm, judgement, and operational maturity simultaneously.
Behavioral for Backend / Infra Engineers
Backend and infrastructure engineering loops grade for a cluster of behavioral signals that frontend and product engineering loops weight less heavily: reliability and oncall judgement, capacity and scale thinking, data-integrity decisions under pressure, and the empathy-for-the-pager dimension that distinguishes engineers who can be trusted with production. The behavioral signal is most often woven into the system-design round and the oncall-and-incident round, with explicit story shapes (the 3am page, the SLO trade-off) that interviewers reach for. This lesson defines the cross-cutting backend signals interviewers grade, walks through how the loop folds the behavioral signal into the technical rounds, maps the signals to the questions interviewers ask, and shows two model answers tailored to the incident-response and capacity-planning story shapes.
Community
Metrics, Logs, and Traces: The Three Pillars Without the Marketing
What each pillar actually does, when reaching for it pays off, and the budget I follow so I am not paying observability vendors more than I am paying for compute.
Technical Debt: When It's Debt vs When It's Just Old
Most code labeled technical debt is not debt at all. Here is the test I use to tell debt from age, and the rule I follow when paying it down.
Datadog Onsite: Five Hours of System Design
A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.
SRE Incident Drill Questions From Our On-Call
Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.
The Refactoring Playbook: Six Moves I Use Weekly
The small, low-risk refactoring moves I reach for every week, what each one fixes, and the order to apply them so the diff stays reviewable.
Database Migrations: A Zero-Downtime Playbook
Adding a column, renaming a column, dropping a column, splitting a table. The expand-contract pattern, the four-step rename, and the migration phases that have kept me from taking the site down.
Infrastructure as Code: Terraform vs Pulumi vs CDK
Three IaC tools I have shipped to production, the trade-offs that actually matter (state, language, drift, blast radius), and the picks I would make today by team shape.
Rate Limiting on the Edge with a Redis Token Bucket
Token bucket as a single Redis Lua script, evaluated atomically, deployed near the edge. The implementation, the failure modes, and what I would actually ship today.
Docker for Devs Who Don't Want to Be Sysadmins
The 80% of Docker that I actually use day to day, the layer-cache rules that cut my image builds from 4 minutes to 30 seconds, and the four mistakes that haunted my first year.
Testing Pyramid vs Trophy: Pick the Right Shape
Most teams ship the testing pyramid by accident. The trophy is what actually matches modern frontend work. Here is how to choose.
The On-Call Handbook for Engineers Who Hate Being On-Call
The 12 hours before, the first hour of an incident, the playbook discipline that makes 3am pages survivable, and the post-rotation rituals that have stopped on-call from wrecking my health.
CI/CD Pipelines: Stop Letting Them Rot
The maintenance habits that have kept my pipelines fast and trusted for years, the seven categories of rot I have actually seen, and the budget I run so the pipeline is treated as production code.
Feature Flags: Three Patterns I Keep Reusing
The release flag, the kill switch, and the experiment flag. Different lifetimes, different rollback rules, and the cleanup discipline that has stopped my flag system from becoming a graveyard.
Connection Pooling, PgBouncer, and the Prisma Trap
What a connection pool actually does, why your Postgres falls over at 200 connections, where PgBouncer sits, and the prepared-statement bug that bites every Prisma team that adds it the wrong way.
Idempotency Keys: The Pattern Stripe Taught Everyone
The key itself is the trivial part. The lifecycle, the storage, the body fingerprint, and the TTL are where production teams trip.
Webhook Design: Retries, Signatures, and Replay Protection
Sign requests. Dedupe by event id. Apply idempotently by resource id. Ack fast, process async. Tolerate out-of-order. Five concerns that turn a webhook into critical infrastructure.
AWS Lambda Cold Starts: What Actually Helps
Where the cold-start time really comes from, the four levers that have moved my p99 down by hundreds of milliseconds, and the optimizations I have tried and abandoned because they did not pay back.
Incident Debrief Questions They Asked Me
A 4-question set drawn from the debrief portion of an SRE-flavored loop. Every behavioral prompt about an on-call story got paired with a design follow-up the interviewer used to stress-test the takeaway.
