System Design Article
Monitoring, Logging, Alerting & SLAs
Difficulty: Medium
Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.
Monitoring, Logging, Alerting & SLAs
Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.
474 views
4
What is Observability?
Observability is the property that lets you understand a system's internal state from its external outputs. The goal: when the system misbehaves, you can answer 'why' without attaching a debugger to production.
Observability is built on three pillars: metrics, logs, and traces. Each answers different questions; mature systems use all three.
---------- The three pillars ----------
metrics : what is happening, in aggregate? (numbers over time)
logs : what happened, in detail? (text events with context)
traces : how did one request flow through the system? (causal graph)Pillar 1: Metrics
Metrics are numerical measurements over time. They are cheap, aggregate well, and form the basis of dashboards and alerts.
Common metric types:
- Counter: monotonically-increasing total. Examples:
http_requests_total,bytes_sent_total. Useful for rate calculations (rate(http_requests_total[5m])). - Gauge: a value that can go up or down. Examples:
cpu_utilization,memory_used_bytes,queue_depth. - Histogram: distribution of values, bucketed. Examples:
http_request_duration_seconds. Lets you compute p50, p95, p99 latency. - Summary: similar to histogram, computed client-side; useful for high-cardinality fields.
Stack: Prometheus (the de facto standard), VictoriaMetrics, M3DB, Datadog, AWS CloudWatch, Google Cloud Monitoring.
The RED Method (for services)
For every service, dashboard these three metrics:
- Rate: requests per second
- Errors: error rate (5xx, exceptions)
- Duration: latency distribution (p50, p95, p99)
This is the 80/20 of service monitoring. If you only had three graphs per service, these are them.
The USE Method (for resources)
For every resource (CPU, memory, disk, network):
- Utilization: average busy time (e.g., 60% CPU)
- Saturation: queue depth or wait time (e.g., load average, IO wait)
- Errors: error counts (e.g., dropped packets, IO errors)
Useful for spotting capacity issues before they cause user-visible failures.
Pillar 2: Logs
Logs are text events emitted by services. Each log line typically has a timestamp, a severity level (INFO, WARN, ERROR), a message, and structured fields (request ID, user ID, latency).
Structured vs unstructured: structured logs (JSON or key-value) are queryable; unstructured logs are nearly useless at scale. Always log JSON in production.
{
"timestamp": "2026-04-26T10:00:00.123Z",
"level": "ERROR",
"service": "order-api",
"trace_id": "abc123",
"user_id": "u-42",
"order_id": "o-99",
"message": "Failed to charge card",
"error": "insufficient_funds",
"latency_ms": 234
}Stack: ELK (Elasticsearch + Logstash + Kibana), Loki, Splunk, Datadog Logs, AWS CloudWatch Logs, Google Cloud Logging.
Logging at scale
A service handling 10K req/sec at 5 logs per request is 50K log lines per second per service. With 100 services, you are at 5M log lines per second. Storage and search become real engineering challenges.
Strategies:
- Sampling: at high volume, log only a fraction of requests in detail (1 in 100, but always log errors).
- Tiered retention: hot tier (last 7 days, fast search), warm tier (30 days, slower search), cold tier (90+ days, archive).
- Structured fields, not message strings: searching on
user_id=u-42is fast; searching on"failed for user u-42"is slow. - Aggregate first, log second: use metrics for counts, logs for context. Do not log every successful request just to count them.
Pillar 3: Distributed Tracing
A trace shows the path of one request through every service it touched. Each operation (HTTP call, DB query, cache lookup) is a span with a start time, end time, and parent span. Spans link into a tree.
---------- A trace for one checkout request ----------
[span: HTTP POST /checkout 200ms]
[span: validate user 5ms]
[span: reserve inventory 30ms]
[span: DB UPDATE inventory 15ms]
[span: charge card 120ms]
[span: HTTP POST stripe.com 80ms]
[span: write order record 20ms]
[span: DB INSERT orders 8ms]
[span: emit OrderCreated event 3ms]A trace tells you: 'this request was slow because the Stripe call took 80ms', or 'this error originated in the inventory service, propagated through the order service'.
Stack: OpenTelemetry (the open standard for instrumentation), Jaeger, Zipkin, Tempo, Datadog APM, New Relic, Honeycomb, Lightstep.
Trace context propagation
The critical mechanism: every cross-service call must carry a trace ID in headers (traceparent per W3C spec). Each service starts a new span as a child of the incoming span. Without this propagation, you have local logs but no end-to-end view.
// Express middleware: extract or create trace context
function tracing(req, res, next) {
const traceId = req.headers['traceparent'] ?? generateTraceId();
req.trace = { id: traceId, span: startSpan('http.request', traceId) };
res.on('finish', () => req.trace.span.end({ status: res.statusCode }));
next();
}
// Outgoing HTTP call: propagate trace context
async function callDownstream(req, url) {
return await fetch(url, {
headers: { traceparent: req.trace.id },
});
}SLI, SLO, SLA, Error Budgets
SRE vocabulary that every senior engineer must know precisely.
SLI (Service Level Indicator)
A quantitative measure of some aspect of the service. Examples:
- 'fraction of HTTP requests returning 2xx within 100 ms'
- 'fraction of orders processed within 5 seconds'
- 'data freshness lag, p99'
Good SLIs are user-centric (what does the customer experience?) and quantifiable.
SLO (Service Level Objective)
A target value for an SLI. Examples:
- '99.9% of HTTP requests return 2xx within 100 ms (measured over 30 days)'
- '99.5% of orders complete within 5 seconds'
- 'p99 ingest lag < 60 seconds'
SLOs are internal targets. They are aggressive enough to keep customers happy but achievable enough to be sustainable.
SLA (Service Level Agreement)
A contractual promise to customers, typically with financial penalties for breaches. Always looser than the SLO (the buffer between SLO and SLA is your safety margin).
Example:
- Internal SLO: 99.9% availability.
- External SLA: 99.5% availability with 10% credits for breaches.
Error Budget
If the SLO is 99.9% (downtime budget = 0.1% = 43 min/month), the team has a 'budget' of 43 minutes of downtime per month. Spent on:
- Risky launches (new features, infrastructure changes).
- Experiments (chaos engineering, canary failures).
- The unplanned (real outages).
The discipline: if the budget is consumed (more than 43 minutes of breaches this month), the team freezes risky changes until reliability recovers. This balances feature velocity with reliability without endless meetings about whether to ship.
---------- Error budget mechanics ----------
SLO: 99.9% over 30 days = 43.2 min budget
Day 5: unplanned outage = 12 min consumed (31.2 left)
Day 12: chaos experiment = 3 min consumed (28.2 left)
Day 18: launch caused brownout = 25 min consumed (3.2 left)
Day 22: budget nearly exhausted -> launch freeze
Day 30: budget resetsAlerting: Done Right
The goal of alerting is to wake the right person at the right time with the right context. Done wrong, alerts are noise that nobody reads.
Alert on Symptoms, Not Causes
Wrong: 'CPU > 90% on web-3'. The user does not care about CPU. They care about whether the page loaded.
Right: 'p99 latency > 500ms for 5 minutes' or 'error rate > 1% for 2 minutes'. These are user-visible symptoms; if they fire, something is genuinely broken from the customer's view.
SLO-Based Alerts
The modern best practice. Alert when the error budget is being burned faster than expected.
Example:
- SLO: 99.9% over 30 days = 43.2 min budget.
- Burn-rate alert: 'budget consumed at >14x baseline rate' = 1 hour of bad service in last 5 minutes -> page immediately.
- Slower burn alert: 'budget consumed at >6x baseline rate' = 6 hours of bad service in last hour -> page (less urgent).
Burn-rate alerts page on real customer impact, not on noise. Google's SRE workbook has the canonical formulas.
Severity Levels
- P0 / Page: customers are impacted now; wake someone up.
- P1 / Ticket: customers will be impacted soon if not addressed; respond within hours.
- P2 / Notify: trend that matters; review during business hours.
- Info: log only, no alert.
Be ruthless. If it does not need a human now, it is not P0.
Alert Fatigue
The single most common monitoring failure. Teams with hundreds of alerts firing per week stop reading them; the one alert that mattered gets ignored.
Mitigations:
- Audit alerts quarterly. Delete or downgrade any that fire without action being taken.
- Aggregate similar alerts (one alert for 'database is slow' instead of 100 alerts for individual queries).
- Page only on user-impacting symptoms; route everything else to ticket queues.
- Track 'alert volume per on-call shift'; if it exceeds 5-10, the on-call is broken.
Dashboards: Done Right
Three types of dashboards. Each serves a purpose; none replaces the others.
Service Dashboard (RED + USE)
For each service: rate, errors, duration of HTTP requests; CPU/memory/disk of underlying instances. Used by service owners daily.
Customer Journey Dashboard
User-visible flows: 'login success rate', 'checkout completion rate', 'p95 search latency'. Used by product teams and incident commanders.
SLO Dashboard
For each SLO: current value, 30-day budget remaining, burn rate. Used by SRE and engineering leads to govern launch decisions.
Do not build dashboards with 100 graphs that nobody reads. Three good dashboards beat thirty cluttered ones.
Tool Comparison
| Workload | Recommended stack | Notes |
|---|---|---|
| Metrics + alerting (open source) | Prometheus + Grafana + Alertmanager | The de facto standard; pull-based scraping. |
| Metrics + alerting (managed) | Datadog, Grafana Cloud, New Relic | All-in-one; expensive at scale but no infra. |
| Logs (open source) | Loki (with Grafana) or ELK | Loki for cheaper storage; ELK for richer search. |
| Logs (managed) | Datadog Logs, Splunk, Sumo Logic | Easier to operate; per-GB pricing. |
| Traces (open source) | Jaeger, Tempo, Zipkin | OpenTelemetry as the instrumentation standard. |
| Traces (managed) | Datadog APM, Honeycomb, Lightstep, New Relic | Mature query interfaces; high-cardinality friendly. |
| Unified observability platform | Datadog, New Relic, Grafana Cloud | One UI for all three pillars. |
Default modern recommendation: OpenTelemetry for instrumentation everywhere; Prometheus + Grafana for metrics if self-hosting; Datadog or Grafana Cloud if you can pay for managed.
Operational Practices
- Runbooks per alert. Every alert links to a runbook with: 'what does this mean? what should I check first? what are the standard remediations?'
- On-call rotations. Healthy on-call: predictable schedule, reasonable load (5-10 alerts per shift max), follow-the-sun for global teams.
- Post-mortems for every significant incident. Blameless format: timeline, root cause, contributing factors, action items.
- Game days. Quarterly exercises simulating major failures. Validate that runbooks work and on-call humans know how to use them.
- Audit alerts quarterly. Delete what is noisy. Promote what gets ignored to higher severity.
How to Talk About This in an Interview
- Lead with the three pillars. 'I would instrument metrics, logs, and traces; each answers different questions.'
- Use the RED method for services and USE for resources. 'For every service: rate, errors, duration. For every resource: utilization, saturation, errors.'
- Define your SLOs. 'For this product I would target 99.9% availability with p95 latency under 200ms over 30 days. The error budget is ~43 minutes; we freeze risky launches if we exceed it.'
- Always alert on symptoms, not causes. 'I alert when error rate or latency exceeds the SLO burn rate, not on CPU or memory thresholds.'
- Mention OpenTelemetry. The modern open standard for trace and metric instrumentation; using it signals you are aware of current practice.
- Acknowledge alert fatigue. 'I audit alerts quarterly and delete or downgrade any that fire without driving action. Healthy on-call is 5-10 alerts per shift, max.'
Quick Review
- Three pillars: metrics (numbers), logs (events with context), traces (causal request graph).
- RED for services (rate, errors, duration); USE for resources (utilization, saturation, errors).
- SLI = a measure; SLO = a target for the SLI; SLA = a contractual promise (looser than SLO).
- Error budget: the amount of unreliability you can spend on launches and experiments.
- Alert on user-visible symptoms (latency, errors), not on internal causes (CPU).
- SLO burn-rate alerts are the modern best practice.
- OpenTelemetry is the open instrumentation standard.
- Runbooks, blameless post-mortems, game days, alert audits are the operational disciplines.
Real-World Examples
How real systems implement this in production
Google pioneered the error-budget model. Each service has an SLO (often 99.9% or 99.99%); the budget is the inverse. If the budget is healthy, the service team can ship aggressively. If the budget is consumed by outages or risky launches, all non-critical changes are frozen until reliability recovers. This codified rule replaces endless meetings about whether to ship with a clear, measurable gate.
Trade-off: Error budgets give engineering teams real autonomy: ship with confidence when the budget is healthy, slow down when it is not. The cost is cultural - teams must trust the SLO (which means setting it carefully) and accept the freeze rule (which means not gaming it). Done right, it dramatically improves both velocity and reliability.
Netflix built Atlas (a high-cardinality time-series database) and Mantis (a stream-processing observability platform) to handle their scale: trillions of metric events per day across thousands of services. Engineers can query metrics by any combination of tags (service, region, AZ, AB-test cohort) without precomputed aggregations.
Trade-off: High-cardinality metrics give Netflix unmatched diagnostic power but cost orders of magnitude more storage and compute than fixed-cardinality systems like Prometheus. The investment is justified at Netflix scale; for smaller systems, Prometheus + Grafana is a better cost-benefit balance.
Honeycomb pioneered the 'observability 2.0' approach: instead of pre-aggregating metrics, store every event with all its attributes (user_id, region, build version, AB-test cohort) and let engineers slice and group at query time. This enables debugging questions that traditional metrics cannot answer ('why is p99 latency high only for users on Android in the new build?').
Trade-off: High-cardinality observability is a paradigm shift that lets you answer questions you did not anticipate. Cost: storage and query infrastructure are more complex. Cultural: teams must learn to think in terms of events, not pre-defined metric names. The win is unprecedented diagnostic power for complex distributed systems.
Coinbase, a cryptocurrency exchange, uses Datadog as its unified observability platform: metrics, logs, APM traces, and security monitoring in one UI. With trading volumes spiking unpredictably during crypto market events, Coinbase needs to see the full picture in seconds: which service is slow, which trace is anomalous, which logs explain it.
Trade-off: A unified managed platform like Datadog provides immediate productivity (engineers learn one UI, correlations across pillars are built-in) at significant cost ($1M+ per month at scale). For high-revenue products where minutes of downtime are huge losses, the ROI is clear. For lower-stakes products, OSS Prometheus + Grafana + Loki is a viable alternative.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Three pillars: (1) Metrics - dashboards using RED (rate, errors, duration) per service and USE (utilization, saturation, errors) per resource. (2) Logs - structured JSON, indexed in Loki or Elasticsearch, with sampling at high volume. (3) Traces - OpenTelemetry instrumentation, trace context propagated via W3C traceparent header, viewable in Jaeger or Datadog APM. Define SLOs (e.g., 99.9% availability, p95 < 200ms). Alert on SLO burn rate, not on causes. On-call rotations with runbooks per alert. Quarterly post-mortems and game days.
SLI (Indicator): a quantitative measure of an aspect of service quality, e.g., 'fraction of HTTP requests returning 2xx within 100ms'. SLO (Objective): an internal target value for the SLI, e.g., '99.9% of requests over 30 days'. SLA (Agreement): a contractual promise to customers, typically looser than the SLO (e.g., '99.5% with credits for breach'). The buffer between SLO and SLA is the safety margin. Error budget = 1 - SLO; the amount of unreliability the team can spend on launches and experiments before freezing changes.
RED is for services: Rate (requests/second), Errors (error rate), Duration (latency p50/p95/p99). Apply to every service; if you only had three graphs per service, these are them. USE is for resources: Utilization (% busy), Saturation (queue depth or wait time), Errors (drops, IO errors). Apply to every CPU, memory, disk, network. RED tells you whether the service is meeting its contract; USE tells you whether the resource is healthy. Use both: RED for SLO alerts, USE for capacity and saturation diagnosis.
Alert only on user-visible symptoms (latency, error rate, SLO burn rate), not on internal causes (CPU, memory, disk). Use SLO burn-rate alerts: page on fast-burn (1 hour of bad service in 5 minutes), ticket on slow-burn (6 hours in 1 hour). Severity tiers: P0 (page now), P1 (ticket, respond hours), P2 (review during business hours). Audit alerts quarterly; delete what fires without action. Aggregate similar alerts. Track alerts-per-shift; healthy on-call is 5-10.
Trace ID is propagated through every cross-service call via the W3C traceparent header. Each service emits spans (operation start, end, attributes) tagged with the trace ID. Spans assemble into a tree showing the full request path. To debug: find the trace by user ID, request ID, or sample of slow requests; visualize the timeline; identify the slowest span. Common findings: a downstream HTTP call took 80% of the time; a DB query was missing an index; a sequential operation should have been parallelized. OpenTelemetry is the open standard; Jaeger or Datadog APM is the typical UI.
Interview Tips
How to discuss this topic effectively
Lead with the three pillars (metrics, logs, traces) and what each answers. Treating monitoring as one undifferentiated thing is a junior-level move.
Always quote SLOs as numbers. '99.9% availability with p95 < 200ms over 30 days' is much stronger than 'high availability'.
Mention error budgets and the freeze rule. They are how mature teams balance reliability and feature velocity.
Distinguish symptom alerts from cause alerts. CPU is a cause; a latency SLO breach is a symptom. Page only on symptoms.
Bring up OpenTelemetry by name. It is the current open standard and the right answer to 'how do you instrument trace context propagation?'.
Common Mistakes
Pitfalls to avoid in interviews
Alerting on causes instead of symptoms
An alert that fires when CPU > 90% wakes the on-call without telling them whether customers are affected. Alert on user-visible symptoms (latency, error rate, SLO burn) and use cause metrics for diagnosis after the symptom alert fires.
Treating SLA, SLO, and SLI as synonyms
SLI is a measure (% of requests under 100ms). SLO is the internal target (99.9%). SLA is a contractual promise to customers (99.5%, with credits for breaches). Internal SLO must be tighter than the SLA so you have a buffer. Conflating them undermines your reliability vocabulary.
Logging unstructured strings
Unstructured log lines ('Failed to charge user 42 with $99.99') are nearly unsearchable at scale. Always log structured JSON ({error: 'charge_failed', user_id: 'u-42', amount: 99.99}). Search on user_id=u-42 is fast; substring search is slow and brittle.
Drowning in alerts
Hundreds of alerts per week train the on-call to ignore them; the one that mattered slips through. Audit alerts quarterly; delete or downgrade anything that fires without action. Healthy on-call sees 5-10 actionable alerts per shift.
Building monitoring without trace context propagation
Without trace IDs in cross-service calls, you have local logs and metrics but no way to debug 'why was this one request slow?' across services. Adopt OpenTelemetry from day one; passing trace context (W3C traceparent header) is non-negotiable for microservices.
