System Design Article

Monitoring, Logging, Alerting & SLAs

Difficulty: Medium

Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.

System Design
/

Monitoring, Logging, Alerting & SLAs

Monitoring, Logging, Alerting & SLAs

Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.

System Design
Medium
monitoring
alerting
logging
tracing
sla
slo
reliability
system-design
intermediate
premium

474 views

4

What is Observability?

Observability is the property that lets you understand a system's internal state from its external outputs. The goal: when the system misbehaves, you can answer 'why' without attaching a debugger to production.

Observability is built on three pillars: metrics, logs, and traces. Each answers different questions; mature systems use all three.

Text
---------- The three pillars ----------
  metrics    : what is happening, in aggregate? (numbers over time)
  logs       : what happened, in detail? (text events with context)
  traces     : how did one request flow through the system? (causal graph)

Pillar 1: Metrics

Metrics are numerical measurements over time. They are cheap, aggregate well, and form the basis of dashboards and alerts.

Common metric types:

  • Counter: monotonically-increasing total. Examples: http_requests_total, bytes_sent_total. Useful for rate calculations (rate(http_requests_total[5m])).
  • Gauge: a value that can go up or down. Examples: cpu_utilization, memory_used_bytes, queue_depth.
  • Histogram: distribution of values, bucketed. Examples: http_request_duration_seconds. Lets you compute p50, p95, p99 latency.
  • Summary: similar to histogram, computed client-side; useful for high-cardinality fields.

Stack: Prometheus (the de facto standard), VictoriaMetrics, M3DB, Datadog, AWS CloudWatch, Google Cloud Monitoring.

The RED Method (for services)

For every service, dashboard these three metrics:

  • Rate: requests per second
  • Errors: error rate (5xx, exceptions)
  • Duration: latency distribution (p50, p95, p99)

This is the 80/20 of service monitoring. If you only had three graphs per service, these are them.

The USE Method (for resources)

For every resource (CPU, memory, disk, network):

  • Utilization: average busy time (e.g., 60% CPU)
  • Saturation: queue depth or wait time (e.g., load average, IO wait)
  • Errors: error counts (e.g., dropped packets, IO errors)

Useful for spotting capacity issues before they cause user-visible failures.

Pillar 2: Logs

Logs are text events emitted by services. Each log line typically has a timestamp, a severity level (INFO, WARN, ERROR), a message, and structured fields (request ID, user ID, latency).

Structured vs unstructured: structured logs (JSON or key-value) are queryable; unstructured logs are nearly useless at scale. Always log JSON in production.

Jsonc
{
    "timestamp": "2026-04-26T10:00:00.123Z",
    "level": "ERROR",
    "service": "order-api",
    "trace_id": "abc123",
    "user_id": "u-42",
    "order_id": "o-99",
    "message": "Failed to charge card",
    "error": "insufficient_funds",
    "latency_ms": 234
}

Stack: ELK (Elasticsearch + Logstash + Kibana), Loki, Splunk, Datadog Logs, AWS CloudWatch Logs, Google Cloud Logging.

Logging at scale

A service handling 10K req/sec at 5 logs per request is 50K log lines per second per service. With 100 services, you are at 5M log lines per second. Storage and search become real engineering challenges.

Strategies:

  • Sampling: at high volume, log only a fraction of requests in detail (1 in 100, but always log errors).
  • Tiered retention: hot tier (last 7 days, fast search), warm tier (30 days, slower search), cold tier (90+ days, archive).
  • Structured fields, not message strings: searching on user_id=u-42 is fast; searching on "failed for user u-42" is slow.
  • Aggregate first, log second: use metrics for counts, logs for context. Do not log every successful request just to count them.

Pillar 3: Distributed Tracing

A trace shows the path of one request through every service it touched. Each operation (HTTP call, DB query, cache lookup) is a span with a start time, end time, and parent span. Spans link into a tree.

Text
---------- A trace for one checkout request ----------
  [span: HTTP POST /checkout                     200ms]
     [span: validate user                          5ms]
     [span: reserve inventory                     30ms]
        [span: DB UPDATE inventory                15ms]
     [span: charge card                          120ms]
        [span: HTTP POST stripe.com               80ms]
     [span: write order record                    20ms]
        [span: DB INSERT orders                    8ms]
     [span: emit OrderCreated event                3ms]

A trace tells you: 'this request was slow because the Stripe call took 80ms', or 'this error originated in the inventory service, propagated through the order service'.

Stack: OpenTelemetry (the open standard for instrumentation), Jaeger, Zipkin, Tempo, Datadog APM, New Relic, Honeycomb, Lightstep.

Trace context propagation

The critical mechanism: every cross-service call must carry a trace ID in headers (traceparent per W3C spec). Each service starts a new span as a child of the incoming span. Without this propagation, you have local logs but no end-to-end view.

// Express middleware: extract or create trace context
function tracing(req, res, next) {
    const traceId = req.headers['traceparent'] ?? generateTraceId();
    req.trace = { id: traceId, span: startSpan('http.request', traceId) };
    res.on('finish', () => req.trace.span.end({ status: res.statusCode }));
    next();
}

// Outgoing HTTP call: propagate trace context
async function callDownstream(req, url) {
    return await fetch(url, {
        headers: { traceparent: req.trace.id },
    });
}

SLI, SLO, SLA, Error Budgets

SRE vocabulary that every senior engineer must know precisely.

SLI (Service Level Indicator)

A quantitative measure of some aspect of the service. Examples:

  • 'fraction of HTTP requests returning 2xx within 100 ms'
  • 'fraction of orders processed within 5 seconds'
  • 'data freshness lag, p99'

Good SLIs are user-centric (what does the customer experience?) and quantifiable.

SLO (Service Level Objective)

A target value for an SLI. Examples:

  • '99.9% of HTTP requests return 2xx within 100 ms (measured over 30 days)'
  • '99.5% of orders complete within 5 seconds'
  • 'p99 ingest lag < 60 seconds'

SLOs are internal targets. They are aggressive enough to keep customers happy but achievable enough to be sustainable.

SLA (Service Level Agreement)

A contractual promise to customers, typically with financial penalties for breaches. Always looser than the SLO (the buffer between SLO and SLA is your safety margin).

Example:

  • Internal SLO: 99.9% availability.
  • External SLA: 99.5% availability with 10% credits for breaches.

Error Budget

If the SLO is 99.9% (downtime budget = 0.1% = 43 min/month), the team has a 'budget' of 43 minutes of downtime per month. Spent on:

  • Risky launches (new features, infrastructure changes).
  • Experiments (chaos engineering, canary failures).
  • The unplanned (real outages).

The discipline: if the budget is consumed (more than 43 minutes of breaches this month), the team freezes risky changes until reliability recovers. This balances feature velocity with reliability without endless meetings about whether to ship.

Text
---------- Error budget mechanics ----------
  SLO: 99.9% over 30 days = 43.2 min budget

  Day 5:  unplanned outage = 12 min consumed (31.2 left)
  Day 12: chaos experiment = 3 min consumed  (28.2 left)
  Day 18: launch caused brownout = 25 min consumed (3.2 left)
  Day 22: budget nearly exhausted -> launch freeze
  Day 30: budget resets

Alerting: Done Right

The goal of alerting is to wake the right person at the right time with the right context. Done wrong, alerts are noise that nobody reads.

Alert on Symptoms, Not Causes

Wrong: 'CPU > 90% on web-3'. The user does not care about CPU. They care about whether the page loaded.

Right: 'p99 latency > 500ms for 5 minutes' or 'error rate > 1% for 2 minutes'. These are user-visible symptoms; if they fire, something is genuinely broken from the customer's view.

SLO-Based Alerts

The modern best practice. Alert when the error budget is being burned faster than expected.

Example:

  • SLO: 99.9% over 30 days = 43.2 min budget.
  • Burn-rate alert: 'budget consumed at >14x baseline rate' = 1 hour of bad service in last 5 minutes -> page immediately.
  • Slower burn alert: 'budget consumed at >6x baseline rate' = 6 hours of bad service in last hour -> page (less urgent).

Burn-rate alerts page on real customer impact, not on noise. Google's SRE workbook has the canonical formulas.

Severity Levels

  • P0 / Page: customers are impacted now; wake someone up.
  • P1 / Ticket: customers will be impacted soon if not addressed; respond within hours.
  • P2 / Notify: trend that matters; review during business hours.
  • Info: log only, no alert.

Be ruthless. If it does not need a human now, it is not P0.

Alert Fatigue

The single most common monitoring failure. Teams with hundreds of alerts firing per week stop reading them; the one alert that mattered gets ignored.

Mitigations:

  • Audit alerts quarterly. Delete or downgrade any that fire without action being taken.
  • Aggregate similar alerts (one alert for 'database is slow' instead of 100 alerts for individual queries).
  • Page only on user-impacting symptoms; route everything else to ticket queues.
  • Track 'alert volume per on-call shift'; if it exceeds 5-10, the on-call is broken.

Dashboards: Done Right

Three types of dashboards. Each serves a purpose; none replaces the others.

Service Dashboard (RED + USE)

For each service: rate, errors, duration of HTTP requests; CPU/memory/disk of underlying instances. Used by service owners daily.

Customer Journey Dashboard

User-visible flows: 'login success rate', 'checkout completion rate', 'p95 search latency'. Used by product teams and incident commanders.

SLO Dashboard

For each SLO: current value, 30-day budget remaining, burn rate. Used by SRE and engineering leads to govern launch decisions.

Do not build dashboards with 100 graphs that nobody reads. Three good dashboards beat thirty cluttered ones.

Tool Comparison

WorkloadRecommended stackNotes
Metrics + alerting (open source)Prometheus + Grafana + AlertmanagerThe de facto standard; pull-based scraping.
Metrics + alerting (managed)Datadog, Grafana Cloud, New RelicAll-in-one; expensive at scale but no infra.
Logs (open source)Loki (with Grafana) or ELKLoki for cheaper storage; ELK for richer search.
Logs (managed)Datadog Logs, Splunk, Sumo LogicEasier to operate; per-GB pricing.
Traces (open source)Jaeger, Tempo, ZipkinOpenTelemetry as the instrumentation standard.
Traces (managed)Datadog APM, Honeycomb, Lightstep, New RelicMature query interfaces; high-cardinality friendly.
Unified observability platformDatadog, New Relic, Grafana CloudOne UI for all three pillars.

Default modern recommendation: OpenTelemetry for instrumentation everywhere; Prometheus + Grafana for metrics if self-hosting; Datadog or Grafana Cloud if you can pay for managed.

Operational Practices

  1. Runbooks per alert. Every alert links to a runbook with: 'what does this mean? what should I check first? what are the standard remediations?'
  2. On-call rotations. Healthy on-call: predictable schedule, reasonable load (5-10 alerts per shift max), follow-the-sun for global teams.
  3. Post-mortems for every significant incident. Blameless format: timeline, root cause, contributing factors, action items.
  4. Game days. Quarterly exercises simulating major failures. Validate that runbooks work and on-call humans know how to use them.
  5. Audit alerts quarterly. Delete what is noisy. Promote what gets ignored to higher severity.

How to Talk About This in an Interview

  1. Lead with the three pillars. 'I would instrument metrics, logs, and traces; each answers different questions.'
  2. Use the RED method for services and USE for resources. 'For every service: rate, errors, duration. For every resource: utilization, saturation, errors.'
  3. Define your SLOs. 'For this product I would target 99.9% availability with p95 latency under 200ms over 30 days. The error budget is ~43 minutes; we freeze risky launches if we exceed it.'
  4. Always alert on symptoms, not causes. 'I alert when error rate or latency exceeds the SLO burn rate, not on CPU or memory thresholds.'
  5. Mention OpenTelemetry. The modern open standard for trace and metric instrumentation; using it signals you are aware of current practice.
  6. Acknowledge alert fatigue. 'I audit alerts quarterly and delete or downgrade any that fire without driving action. Healthy on-call is 5-10 alerts per shift, max.'

Quick Review

  • Three pillars: metrics (numbers), logs (events with context), traces (causal request graph).
  • RED for services (rate, errors, duration); USE for resources (utilization, saturation, errors).
  • SLI = a measure; SLO = a target for the SLI; SLA = a contractual promise (looser than SLO).
  • Error budget: the amount of unreliability you can spend on launches and experiments.
  • Alert on user-visible symptoms (latency, errors), not on internal causes (CPU).
  • SLO burn-rate alerts are the modern best practice.
  • OpenTelemetry is the open instrumentation standard.
  • Runbooks, blameless post-mortems, game days, alert audits are the operational disciplines.

Real-World Examples

How real systems implement this in production

Google SRE error budgets

Google pioneered the error-budget model. Each service has an SLO (often 99.9% or 99.99%); the budget is the inverse. If the budget is healthy, the service team can ship aggressively. If the budget is consumed by outages or risky launches, all non-critical changes are frozen until reliability recovers. This codified rule replaces endless meetings about whether to ship with a clear, measurable gate.

Trade-off: Error budgets give engineering teams real autonomy: ship with confidence when the budget is healthy, slow down when it is not. The cost is cultural - teams must trust the SLO (which means setting it carefully) and accept the freeze rule (which means not gaming it). Done right, it dramatically improves both velocity and reliability.

Netflix Atlas + Mantis

Netflix built Atlas (a high-cardinality time-series database) and Mantis (a stream-processing observability platform) to handle their scale: trillions of metric events per day across thousands of services. Engineers can query metrics by any combination of tags (service, region, AZ, AB-test cohort) without precomputed aggregations.

Trade-off: High-cardinality metrics give Netflix unmatched diagnostic power but cost orders of magnitude more storage and compute than fixed-cardinality systems like Prometheus. The investment is justified at Netflix scale; for smaller systems, Prometheus + Grafana is a better cost-benefit balance.

Honeycomb's high-cardinality observability

Honeycomb pioneered the 'observability 2.0' approach: instead of pre-aggregating metrics, store every event with all its attributes (user_id, region, build version, AB-test cohort) and let engineers slice and group at query time. This enables debugging questions that traditional metrics cannot answer ('why is p99 latency high only for users on Android in the new build?').

Trade-off: High-cardinality observability is a paradigm shift that lets you answer questions you did not anticipate. Cost: storage and query infrastructure are more complex. Cultural: teams must learn to think in terms of events, not pre-defined metric names. The win is unprecedented diagnostic power for complex distributed systems.

Datadog at Coinbase

Coinbase, a cryptocurrency exchange, uses Datadog as its unified observability platform: metrics, logs, APM traces, and security monitoring in one UI. With trading volumes spiking unpredictably during crypto market events, Coinbase needs to see the full picture in seconds: which service is slow, which trace is anomalous, which logs explain it.

Trade-off: A unified managed platform like Datadog provides immediate productivity (engineers learn one UI, correlations across pillars are built-in) at significant cost ($1M+ per month at scale). For high-revenue products where minutes of downtime are huge losses, the ROI is clear. For lower-stakes products, OSS Prometheus + Grafana + Loki is a viable alternative.

Quick Interview Phrases

Key terms to use in your answer

RED method
SLI, SLO, SLA
error budget
burn-rate alerting
distributed tracing
OpenTelemetry
alert on symptoms not causes

Common Interview Questions

Questions you might be asked about this topic

Three pillars: (1) Metrics - dashboards using RED (rate, errors, duration) per service and USE (utilization, saturation, errors) per resource. (2) Logs - structured JSON, indexed in Loki or Elasticsearch, with sampling at high volume. (3) Traces - OpenTelemetry instrumentation, trace context propagated via W3C traceparent header, viewable in Jaeger or Datadog APM. Define SLOs (e.g., 99.9% availability, p95 < 200ms). Alert on SLO burn rate, not on causes. On-call rotations with runbooks per alert. Quarterly post-mortems and game days.

Interview Tips

How to discuss this topic effectively

1

Lead with the three pillars (metrics, logs, traces) and what each answers. Treating monitoring as one undifferentiated thing is a junior-level move.

2

Always quote SLOs as numbers. '99.9% availability with p95 < 200ms over 30 days' is much stronger than 'high availability'.

3

Mention error budgets and the freeze rule. They are how mature teams balance reliability and feature velocity.

4

Distinguish symptom alerts from cause alerts. CPU is a cause; a latency SLO breach is a symptom. Page only on symptoms.

5

Bring up OpenTelemetry by name. It is the current open standard and the right answer to 'how do you instrument trace context propagation?'.

Common Mistakes

Pitfalls to avoid in interviews

Alerting on causes instead of symptoms

An alert that fires when CPU > 90% wakes the on-call without telling them whether customers are affected. Alert on user-visible symptoms (latency, error rate, SLO burn) and use cause metrics for diagnosis after the symptom alert fires.

Treating SLA, SLO, and SLI as synonyms

SLI is a measure (% of requests under 100ms). SLO is the internal target (99.9%). SLA is a contractual promise to customers (99.5%, with credits for breaches). Internal SLO must be tighter than the SLA so you have a buffer. Conflating them undermines your reliability vocabulary.

Logging unstructured strings

Unstructured log lines ('Failed to charge user 42 with $99.99') are nearly unsearchable at scale. Always log structured JSON ({error: 'charge_failed', user_id: 'u-42', amount: 99.99}). Search on user_id=u-42 is fast; substring search is slow and brittle.

Drowning in alerts

Hundreds of alerts per week train the on-call to ignore them; the one that mattered slips through. Audit alerts quarterly; delete or downgrade anything that fires without action. Healthy on-call sees 5-10 actionable alerts per shift.

Building monitoring without trace context propagation

Without trace IDs in cross-service calls, you have local logs and metrics but no way to debug 'why was this one request slow?' across services. Adopt OpenTelemetry from day one; passing trace context (W3C traceparent header) is non-negotiable for microservices.