Community Article

Metrics, Logs, and Traces: The Three Pillars Without the Marketing

What each pillar actually does, when reaching for it pays off, and the budget I follow so I am not paying observability vendors more than I am paying for compute.

Metrics, Logs, and Traces: The Three Pillars Without the Marketing

What each pillar actually does, when reaching for it pays off, and the budget I follow so I am not paying observability vendors more than I am paying for compute.

monitoring
logging
tracing
alerting
reliability
zarakamau

By @zarakamau

May 17, 2026

·

Updated May 20, 2026

313 views

6

4.4 (9)

Three pillars. The phrase has been on every observability vendor's homepage for so long that it sounds suspiciously like a sales line. It is not, mostly. The three things, metrics, logs, and traces, really do answer different questions, and a service that is missing any one of them has a recurring kind of debugging pain. The marketing version of the story makes them sound interchangeable. They are not, and the cost of treating them as if they were is most of what shows up on my observability bill.

This is the breakdown I wish I had been handed in my first on-call rotation. What each pillar is actually for, the question it is good at answering, the question it is bad at answering, and the budget I now use so we are not paying more for visibility than for compute.

The three pillars, in one sentence each

Three pillars at a glance
  Metrics  numbers, sampled at intervals, aggregated cheaply across many requests.
  Logs     timestamped text events, one per interesting moment, structured as JSON.
  Traces   a single request's path through N services, with timing per hop.

Metrics tell you something is wrong. Logs tell you what happened in one specific case. Traces tell you where in a chain of services the slowness or error came from.

The reason each one exists separately is cost shape. Metrics cost almost nothing per data point, scale with the number of distinct series (cardinality), and let you do things like "plot p99 over 30 days" cheaply. Logs cost roughly per byte ingested and stored; you cannot afford to log every event at full verbosity in prod. Traces cost about the same as logs but pay back when a request crosses three or more services and you need to see the whole path.

Metrics: the dashboard you stare at while paged

A metric is a numeric measurement, attached to a name, with a small set of labels. "http_requests_total" labelled by route and status, "request_duration_seconds" labelled by route, "db_pool_in_use" labelled by db. Each label combination becomes a time series, and the cost of metrics is dominated by how many series you have (cardinality), not how often you sample them.

The four metrics I make sure exist on every service from day one, often called the RED method (Rate, Errors, Duration) plus saturation:

RED + Saturation, the four-metric starter pack
  Rate         requests per second per route
  Errors       error rate per route (status >= 500)
  Duration     request duration histogram per route, p50/p95/p99
  Saturation   one or two saturation gauges (DB pool used, queue depth, CPU)

This is enough to answer "is the service healthy?" without opening a single log. The dashboard built on top of those four is the first thing I open when paged.

The trap with metrics is cardinality. Adding user_id as a label to a metric on a service with a million users creates a million series. Most metrics backends (Prometheus, the cheap tier of Datadog, M3) charge per active series and slow down query performance as cardinality grows. Practical rule: labels are for things with a small, bounded set of values (route, status code, region, environment). User IDs, request IDs, trace IDs, anything per-request, do not belong on metrics. They belong in logs and traces.

A Prometheus example, in the actual format I would expose from a Node service:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{route="/api/orders",status="200"} 14523
http_requests_total{route="/api/orders",status="500"} 12
http_requests_total{route="/api/users",status="200"} 9821

# HELP http_request_duration_seconds Request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{route="/api/orders",le="0.1"} 13900
http_request_duration_seconds_bucket{route="/api/orders",le="0.5"} 14400
http_request_duration_seconds_bucket{route="/api/orders",le="1.0"} 14520
http_request_duration_seconds_bucket{route="/api/orders",le="+Inf"} 14535
http_request_duration_seconds_count{route="/api/orders"} 14535

Four route labels, three status codes, ten histogram buckets per duration: that is 120 series, well within budget. Add user_id and the same metric balloons into millions.

Logs: the play-by-play, structured or it is useless

A log line is a structured event: timestamp, level, message, and a JSON payload. The defining property is that each line stands alone; you can grep, filter, or query without context.

The pattern that makes logs actually useful: structure them as JSON, never as free-form prose. The difference between

2026-04-23T14:22:11Z ERROR could not charge user 1234 for $42.50: card declined

and

{
  "ts": "2026-04-23T14:22:11Z",
  "level": "error",
  "msg": "charge_failed",
  "user_id": 1234,
  "amount_cents": 4250,
  "reason": "card_declined",
  "trace_id": "abc123",
  "request_id": "req-789"
}

is the difference between an SRE running grep -i "declined" | wc -l and an SRE running level:error AND msg:charge_failed in their log query language and getting an exact count of failed charges by reason in the last hour. You want the second world.

The budget question for logs is: how much do I log at INFO vs DEBUG, and at what point does the bill scale faster than my traffic? My rule of thumb across the systems I have run: logging every request at INFO level, with a structured payload, comes out to roughly 1 to 2 KB per request. At a hundred requests per second, that is 100 to 200 KB/s, or roughly 8 to 17 GB per day for one service. At hosted-log-vendor pricing of $1 to $5 per GB ingested (the range I have actually seen invoiced; check your contract), that single service is between $8 and $80 per day in logging fees. Add ten services and three environments and you can be a five-figure-per-month problem. The same numbers explain why teams stop logging individual successful requests in production, even though they keep DEBUG on for failures (a pattern called sampling, covered in a moment).

Two concrete patterns that have controlled my logs bill:

Sampling at the source. Log every error. Log a 1% or 5% sample of successful requests. The cardinality and cost drop dramatically; the few times you need a specific successful request, you can usually still infer it from upstream metrics or downstream traces.

Trace ID on every line. When a single request fails, you want every log line that request produced, across every service it touched. The only way this works is if every log line carries the trace ID. The middleware that pulls the trace ID out of the request context and stamps it on every log emission is the most useful 50 lines of code in any service.

Traces: the timeline that explains slow requests

A trace is a tree of spans. Each span is a unit of work (a function call, a DB query, an outbound HTTP request) with a start time, end time, and a parent span. The trace's root span is the incoming request; its children are everything that happened to serve it.

What traces do that metrics and logs cannot: when a request takes 3 seconds and it should take 300 milliseconds, traces show you the flame chart of where the time went.

Simplified trace for a slow request
  POST /api/checkout                                3200ms
    auth.verify_session                              12ms
    db.users.findById                                 4ms
    cart.compute_total                              140ms
      db.cart_items.findMany                        135ms
    payment.charge                                 2900ms   <-- the cost
      external_api.stripe.charges.create            45ms
      db.payments.insert                              5ms
      db.cart.markCheckedOut                       2840ms   <-- the actual cost
        db.cart_items.delete (1 query per item)    2840ms
    notify.email_receipt                            120ms

The trace makes it instantly obvious that the bottleneck is one specific DB call, and that call is firing once per cart item. This is the N+1 query I would have spent an hour finding from logs and never spotted from metrics. With traces, I see it in 30 seconds.

The trade-off with traces is the sampling rate. Recording every span on every request is expensive (tracing backends charge per span). Most teams sample 1% to 10% of traces, which is fine for performance debugging because you have plenty of typical-shaped traces to look at. The tail-based sampling option (record every span, decide at the end of the request whether to keep the trace based on whether it errored or was slow) is more expensive to run but catches the rare slow request without you needing to be lucky.

When to reach for which pillar

A quick decision table I have on a sticky note next to my monitor:

When to reach for which pillar
  Question                                          Pillar
  Is the service degraded right now?                Metrics
  How many users are affected?                      Metrics
  Did a deploy 30 minutes ago cause this?           Metrics (deploy markers + RED dashboards)
  What did this one failed request actually do?     Logs (filter by request_id or trace_id)
  Why is this one request slow?                     Traces
  Where in a chain of 5 services is the bottleneck? Traces
  Was there a spike at 02:14 last night?            Metrics
  What error message did user 1234 see at 02:14?    Logs

Most outages I have been part of follow the same shape. Metrics catch the regression first (alert fires). I open the metric dashboard, see the rate or error rate moved at deploy-time. I then jump to a trace from one of the failing requests to see where the time went, and to a log query filtered by the same trace ID for the actual error message. Three pillars, three jobs, used in sequence.

A budget that has held for me

The combined cost of metrics + logs + traces should be a small fraction of the cost of running the service itself. My rough budget on a service that costs $1000/month in compute:

  • Metrics: $50 to $150/month. Cardinality kept low by review.
  • Logs: $200 to $400/month. Sampled, structured, retained 30 days.
  • Traces: $50 to $200/month. Sampled at 5%, tail-based for slow requests.

If the observability bill creeps above 30% of compute, something has gone wrong, almost always cardinality or a too-verbose log level. The fix is rarely "buy a different vendor". It is almost always "audit what we are emitting and stop emitting the expensive parts".

The pillar I would add if I had to pick a fourth

If someone forced me to add a fourth pillar, it would be events: high-cardinality, structured, business-meaningful records ("order created", "payment captured", "refund issued"), shipped to a warehouse for analytics, retained for years rather than days. They are not a debugging tool the way logs and traces are. They are the source of truth for product analytics, financial reconciliation, and regulatory questions. Treating them as a separate stream from operational logs (different ingestion path, different retention, different query tool) keeps both clean.

Most teams I have joined had this fourth stream by accident, mixed into their logs at random verbosity levels, with no schema and no SLA on retention. Pulling it out into a real events pipeline (Kafka, then a warehouse) was cheap, was almost always overdue, and made both the operational logs and the product analytics work better. The three pillars do not stop being useful when you add this fourth stream; they just stop being asked to do the events stream's job. I do not bill it as a fourth pillar in the abstract; I am answering the "what about audit and analytics?" question that comes up every time.

The sentence I leave teams with

When I onboard at a new team, the first observability question I ask is which pillar each existing alert is built on, because the answer tells me almost everything about the team's debugging culture. Teams that alert exclusively on logs ("if this regex appears more than X times") are usually drowning in noise; teams that alert exclusively on metrics without trace correlation usually spend their incident time grepping production logs by hand. The healthiest teams I have worked with treat metrics as the alarm, traces as the locator, and logs as the receipt: each pillar plays its specific role and none of them is asked to substitute for the others. Once you can name which question each pillar is good at and which it is bad at, the marketing version of the story stops mattering, and the bill stops surprising you.

Back to Articles