I have caused at least four production incidents that traced back to an N+1 query. The pattern is identical every time. The endpoint works fine in dev with three rows, works fine in staging with thirty, and the moment a real customer with three hundred relationships hits it the endpoint times out and the database CPU spikes. The fix is always small. The detection, in the absence of the right logging, is what costs the day.
This article is the writeup I now share with every team I join. What an N+1 actually is, why ORMs make them so easy to ship, four concrete ways to fix them, and the one logging change I add to every project on day one that has caught every N+1 I have authored since.
What an N+1 actually is
The pattern is one query that returns N rows, followed by one query per row to fetch some related data. So you fire one query for the parents and N queries for the children, for a total of N+1.
Concretely, in pseudocode that looks like a typical Active Record / Sequelize / Prisma flow:
If orders.length is 1, the endpoint runs two queries. If it is 50, fifty-one. If it is 500, five hundred and one, and most of them serial because they live inside a for loop that awaits each round trip. Each query is a network round trip, a planner pass, an index seek; the per-query cost is real even when it is small. Multiply by 500 and you have a 2-second endpoint where a single query would have run in 50ms.
Why ORMs make this so easy
ORMs translate object access into queries. The very feature that makes them ergonomic, that you can write order.items and get the items, hides the round trip. With raw SQL you would have to write the join explicitly; with the ORM you accidentally write a loop that looks like in-memory iteration but actually fires a query per iteration.
Lazy-loading defaults make this worse. The first time order.items is touched, the ORM goes to the database. Looking at the access in code, you cannot tell whether it is a memory hit or a network call. The line console.log(order.items.length) looks innocent and might be the single most expensive line in the function.
Some ORMs (TypeORM with default eager: false, Sequelize without include, ActiveRecord with has_many) have made me hit this pattern more times than I can count. Prisma is better here because it does not lazy-load (you have to explicitly include or do a separate query), but it still happens when developers translate a service layer faithfully without thinking about the query shape.
Four concrete fixes
The fixes form a hierarchy of how much you change in the calling code.
Fix 1: include the relation in the first query. The cleanest fix. One query, one network round trip, the database does the join.
Prisma compiles this into one or two SQL statements (one per table for related collections, via an IN lookup that is still O(1) round trips). The database does the join, the network round trip is one. This is what the ORM was designed for; use it.
Fix 2: batch the second query. If you cannot include for some reason (you got the parent list from somewhere else, or the relation is not a direct association), fire one batched query for the children and join in memory.
Two queries total, regardless of N. The IN clause is the workhorse; most databases handle IN (...) lists of a few thousand entries efficiently. If your N is in the millions, you should be paginating, not loading all of it.
Fix 3: use a DataLoader. When the call sites are spread out (typical in GraphQL resolvers, where the schema dictates that order.items is its own resolver fired once per parent), you cannot easily restructure the calling code into a single batch. DataLoader (from Facebook, originally) batches all calls within a single tick into one query and caches per-request.
DataLoader collects all load(id) calls fired in the same event-loop tick, dedupes them, and fires one batched query. The resolver code looks like a per-row call; the actual database behavior is one batched query per type per request. This is the pattern every GraphQL backend ends up using, and the absence of it is the reason GraphQL has a reputation for being slow.
Fix 4: denormalize the field onto the parent. When the count or summary is queried alongside the parent every single time, store it on the parent. orders.itemCount, users.followerCount, posts.commentCount. Maintained via triggers, by the application on write, or by a periodic refresh. Reads become trivial; the cost moves to writes and the (usually rare) inconsistency window if your refresh lags.
I keep this as the last fix because it is structural; you give up some normalization. But for fields that are read on every list endpoint and written rarely, it is the right call. Counting comments per post on every blog index page when you have a million posts is silly; storing the count on the post is the cheap and correct answer.
Detection: the one logging change I always add
The fix is easy. The detection is what costs you. Most teams I have joined add SQL logging to dev mode, then turn it off because the noise drowns out the signal. The change I add on day one of every project: a per-request query counter, logged at the end of the request with a warning if it exceeds a threshold.
Twenty is my default threshold. A typical CRUD endpoint runs 1-5 queries. Anything over 20 is a smell, often an N+1 caught early. The warning shows up in dev logs the first time someone writes a loop with a query inside; you fix it before it ships. I have caught many N+1s this way that I would otherwise have shipped to production and discovered three weeks later when an angry customer hit the endpoint with three hundred children.
Production-side, the same metric is invaluable. If your monitoring graphs queries-per-request per endpoint, an N+1 that ships shows up as a sudden jump in the graph for that endpoint, long before it shows up as latency that customers complain about.
False positives: when high query counts are fine
Not every endpoint with many queries has an N+1 bug. A bulk-import endpoint that processes 100 records and writes each to multiple tables will fire many queries; that is correct. The tell is whether the queries are similar (smell of a loop with one query inside) or varied (different shapes for different parts of the work).
The diagnostic question I ask: "if N doubled, would the query count double?". If yes, it is an N+1 (or close to it) and there is a batching opportunity. If no, the queries are doing different work and the count is appropriate.
A real production fix I shipped
One concrete example, from memory of an incident I led the response on. An admin dashboard endpoint was timing out for some tenants. The endpoint listed projects and showed each project's owner, member count, and tag list. The code looked like:
For a tenant with 200 projects, that was 1 + 200*3 = 601 queries. The endpoint took 8 seconds p99. The fix:
Three or four queries instead of 601. Endpoint p99 dropped from 8 seconds to 250ms. The fix was four lines; the issue had been in production for months because the query-count metric did not yet exist on that service.
What I would tell my younger self
Two things. First, every loop you write that touches a database is a candidate N+1. Train yourself to spot the pattern in code review: if there is await inside for ... of, scrutinize it. The few legitimate cases are fine; the many illegitimate ones are bugs.
Second, do not rely on humans to spot it. Add the per-request query counter on day one and let the logs tell you. A high query count on a 5-row dev dataset becomes a 500-query disaster in production; the metric catches it when N is small, before it costs you.
The pattern is solved, the bugs are about discipline
N+1 is not an open research problem. The four fixes above cover every case I have ever encountered. The reason teams keep shipping them is not lack of knowledge; it is the lack of feedback when they happen. The instant you make query-count visible in dev, the rate of N+1 bugs that escape to production drops to nearly zero. It is one of those changes that pays for itself within a week, and the thing I now reach for before I do almost anything else on a new service.
