Community Article

MongoDB Aggregation Pipelines, by Example

Match, group, lookup, project, the four stages I use 90% of the time. Real pipelines for funnel analysis, leaderboard, and a one-stage join, with the gotchas that surprise SQL refugees.

MongoDB Aggregation Pipelines, by Example

Match, group, lookup, project, the four stages I use 90% of the time. Real pipelines for funnel analysis, leaderboard, and a one-stage join, with the gotchas that surprise SQL refugees.

database
nosql
query-optimization
data-modeling
performance
fatimapark

By @fatimapark

February 1, 2026

·

Updated May 20, 2026

582 views

6

4.5 (11)

Last quarter I rewrote a 14-stage aggregation that ran in 380 ms into a 6-stage one that runs in 22 ms. The new version is shorter, easier to read, and uses two indexes the original never touched. The thing that made the rewrite possible was not a clever operator. It was finally getting comfortable with the small set of stages that do most of the work, and learning to read what the engine does with them. This article is what I wish someone had handed me when I switched from years of writing SQL to writing MongoDB.

The pipeline is just SQL turned inside out. Instead of a single declarative statement that the planner reorders for you, it is an ordered sequence of transformations you control directly. Every example I read on day one started with db.collection.aggregate([...]) and twenty stages of unfamiliar operators, and my brain would short-circuit. Once the pipeline-as-stream model clicked, aggregation became the thing I reach for first.

Four stages cover 90% of what I write ($match, $group, $lookup, $project), plus $facet when one input has to feed several parallel summaries. The rest of this piece walks those stages, three real pipelines, the shape-semantics differences between $project / $addFields / $set / $unwind (which trip people up), how I read .explain('executionStats'), and the gotchas that catch SQL-fluent engineers their first month.

The pipeline-as-stream mental model

An aggregation pipeline is an array of stages. Each stage takes a stream of documents and returns a stream of documents. The stages run in order: the first stage's input is the collection; each subsequent stage's input is the previous stage's output. There is no planner to reorder stages for you. The order you write them is the order they execute.

Pipeline mental model
  collection -> [stage 1] -> [stage 2] -> [stage 3] -> result
                  filter        group       project

This is the inversion that confuses SQL people. In SQL, WHERE and GROUP BY and SELECT are written in any order and the planner figures it out. In MongoDB, you write them in execution order. $match first (filter), $group second (aggregate), $project last (shape the output). Doing them in a different order produces a different (sometimes wrong, often slow) result.

Stage one: $match

$match is the equivalent of SQL WHERE. It filters the document stream. It uses the same query syntax you would use with find(), which is the easiest part to learn coming from Mongo basics.

db.orders.aggregate([
    {
        $match: {
            status: 'completed',
            createdAt: { $gte: new Date('2026-01-01') },
            total: { $gte: 100 },
        },
    },
]);

The single most important rule: put $match as early as possible, ideally as the first stage. Mongo can use indexes on a $match only if it is the first stage of the pipeline (or the first stage after $sort and similar). A $match after $group or $project runs as a post-filter on the in-memory results, which is much slower.

If you have multiple filters and one is index-eligible while the other is not, split them. The first $match uses the index; the later filter (after a $lookup or $project) runs on the smaller filtered set.

Stage two: $group

$group is GROUP BY plus aggregate functions, fused into one operator. The _id field specifies the grouping key (single field, multiple fields, or an expression). Other fields specify the aggregations.

db.orders.aggregate([
    { $match: { status: 'completed' } },
    {
        $group: {
            _id: '$userId',                       // group key
            orderCount: { $sum: 1 },              // count(*)
            totalSpent: { $sum: '$total' },       // sum(total)
            avgOrderValue: { $avg: '$total' },    // avg(total)
            lastOrderAt: { $max: '$createdAt' },  // max(createdAt)
        },
    },
]);

Three details that took me a minute to internalize:

  1. The grouping key is always called _id in the output. If you group by userId, the output shape is { _id: <userId>, ... }, not { userId: <userId>, ... }. To rename it back, you add a $project stage at the end.
  2. To group by multiple fields, use an object: _id: { userId: '$userId', month: '$month' }. The _id is then a sub-document.
  3. $sum: 1 is the count. $sum: '$field' is the sum of a field. The first form is so common it has its own pattern.

After $group, every document in the stream has the shape { _id, ...aggregations }. The original document fields are gone unless you explicitly carried them through (with $first, $last, $push, etc).

Stage three: $lookup

$lookup is the join operator. It pulls related documents from another collection and embeds them as an array on each input document. This is the operator that took me longest to like, because the syntax has more shapes than any other stage.

The simplest form is the equality join. "For each order, attach the related user."

db.orders.aggregate([
    { $match: { status: 'completed' } },
    {
        $lookup: {
            from: 'users',           // the collection to join with
            localField: 'userId',    // field on this document
            foreignField: '_id',     // field on the joined document
            as: 'user',              // name of the array field to write
        },
    },
    {
        $unwind: '$user',            // turn the 1-element array into a single field
    },
]);

The result of $lookup is always an array (multiple matching documents are valid in Mongo, even if your data is one-to-one). For one-to-one joins I always follow it with $unwind, which converts the single-element array to a scalar. For one-to-many joins I leave the array as is.

There is a more flexible form (let plus pipeline) that lets you express conditions on the join, similar to SQL JOIN ... ON .... I use it when the join needs more than equality:

{
    $lookup: {
        from: 'orderItems',
        let: { orderId: '$_id' },
        pipeline: [
            { $match: { $expr: { $eq: ['$orderId', '$$orderId'] }, isRefunded: false } },
            { $project: { _id: 0, name: 1, qty: 1 } },
        ],
        as: 'items',
    },
}

The let defines variables visible inside the sub-pipeline. The $$ prefix references them. This form lets you filter and project inside the join, so the joined documents arrive already shaped correctly.

Index hint: the foreign side of $lookup is hit once per input document. If orders has a million entries and orderItems.orderId is not indexed, your pipeline will be slow no matter what you do upstream. Always index the foreign field.

One nuance the modern engine adds: only an equality predicate against a $$let variable can use the foreign-side index. If your sub-pipeline filters with $gte or $in against a let variable, or wraps the field in a function before comparing, the join falls back to a per-document collection scan even when the index exists. When I see a $lookup get suspiciously slow after a refactor, the first thing I check is whether someone changed the inner $match from a clean $eq to a $gte or an $in and accidentally bypassed the index.

Stage four: $project

$project is SELECT plus column expressions. It reshapes the output: keeps fields, drops fields, renames, computes new ones.

{
    $project: {
        _id: 0,                                    // drop _id
        userId: '$_id',                            // rename _id to userId
        orderCount: 1,                             // keep as is
        totalSpent: 1,
        avgOrderValue: { $round: ['$avgOrderValue', 2] },  // computed expression
        tier: {
            $cond: {
                if: { $gte: ['$totalSpent', 1000] },
                then: 'gold',
                else: { $cond: { if: { $gte: ['$totalSpent', 100] }, then: 'silver', else: 'bronze' } },
            },
        },
    },
}

1 keeps a field, 0 drops it, an expression computes a new value. _id: 0 is the most common drop because Mongo always returns _id by default.

One rule that catches everyone the first time: you cannot mix inclusion (field: 1) and exclusion (field: 0) in the same $project stage. The engine rejects the pipeline outright. The only exception is _id, which can be excluded alongside an inclusion list. So { _id: 0, name: 1, total: 1 } is fine; { name: 1, internalNotes: 0 } is an error. If you genuinely need to keep most fields and drop one or two, use $unset (or $project with a fully written-out inclusion list). The engine is strict here on purpose; mixing the two is almost always an authoring mistake.

Stage five: $facet (parallel sub-pipelines)

$facet is the stage I reach for whenever a dashboard needs several different summaries from the same input. It runs N sub-pipelines on the same stream and returns one document with N named arrays. Faceted search and "top 10 plus total count" queries are the canonical uses.

db.orders.aggregate([
    {
        $match: {
            createdAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
        },
    },
    {
        $facet: {
            topProducts: [
                { $unwind: '$items' },
                { $group: { _id: '$items.productId', units: { $sum: '$items.qty' } } },
                { $sort: { units: -1 } },
                { $limit: 10 },
            ],
            byCountry: [
                { $group: { _id: '$shippingCountry', orders: { $sum: 1 } } },
                { $sort: { orders: -1 } },
            ],
            totals: [
                {
                    $group: {
                        _id: null,
                        orderCount: { $sum: 1 },
                        revenueCents: { $sum: '$total' },
                    },
                },
            ],
        },
    },
]);

One pass over the matched orders, three results. The two things to know: the $match before $facet is the only stage that uses an index (the sub-pipelines run in memory on the already-fetched stream), and the result document can get large fast since it carries every sub-pipeline's array. I keep $facet to a handful of branches and put a $limit inside any branch that could grow without bound. When I have caught myself reaching for five round trips from the API to build one dashboard, $facet has usually replaced four of them.

$unwind, $addFields, $set, $project: shape semantics in one table

These four stages all reshape documents and the differences are small enough that I keep mistaking them. A short reference:

StageWhat it does to the documentDrops other fields?Typical use
$projectPicks an inclusion or exclusion set; can compute new fieldsYes (anything not listed is gone)Final output shaping
$addFieldsAdds computed fields, leaves originals intactNoMid-pipeline derivations
$setIdentical to $addFields, just a friendlier nameNoSame as $addFields
$unwindReplaces an array field with one document per array elementNo (other fields stay; the array field becomes a scalar)Flattening one-to-many before grouping

$set and $addFields are aliases: same semantics, different keyword. I prefer $set because it reads more like an assignment and shorter pipelines are easier to skim. The mistake I made for months was using $project mid-pipeline to add a single computed field, then wondering where my other fields had gone. If you only want to add or overwrite, reach for $set. Save $project for the last step where you choose what the consumer sees.

Real pipeline 1: weekly funnel

"For each user who signed up in the last 30 days, did they create an order, did they complete checkout, did they make a second order?"

db.users.aggregate([
    {
        $match: {
            createdAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
        },
    },
    {
        $lookup: {
            from: 'orders',
            localField: '_id',
            foreignField: 'userId',
            as: 'orders',
        },
    },
    {
        $project: {
            _id: 1,
            createdAt: 1,
            startedCheckout: { $gt: [{ $size: '$orders' }, 0] },
            completedFirstOrder: {
                $gt: [
                    {
                        $size: {
                            $filter: {
                                input: '$orders',
                                cond: { $eq: ['$$this.status', 'completed'] },
                            },
                        },
                    },
                    0,
                ],
            },
            completedSecondOrder: {
                $gt: [
                    {
                        $size: {
                            $filter: {
                                input: '$orders',
                                cond: { $eq: ['$$this.status', 'completed'] },
                            },
                        },
                    },
                    1,
                ],
            },
        },
    },
    {
        $group: {
            _id: null,
            signups: { $sum: 1 },
            startedCheckout: { $sum: { $cond: ['$startedCheckout', 1, 0] } },
            completedFirstOrder: { $sum: { $cond: ['$completedFirstOrder', 1, 0] } },
            completedSecondOrder: { $sum: { $cond: ['$completedSecondOrder', 1, 0] } },
        },
    },
]);

This is one query that gives you a four-row funnel. The same thing in SQL would be a series of COUNT(DISTINCT user_id) FILTER (WHERE ...) clauses; in Mongo it is the $cond plus $sum pattern.

Real pipeline 2: leaderboard

"Top 10 users by total spend in the last 7 days, with the user's name attached."

db.orders.aggregate([
    {
        $match: {
            status: 'completed',
            createdAt: { $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) },
        },
    },
    {
        $group: {
            _id: '$userId',
            totalSpent: { $sum: '$total' },
            orderCount: { $sum: 1 },
        },
    },
    { $sort: { totalSpent: -1 } },
    { $limit: 10 },
    {
        $lookup: {
            from: 'users',
            localField: '_id',
            foreignField: '_id',
            as: 'user',
        },
    },
    { $unwind: '$user' },
    {
        $project: {
            _id: 0,
            userId: '$_id',
            name: '$user.name',
            totalSpent: 1,
            orderCount: 1,
        },
    },
]);

The order matters here: $limit comes before $lookup. We narrow to 10 users first, then fetch their names. If you reversed it (lookup all users first, then limit), you would join millions of users and discard 99.99% of the work. Stage order is performance.

Reading .explain('executionStats')

Pipelines hide cost behind a clean syntax, which is why I run .explain('executionStats') on every non-trivial pipeline before it ships. Three numbers in the output are the ones I look at first.

db.orders.aggregate(pipeline, { explain: 'executionStats' });
// or, in the modern driver:
db.orders.explain('executionStats').aggregate(pipeline);

The output is a tree of stage objects, one per pipeline stage, plus the underlying query plan for the first index-eligible stage. The pieces I read every time:

executionStats output, the fields that matter
  totalDocsExamined  vs  nReturned   -> ratio tells you how selective the index was
  totalKeysExamined  vs  nReturned   -> high keys / low returned = index does not match the filter shape
  executionTimeMillis                -> wall time for the whole pipeline
  stage / inputStage                 -> COLLSCAN means no index; IXSCAN + FETCH is the usual healthy shape
  rejectedPlans                      -> if non-empty, the planner considered alternatives; check why

The single most useful read is totalDocsExamined / nReturned. If you returned 10 documents and Mongo examined 4 million, your $match is not using an index. If you see COLLSCAN anywhere on a busy collection, that is the first thing to fix. After I changed how I read explain output, I started catching these issues before they hit production rather than after the on-call ticket.

Gotchas that surprise SQL refugees

Three that I personally hit:

$project does not silently keep fields. If you write $project: { name: 1 }, every field other than name and _id disappears. SQL SELECT name does the same thing, but in an aggregation pipeline the dropped fields are gone for the rest of the pipeline. To carry fields through, list them all, or use $addFields instead, which adds without dropping.

$lookup is not transactional and not always cheap. Each document in the input stream triggers a query against the foreign collection (in the simple form). If the input stream is millions of documents, you fire millions of foreign queries. Mongo optimizes some of this with indexes, but the cost is real. Sometimes denormalizing the joined fields onto the parent collection is the better answer.

$match after $group cannot use indexes. Once you have grouped, the index is gone (the post-group documents do not exist as collection rows). A $match on a grouped count or sum runs as an in-memory filter. This is fine for small group counts and bad for large ones; if you have a million groups and you want only those with totalSpent > 1000, that filter is running on a million in-memory documents.

What I default to today

After enough pipelines, my default shape is:

Default pipeline order
  1. $match (use indexes early)
  2. $sort (if it can use the same index as $match)
  3. $limit (push limit as early as possible)
  4. $lookup
  5. $unwind / $addFields / $set
  6. $group
  7. $sort
  8. $project

Not every pipeline needs every stage, but the order is stable. $match and $limit first to shrink the working set; $lookup after the working set is small; $group and $project at the end to shape output.

Run .explain on every pipeline you ship

Aggregation pipelines are the single biggest reason I still reach for MongoDB on the right workloads. They are more verbose than SQL for simple queries, and that verbosity is the price of admission. But for analytics-style work where the shape of the answer is non-trivial (funnels, cohorts, joins with conditions), the pipeline is more honest about what is happening than SQL is. You can see, line by line, what each stage costs, what data shape flows in and out, where the index will be used. SQL's planner is more powerful, but the pipeline is more legible. For a developer learning to think about query performance, that legibility is a real teaching tool.

If you have been avoiding aggregation pipelines because the syntax looks alien, the four stages above ($match, $group, $lookup, $project) cover most of what you will write in your first year. Learn those four well; reach for $facet when one input feeds many summaries; pull in the others when you actually need them. And run .explain('executionStats') on every pipeline you ship, the same way you would run EXPLAIN ANALYZE on every SQL query. The pipeline tells you exactly what it is doing; you just have to read it.

Back to Articles