Community Article

B-Trees and Why Databases Love Them

High fanout, leaf chaining, and the asymmetric access cost that has kept the B+ tree at the heart of PostgreSQL, MySQL InnoDB, MongoDB WiredTiger, and SQLite for fifty years.

B-Trees and Why Databases Love Them

High fanout, leaf chaining, and the asymmetric access cost that has kept the B+ tree at the heart of PostgreSQL, MySQL InnoDB, MongoDB WiredTiger, and SQLite for fifty years.

b-tree

b-plus-tree

database-indexing

database

data-structures

By @valentinamwangi

March 31, 2026

Updated May 18, 2026

724 views

4.6 (9)

If you have ever run EXPLAIN on a query and seen "Index Scan using ix_users_email", you have read the output of a B-tree (more precisely, a B+ tree) lookup. Databases love B-trees in a way that approaches monogamy. PostgreSQL's default index type is a B-tree. MySQL's InnoDB stores every table as a clustered B+ tree on the primary key. SQL Server, Oracle, DB2, and SQLite all reach for the same structure as the workhorse. Even MongoDB, which is positioned as the anti-relational database, uses a B-tree variant in its WiredTiger storage engine. The cross-vendor agreement is not a coincidence. The structure is genuinely the right answer for the problem databases are solving.

I want to walk through what the problem actually is, why a B-tree (and specifically the B+ variant most databases use) is shaped to solve it, and what changes about the design when storage moves from spinning disks to NVMe to "everything in RAM". The structure has been around since 1970 and has survived three storage generations. That kind of staying power is rare in CS, and the reasons are worth understanding even if you never write one yourself.

What a B-tree is, in one diagram

A B-tree is a self-balancing search tree where each node holds many keys (not just one), and each internal node has many children (not just two). The defining parameter is the order m: each node holds at most m - 1 keys and at most m children. For a typical database, m is in the hundreds or low thousands, sized so each node fits in one disk page (commonly 4 KB, 8 KB, or 16 KB).

A B-tree of order 5 (each node holds up to 4 keys, has up to 5 children):

                        [10, 20, 30]
                       /     |     |    \
                  [3,7]  [12,15] [22,25] [33,40,50]

Compare that to a binary search tree, which has at most 2 children per node and 1 key per node. For 1 million entries, a balanced BST has height about 20 (log2(1M) ~ 20). A B-tree with m = 100 has height about 3 (log100(1M) = 3). Same number of entries, one-sixth the height. That single fact is the whole reason databases love B-trees, and the rest of this article is about why height matters when storage is involved.

Why the BST loses to the B-tree on disk

A balanced BST is optimal in main memory, where each pointer dereference costs a few cycles. On disk, every pointer dereference is potentially a page read, and a page read costs around 10ms on a hard drive, around 100 microseconds on an NVMe SSD. That is six to nine orders of magnitude slower than RAM. The cost of an algorithm on disk is dominated by the number of pages it touches, not by the number of comparisons.

A balanced BST traversal of a 1-million-entry index touches about 20 different pages on disk. That is 20 disk reads per lookup. At 100 microseconds each, that is 2 milliseconds per lookup. At 10 ms each (HDD), it is 200 ms per lookup. Either way, far too slow.

A B-tree traversal of the same index touches about 3 pages. That is 3 disk reads per lookup. 300 microseconds on NVMe, 30 ms on HDD. The structure is asymptotically the same O(log n), but the base of the logarithm is the fanout, and a higher fanout collapses the tree.

Disk reads per lookup, 1 million entries

BST (balanced):    ~20 reads  (log_2 1M = 20)
B-tree m=100:      ~3 reads   (log_100 1M = 3)
B-tree m=1000:     ~2 reads   (log_1000 1M = 2)

This is why every database on disk uses some form of multi-way tree, and why the fanout is "as many keys as fit in a page". The structure is shaped by the storage cost model.

B-trees vs B+ trees: the difference databases actually picked

There are two flavors. A B-tree (the original) stores values at every node, including internal nodes. A B+ tree stores values only at the leaves, and leaves are linked together in a sorted singly-linked list.

Almost every database uses B+ trees, not pure B-trees. Three reasons:

Internal nodes store only keys, so they have higher fanout. If keys are 16 bytes and values are 200 bytes, a 16 KB page in a pure B-tree fits about 70 entries, but a B+ tree's internal page fits about 1000 keys. Higher fanout means shallower tree means fewer disk reads.

Range scans are linear, not tree walks. Once you have located the start of a range in the leaves, you follow the sibling pointers to walk the entire range in linear time, never visiting an internal node again. Range scans are how every WHERE created_at BETWEEN ... AND ... query works, and how every ORDER BY indexed_column is served. The B+ tree's leaf chaining is the structural reason these scans are cheap.

Leaf-only values mean leaf splits are cleaner. Without values in internal nodes, internal pages are mostly stable; only the keys propagate up on splits. The maintenance code is shorter and the I/O is more predictable.

B+ tree of order 5, with leaf chaining:

                  [10, 20, 30]                ← internal pages: keys only
                 /     |     |    \
              [3,7]→[12,15]→[22,25]→[33,40]   ← leaves: keys + values + sibling pointer
              key→val pairs in each leaf

The arrows between leaves are the linked list. To range-scan from key 5 to key 35, you walk down to the leaf containing 5, read it, follow the next-pointer to the next leaf, read it, and continue until you cross 35. No backtracking up the tree, no re-traversal of internal nodes.

How real databases use B+ trees

The implementation details vary, and they vary in ways that are worth knowing.

PostgreSQL. PostgreSQL's default index access method is the B-tree (technically a B+ tree, but called "btree" in the catalog). Index pages are 8 KB by default. The implementation is a Lehman-Yao concurrent B+ tree, which uses high keys and right-link pointers to allow concurrent access without page-level locks during traversal. PostgreSQL stores the index separately from the heap table; an index entry contains the key and a tuple ID pointing into the heap. This is a non-clustered design: the table itself is heap-organized (rows in insertion order), and indexes are auxiliary structures.

MySQL InnoDB. InnoDB takes the opposite approach. The primary key index IS the table. The clustered B+ tree's leaves contain the full row, not a pointer to the heap. Looking up a row by primary key is a single B+ tree traversal that lands on the row itself; secondary indexes contain the primary key value as their "pointer", so a secondary lookup that needs the full row does two B+ tree traversals (secondary, then primary). InnoDB pages default to 16 KB. The clustered layout means rows with adjacent primary keys are physically adjacent on disk, which is great for range scans on the primary key and bad for inserts with non-monotonic primary keys (UUIDv4, for example) because they trigger random-page splits. This is the documented reason every InnoDB-on-large-table production guide tells you to avoid random UUIDs as primary keys.

MongoDB / WiredTiger. WiredTiger, MongoDB's default storage engine since 3.2, uses a B+ tree for both data and indexes. Each collection has a single B+ tree storing documents keyed by a generated record ID; secondary indexes are separate B+ trees pointing to that record ID. Pages default to 32 KB but are variable-sized in memory. Document storage in the leaves is BSON. The on-disk format is an LSM-influenced page-write strategy with checkpoints, but the lookup path is a B+ tree traversal.

SQLite. SQLite's pager and B-tree layer are the heart of the engine. Each table is a B+ tree on the row ID (or on the primary key if WITHOUT ROWID is used, which makes the table clustered like InnoDB). Indexes are separate B+ trees. Default page size is 4 KB. SQLite's B-tree implementation is famously clean and well-documented; the source file btree.c is one of the more readable database internals in any open-source project.

The pattern across all four: same family, same idea, page-sized fanout, leaf chaining for range scans. The differences are about clustering (InnoDB and SQLite-WITHOUT-ROWID cluster the table on the primary key; PostgreSQL keeps the heap separate; MongoDB uses an internal record ID), concurrency control (Lehman-Yao right-links, latch coupling, lock coupling), and durability (write-ahead logging, checkpoints, copy-on-write).

Inserts, splits, and why fanout matters more than perfect balance

A B+ tree insert walks down to the leaf, places the key, and if the leaf overflows, splits it: half the keys go to a new sibling leaf, the median key is propagated up to the parent, and if the parent overflows, the split cascades. The tree grows in height only when the root splits, which is rare. For a tree of order 1000, the root splits roughly once per million inserts.

This is the key efficiency: the structure stays balanced without per-insert rebalancing, because each split is a local operation and the high fanout ensures the cascade rarely reaches the root.

Before insert of "33" (leaf is full):

   ... → [30, 31, 32, 33] → ...    ← leaf at capacity 4

After insert of "34", split:

                ... [30] ... [33] ...     ← propagated key
                       \         \
                  [30, 31, 32]   [33, 34]     ← two leaves
                       (linked together →)

The cost of an insert: O(log_m n) page reads to find the leaf, then O(1) for the leaf write, then O(log_m n) for the worst-case split cascade up the tree. In practice, over a million inserts, the average cost is dominated by the first walk; splits are rare.

Random inserts cause more splits than sequential inserts, because they distribute across the tree and each leaf fills up independently. Sequential inserts (auto-incrementing primary keys) are cheap because they always append to the rightmost leaf. This is the practical reason "use a sequential primary key" is universal advice for high-insert workloads on InnoDB.

What B-trees are not good at

The structure is not universal. Three workloads where databases reach for something else:

Write-heavy workloads with random keys. B+ trees pay a real cost on random inserts because of leaf splits and cache misses. LSM-trees (Cassandra, HBase, RocksDB, LevelDB) trade write efficiency for read complexity by buffering writes in memory and merging into sorted runs in the background. RocksDB writes are roughly 10x cheaper than B+ tree writes; reads are slower because they have to check multiple levels.

Full-text search. Inverted indexes (Lucene, Elasticsearch, PostgreSQL's GIN) handle "find documents containing word X" efficiently because they invert the data: the index maps from term to document list. A B+ tree on words would force a linear scan within each row to find term matches; an inverted index does it in one lookup. PostgreSQL keeps GIN as a separate access method for this reason.

Geospatial queries. "Find all points within X km of this location" cannot be answered by a B+ tree on a single dimension. R-trees, quadtrees, and geohash-on-B-tree are the family. PostGIS uses an R-tree-like GiST index. MySQL has a spatial-index B-tree variant.

Equality lookups on very high-cardinality keys. A hash index has O(1) lookup vs B-tree's O(log n). PostgreSQL has hash indexes; they were rarely used for years because of crash-safety issues and now-fixed concurrency limits. The honest answer is that for "find the row where id = 12345", the constant factor on a B-tree's three-page-walk is small enough that most database authors do not bother optimizing for hash. PostgreSQL's hash index is faster than its B-tree for some equality workloads but not by a margin that justifies the operational complexity for most teams.

How storage tiers change the calculation

The B-tree was designed in 1970 for spinning disks where seek time was 10 ms and sequential reads were free. Storage has moved up a tier roughly every decade. The structure has held up because the underlying assumption (sequential access is much cheaper than random access) is still true on every storage tier, just with different constants.

Storage tier	Random read (typical)	Sequential bandwidth	B-tree fanout sweet spot
HDD	~10 ms	~100 MB/s	Hundreds (page = 4-8 KB)
SATA SSD	~100 us	~500 MB/s	Thousands
NVMe SSD	~10-50 us	~3-7 GB/s	Thousands
Persistent memory	~300 ns	~10 GB/s	Tens to hundreds
RAM	~50 ns	~30 GB/s	Smaller (cache-line sized)

The general direction: as random reads get cheaper, fanout matters less and the optimal page size shrinks. Some research databases (SiloDB, MemSQL) use cache-line-sized B-tree nodes (around 128 entries) optimized for L2/L3 cache rather than disk pages. Modern storage engines (ScyllaDB, FoundationDB) tune their page sizes to the underlying storage's read amplification curve.

The structure is the same. The constants move. This is the reason the B-tree has survived three storage generations and will probably survive a fourth: the access-cost asymmetry that birthed it (random > sequential) is a hardware fact, not a tunable.

What this means if you maintain a database in production

Three takeaways that have actually helped me debug or design real systems.

Index a column you expect to range-scan, especially with a sort. The B+ tree's leaf chaining makes range scans almost free. A query that does WHERE created_at BETWEEN A AND B ORDER BY created_at against an indexed created_at column is one tree-walk plus a linear leaf scan, no in-memory sort. Without the index, the database has to scan the whole table and sort the result. The performance gap is several orders of magnitude on large tables.

Avoid random primary keys on InnoDB if writes matter. A UUIDv4 primary key on a billion-row InnoDB table will cause page splits scattered across the entire tree on every insert, and the working set of dirty pages will grow until your buffer pool cannot hold it. Either use a sequential ID (auto-increment, snowflake ID, ULID) or use UUIDv7, which is monotonic by time.

Understand index column order. A B+ tree index on (a, b, c) is a tree where keys are ordered first by a, then by b, then by c. A query WHERE a = ? AND b = ? uses the index efficiently. A query WHERE b = ? with no a predicate cannot use this index for filtering; the database has to scan all leaves. This is the leftmost-prefix rule, and it is structural, not a database-specific quirk.

The next decade: same structure, faster substrate

The interesting forward question is whether B+ trees survive the move to a world where storage is cheap enough to fit working sets in RAM and where the bottleneck is no longer disk I/O but memory bandwidth and cache misses. My read of the research and the industry direction is yes, but with subtler changes. The "page" is no longer a 4 KB disk block; it is a 64-byte cache line, a 4 KB virtual memory page, or a 2 MB hugepage, depending on the level of the hierarchy you are tuning for. The fanout shrinks, the rebalancing logic stays the same, and the leaf chaining keeps mattering because range scans are still common workloads. The names of the data structures change ("Bw-tree", "FOEDUS-tree", "Masstree"), but they all share the same family lineage. The structure that was designed to deal with 1970s disks turns out to deal with 2030s persistent memory just fine, because what it really optimizes is not "disks" but "asymmetric access cost", and that has not gone away. If you understand the shape of the B+ tree (high fanout, balanced, leaf-chained, page-sized), you understand what every relational database has been doing under the hood for fifty years, and what they will probably keep doing for the next fifty.

Back to Articles