Community Article

Hash Tables: The Most Useful Data Structure

Why hash tables earn their reputation, where collisions and load factor actually bite, and the language-specific quirks (JS Map vs Object, Python dict, Java HashMap) that change the game.

Hash Tables: The Most Useful Data Structure

Why hash tables earn their reputation, where collisions and load factor actually bite, and the language-specific quirks (JS Map vs Object, Python dict, Java HashMap) that change the game.

hash-table

hash-map

data-structures

fundamentals

interview-prep

By @oliviafoster

April 16, 2026

Updated May 18, 2026

607 views

4.4 (13)

Hash tables are the most useful data structure in computer science. This article is the long version of why, and the longer version of where they bite. I am going to spend most of the time on the parts that get glossed over in textbooks, because that is where the production bugs live.

If you took a typical service I have worked on and counted the data structures it uses, you would find arrays for ordered things, hash tables for everything else, and maybe a tree somewhere in a query planner you do not maintain. The dominance of the hash table is not an accident. Constant-time average-case lookup is so cheap, and so general, that it has become the default tool. The interesting question is not why we use them. The interesting question is what we forget about them when we do.

What a hash table actually is

A hash table is two things glued together: a hash function and an array of buckets. The hash function turns a key into a bucket index. The bucket holds the value (or in collision-handling cases, a small structure that can hold several key/value pairs). To look up a key, you hash it, jump to that bucket, and read out the value.

# Conceptually, a 4-bucket hash table:
# bucket 0: []
# bucket 1: [("banana", 2)]
# bucket 2: [("apple", 1), ("grape", 3)]   # collision!
# bucket 3: []

h = hash("apple") % 4   # = 2
# walk bucket 2, find "apple", return 1

That is it. The amortized cost of insert, lookup, and delete is O(1) average case, O(n) worst case (when every key collides into the same bucket). The whole rest of this article is about closing the gap between average and worst case in real systems.

Collisions: chaining vs open addressing

Two keys can hash to the same bucket. They will. With 100 keys in 100 buckets and a uniformly random hash, the birthday-paradox math says you can expect roughly 37 occupied buckets and several collisions. Hash tables handle collisions in two main ways.

Separate chaining. Each bucket holds a small list (or, in modern implementations, a small array or even a tree). To find a key, you hash to the bucket, then scan the list. Java's HashMap uses chaining and upgrades the list to a red-black tree once it exceeds 8 entries. Python's dict does not chain. JavaScript's Map is implementation-defined, but V8 uses a variant.

Open addressing. All entries live in the bucket array directly. On a collision, probe a sequence of nearby buckets (linear probing, quadratic probing, double hashing) until you find an empty slot. CPython's dict uses open addressing with perturbed probing. So does Go's map.

Chaining tolerates higher load factors (typically up to ~0.75). Open addressing is more cache-friendly for small entries and tends to use less memory per slot, but it degrades more sharply when full and is sensitive to clustering.

The tradeoff is rarely something you tune yourself. The standard library picked one for you a long time ago. Knowing which is which still matters when you are debugging a slowdown that turns out to be a clustering issue.

Load factor and resize: where amortized hides

Load factor is entries / buckets. As load factor approaches 1, collisions become common and operations slow down. To keep load factor in check, hash tables resize themselves: allocate a larger bucket array (usually doubled), rehash every existing entry into it, and continue.

That resize is where "amortized O(1)" stops being a free lunch. The resize itself is O(n). It happens rarely, but when it happens, every other operation either blocks or sees an inconsistent view.

load factor over time:

0.75 ────────●──── (insert triggers resize)
              │
              ▼
   ┌──────────────────────────────┐
   │ Allocate 2n buckets          │
   │ Walk all n existing entries  │   ← O(n) on this insert
   │ Rehash and copy each one     │
   │ Free old bucket array        │
   └──────────────────────────────┘
              │
              ▼
0.375 ────────●──── (back to amortized O(1))

If you are pre-loading a million entries into a fresh hash table, expect roughly 20 resize events (one per power-of-two crossing). If you know the size up front, pre-size the table. In Java, new HashMap<>(initialCapacity). In Python, just create the dict with a comprehension instead of growing it. In JavaScript, there is no pre-sizing API for Map, which is one reason JS code that builds large maps in a loop is sometimes slower than the equivalent Python.

Real-world quirks: JS Map vs Object, Python dict, Java HashMap

JavaScript: Object vs Map. I see Object used as a hash table all the time, and most of the time it is fine. Two cases where it bites: (1) Object keys are coerced to strings, so obj[1] and obj["1"] are the same entry. Map accepts arbitrary keys including objects and preserves identity. (2) Object has a prototype chain, which means lookups go through __proto__ and an attacker who can write a key like __proto__ can pollute it. Reach for Map when keys are dynamic or untrusted, and when iteration order or .size matters. Reach for Object when keys are statically known or when you need to JSON-serialize the result.

Python: dict. Python dict insertion-order preservation has been guaranteed since 3.7. The hash function for strings is randomized per-process by default to defend against algorithmic complexity attacks (PYTHONHASHSEED). For numerical keys, hash(int) == int, which is predictable but rarely a problem. The thing that catches people: dict lookup raises KeyError. Use .get(key, default) or defaultdict if you want a clean default-value flow.

Java: HashMap vs ConcurrentHashMap vs LinkedHashMap. HashMap is not thread-safe; concurrent writes can corrupt its internal structure (this was famously demonstrated to cause infinite loops in pre-Java-8 versions). ConcurrentHashMap uses fine-grained locking and is the right choice for shared maps. LinkedHashMap preserves insertion order or access order (great for LRU). TreeMap is a balanced BST, not a hash table, with O(log n) ops and key ordering.

Go: map. Go maps are not safe for concurrent use; the runtime will panic if it detects concurrent reads and writes. Use sync.Map for shared maps, or wrap a regular map in a mutex. Iteration order is randomized per-loop intentionally (so people stop relying on it).

Comparison: hash table vs the alternatives

Operation	Hash Table	Sorted Array	Balanced BST	Trie
Insert	O(1) avg	O(n)	O(log n)	O(k)
Lookup	O(1) avg	O(log n)	O(log n)	O(k)
Range query	O(n)	O(log n + k)	O(log n + k)	O(k + r)
Min / Max	O(n)	O(1)	O(log n)	O(k)
Memory overhead	Medium	Low	High	Very high
Worst-case bound	O(n)	O(n)	O(log n)	O(k)

The column that decides things in production is often "range query" or "worst-case bound". A hash table is wrong for time-series workloads where you scan ranges by timestamp. A balanced BST is better when you need predictable per-op latency and ordered iteration. A trie is better when keys are strings with shared prefixes (autocomplete, IP routing).

When hash tables are the wrong choice

Three honest cases I have hit.

First, when keys are short strings with shared prefixes and you need prefix queries. Use a trie. Hash tables cannot tell you which keys start with "foo" without scanning the whole table.

Second, when you need ordered iteration. JS Map and Python dict preserve insertion order, which helps, but they cannot iterate by sort order without a separate structure. If your access pattern is "give me the smallest key greater than X", you want a sorted structure.

Third, when worst-case latency matters more than average latency. The resize spike is real. For real-time systems, an incremental-rehashing hash table (Redis uses one) or a paged structure smooths the spike, but plain HashMap does not.

Try this on your codebase tomorrow

Grep your repo for for ... in ... loops that contain a .includes, .find, or .indexOf against another array. Almost every one of those is a hash table waiting to happen, and replacing the inner scan with a Set or Map lookup is usually a one-line patch that drops the function from O(n^2) to O(n).

Then grep for places where you are using a plain Object as a map with user-supplied keys. Convert those to Map. The keys-are-strings coercion bug is silent, the prototype-pollution bug is a security issue, and the migration is mechanical. I have run this exercise on three different codebases now. Every time it found at least one O(n^2) hot path I did not know about, and at least one place where untrusted keys were going into a bare Object. The hash table is the most useful data structure because it is so general, but "general" only translates to "correct" when you pick the right variant for the job. Take an hour, run those greps, and you will close more performance and correctness gaps than most refactors do.

Back to Articles