System Design Article
Database Sharding & Partitioning Strategies
Difficulty: Medium
Sharding splits a database into many smaller pieces (shards) so writes and storage can scale across servers. The hard part is not the splitting; it is choosing a shard key that avoids hot shards, supporting cross-shard queries, and rebalancing as the data grows. This lesson covers the four sharding strategies, how to pick a shard key, the operational realities of resharding, and when sharding is the wrong answer.
Database Sharding & Partitioning Strategies
Sharding splits a database into many smaller pieces (shards) so writes and storage can scale across servers. The hard part is not the splitting; it is choosing a shard key that avoids hot shards, supporting cross-shard queries, and rebalancing as the data grows. This lesson covers the four sharding strategies, how to pick a shard key, the operational realities of resharding, and when sharding is the wrong answer.
282 views
5
Vertical vs Horizontal Partitioning vs Sharding
The terms get tossed around interchangeably, but they mean different things.
- Vertical partitioning: split a wide table by columns. The
userstable becomesusers_core(id, email, created_at) andusers_profile(id, bio, avatar_url). Useful when a table has hot small columns and cold large ones. - Horizontal partitioning (single-machine): split a table by rows but keep all partitions on the same database server. Postgres
PARTITION BY RANGE (created_at)is the canonical example. Speeds up scans and lets you drop old data by dropping a partition. - Sharding: horizontal partitioning where each partition lives on its own database server. This is what scales writes and total storage past one machine.
In an interview, when someone says 'shard', they almost always mean the third one: row-level splits across multiple physical servers.
When (and When Not) to Shard
Sharding is expensive. It complicates joins, transactions, schema migrations, backups, and observability. Reach for it only when you cannot avoid it.
Good signals you need sharding:
- A single leader cannot keep up with write QPS even after vertical scaling.
- The dataset is larger than what one node can hold or back up in a reasonable window (usually multi-TB and growing fast).
- Geographic latency requires data to live near users in different regions.
Signals you do not need sharding yet:
- Read traffic dominates - add read replicas instead.
- Specific tables are huge but the rest are normal - try table-level partitioning or moving cold data to archive storage.
- Slow queries on indexed columns - the problem is query design, not horizontal scale.
A common rule of thumb: a tuned PostgreSQL or MySQL instance on modern hardware comfortably serves 10-50k write QPS and tens of TB of data. Most apps never need to shard.
The Four Sharding Strategies
1. Hash-Based Sharding
Apply a hash function to the shard key. Take hash(key) mod N to pick a shard. Distributes load uniformly because any clustering in the key (sequential IDs, common prefixes) is destroyed by the hash.
---------- Hash sharding ----------
shard_id = hash(user_id) % 4
user_id 42 -> hash 0x9F2A -> shard 2
user_id 43 -> hash 0x1B07 -> shard 3
user_id 44 -> hash 0xC4E1 -> shard 1- Pros: even distribution, simple routing.
- Cons: range scans are impossible (related keys land on different shards), and
mod Nrebalancing reshuffles almost everything when N changes. Use consistent hashing (covered in its own lesson) to avoid the second problem.
2. Range-Based Sharding
Assign each shard a contiguous range of the shard key. user_id 1-1M -> shard 1, 1M-2M -> shard 2, and so on.
- Pros: range scans are fast (
WHERE user_id BETWEEN ...hits a single shard). - Cons: monotonically increasing keys cause hot shards (every new user lands on the latest shard). Mitigation: use a non-monotonic key, prepend a hash bucket, or use ordered keys with rotation.
Used by HBase, Bigtable, MongoDB ranged sharding, and CockroachDB.
3. Geographic / Directory-Based Sharding
A lookup table maps each entity (user, tenant) to a specific shard. Common in multi-tenant SaaS:
| tenant_id | shard |
|---|---|
| acme | shard-us-1 |
| globex | shard-eu-3 |
| initech | shard-ap-2 |
- Pros: maximum flexibility (move a tenant to its own shard for noisy-neighbor reasons), easy compliance with data-residency rules.
- Cons: every read needs a directory lookup (cache it!), the directory is itself a single point of failure.
4. Functional Sharding (a.k.a. service-based)
Different tables live on different databases entirely. users on one cluster, orders on another, analytics_events on a third. Often a stepping stone before true horizontal sharding.
Choosing the Shard Key
The shard key is the single most important decision in any sharded system. Get it wrong and the database is forever painful.
A good shard key:
- Has high cardinality - many distinct values, so load can spread.
- Has even access frequency - no key dominates the read or write load.
- Matches the dominant query pattern - so most queries hit a single shard.
- Is immutable per row - changing a row's shard key means moving it, which is slow.
Classic shard keys by domain:
- Multi-tenant SaaS:
tenant_id. All of a tenant's data lives together; cross-tenant queries are rare. - Social networks:
user_id. The user's posts, friends, settings all live together. - Time-series and IoT:
(device_id, day)or justdevice_idwith TTL. Range scans within a device are fast. - E-commerce:
customer_idfor orders, butproduct_idfor inventory - often a system uses both depending on the table.
Bad shard key examples
country- low cardinality and skewed distribution (the US shard explodes).status- tiny number of values, plus values change so rows would have to migrate.created_at- monotonic, so the newest shard is always the hot one.
When the dominant queries do not have a single key
Reality check: most apps have multiple query patterns. You can:
- Pick the most frequent pattern's key as the shard key.
- Maintain a secondary index in a separate sharded table indexed by the other key.
- Use a search engine (Elasticsearch) for ad-hoc queries that span shards.
- Periodically materialize aggregates into a small global table.
This is the kind of trade-off senior engineers make explicitly.
Cross-Shard Operations: The Painful Part
Cross-Shard Joins
A SQL JOIN between rows on different shards is not natively supported. Options:
- Co-locate by shard key - if both tables are sharded by
user_id, the join is local to one shard. - Replicate small reference tables - copy
currenciesorcountriesto every shard so joins always work. - Application-side join - fetch from each shard and merge in code. Pay the latency.
- Scatter-gather - send the query to all shards in parallel, merge results. Bounded by the slowest shard.
Cross-Shard Transactions
Single-shard ACID is easy; cross-shard ACID needs two-phase commit (2PC) or the Saga pattern (covered in the Distributed Transactions lesson). Either adds latency and failure modes. The pragmatic answer is to design the schema so that the things that must be transactional live on the same shard.
Resharding
When one shard fills up or runs hot, you must split it. This is the operationally hardest part of sharding:
- Add new shard(s) and start a logical-replication stream from the affected source shard.
- Wait for the new shard to catch up.
- Switch the routing layer to send the moved keys to the new shard.
- Verify and clean up data on the old shard.
With hash sharding and a fixed N, resharding requires moving most of the data. With consistent hashing or with virtual shards (logically thousands of shards mapped to a smaller set of physical nodes), only a small fraction moves. Always design for resharding from day one.
Query Routing
When the application sends a query, something must route it to the right shard(s). Three architectures:
---------- Routing options ----------
1. Smart client - app library knows the shard map, routes directly
2. Routing tier - a stateless proxy (Vitess vtgate, Citus coordinator) accepts SQL,
parses it, routes to the right shard, merges results
3. Coordinator node - a cluster member that knows the shard map and forwardsMost teams choose option 2 because it keeps app code simple, lets you change the shard map without redeploying every service, and provides a natural place for connection pooling, query rewriting, and metrics.
Pseudocode for a smart client routing layer
const SHARDS = [pgShard0, pgShard1, pgShard2, pgShard3];
function shardFor(userId) {
// FNV-1a 32-bit hash, then mod by shard count
const h = fnv1a(String(userId));
return SHARDS[h % SHARDS.length];
}
async function getUser(userId) {
const shard = shardFor(userId);
return shard.query('SELECT * FROM users WHERE id = $1', [userId]);
}
async function getUsersByEmail(email) {
// Cross-shard scatter-gather (we have no email -> shard map)
const results = await Promise.all(
SHARDS.map((s) => s.query('SELECT * FROM users WHERE email = $1', [email]))
);
return results.flatMap((r) => r.rows);
}Note how the second function pays N times the cost of the first. Cross-shard queries are why the shard key matters so much.
Real-World Examples
How real systems implement this in production
YouTube built Vitess to shard MySQL transparently. The vtgate router speaks the MySQL protocol, parses queries, and routes them to the right shards. Slack, GitHub, and HubSpot run Vitess at very large scale today.
Trade-off: Vitess hides most sharding complexity from the application but constrains schema design - some SQL features (foreign keys across shards, certain joins) are not supported. Teams adapt the schema to fit the routing layer.
Instagram sharded Postgres by user_id from early on, encoding the shard ID inside the primary key. A user's posts are co-located on one shard, so profile rendering is a single-shard read. Cross-user feeds are computed by separate denormalized pipelines.
Trade-off: Per-user queries are blazing fast and trivially cacheable, but cross-shard analytics requires a separate data warehouse path. The split between OLTP and OLAP became a permanent architectural decision.
Discord uses (channel_id, time_bucket) as the partition key for messages. Within a channel, recent messages cluster together for fast reads, but a viral channel only heats up the current bucket instead of the entire channel partition.
Trade-off: Bucketing complicates queries that span buckets (the client must iterate over recent buckets) but keeps tail latency predictable even when individual channels go viral.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Shard by tenant_id. All of a tenant's rows live on one shard, so per-tenant queries are local and tenant-level isolation is easier. Use a directory service (small lookup table) to map tenant_id -> shard, and cache it. For very large tenants, consider giving them a dedicated shard. For very small tenants, pack many of them onto one shard.
Hash sharding distributes load uniformly and prevents hot shards from sequential IDs, but loses locality - range scans become scatter-gather. Range sharding gives fast range scans and natural archiving (drop the oldest range) but creates hot spots when the key is monotonic. For user-keyed OLTP workloads, hash by user_id. For time-series and analytical scans, range by time. Many teams use both: hash on a primary key, range partition each shard by date.
With mod-N, adding or removing a node changes the result of nearly every key's hash mod N, forcing most rows to move. With consistent hashing, each node owns a slice of a hash ring, so adding a node only steals from its neighbors - on average 1/N of the keys move. This is the difference between a quiet capacity addition and a multi-day data shuffle.
First, try not to. Re-shape the schema so transactional data lives on the same shard (co-locate orders with users by user_id). When you cannot avoid it, use the Saga pattern - split the transaction into compensatable steps - or use a distributed transaction protocol like 2PC if the cluster supports it (CockroachDB, Spanner, Vitess with VStream). 2PC adds latency and a coordinator failure mode, so reach for it sparingly.
Hot shard: a small number of keys carrying most of the traffic. Diagnose by sampling the slow query log and grouping by shard key value. Fixes (in order of complexity): cache the hot keys in Redis to absorb reads; add a secondary key dimension to spread the hot key across multiple buckets (key sharding within a shard); split the shard if the heat is geographic; promote the hottest tenants to dedicated shards. Long-term: re-evaluate the shard key choice.
Interview Tips
How to discuss this topic effectively
State the shard key in the first sentence and tie it to the dominant query pattern. 'I'd shard by user_id because the profile read - 80% of traffic - is then a single-shard lookup.'
Always raise the cross-shard query problem unprompted. 'Joins across shards are expensive, so I'd co-locate friends and posts on the same shard or denormalize.' Interviewers expect to hear this.
Acknowledge the operational tax of sharding: harder migrations, harder backups, harder transactions, harder analytics. Then justify why this workload still needs it.
Bring up resharding strategy proactively. 'I'd use virtual shards / consistent hashing so adding a node moves only 1/N of the data, not most of it.'
Know when not to shard. 'Postgres on a 96-core box handles 50k QPS; we should saturate one node before adding shard complexity.' Restraint is a senior signal.
Common Mistakes
Pitfalls to avoid in interviews
Sharding by a low-cardinality column like country or status
Low cardinality means too few distinct values to spread load. The US shard or the 'pending' shard becomes a hot spot. Pick a high-cardinality column, ideally one whose values are accessed roughly uniformly.
Sharding by a monotonically increasing key (created_at, auto-increment id)
All new writes land on the same shard. Either hash the key first, prepend a small bucket, or use an explicit per-shard sequence so growth is spread out.
Choosing the shard key based on what is convenient now, not the dominant query
The shard key has to align with the queries that run the most. Decide your top 3 query patterns first, then pick a key that makes the top 1 or 2 single-shard lookups.
Sharding before you have to
Sharding multiplies the operational load: harder schema migrations, no easy joins, complicated transactions. A well-tuned single Postgres instance handles most workloads up to 50k QPS. Saturate one node first.
Forgetting to plan for resharding from day one
Workloads grow unevenly and shards need to split. If you used naive hash-mod-N, splitting requires moving most rows. Use consistent hashing or virtual shards so adding capacity moves only a small fraction.
