System Design Article
Design a Unique ID Generator
Difficulty: Medium
Design a service that generates globally unique, roughly time-sortable 64-bit IDs at 1M IDs per second across hundreds of application servers, without coordination on the hot path. The interview centerpiece is the trade-off between uniqueness, ordering, size, and coordination cost. We compare UUIDv4 (random, no coordination, 128 bits, no ordering), database AUTOINCREMENT (single point of contention), Twitter Snowflake (64 bits, time-ordered, requires worker_id assignment and clock discipline), Instagram's per-shard hybrid, and ULID/KSUID. We deep-dive into Snowflake: bit layout, clock skew handling, leader election for worker IDs, and the dreaded clock-rollback bug.
Design a Unique ID Generator
Design a service that generates globally unique, roughly time-sortable 64-bit IDs at 1M IDs per second across hundreds of application servers, without coordination on the hot path. The interview centerpiece is the trade-off between uniqueness, ordering, size, and coordination cost. We compare UUIDv4 (random, no coordination, 128 bits, no ordering), database AUTOINCREMENT (single point of contention), Twitter Snowflake (64 bits, time-ordered, requires worker_id assignment and clock discipline), Instagram's per-shard hybrid, and ULID/KSUID. We deep-dive into Snowflake: bit layout, clock skew handling, leader election for worker IDs, and the dreaded clock-rollback bug.
184 views
5
Requirements
Functional Requirements
- Generate unique IDs at high throughput (millions/sec across the fleet).
- 64-bit IDs preferred (compact, fits a database BIGINT, half the size of a UUID).
- Roughly time-sortable: sorting by ID approximates sorting by creation time.
- Decentralized: each application instance can generate IDs without a coordinator on the hot path.
- Configurable epoch: choose a service-specific zero point so timestamp bits last long enough.
Out of Scope (state explicitly)
- ID-based sharding strategy (covered in design-key-value-store).
- ID-based access control / capability tokens.
- Public-facing hashes / short URLs (covered in design-url-shortener).
Non-Functional Requirements
- Throughput: 1M IDs/sec aggregate across the fleet; 4096 IDs/ms per worker (Snowflake-style).
- Uniqueness: never duplicate, even across worker restarts and clock skew within tolerance.
- Latency: ID generation must be local (no network call); <1 microsecond per ID.
- Service lifetime: ID space lasts >50 years from epoch.
- Operability: detect clock rollback, refuse to issue colliding IDs.
Back-of-the-Envelope Estimation
Throughput per Worker (Snowflake)
---------- Snowflake bit budget ----------
Layout (64 bits, 1 reserved):
41 bits = timestamp (ms since custom epoch)
10 bits = worker_id (1024 workers max)
12 bits = sequence (per ms per worker)
1 bit = sign (always 0 to keep IDs positive)
Max IDs per worker per ms: 2^12 = 4096
Max IDs per worker per sec: 4,096,000
Max IDs across 1024 workers per sec: ~4.2 billion4096 IDs per worker per millisecond is plenty for almost any application; if a worker exceeds it, the algorithm waits for the next ms.
Service Lifetime
---------- Lifetime calculation ----------
41 bits of milliseconds = 2^41 ms = ~69.7 years
With a custom epoch (e.g., 2020-01-01), service runs until ~2089.
UUIDv4 has no time bits; UUIDv7 uses 48 bits of ms = ~8900 years.Storage Comparison
---------- ID size ----------
UUIDv4: 128 bits (16 bytes)
Snowflake: 64 bits (8 bytes)
ULID: 128 bits (16 bytes; sortable)
KSUID: 160 bits (20 bytes; sortable)
Mongo ObjectID:96 bits (12 bytes)For a high-volume table (10B rows), 64-bit IDs save ~80 GB vs UUIDs in the primary key column alone, and far more in indexes.
High-Level Design
There are two patterns to consider, and the right answer depends on your operational model.
Pattern A: Embedded Library (no service)
Each application process imports a Snowflake library; on startup, it gets a worker_id from a config store (Zookeeper, etcd, AWS Parameter Store) and generates IDs locally. There is no ID service in the request path.
---------- Embedded library architecture ----------
[ App Server 1 ] [ App Server 2 ] [ App Server 3 ]
| | |
has worker_id=42 has worker_id=51 has worker_id=104
| | |
v v v
generate IDs locally; no network call
| | |
+--------+---------+---------+
|
v (only once at startup, to claim worker_id)
+--------------------+
| Zookeeper / etcd |
| (worker_id assign) |
+--------------------+Pattern B: Centralized ID Service
A dedicated ID-Gen Service exposes GET /id. App servers call it to obtain an ID. Simpler operationally (one fleet to monitor) but adds a network round trip per ID, which is a hard sell at high throughput.
Production systems almost always pick Pattern A. We design it.
API Design (Embedded Library)
// Library on each app process
class SnowflakeGenerator {
constructor(workerId, epoch = 1577836800000) {
this.workerId = workerId;
this.epoch = epoch;
this.sequence = 0n;
this.lastTs = -1n;
}
nextId() {
let ts = BigInt(Date.now()) - BigInt(this.epoch);
if (ts < this.lastTs) {
throw new Error('Clock moved backwards; refusing to issue ID.');
}
if (ts === this.lastTs) {
this.sequence = (this.sequence + 1n) & 0xFFFn;
if (this.sequence === 0n) {
ts = this.waitNextMs(this.lastTs);
}
} else {
this.sequence = 0n;
}
this.lastTs = ts;
return (ts << 22n) | (BigInt(this.workerId) << 12n) | this.sequence;
}
waitNextMs(prev) {
let ts;
do {
ts = BigInt(Date.now()) - BigInt(this.epoch);
} while (ts <= prev);
return ts;
}
}Detailed Design
The two interesting topics are the Snowflake bit layout choices and the clock skew / worker_id assignment problem.
Why 41 + 10 + 12?
The bit budget is a three-way trade among:
- Service lifetime (timestamp bits): more bits = longer life, but constrains the others.
- Worker count (worker_id bits): more bits = more workers, but constrains the others.
- Per-worker throughput (sequence bits): more bits = higher per-worker IDs/ms.
---------- Bit budget choices ----------
Twitter (default): 41 ts + 10 worker + 12 seq = 4096 IDs/ms/worker, 1024 workers, 69 yr
Discord: 42 ts + 5 dc + 5 worker + 12 seq = same per-worker, 32 datacenters * 32 workers
Mongo ObjectID: 32 s ts + 40 random + 24 counter (96 bits, post-3.4 spec)
UUIDv7: 48 ts (ms) + 12 ver/var + 62 random (128 bits)10 worker bits gives you 1024 worker IDs - enough for almost any deployment. 12 sequence bits gives 4096 IDs/ms/worker, which is 4M/s/worker - more than any single process needs. The remaining 41 bits give a 69-year lifetime, comfortably past any service's planning horizon.
Clock Skew: The Production Killer
Snowflake assumes time moves forward. NTP can step the clock backwards (e.g., a leap second adjustment, a misconfigured clock fixing itself). If now() < last_ts, the next ID could collide with one already issued.
---------- Three failure scenarios ----------
1. Small backward jump (a few ms):
- Detect: ts < last_ts.
- Action: pause until clock catches up (sleep then retry).
2. Large backward jump (>1 s):
- Detect: ts < last_ts by more than tolerance.
- Action: refuse to issue IDs; raise alert; engineer intervention.
3. Forward leap (clock jumps minutes ahead):
- Detect: not directly harmful (IDs continue to be unique).
- Action: reduces effective lifetime; usually self-corrects.The canonical fix is to disable NTP step adjustments and only allow slew (gradual). On Linux: ntpd -x slews instead of stepping; with chrony the equivalent is configuring makestep (e.g., makestep 0 -1 to disable stepping entirely, or set a strict slew bound in chrony.conf). The Snowflake process refuses to issue IDs if the clock moves backwards regardless.
Worker ID Assignment
With 1024 worker IDs, you cannot manually configure each app server. Options:
- Zookeeper / etcd ephemeral nodes: the process registers, gets the next available worker_id, holds it for the process lifetime. On crash, the node disappears and the ID is reusable.
- Config service (e.g., AWS Parameter Store): centrally allocate worker_id ranges per service.
- Hash from instance ID: derive worker_id from
hash(hostname) mod 1024. Risk of collision; fine for small fleets.
Production systems use Zookeeper for ephemeral assignment. The complication is reuse: if worker_id 42 was assigned to a process that crashed, can we reuse it immediately? Only if we are sure the dead process is dead (and not generating IDs we missed). Use a 'cool-down' (e.g., 30 s) before reassignment.
Why Not UUIDv4?
UUIDv4 is 128 bits of randomness with no coordination needed. It has two big problems for high-throughput databases:
- Index fragmentation: random keys insert into random pages of a B-tree, causing page splits and poor cache locality. UUID-keyed inserts can be 5-10x slower than sequential keys at scale.
- Size: 128 bits doubles index storage and eats RAM for cache.
UUIDv7 (drafted 2022) fixes the first problem by putting 48 bits of millisecond timestamp in the high bits, making them roughly sortable. It is the modern alternative to Snowflake when you want decentralized + roughly ordered + standardized format. Cost: still 128 bits.
Comparison Matrix
| Strategy | Size (bits) | Coordination | Ordered | Throughput | Use case |
|---|---|---|---|---|---|
| DB AUTOINCREMENT | 64 | Strong (DB) | Yes | DB write QPS | Single-DB apps |
| UUIDv4 | 128 | None | No | Unlimited | Distributed, no ordering needed |
| UUIDv7 | 128 | None | Roughly | Unlimited | Modern decentralized + ordered |
| Snowflake | 64 | Worker assignment | Roughly | 4M/s/worker | Distributed, ordered, compact |
| ULID | 128 | None | Roughly | Unlimited | UUID drop-in with ordering |
| KSUID | 160 | None | Roughly | Unlimited | Larger time range than ULID |
| Mongo ObjectID | 96 | Process ID | Roughly | High | MongoDB primary keys |
| Range allocator | 64 | Periodic | Yes | Very high | Pre-allocate ID blocks per worker |
Range Allocator (Instagram Pattern)
Instagram needed time-ordered IDs but did not want a single Snowflake fleet. Their solution: each shard has its own auto-increment sequence; the high bits of the ID encode the shard. A central service hands out 'ID blocks' (e.g., 1000 IDs at a time) to each app server, which then consumes them locally without coordination.
---------- Range allocator ----------
App server requests block: GET /allocate?count=1000
Response: start=1000000, end=1000999
App server consumes locally; uses the block for the next ~5 s
When block depletes: request next block
Advantage: no per-ID coordination; resilient to brief allocator outages
Disadvantage: IDs may be skipped if the worker dies mid-blockThis is the Snowflake idea inverted: amortize the coordination cost over many IDs. Used by Postgres cache_size on sequences and many internal systems.
Data Model
Worker Registry (Zookeeper / etcd)
---------- ZK node layout ----------
/idgen/workers/0 ephemeral, owner=app-server-23
/idgen/workers/1 ephemeral, owner=app-server-77
...
/idgen/workers/1023 ephemeralOn process startup, attempt to create the lowest-numbered absent worker node. The first to succeed gets that worker_id. On process crash, the ephemeral node disappears (after session timeout, ~10 s) and the ID is available for reassignment after a cool-down.
Optional Audit Trail
CREATE TABLE id_workers (
worker_id INT PRIMARY KEY,
last_owner VARCHAR(64),
last_assigned TIMESTAMPTZ,
last_released TIMESTAMPTZ,
cool_down_until TIMESTAMPTZ
);Tracks history for ops; not on the hot path.
Persisted Last Timestamp (Defensive)
For extra safety against clock rollback, the worker can persist last_ts periodically (every 100 ms) to local disk. On restart, refuse to issue IDs until now() > last_ts_persisted + safety_margin.
-- Local SQLite per worker (optional)
CREATE TABLE snowflake_state (
worker_id INT PRIMARY KEY,
last_ts BIGINT NOT NULL
);This catches the case where a VM reboots after a clock change; the process refuses to issue IDs until time has moved forward.
Scaling and Bottlenecks
Single Worker Throughput
4096 IDs/ms = 4M/s. A single worker exceeding this is rare; if it does, the algorithm waits for the next ms (effectively 4M/s ceiling). Most app processes generate 100-10K IDs/s, far below the ceiling.
Fleet Throughput
1024 workers * 4M/s = ~4 billion IDs/s. More than any system needs.
Worker ID Exhaustion
If you have >1024 worker processes, you outgrow standard Snowflake. Options:
- Repurpose datacenter bits (e.g., 5 dc + 5 worker = 32 dc * 32 = 1024) but you can have many fewer total IDs/ms.
- Reduce sequence bits and add worker bits.
- Use UUIDv7 instead.
Discord hit this at 5+ datacenters and split worker_id into datacenter + worker_in_dc.
Clock Skew Recovery
If a worker detects clock rollback >1 s, it refuses to issue IDs and alerts. Operations:
- Page on-call.
- Drain the affected worker (move traffic to others).
- Investigate NTP / hardware clock.
- Restart the worker once time is monotonic.
Most incidents resolve in <30 minutes; capacity is sized so 1-2 workers offline is invisible.
Adoption Path
Most systems start with database autoincrement. When you outgrow a single DB or need decentralized writes, switch to Snowflake. For greenfield modern systems, UUIDv7 is the simplest path: standard format, decentralized, roughly ordered; you give up the 64-bit compactness.
Trade-offs and Alternatives
Snowflake vs UUIDv4
UUIDv4 wins on simplicity (no infra, no coordination, just call uuid4()). Snowflake wins on size (8 vs 16 bytes), ordering (better index locality, easier 'recent' queries), and the ability to extract the timestamp from an ID for debugging. For high-throughput systems, the ordering and size benefits compound.
Snowflake vs UUIDv7
UUIDv7 is the modern alternative: 128 bits, standardized, roughly ordered, no coordination needed. Snowflake is more compact (64 bits) but requires worker_id assignment infrastructure. For new projects, UUIDv7 is simpler; for systems where every byte matters (high-volume primary keys), Snowflake is still worth it.
Snowflake vs database AUTOINCREMENT
AUTOINCREMENT is fine when you have one logical database. It becomes a bottleneck when you scale writes to multiple primaries (sharding) because each shard has its own counter and IDs collide. Snowflake works across sharded writes by construction.
Strict ordering vs roughly ordered
Snowflake IDs from the same worker are strictly ordered. Across workers, they are only roughly ordered (within the same ms, worker_id determines order, not arrival order). For most use cases (timeline of recent posts) roughly ordered is fine. For strict ordering you need a single counter (or Lamport timestamps), which is hard at scale.
Time-ordering vs randomness for security
If IDs are guessable (Snowflake's timestamp is visible in the high bits), an attacker can enumerate recent IDs to scrape user data or guess pending records. For public-facing IDs (e.g., URLs), use a separate slug or hash; reserve Snowflake for internal IDs only.
Centralized ID service vs embedded library
Centralized: one fleet to operate, but adds a network call per ID (1-5 ms latency, single point of failure). Embedded library: zero call overhead but every app must include the library and handle worker_id assignment. Embedded wins for any high-throughput system.
Range allocator vs Snowflake
Range allocator works when you have a centralized authority that hands out blocks (e.g., one service per shard). Snowflake works when you do not (any worker can generate without external state once it has its worker_id). For Postgres-style sequences, range allocator is the standard implementation; for distributed app fleets, Snowflake is the standard.
Real-World Examples
How real systems implement this in production
Twitter open-sourced Snowflake in 2010 to generate tweet IDs across thousands of servers without coordination on the hot path. The bit layout (41 ts + 10 worker + 12 seq) became the industry default and is the inspiration for Discord, Sony, and Mastodon ID schemes.
Trade-off: Snowflake gives compact, time-ordered IDs at the cost of running infrastructure for worker_id assignment; Twitter eventually retired the standalone Snowflake service in favor of an embedded-library variant for lower latency.
Instagram uses a hybrid: each Postgres shard has an auto-increment sequence; the high bits of the ID encode the shard, allowing globally unique IDs without a central service. They allocate 'ID blocks' per shard via PL/pgSQL functions.
Trade-off: Per-shard sequences are simpler to reason about than Snowflake but tie ID generation to database availability; if the shard is down, you cannot issue IDs for that shard's writes.
Discord uses Snowflake-style IDs but split the 10 worker bits into 5 datacenter + 5 worker to support multi-DC deployment (32 datacenters, 32 workers each). Their snowflake epoch is 2015-01-01 (the year Discord launched).
Trade-off: Splitting worker bits into datacenter + worker lets Discord scale horizontally across DCs but caps each DC at 32 ID-generating workers; they offset by running multiple ID-generating processes per host.
MongoDB ObjectIDs are 96 bits: 32-bit timestamp (seconds) + 24-bit machine ID + 16-bit process ID + 24-bit counter. Generated client-side by the driver, requiring no server round trip.
Trade-off: 96 bits gives more headroom than Snowflake (8 thousand years from 1970) but is 50% larger than 64-bit IDs in storage and indexes; the second-resolution timestamp also gives weaker ordering than Snowflake's millisecond resolution.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
App process starts; library connects to Zookeeper, creates the lowest-numbered ephemeral node under /idgen/workers/, gets worker_id (e.g., 42). Library initializes with epoch (e.g., 2020-01-01), worker_id=42, sequence=0, last_ts=-1. On nextId(): compute ts = now() - epoch in ms. If ts < last_ts: throw (clock moved backwards). If ts == last_ts: increment sequence; if sequence overflows 12 bits, busy-wait until next ms. If ts > last_ts: reset sequence to 0. Compose: id = (ts << 22) | (worker_id << 12) | sequence. Update last_ts = ts. Return 64-bit ID. Total wall time: <1 microsecond when not waiting on clock.
Detection: nextId() sees ts < last_ts. Action: refuse to issue IDs; throw an exception. Operationally: alerting fires (high error rate from this process); load balancer drains the worker; engineer investigates why NTP stepped backwards (misconfiguration, leap second). Long-term fix: configure chronyd or ntpd in slew-only mode (-x) so the clock only catches up gradually, never steps backward. After investigation, restart the process once time is monotonic. The defensive layer: persist last_ts to disk every 100 ms; on restart, refuse to issue IDs until now() > last_ts_persisted + safety_margin.
UUIDv4 (random, 128 bits) is great when: you need zero coordination, ordering does not matter, and you do not care about index locality. It is bad as a primary key for high-throughput tables because random keys cause B-tree fragmentation. UUIDv7 (timestamp-prefixed, 128 bits) is the modern alternative: decentralized, roughly ordered, standardized format, drop-in for UUIDv4 systems. Pick UUIDv7 over Snowflake when standardization matters more than the 8 vs 16 byte difference. Pick Snowflake when you need maximum index efficiency, compact storage, and you can run the worker_id assignment infrastructure.
Use Zookeeper ephemeral nodes. Each process on startup attempts to create the lowest-numbered absent node under /idgen/workers/N (where N is 0..1023). The first to succeed gets that worker_id. The node is ephemeral, so on process death (or session timeout, ~10s), the node disappears and the ID becomes reclaimable. We add a cool-down (e.g., 30s) before reassigning the same ID to ensure the dead process is truly gone. Alternatives: etcd, Consul, AWS Parameter Store with locking, or hash-based assignment for very small fleets (worker_id = hash(hostname) mod 1024, with collision detection).
(a) Snowflake. 64-bit, time-ordered improves index locality, decentralized works across shards. Saves ~80 GB on the PK column alone vs UUID. (b) UUIDv4 or random 64-bit. Trace IDs need only uniqueness, no ordering, no PK locality concerns; speed and decentralization beat everything. The W3C TraceContext standard uses 128-bit random trace_id and 64-bit span_id. (c) Cryptographically random hash. URL slugs must be unguessable to prevent enumeration; Snowflake exposes the timestamp in the high bits which is a security smell. Use a hash of the internal ID + secret, base62-encoded for short URLs (see design-url-shortener).
Interview Tips
How to discuss this topic effectively
Open with the trade-off triangle: uniqueness, ordering, coordination cost. Saying 'we want all three; the cheapest path is local generation with a small per-process worker_id' frames the design.
Explain the Snowflake bit layout from first principles. Walk through 41 ts + 10 worker + 12 seq = 64 bits total, what each gives you, and what happens if you reallocate. This shows you can reason about bit budgets, not just memorize the layout.
Bring up clock skew explicitly. Saying 'NTP can step the clock backwards; we detect and refuse to issue IDs, configure NTP slew-only mode' is the operational signal that distinguishes a senior answer.
Mention UUIDv7 as the modern alternative. Many candidates only know UUIDv4 vs Snowflake; knowing UUIDv7 (timestamp-prefixed UUID, standardized 2022) shows you keep current.
Quantify storage. 'For 10 billion rows, 64-bit IDs save ~80 GB in just the primary key column vs UUID' is the kind of concrete number that lands.
Common Mistakes
Pitfalls to avoid in interviews
Suggesting UUIDv4 as the primary key for a high-throughput sharded database
UUIDv4 is random with no time component, so inserts hit random pages of a B-tree, causing page splits and poor cache locality. Postgres benchmarks show 5-10x slower insert rates with UUIDv4 keys vs sequential. Use Snowflake or UUIDv7 (time-prefixed) for high-throughput tables; use UUIDv4 only when ordering is irrelevant or for security-sensitive public IDs.
Ignoring clock skew in the Snowflake design
NTP can step the clock backwards (leap seconds, misconfigured clock fixing itself, VM time drift). If now() < last_ts, the next ID could collide with one already issued. The fix is to detect rollback and refuse to issue IDs (raise an alert), and configure the OS to use NTP slew-only mode (no step adjustments). Many production Snowflake bugs are clock-skew related.
Trying to make Snowflake IDs strictly globally ordered
Snowflake guarantees strict ordering only WITHIN a single worker. Across workers, two IDs in the same ms compare by worker_id, not by physical wall-clock arrival. For most use cases this is fine; if you need strict global ordering you need a single counter or hybrid logical clocks, which has fundamental scalability limits.
Using a centralized ID service in the request hot path
A centralized service adds a network call per ID. At 1M IDs/sec across 100 servers, that is 1M extra RPCs/sec to one service. Use an embedded library that generates locally with a worker_id obtained at startup; the centralized component (Zookeeper) is consulted only when a worker starts, not per ID.
Hardcoding worker IDs in config
Hardcoding worker_id per app server forces you to manage 1024 unique IDs by hand and breaks blue-green deploys (two processes claim the same ID transiently). Use Zookeeper or etcd ephemeral nodes: the process claims the lowest unused ID at startup, releases on shutdown, with a cool-down before reassignment.
