Community Article

Union-Find Explained with Three Real Problems

Disjoint-set union seen through Kruskal MST, dynamic connectivity in a message bus, and grid percolation, plus when the path-compression-plus-rank optimization stops being optional.

Union-Find Explained with Three Real Problems

Disjoint-set union seen through Kruskal MST, dynamic connectivity in a message bus, and grid percolation, plus when the path-compression-plus-rank optimization stops being optional.

union-find
kruskal
minimum-spanning-tree
graph-algorithms
data-structures
rileynair

By @rileynair

March 14, 2026

·

Updated May 18, 2026

1,033 views

32

Rate

The first time I wrote a union-find by accident was in a multiplayer game prototype. Players could form parties. Parties could merge when their leaders shook hands in-game. The "is player A in the same party as player B" query was hit on every chat message, and the naive implementation (walk a parent pointer chain on each query) was the slowest part of the chat tick. A friend who was watching me debug looked over my shoulder and said "you have written a union-find with path compression backwards, and badly". He was right. He fixed it in fifteen minutes.

This is the article I wish that friend had handed me earlier. Union-find (sometimes called disjoint-set union, or DSU) is one of those small data structures that hides in plain sight inside larger algorithms. You learn it once, and then you spend years recognizing problems that have a union-find shape and would be a nightmare without one. I want to walk through three real problems where it has been the right tool, with the structure introduced as it becomes necessary.

What union-find actually is

Union-find is a data structure that maintains a collection of disjoint sets and supports two operations.

  • find(x): return a canonical representative of the set containing x.
  • union(x, y): merge the sets containing x and y.

The classic implementation uses a parent array. Each element points to its parent in a tree; the root of the tree is the canonical representative of the set. find(x) walks up to the root. union(x, y) finds both roots and makes one the parent of the other.

The naive version, with no optimizations, is O(n) per operation in the worst case (the trees can become long chains). With two optimizations (path compression during find, and union by rank or size during union), the amortized cost per operation drops to roughly O(α(n)), where α is the inverse Ackermann function. For all practical input sizes, that is effectively constant. It grows so slowly that for n smaller than the number of atoms in the observable universe, α(n) < 5. People shorthand this as O(1).

class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))
        self.rank = [0] * n

    def find(self, x):
        # Path compression: make every node on the path point directly to the root
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry:
            return False  # already in the same set
        # Union by rank: hang the shorter tree under the taller one
        if self.rank[rx] < self.rank[ry]:
            rx, ry = ry, rx
        self.parent[ry] = rx
        if self.rank[rx] == self.rank[ry]:
            self.rank[rx] += 1
        return True

    def connected(self, x, y):
        return self.find(x) == self.find(y)

About 20 lines. That is the entire data structure. The two optimizations together (path compression plus union by rank) are what unlock the near-constant amortized cost. Skip either one and you can construct adversarial inputs that drag the operations back to O(log n) or worse.

Real problem 1: Kruskal's MST in a network simulator

The first time I shipped a union-find on purpose was in a network simulator that needed a minimum spanning tree of a randomly generated topology. The simulator generated about 50,000 nodes and 250,000 weighted edges, and we wanted the cheapest set of edges that kept everything connected. The textbook answer is Kruskal's algorithm: sort the edges by weight, walk them in order, and accept each edge if and only if it joins two currently-disconnected components.

The pseudocode reads like English the moment you have union-find:

def kruskal_mst(n, edges):
    # edges: list of (weight, u, v)
    edges.sort()
    uf = UnionFind(n)
    mst = []
    total_weight = 0
    for w, u, v in edges:
        if uf.union(u, v):
            mst.append((u, v, w))
            total_weight += w
            if len(mst) == n - 1:
                break
    return mst, total_weight

The if uf.union(u, v) line is doing the whole "is this edge a cycle" check. If the two endpoints are already in the same component, the edge would form a cycle and we skip it; otherwise we accept it and merge the components. That single line, fast and amortized-constant, is what makes Kruskal practical at scale.

The naive alternative (run BFS/DFS on the partially built graph after every edge to check connectivity) would have been O(E * (V + E)), which on our 250K-edge input was minutes. Kruskal with union-find ran in about 200ms. Same algorithm, different connectivity check.

Prim's algorithm is the other classical MST algorithm and uses a heap instead of union-find. The two are roughly comparable in big-O for sparse graphs; I reach for Kruskal when the input is already a list of edges (no implicit graph structure) and Prim when I have an adjacency list and want to grow the tree from a fixed start node. For the simulator, the input was edges, so Kruskal won.

Real problem 2: Dynamic connectivity in a distributed system

A few years later I was on a team that ran a peer-to-peer message bus. Nodes joined and left dynamically. The application layer asked, on every publish, "are publisher P and subscriber S in the same connected component of the bus right now?" A naive answer was to run BFS on every publish, which on a 10,000-node bus was about 5ms per query. Multiplied by the publish rate, that was a CPU disaster.

Union-find with only the union operation (no edge deletion) gave us O(α(n)) per query, which dropped the per-publish cost to negligible. Membership changes (a node coming online and connecting to two existing nodes) became one union call. Connectivity queries became one find comparison.

The asterisk in this problem is that union-find does NOT natively support disconnect. If a node leaves the bus, you cannot in general "split" the union-find tree it was in. The standard fix is to maintain the structure offline (in event-time order) or to use a more advanced "link-cut tree" data structure, which I have not had to write personally. For our system, departures were rare and we did a full rebuild from the membership log every minute; the per-minute cost of rebuilding the union-find from scratch (a few thousand union calls) was much smaller than per-query BFS would have been.

When you read about "fully dynamic connectivity" in academic papers, the structures get exotic (Holm-Lichtenberg-Thorup, Euler-tour trees) and you almost certainly want a library. For "online connectivity with no deletions", plain union-find is enough and the code is twenty lines.

Real problem 3: Percolation and grid cluster detection

The third problem I want to walk through is the one that finally made union-find click for me as a versatile tool: percolation on a grid. The setup is a 2D grid where each cell is either "open" (a 1) or "blocked" (a 0). You want to know whether the top row is connected to the bottom row through a chain of orthogonally-adjacent open cells. This shows up in flood-fill image processing, in physics simulations of fluid through porous material, and (more relevantly to me) in a procedurally-generated dungeon I was building where I needed to verify the player could reach the exit from the entrance after the procedural pass.

The clever framing is to add two virtual nodes: TOP, connected to every open cell in the top row, and BOTTOM, connected to every open cell in the bottom row. Then walk the grid; for every open cell, union it with each of its open orthogonal neighbours. At the end, the grid percolates if and only if connected(TOP, BOTTOM).

def percolates(grid):
    rows, cols = len(grid), len(grid[0])
    TOP, BOTTOM = rows * cols, rows * cols + 1
    uf = UnionFind(rows * cols + 2)

    def idx(r, c):
        return r * cols + c

    for r in range(rows):
        for c in range(cols):
            if grid[r][c] != 1:
                continue
            if r == 0:
                uf.union(idx(r, c), TOP)
            if r == rows - 1:
                uf.union(idx(r, c), BOTTOM)
            if r > 0 and grid[r - 1][c] == 1:
                uf.union(idx(r, c), idx(r - 1, c))
            if c > 0 and grid[r][c - 1] == 1:
                uf.union(idx(r, c), idx(r, c - 1))

    return uf.connected(TOP, BOTTOM)

The grid has R * C cells. The loop does at most 4 * R * C union calls. With path compression and union by rank, the total cost is O(R * C * α(R*C)), which for a million-cell grid is practically O(R * C). A flood-fill BFS variant would also be linear in the cell count, but the union-find version generalizes for free: I can ask "does the player's start position percolate to the exit?" by adding more virtual nodes, and I can answer "how many distinct connected regions are there?" by counting roots after the build pass.

The same pattern appears in Hoshen-Kopelman cluster labelling for image segmentation and in Conway's-Game-of-Life-style component analysis. Once you see the "label connected components" problem shape, union-find is almost always cleaner than recursive flood fill, and it is iterative (no recursion-depth issues on huge grids).

Two optimizations and when each one earns its keep

Path compression and union by rank are individually useful and together transformative. The numbers below are the benchmarks I ran during the multiplayer-game project. Take them as illustrative, not authoritative; the constant factors depend on language and hardware.

VariantAvg find cost (n=1M)
Naive (no optimizations)O(n) worst case
Path compression only~O(log n) amortized
Union by rank onlyO(log n)
Path compression + union by rankO(α(n)) amortized

For most application code, you should always implement both. The line count increase is two or three lines, and there is no realistic input on which the optimized version is slower. There is one situation where I have skipped the rank optimization: when I needed the find to return a deterministic representative based on insertion order (e.g. "always the first element ever added"), and the rank-based union would break that property. In that case, I used path compression with "union by always picking smaller index as the parent", which gives O(log n) find and is fine for inputs in the millions.

Where union-find is the wrong choice

A short list of problems that look like union-find at first glance and are actually something else.

Connectivity with deletions. As I mentioned in Problem 2, union-find does not support disconnect. If your problem has both joins and splits, you need something else (link-cut trees, dynamic-connectivity structures, or rebuild-on-update).

Connectivity queries with weights or distances. "Are A and B connected, and what is the cheapest path between them?" Union-find tells you connected/not-connected, not distance. You want BFS, Dijkstra, or Floyd-Warshall depending on the structure of the graph.

Range or interval problems disguised as connectivity. "Merge overlapping intervals" can be done with union-find but is usually faster with a sort-and-sweep. Reach for the simpler tool first.

When the elements are not naturally enumerable. Union-find works on integer indices. If your elements are strings or arbitrary objects, you need a hash map from element to index first, which is fine but adds code; if you only have a few thousand elements it is rarely worth it.

Five problems, one structure, twenty lines of code

The thing that sold me on union-find is the gap between the surface diversity of the problems it solves and the unchanging tininess of the code. Kruskal's MST, dynamic connectivity in a graph, percolation on a grid, equivalence-class merging in a type checker, friend-of-a-friend computation in a social graph; all of these are different applications of the same twenty lines. Once you can write the structure from memory, the recognition rule is: if your problem is about partitioning things into groups where merges happen and groupings are queried often, union-find is the data structure. The two optimizations (path compression and union by rank) are not optional for production code; they are what makes the amortized cost effectively constant. Write them once, keep the snippet in a utility file in every language you use, and stop dreading the next graph problem that lands on your desk.

Back to Articles