Data Structures

Suffix Array / Suffix Tree

Difficulty: Advanced

ADVANCED

Complexity varies

Suffix Array / Suffix Tree

Genome alignment tools like BWA and Bowtie find a 100-character read inside a 3-billion-base human reference in milliseconds, not by scanning the genome but by querying an index built once over all of its suffixes. Suffix Array / Suffix Tree is the family of data structures behind that workflow, and the same machinery powers full-text search, plagiarism detection, and the BWT-based compression in bzip2.

This lesson covers the suffix array (a sorted array of starting positions of every suffix), the binary-search pattern match in O(m log n) for a pattern of length m, the LCP (Longest Common Prefix) array that captures shared prefixes between adjacent sorted suffixes, and the suite of substring problems the pair solves: longest repeated substring, number of distinct substrings, and longest common substring. You will also meet the suffix tree, a compressed trie of all suffixes that pushes pattern matching down to O(m) independent of text length, and weigh the suffix array's smaller memory footprint against the suffix tree's faster query.

In Arrays & Strings, sorting and binary search worked over numbers; here the same primitives extend to strings, with a comparator that compares characters position by position. Trie (Prefix Tree) introduced the prefix-matching tree shape, and a suffix tree is a compressed trie built from every suffix of the input.

Next, the curriculum explores data structures organized around a different axis entirely: preserving every historical version of themselves through time.

Advanced

75 min

6 Objectives

5 Sections

Prerequisites:

Arrays & Strings Trie (Prefix Tree)

Topics:

Suffix Array

Suffix Tree

Strings

Sorting

Data Structures

Advanced

Premium

String Matching

What You'll Learn

By the end of this lesson, you will be able to:

Define what a suffix array is and construct one by sorting all suffixes of a string.

Use binary search on a suffix array to find whether a pattern exists in a text in O(m log n) time.

Build the LCP array from a suffix array and explain how it captures shared prefix information between adjacent sorted suffixes.

Solve the longest repeated substring problem using the suffix array and LCP array.

Explain the suffix tree as a compressed trie of all suffixes and describe its key properties.

Compare suffix arrays and suffix trees in terms of construction time, space, query time, and practical trade-offs.

Why This Matters

Suffix arrays enable binary search over all suffixes of a string, reducing pattern matching from O(n*m) brute force to O(m log n). Combined with the LCP array, they solve an entire class of substring problems (longest repeated substring, number of distinct substrings, longest common substring) efficiently.

Understanding suffix arrays deepens your grasp of how sorting and binary search extend beyond numbers to strings, and how auxiliary arrays like the LCP array unlock additional query types through clever reductions.

Suffix trees provide O(m) pattern matching independent of text length, making them the theoretical gold standard for text indexing. While suffix arrays are preferred in practice due to lower memory, suffix trees remain important for understanding the compressed trie concept and Ukkonen's online construction.

These structures are used in production systems including bioinformatics (genome sequence alignment with tools like BWA and Bowtie), full-text search engines, data compression (BWT in bzip2), and plagiarism detection. They appear in advanced competitive programming problems involving string manipulation.

Key Terms

7 terms you'll encounter in this lesson

Suffix

A substring that extends from some starting index i to the end of the string. For string 'banana', the suffix starting at index 2 is 'nana'. A string of length n has exactly n suffixes (or n+1 if the empty suffix is included).

Suffix Array

An array of integers representing the starting indices of all suffixes of a string, sorted in lexicographic order. For 'banana$', the suffix array is [6, 5, 3, 1, 0, 4, 2], meaning suffix '$' (index 6) comes first and 'nana$' (index 2) comes last.

LCP Array (Longest Common Prefix)

An array where LCP[i] stores the length of the longest common prefix between the i-th and (i-1)-th suffixes in the sorted suffix array. It captures how similar adjacent sorted suffixes are and enables efficient substring queries.

Sentinel Character ($)

A character appended to the string that is lexicographically smaller than any character in the alphabet. It ensures no suffix is a prefix of another suffix, which simplifies sorting and avoids ambiguity in comparisons.

Suffix Tree

A compressed trie (also called a Patricia trie) of all suffixes of a string. Each edge is labeled with a substring, and each leaf represents one suffix. It uses O(n) nodes and enables O(m) pattern matching where m is the pattern length.

Longest Repeated Substring

The longest substring that appears at least twice in a string. It equals the maximum value in the LCP array, since repeated substrings correspond to shared prefixes between sorted suffixes.

Kasai's Algorithm

An O(n) algorithm for computing the LCP array from a suffix array. It exploits the observation that if the LCP between suffixes starting at positions i and j is h, then the LCP between suffixes starting at i+1 and j' is at least h-1.

Heads Up

Common misconceptions to watch out for

"A suffix array stores the actual suffix strings"

A suffix array stores only the starting indices (integers) of each suffix, sorted by the lexicographic order of the suffixes they represent. The actual strings are never copied -- they are referenced by index into the original string. This is what makes the space O(n) integers rather than O(n^2) characters.

"Suffix trees and suffix arrays solve different problems"

They solve the same class of problems. Any query answerable by a suffix tree can also be answered (often with the help of the LCP array) by a suffix array, and vice versa. The practical difference is in constant factors: suffix trees use more memory but offer O(m) pattern search, while suffix arrays use less memory but require O(m log n) for pattern search.

"Building a suffix array always takes O(n^2 log n) time"

The naive approach (sort all suffixes using comparison sort) takes O(n^2 log n) because each comparison takes O(n) and sorting takes O(n log n) comparisons. However, efficient algorithms like the prefix-doubling method achieve O(n log^2 n) or O(n log n) with radix sort, and SA-IS achieves O(n) linear time.

"The LCP array is just a nice-to-have optimization"

The LCP array is essential for most suffix array applications. Without it, the suffix array alone only supports pattern search via binary search. With the LCP array, you can solve longest repeated substring (max of LCP), count distinct substrings (n*(n+1)/2 - sum of LCP), and many other problems. The LCP array transforms the suffix array from a search index into a full substring analysis tool.