Community Article

KMP and Rabin-Karp Without the Pain

The failure function explained as a memo, rolling hash explained as a sliding window, and the test-case patterns where each algorithm earns its keep.

KMP and Rabin-Karp Without the Pain

The failure function explained as a memo, rolling hash explained as a sliding window, and the test-case patterns where each algorithm earns its keep.

kmp

rabin-karp

string-matching

rolling-hash

string-manipulation

By @ryancastillo

January 25, 2026

Updated May 18, 2026

329 views

4.3 (11)

Imagine you're scanning a 50 GB log file for the string "OutOfMemoryError" to figure out which deployment first crashed. The naive approach, text.indexOf(pattern), runs in O(n * m) worst case (n is text length, m is pattern length), and on adversarial inputs that becomes minutes. JavaScript's String.indexOf is not naive; engines like V8 use Boyer-Moore-Horspool for long enough patterns. But the moment you write your own pattern matcher (in a database engine, a search index, an XSS sanitiser), you face the same set of choices. Two algorithms come up over and over in the literature and in interview questions: KMP and Rabin-Karp. Both run in O(n + m) average time. Neither is intuitive on first read. Both have a reputation for being painful to implement.

The goal of this article is to make each of them clickable in 15 minutes, by introducing the central idea before the math. KMP's failure function is a memo, not a derivation. Rabin-Karp's rolling hash is a sliding window, not a number-theoretic incantation. Once you have the right mental picture, the code follows.

The naive baseline (so you know what we are escaping)

function naiveSearch(text, pattern) {
    for (let i = 0; i <= text.length - pattern.length; i++) {
        let j = 0;
        while (j < pattern.length && text[i + j] === pattern[j]) j++;
        if (j === pattern.length) return i;
    }
    return -1;
}

For most inputs this is fast enough. The pathological case is a pattern like "aaab" searched in a text like "aaaaaaaab", where each starting position rescans most of the pattern. Worst-case runtime is O(n * m). KMP and Rabin-Karp both attack this same worst case.

KMP: the failure function is a memo of "what would have matched"

The insight that took me years to internalise: KMP does not look ahead. It looks at the pattern, in advance, and figures out what to do when a partial match fails. The failure function fail[k] answers one question: "if I had matched pattern[0..k-1] against the text and the next character doesn't match, what is the largest prefix of pattern that is also a suffix of pattern[0..k-1]?" That largest matching prefix-suffix is where to resume; everything to its left has already been verified.

Let's make this concrete. Pattern: "abacab". Failure function:

KMP failure function for pattern "abacab"

index k:    0   1   2   3   4   5
pattern:    a   b   a   c   a   b
fail[k]:   -1   0   0   1   0   1   2
           ^ for empty prefix, fail = -1 (sentinel)

(Some textbooks use fail[k] to mean the value before position k, others mean after. I am using the after-position convention here, so fail[6] is the value used when the full pattern matched, indicating where to continue searching for the next match.)

Reading the table: after matching "a" (length 1), if the next text char does not match, the longest proper prefix of "a" that is a suffix is empty, so fail = 0. After matching "ab" (length 2), if the next char fails, the longest matching prefix-suffix is empty, so we restart. After "aba", the prefix "a" is also a suffix, so fail = 1. After "abac", no match. After "abaca", the prefix "a" is again the longest suffix, so fail = 1. After "abacab" (the full pattern), the prefix "ab" is also a suffix, so fail = 2.

With this table, the search loop never re-examines characters in the text. When a mismatch occurs at text position i after matching k chars of the pattern, the next attempt resumes at text position i with pattern position fail[k]. The pattern "slides forward" by k - fail[k] positions, never backward, and the text pointer never moves backward. Total work is O(n + m).

function buildFailure(pattern) {
    const fail = [0];
    let k = 0;
    for (let i = 1; i < pattern.length; i++) {
        while (k > 0 && pattern[i] !== pattern[k]) k = fail[k - 1];
        if (pattern[i] === pattern[k]) k++;
        fail.push(k);
    }
    return fail;
}

function kmpSearch(text, pattern) {
    if (pattern.length === 0) return 0;
    const fail = buildFailure(pattern);
    let k = 0;
    for (let i = 0; i < text.length; i++) {
        while (k > 0 && text[i] !== pattern[k]) k = fail[k - 1];
        if (text[i] === pattern[k]) k++;
        if (k === pattern.length) return i - k + 1;
    }
    return -1;
}

The buildFailure and kmpSearch functions are nearly identical: both walk a string and adjust k (the length of the current match) using the failure table. That parallel is not coincidental; the failure function is itself computed by running KMP on the pattern against itself, a one-line bootstrap.

The runtime is honestly O(n + m). Each of n text chars triggers at most one k++ and at most O(amortised) decrement steps via fail[k-1], because every decrement is paid for by a previous increment. The amortised argument is the same one used for dynamic-array resize.

Rabin-Karp: rolling hash is a sliding window over numbers

Rabin-Karp's idea is different. Instead of preprocessing the pattern, hash it. Then walk a window of size m over the text, hash each window, and compare hashes. If two hashes match, do a final character-by-character verification to rule out collisions. The naive cost would be O(n * m) because each window-hash computation reads m characters. The trick is the rolling hash: when the window slides by one position, you can update the hash in O(1) by removing the contribution of the leaving character and adding the contribution of the entering one.

The most common rolling hash is a polynomial hash modulo a large prime:

hash(s[0..m-1]) = (s[0] * b^(m-1) + s[1] * b^(m-2) + ... + s[m-1] * b^0) mod p

When the window slides from [i, i+m-1] to [i+1, i+m], the hash updates as:

newHash = (oldHash - s[i] * b^(m-1)) * b + s[i+m]    // mod p

The s[i] * b^(m-1) term is the contribution of the character that just left, scaled to its leading-coefficient position. We subtract it, then multiply the rest by b to shift everything up one position, then add the new character. Subtract-then-add. That is the rolling hash, and once you see the polynomial structure, it is a sliding window over numbers, exactly analogous to a sliding-window-sum over an array.

function rabinKarp(text, pattern) {
    const m = pattern.length;
    const n = text.length;
    if (m === 0) return 0;
    if (m > n) return -1;

    const base = 256;
    const mod = 1_000_000_007n;
    const baseBig = BigInt(base);
    let highPow = 1n;
    for (let i = 0; i < m - 1; i++) highPow = (highPow * baseBig) % mod;

    let patternHash = 0n;
    let windowHash = 0n;
    for (let i = 0; i < m; i++) {
        patternHash = (patternHash * baseBig + BigInt(pattern.charCodeAt(i))) % mod;
        windowHash = (windowHash * baseBig + BigInt(text.charCodeAt(i))) % mod;
    }

    for (let i = 0; i <= n - m; i++) {
        if (patternHash === windowHash) {
            // verify: hashes can collide
            let j = 0;
            while (j < m && text[i + j] === pattern[j]) j++;
            if (j === m) return i;
        }
        if (i < n - m) {
            const leaving = BigInt(text.charCodeAt(i));
            const entering = BigInt(text.charCodeAt(i + m));
            windowHash = (windowHash + mod - (leaving * highPow) % mod) % mod;
            windowHash = (windowHash * baseBig + entering) % mod;
        }
    }
    return -1;
}

The BigInt use is JavaScript-specific because regular number lose precision at 2^53, and a polynomial hash with base = 256 and m even moderately large would overflow. In Java, C++, or Python this is cleaner because of native long/int64/arbitrary-precision. The mod prime 1_000_000_007 is a common choice; pairs of mod primes (double hashing) reduce collision probability further if you are paranoid.

Average runtime is O(n + m). Worst-case is O(n * m), achieved when the hash collides on every window (extremely unlikely with a good hash and a large prime, but possible for adversarial inputs targeting your specific hash function). The verification step keeps the answer correct even when collisions happen.

When to reach for which one

KMP wins when:

You're searching once and need worst-case O(n + m) guaranteed (no collision step).
The pattern is moderately long (the failure-function preprocessing is O(m), paid once).
You don't want to deal with hash arithmetic, modular multiplication, or BigInt overflow.

Rabin-Karp wins when:

You're doing multiple-pattern search. Hash all patterns, hash each window, compare against the set. With KMP you'd need a separate failure function per pattern, and the multi-pattern KMP variant (Aho-Corasick) is much heavier to implement.
You're searching for a pattern in many texts and the pattern hash is precomputed once.
You're matching with hash-as-fingerprint semantics already (plagiarism detection, near-duplicate documents, content addressing).

Neither is the right choice when:

The pattern is short (under 5-10 chars). Naive indexOf with engine optimisations (SIMD, Boyer-Moore-Horspool) usually beats both.
The pattern matches many times and you need all match positions efficiently. Aho-Corasick or suffix automata become competitive.
The pattern has wildcards or regex semantics. Use a regex engine; KMP and Rabin-Karp are exact matchers.

Test-case shapes I use to validate either implementation

When I write either of these from scratch, I have a small corpus of test inputs that exercise the painful cases:

Empty pattern -> return 0.
Pattern equals text -> return 0.
Pattern longer than text -> return -1.
No match -> return -1.
"aaab" searched in "aaaaaaaab" -> the classic naive killer. Both KMP and Rabin-Karp should run in linear time on this.
"abcabcabd" searched in a long string of "abc" -> tests overlapping prefix-suffix logic in KMP.
A pattern that hash-collides intentionally -> verifies that Rabin-Karp's verification step kicks in. (You can construct collisions for known mod primes; this is the test that makes you double-hash if you need adversarial-input resistance.)
Multi-byte / Unicode text -> in JS, work with code units (charCodeAt) and document the limitation that surrogate pairs are matched as halves. For full Unicode-aware matching, normalise both inputs first.

The absolute first test I run is pattern.length === 0. Both implementations have to handle it explicitly because the loop bodies assume m >= 1. I have shipped a Rabin-Karp once that had an off-by-one on empty pattern, and the bug went undetected for two weeks because the test suite did not include it.

A short story on when the naive solution wins

In 2023 I was reviewing a search-feature PR for a side project. The author had implemented Rabin-Karp from scratch for searching user-provided patterns in titles. The patterns were on average 4 characters long, the titles were 50-200 characters. I asked them to also benchmark JavaScript's built-in String.indexOf. The built-in beat their hand-rolled Rabin-Karp by 8x on the median input and 30x on the short-pattern tail. The reason: V8's indexOf for short needles uses SIMD-accelerated character scans, and the constant factor is so small that no algorithmic cleverness in user code can keep up. We removed the Rabin-Karp implementation, dropped 200 lines of code, and shipped the built-in. The PR merged faster, the search felt snappier, and the lesson was that KMP and Rabin-Karp are tools for the cases where you cannot use the built-in, not optimisations to apply by default. If your pattern is short and you're searching in JavaScript or Python or Java strings, trust the runtime first, profile second, hand-roll only if profiling gives you a reason. The pain of these algorithms is real, and you should pay it only when the gain is also real.

Back to Articles