The Algorithms - Python Python TheAlgorithms/Python

How Rabin-Karp String Matching Works

Hash the pattern, slide a rolling hash across the text, confirm on match. The 1987 algorithm that makes multi-pattern search practical.

5 stops ~14 min Verified 2026-05-05
What you will learn
  • Why a rolling hash reduces per-shift work from O(m) to O(1) by subtracting the outgoing character
  • How alphabet_size and modulus together produce a polynomial hash with collision resistance
  • Why modulus_power represents alphabet_size raised to the pattern-length-minus-one power
  • How the hash equality test serves as a cheap filter before the expensive substring equality check
  • Why false-positive hash collisions are handled by the exact string comparison on the same line
  • How test_rabin_karp validates both a hit and a miss with the same pattern in two texts
Prerequisites
  • Comfort reading Python for-loops with modular arithmetic
  • Familiarity with the concept of a hash function producing integers
  • No prior knowledge of rolling hashes or string-matching algorithms required
The story behind Rabin-Karp

Michael Rabin and Richard Karp published the rolling-hash string matching algorithm in 1987. The insight was deceptively simple: if you can compute the hash of a length-m window in O(1) as it slides across the text, the total cost of checking all positions drops to O(n) expected, compared to O(nm) for naive per-position string comparison.

The rolling-hash trick works because a polynomial hash over a sliding window satisfies the recurrence: subtract the contribution of the outgoing character, multiply the remaining hash by the base, and add the incoming character. That recurrence costs three arithmetic operations per step, not m character comparisons. A hash match does not guarantee a string match, so the algorithm performs an exact comparison only when the hashes agree, keeping the expected-case cost at O(n+m) while handling hash collisions gracefully.

The algorithm naturally extends to multi-pattern search: compute k pattern hashes upfront and check each window hash against a hash set in O(1). That extension is used by code-deduplication tools, rsync for block-level file comparison, and Git fast-import. The same rolling-hash technique underpins polynomial hashing in competitive programming and the Rabin fingerprinting scheme used in content-defined chunking for backup systems.

1 / 5

Module Constants: Alphabet Size and Modulus

strings/rabin_karp.py:1

alphabet_size is the polynomial hash base (256 for ASCII); modulus is a large prime used to keep hash values from overflowing into slow big-integer arithmetic.

Two module-level constants define the hash arithmetic. alphabet_size = 256 treats each character as a base-256 digit, so the hash of a length-m string is a polynomial evaluated at 256. Using 256 means any byte value maps cleanly to a unique digit, which is important for the UTF-8 test case in test_rabin_karp. modulus = 1000003 is a prime chosen to be large enough that random same-length strings have a collision probability of about 1 in a million, while small enough that all arithmetic stays within Python's fast integer range. The rolling-hash step relies on modular arithmetic to keep intermediate values bounded, so every addition and multiplication wraps at modulus. A larger prime reduces the false-positive rate; a smaller prime speeds up modular reduction. Both choices are explicit in the source so readers can substitute different values.

Key takeaway

alphabet_size is the polynomial base; modulus is the prime that keeps hash values bounded. Both choices trade collision rate against arithmetic speed.

# Numbers of alphabet which we call base
alphabet_size = 256
# Modulus to hash a string
modulus = 1000003
2 / 5

Initial Hash Computation: Pattern and First Window Together

strings/rabin_karp.py:29

The first loop hashes the pattern and the first m characters of the text simultaneously, and accumulates modulus_power as alphabet_size raised to m-1.

The initial hash loop computes three values in one pass. p_hash and text_hash are both built by the same Horner's method recurrence: multiply the running hash by alphabet_size and add the next character's ordinal, all modulo modulus. After the loop, both are the polynomial hash of their respective length-m strings. The third accumulator, modulus_power, tracks the value of alphabet_size raised to the power of m - 1, which is needed in the rolling-hash step to subtract the outgoing character's contribution. The continue on line 37 skips the modulus-power update on the last iteration, because the rolling hash never needs alphabet_size^m; it only needs up to alphabet_size^(m-1). This is the only loop that reads the pattern; the pattern pointer never moves after this.

Key takeaway

One pass computes three values: p_hash, text_hash for position 0, and modulus_power (base raised to m-1) needed for the rolling step.

    p_hash = 0
    text_hash = 0
    modulus_power = 1

    # Calculating the hash of pattern and substring of text
    for i in range(p_len):
        p_hash = (ord(pattern[i]) + p_hash * alphabet_size) % modulus
        text_hash = (ord(text[i]) + text_hash * alphabet_size) % modulus
        if i == p_len - 1:
            continue
        modulus_power = (modulus_power * alphabet_size) % modulus
3 / 5

The Rolling Hash: O(1) per Shift Instead of O(m)

strings/rabin_karp.py:41

Each shift subtracts the outgoing character's contribution, multiplies by alphabet_size, and adds the incoming character. All three operations are O(1).

The sliding-window loop is where the rolling hash pays off. Each iteration checks whether the current window's hash matches the pattern hash. If they match, an exact string comparison on line 42 confirms the hit and returns immediately, handling the case where two different strings happen to produce the same hash value modulo modulus. If the window does not match or produces a hash collision, the hash is updated by the rolling recurrence on lines 47 to 50: subtract ord(text[i]) * modulus_power (the outgoing character's contribution scaled to its position), multiply the result by alphabet_size (shift all remaining characters one position left in the polynomial), and add the new character entering the window from the right. The continue on line 44 skips the update at the last position because there is no next window to hash.

Key takeaway

Subtract outgoing, multiply, add incoming: three operations move the hash forward one position in O(1) instead of rehashing m characters.

    for i in range(t_len - p_len + 1):
        if text_hash == p_hash and text[i : i + p_len] == pattern:
            return True
        if i == t_len - p_len:
            continue
        # Calculate the https://en.wikipedia.org/wiki/Rolling_hash
        text_hash = (
            (text_hash - ord(text[i]) * modulus_power) * alphabet_size
            + ord(text[i + p_len])
        ) % modulus
4 / 5

Test Function and the Doctest Contract

strings/rabin_karp.py:54

test_rabin_karp is the only doctest in the file. It runs five pattern-text pairs with explicit asserts and prints Success on completion.

The test function wraps all assertions in a single callable so CI can verify the entire test suite through a single doctest: test_rabin_karp() prints Success. only after all asserts pass. This is a deliberate design choice, because individual assert statements do not produce doctest-compatible output. Test 1 uses the same repeated-prefix pattern as the KMP test file, which is a good stress test for rolling hashes: the pattern appears in text1 but not in text2, and a naive implementation might report a false match if the rolling hash coincidentally collides at the position where the partial prefix appears in text2. Test 5 uses a two-byte UTF-8 character "Lü" to verify that the 256-base hash correctly handles multi-byte code points by treating each byte as a separate element.

Key takeaway

All tests wrap in one function so a single doctest covers them. Test 5 with a UTF-8 two-byte character verifies the 256-base hash handles multibyte input.

def test_rabin_karp() -> None:
    """
    >>> test_rabin_karp()
    Success.
    """
    # Test 1)
    pattern = "abc1abc12"
    text1 = "alskfjaldsabc1abc1abc12k23adsfabcabc"
    text2 = "alskfjaldsk23adsfabcabc"
    assert rabin_karp(pattern, text1)
    assert not rabin_karp(pattern, text2)
5 / 5

Why False-Positive Hash Collisions Do Not Produce Wrong Answers

strings/rabin_karp.py:41

Hash equality is a cheap filter only. The exact substring check text[i:i+p_len] == pattern on the same line rejects hash collisions before returning True.

The correctness guarantee rests on line 42. A hash match is a necessary but not sufficient condition for a true match: two different strings of the same length can produce the same value modulo modulus, which would give a false positive if the function returned on hash equality alone. The and text[i : i + p_len] == pattern clause on the same line performs a full character-by-character comparison only when the hashes agree, so false positives are caught before the function returns True. The expected cost of this confirmation step is O(m) per hash collision, but with a prime modulus of around one million the expected number of collisions over a length-n text is approximately n divided by one million, making the total cost of all confirmations negligible in practice. The two-level filter design is the canonical Rabin-Karp pattern.

Key takeaway

Hash equality triggers an exact comparison, not a return. That single and clause is what prevents false positives from reaching the caller.

    for i in range(t_len - p_len + 1):
        if text_hash == p_hash and text[i : i + p_len] == pattern:
            return True
        if i == t_len - p_len:
            continue
        # Calculate the https://en.wikipedia.org/wiki/Rolling_hash
        text_hash = (
            (text_hash - ord(text[i]) * modulus_power) * alphabet_size
            + ord(text[i + p_len])
        ) % modulus
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free