The Algorithms - Python Python TheAlgorithms/Python

How the Knuth-Morris-Pratt Algorithm Works

Preprocess the pattern into a failure array, then scan the text without backtracking. The 1977 algorithm that makes grep and DNA search O(n+m).

6 stops ~16 min Verified 2026-05-05
What you will learn
  • Why naive string search backtracks and KMP avoids it by reusing match information
  • What the failure array encodes: the length of the longest proper prefix that is also a suffix
  • How get_failure_array builds the LPS array in O(m) by extending known matches
  • How the main loop advances i through the text while j tracks position in the pattern
  • Why j = failure[j-1] skips ahead instead of resetting j to zero on mismatch
  • How the doctest verifies all five substrings against Python's built-in str.find
  • Why KMP runs in O(n+m) total and where it is used in practice
Prerequisites
  • Comfort reading Python while loops with two index variables
  • Familiarity with string indexing and slicing
  • No prior knowledge of string-matching algorithms required
The story behind KMP

Donald Knuth, James Morris, and Vaughan Pratt independently discovered the failure-function approach to string matching and published it jointly in 1977 in the SIAM Journal on Computing. Knuth had been studying a text-searching problem in a compiler while Morris was building an early text editor, and Pratt connected their approaches through a formal automaton model. The key insight was that when a mismatch occurs at position j in the pattern, the algorithm already knows which characters matched before the failure, and that knowledge tells it how far back to reset without re-examining any text characters.

Before KMP, naive search restarted from position j = 0 on every mismatch, giving O(n times m) worst-case behavior on adversarial inputs. KMP eliminates that backtracking by preprocessing the pattern into a failure function in O(m) time, then scanning the text in O(n) time without ever moving the text pointer backward.

KMP is the theoretical foundation of grep, IDE find-in-files, and DNA pattern search. The failure function encodes the self-similarity structure of the pattern and is directly reused in Aho-Corasick for multi-pattern search.

1 / 6

Function Signature and the Two-Step Algorithm in the Docstring

strings/knuth_morris_pratt.py:4

The docstring names both steps: preprocess the pattern into a failure array, then step through the text comparing one character at a time.

The docstring organizes the algorithm around the problem it solves, not the code that solves it. The complexity claim on line 7, O(n + m) where n is the text length and m is the pattern length, is the reason KMP exists: naive search is O(n times m) on adversarial inputs. The two numbered steps describe a separation of concerns that makes the complexity achievable. The doctest avoids inventing examples and instead checks all five substrings against Python's built-in str.find, which returns the starting index of the match or -1 for no match. That comparison is the most precise contract the function could state: for any input, produce the same result as the standard library. The search string used as both text and pattern in the doctest also exercises the prefix-equals-suffix case the failure function is built to handle.

Key takeaway

The doctest verifies against str.find for five inputs including a miss. That contract is more precise than any ad-hoc example the author could invent.

def knuth_morris_pratt(text: str, pattern: str) -> int:
    """
    The Knuth-Morris-Pratt Algorithm for finding a pattern within a piece of text
    with complexity O(n + m)

    1) Preprocess pattern to identify any suffixes that are identical to prefixes

        This tells us where to continue from if we get a mismatch between a character
        in our pattern and the text.

    2) Step through the text one character at a time and compare it to a character in
        the pattern updating our location within the pattern if necessary

    >>> kmp = "knuth_morris_pratt"
    >>> all(
    ...    knuth_morris_pratt(kmp, s) == kmp.find(s)
    ...    for s in ("kn", "h_m", "rr", "tt", "not there")
    ... )
    True
    """
2 / 6

The Main Loop: Two Pointers, One Advance, No Backtracking

strings/knuth_morris_pratt.py:25

i advances through the text; j tracks position in the pattern. On a full match, return i-j. On mismatch, either use the failure array or advance i.

The text pointer i never moves backward. On each iteration, if the current text character matches the current pattern character, the pattern pointer j advances. When j reaches the last pattern position (line 32), the function returns the start index i - j. On a mismatch with j > 0, the failure array tells us where to reset j without moving i: j = failure[j - 1] jumps to the longest prefix of the pattern that also ends where the current match left off. The continue on line 40 re-evaluates the match at the new j position before advancing i. Only when j == 0 at a mismatch, meaning no prefix information is available, does i advance and the scan continues from scratch. If the loop exhausts the text, the pattern was not found and the function returns -1.

Key takeaway

i never moves backward. On mismatch, failure[j-1] resets j to reuse the longest matching prefix. Only when j is already 0 does i advance.

    # 1) Construct the failure array
    failure = get_failure_array(pattern)

    # 2) Step through text searching for pattern
    i, j = 0, 0  # index into text, pattern
    while i < len(text):
        if pattern[j] == text[i]:
            if j == (len(pattern) - 1):
                return i - j
            j += 1

        # if this is a prefix in our pattern
        # just go back far enough to continue
        elif j > 0:
            j = failure[j - 1]
            continue
        i += 1
    return -1
3 / 6

get_failure_array: Building the LPS Table in O(m)

strings/knuth_morris_pratt.py:45

get_failure_array computes the Longest Proper Prefix that is also a Suffix for each position in the pattern. It returns a list of integers of length m.

The failure array has one entry per pattern character. failure[k] holds the length of the longest proper prefix of pattern[0..k] that is also a suffix of pattern[0..k]. For position 0 that value is always 0, which is why the list is initialized with a single zero. Two pointers i and j scan the pattern: j advances every iteration, while i tracks how long the current matching prefix is. When pattern[i] == pattern[j], the match extends and i increments. When they differ and i > 0, the code applies the same failure-array logic recursively to reset i before retrying. The test file verifies the pattern "aabaabaaa" produces [0, 1, 0, 1, 2, 3, 4, 5, 2], which readers can trace manually to confirm the algorithm before reading the main loop.

Key takeaway

failure[k] is the LPS length at position k. The same skip-and-retry logic used in the main search loop also builds the failure array itself.

def get_failure_array(pattern: str) -> list[int]:
    """
    Calculates the new index we should go to if we fail a comparison
    :param pattern:
    :return:
    """
    failure = [0]
    i = 0
    j = 1
    while j < len(pattern):
        if pattern[i] == pattern[j]:
            i += 1
        elif i > 0:
            i = failure[i - 1]
            continue
        j += 1
        failure.append(i)
    return failure
4 / 6

Tracing the Algorithm: Pattern abc1abc12 in Text

strings/knuth_morris_pratt.py:65

The __main__ block runs four manual tests. Test 1 shows the same pattern in two texts, one containing it and one not, to confirm both the hit and miss paths.

The __main__ block shows tests that would fail in CI if run as doctests because they use assert rather than printing expected output. Test 1 uses a pattern with a repeated prefix: "abc1abc12" starts with "abc1abc1", which means a partial match of the first eight characters followed by a mismatch should not restart from zero. The failure array for this pattern allows the algorithm to skip back to position 4 (after "abc1") rather than resetting to 0, which is the payoff of the preprocessing step. text1 contains the full pattern embedded mid-string; text2 does not, exercising the -1 return path. The pattern structure deliberately contains internal repetition to make the failure array non-trivial, which is why this test case is kept alongside the doctests.

Key takeaway

The repeated-prefix pattern abc1abc12 makes the failure array non-trivial. Test 1 verifies both the hit and the miss return paths with a single pattern.

if __name__ == "__main__":
    import doctest

    doctest.testmod()

    # Test 1)
    pattern = "abc1abc12"
    text1 = "alskfjaldsabc1abc1abc12k23adsfabcabc"
    text2 = "alskfjaldsk23adsfabcabc"
    assert knuth_morris_pratt(text1, pattern)
    assert knuth_morris_pratt(text2, pattern)
5 / 6

Test 6: Verifying the failure Array Values Directly

strings/knuth_morris_pratt.py:99

The final test asserts the exact failure array for aabaabaaa, giving readers a concrete trace they can verify by hand before trusting the main loop.

Asserting the failure array output directly is the most educational test in the file because it gives a concrete example readers can trace by hand. For the pattern "aabaabaaa": position 0 is always 0. Position 1 is 1 because "a" is both a prefix and suffix of "aa". Position 2 is 0 because "b" breaks the prefix-suffix match. Position 3 is 1 because "a" appears at both ends of "aaba". The array continues to build up until position 7 where the value is 5, meaning "aabaa" is both a prefix and a suffix of "aabaabaa". Working through these nine values confirms that the failure array encodes exactly the skip distances the main loop needs, and that no character in the text needs to be re-examined after a mismatch.

Key takeaway

Tracing the 9 values of the failure array for aabaabaaa by hand is the fastest way to understand what information the main loop uses on a mismatch.

    # Test 6)
    pattern = "aabaabaaa"
    assert get_failure_array(pattern) == [0, 1, 0, 1, 2, 3, 4, 5, 2]
6 / 6

Complexity and the Amortized Argument

strings/knuth_morris_pratt.py:4

O(n+m) holds because i never decrements. The total number of j decrements across all mismatch steps is bounded by the total number of j increments.

The O(n + m) claim follows from a counting argument. The text pointer i increments exactly n times and never decrements, so the text is scanned in O(n). The pattern pointer j increments at most once per iteration where i also increments, so across the entire search j increments at most n times. Each failure-array jump j = failure[j - 1] strictly decreases j, so the total number of decrements is bounded by the total number of increments, which is n. The preprocessing step builds the failure array in O(m) by the same argument applied to the pattern alone. The combined cost is therefore O(n + m), achieved without any auxiliary buffer and with a text pointer that never moves backward. This linearity makes KMP the theoretical baseline that faster practical algorithms like Boyer-Moore-Horspool and Aho-Corasick are compared against.

Key takeaway

O(n+m) holds because i never retreats. Total j decrements are bounded by total j increments, capping the mismatch work at O(n).

def knuth_morris_pratt(text: str, pattern: str) -> int:
    """
    The Knuth-Morris-Pratt Algorithm for finding a pattern within a piece of text
    with complexity O(n + m)

    1) Preprocess pattern to identify any suffixes that are identical to prefixes

        This tells us where to continue from if we get a mismatch between a character
        in our pattern and the text.

    2) Step through the text one character at a time and compare it to a character in
        the pattern updating our location within the pattern if necessary

    >>> kmp = "knuth_morris_pratt"
    >>> all(
    ...    knuth_morris_pratt(kmp, s) == kmp.find(s)
    ...    for s in ("kn", "h_m", "rr", "tt", "not there")
    ... )
    True
    """
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free