The Algorithms - Python Python TheAlgorithms/Python

String Algorithms in TheAlgorithms/Python

Seven string search and comparison algorithms from KMP failure arrays to Aho-Corasick automata.

7 stops ~14 min Verified 2026-05-05
What you will learn
  • How KMP's failure array lets the search pointer jump backward without revisiting text characters
  • How Rabin-Karp's rolling hash updates in constant time by subtracting the outgoing character and adding the incoming one
  • How Boyer-Moore's bad-character heuristic shifts the pattern past a mismatching text character
  • How Manacher's algorithm reuses previously computed palindrome lengths to avoid redundant expansions
  • How the Z-function interval tracks the rightmost matching prefix to give O(n) pattern search
  • How Levenshtein distance fills a rolling 1D array row by row instead of a full 2D table
  • How Aho-Corasick fail transitions let a multi-pattern automaton fall back without restarting from the root
Prerequisites
  • Familiarity with Python strings and list indexing
  • Basic understanding of hashing and modular arithmetic
  • Comfort with nested loops and index arithmetic
1 / 7

KMP: a failure array lets the pattern pointer jump back safely

strings/knuth_morris_pratt.py:25

KMP avoids re-scanning text by using a precomputed failure array that maps each pattern position to the length of its longest proper prefix that is also a suffix.

Naive string search compares the pattern to every position in the text, restarting from the beginning of the pattern on any mismatch. That makes it O(nm) in the worst case. KMP avoids restarting by recognizing that if a partial match has occurred, the failure array encodes exactly how much of the pattern can be reused. When pattern[j] != text[i] and j > 0, the pattern index jumps to failure[j-1] rather than to zero. This retains all the matched prefix that is also a suffix of the partial match. The text index i never moves backward, giving an overall O(n + m) search. When j reaches the last index of the pattern, the match start position i - j is returned. For the full module including the failure array construction, see the deeper tour.

Key takeaway

KMP's key guarantee is that i never moves backward: the failure array lets the pattern pointer retreat while the text pointer always advances.

    # 1) Construct the failure array
    failure = get_failure_array(pattern)

    # 2) Step through text searching for pattern
    i, j = 0, 0  # index into text, pattern
    while i < len(text):
        if pattern[j] == text[i]:
            if j == (len(pattern) - 1):
                return i - j
            j += 1

        # if this is a prefix in our pattern
        # just go back far enough to continue
        elif j > 0:
            j = failure[j - 1]
            continue
        i += 1
    return -1
2 / 7

Rabin-Karp: a rolling hash avoids recomputing the window hash from scratch

strings/rabin_karp.py:24

Rabin-Karp computes a polynomial hash for the pattern and rolls it across the text window in constant time per step using modular arithmetic.

Rabin-Karp's insight is that comparing two hashes takes O(1), so if the pattern and window hashes match the algorithm only then performs the O(m) character comparison to confirm. The initial hash treats each character's ASCII value as a digit in a base-256 number, taken modulo a large prime to prevent overflow. When the window slides right by one character, the rolling hash formula removes the contribution of the leftmost character, shifts all remaining digits left by multiplying by the base, and adds the new rightmost character. The precomputed modulus_power holds the positional weight of the leftmost character so it can be subtracted efficiently. Hash collisions are possible, which is why the code confirms with a full string comparison when hashes match. For a multi-pattern extension, see the deeper tour.

Key takeaway

The rolling hash removes the outgoing character's weight and adds the incoming character in O(1), avoiding a full rehash of the window on every slide.

    p_len = len(pattern)
    t_len = len(text)
    if p_len > t_len:
        return False

    p_hash = 0
    text_hash = 0
    modulus_power = 1

    # Calculating the hash of pattern and substring of text
    for i in range(p_len):
        p_hash = (ord(pattern[i]) + p_hash * alphabet_size) % modulus
        text_hash = (ord(text[i]) + text_hash * alphabet_size) % modulus
        if i == p_len - 1:
            continue
        modulus_power = (modulus_power * alphabet_size) % modulus

    for i in range(t_len - p_len + 1):
        if text_hash == p_hash and text[i : i + p_len] == pattern:
            return True
        if i == t_len - p_len:
            continue
        # Calculate the https://en.wikipedia.org/wiki/Rolling_hash
        text_hash = (
            (text_hash - ord(text[i]) * modulus_power) * alphabet_size
            + ord(text[i + p_len])
        ) % modulus
    return False
3 / 7

Boyer-Moore: scan the pattern right-to-left, shift on the bad character

strings/boyer_moore_search.py:57

Boyer-Moore's bad-character heuristic compares pattern and text from the right end, then shifts the pattern so the mismatched text character aligns with its rightmost occurrence in the pattern.

Naive and KMP search both scan the pattern left-to-right. Boyer-Moore inverts this: it compares pattern characters from right to left. Scanning right-to-left matters because a mismatch at the last character of the pattern reveals a text character that must either be absent from the pattern entirely, allowing a full-pattern-length shift, or whose rightmost occurrence in the pattern determines a smaller shift. The mismatch_in_text method implements this right-to-left scan: it starts at the last pattern index and walks backward, returning the text index of the first mismatch. A return value of -1 means the full pattern aligned without mismatches. The bad_character_heuristic method in the same class uses this to compute how far to shift the window at each position. Boyer-Moore achieves O(n/m) best-case performance on texts with large alphabets because a single comparison can skip the entire pattern length.

Key takeaway

Scanning the pattern from right to left means a single mismatch can justify shifting the window by as much as the full pattern length.

    def mismatch_in_text(self, current_pos: int) -> int:
        """
        Find the index of mis-matched character in text when compared with pattern
        from last.

        Parameters :
            current_pos (int): current index position of text

        Returns :
            i (int): index of mismatched char from last in text
            -1 (int): if there is no mismatch between pattern and text block

        >>> bms = BoyerMooreSearch(text="ABAABA", pattern="AB")
        >>> bms.mismatch_in_text(2)
        3
        """

        for i in range(self.patLen - 1, -1, -1):
            if self.pattern[i] != self.text[current_pos + i]:
                return current_pos + i
        return -1
4 / 7

Manacher: reuse prior palindrome lengths to avoid redundant expansion

strings/manacher.py:18

Manacher transforms the input by inserting separator characters, then for each center reuses the mirror position's palindrome length to skip known expansions.

Naive longest palindromic substring expands from every center, costing O(n squared). Manacher achieves O(n) by reusing previously computed palindrome lengths. The first transformation inserts | separators between every character, converting "aba" to "a|b|a". This means every palindrome in the transformed string has an odd length and a well-defined center, eliminating the separate even/odd center cases. The key reuse logic is on line 40: for center j inside the current rightmost palindrome interval, the palindrome length at j is initialized from its mirror position left + right - j rather than from 1. The while loop then only expands beyond what is already known. When a palindrome extends past right, the left and right boundaries are updated. This ensures each character is compared at most twice across the entire algorithm.

Key takeaway

Manacher's separator trick unifies even and odd palindromes; the mirror initialization skips comparisons for positions already inside a known palindromic interval.

    max_length = 0

    # if input_string is "aba" than new_input_string become "a|b|a"
    new_input_string = ""
    output_string = ""

    # append each character + "|" in new_string for range(0, length-1)
    for i in input_string[: len(input_string) - 1]:
        new_input_string += i + "|"
    # append last character
    new_input_string += input_string[-1]

    # we will store the starting and ending of previous furthest ending palindromic
    # substring
    left, right = 0, 0

    # length[i] shows the length of palindromic substring with center i
    length = [1 for i in range(len(new_input_string))]

    # for each character in new_string find corresponding palindromic string
    start = 0
    for j in range(len(new_input_string)):
        k = 1 if j > right else min(length[left + right - j] // 2, right - j + 1)
        while (
            j - k >= 0
            and j + k < len(new_input_string)
            and new_input_string[k + j] == new_input_string[j - k]
        ):
            k += 1

        length[j] = 2 * k - 1
5 / 7

Z-function: an interval tracks the rightmost matching prefix

strings/z_function.py:13

The Z-function computes for each index the length of the longest prefix-matching substring starting there, using an interval to avoid redundant comparisons.

The Z-function is the foundation for pattern search when concatenated with a separator: computing Z on pattern + "$" + text finds all positions where the pattern appears as a prefix-matching substring. The algorithm maintains an interval [left_pointer, right_pointer] that spans the rightmost Z-match found so far. For a new index i inside this interval, the Z-value at the mirror position i - left_pointer gives a lower bound without re-comparing characters already inside the interval. The while loop in go_next then extends beyond the interval boundary if possible. When the extension pushes past right_pointer, the interval advances. The doctest shows that in "abracadabra", index 7 has Z-value 4, reflecting the four-character prefix "abra" that starts at position 7. The Z-function is closely related to Manacher: both maintain a rightmost interval to reuse previously computed information.

Key takeaway

The Z-interval [left, right] lets each new position initialize from its mirror's Z-value, reducing total comparisons to O(n) across the full array.

def z_function(input_str: str) -> list[int]:
    """
    For the given string this function computes value for each index,
    which represents the maximal length substring starting from the index
    and is the same as the prefix of the same size

    e.x.  for string 'abab' for second index value would be 2

    For the value of the first element the algorithm always returns 0

    >>> z_function("abracadabra")
    [0, 0, 0, 1, 0, 1, 0, 4, 0, 0, 1]
    >>> z_function("aaaa")
    [0, 3, 2, 1]
    >>> z_function("zxxzxxz")
    [0, 0, 0, 4, 0, 0, 1]
    """
    z_result = [0 for i in range(len(input_str))]

    # initialize interval's left pointer and right pointer
    left_pointer, right_pointer = 0, 0

    for i in range(1, len(input_str)):
        # case when current index is inside the interval
        if i <= right_pointer:
            min_edge = min(right_pointer - i + 1, z_result[i - left_pointer])
            z_result[i] = min_edge

        while go_next(i, z_result, input_str):
            z_result[i] += 1

        # if new index's result gives us more right interval,
        # we've to update left_pointer and right_pointer
        if i + z_result[i] - 1 > right_pointer:
            left_pointer, right_pointer = i, i + z_result[i] - 1

    return z_result
6 / 7

Levenshtein distance: a rolling row replaces the full 2D table

strings/levenshtein_distance.py:4

Levenshtein distance fills a DP table row by row using only the previous row and one accumulating current row, so memory is O(n) instead of O(mn).

Edit distance and Levenshtein distance describe the same problem: the minimum number of single-character insertions, deletions, or substitutions to turn one string into another. The classic DP formulation stores a full m-by-n matrix, but each row depends only on the row above, so the matrix can be replaced with two 1D arrays. previous_row is initialized to [0, 1, 2, ..., n], representing insertion costs from an empty prefix. For each character c1 in the first word, current_row starts with i + 1 (the deletion cost) and fills rightward. Substitution cost is 0 when characters match, 1 when they differ. After all rows, the last element of previous_row is the full edit distance. The file also contains a variant that modifies rows in place rather than allocating a new list each iteration.

Key takeaway

Replacing the full 2D matrix with two rolling 1D arrays cuts memory from O(mn) to O(n) while leaving the time complexity unchanged.

def levenshtein_distance(first_word: str, second_word: str) -> int:
    """
    Implementation of the Levenshtein distance in Python.
    :param first_word: the first word to measure the difference.
    :param second_word: the second word to measure the difference.
    :return: the levenshtein distance between the two words.
    Examples:
    >>> levenshtein_distance("planet", "planetary")
    3
    >>> levenshtein_distance("", "test")
    4
    >>> levenshtein_distance("book", "back")
    2
    >>> levenshtein_distance("book", "book")
    0
    >>> levenshtein_distance("test", "")
    4
    >>> levenshtein_distance("", "")
    0
    >>> levenshtein_distance("orchestration", "container")
    10
    """
    # The longer word should come first
    if len(first_word) < len(second_word):
        return levenshtein_distance(second_word, first_word)

    if len(second_word) == 0:
        return len(first_word)

    previous_row = list(range(len(second_word) + 1))

    for i, c1 in enumerate(first_word):
        current_row = [i + 1]

        for j, c2 in enumerate(second_word):
            # Calculate insertions, deletions, and substitutions
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)

            # Get the minimum to append to the current row
            current_row.append(min(insertions, deletions, substitutions))

        # Store the previous row
        previous_row = current_row

    # Returns the last element (distance)
    return previous_row[-1]
7 / 7

Aho-Corasick: fail transitions let the automaton fall back without restarting

strings/aho_corasick.py:67

Aho-Corasick augments a keyword trie with fail transitions so that a mismatch falls back to the longest proper suffix of the current state that is also a prefix of some keyword.

KMP and Rabin-Karp search for one pattern at a time. Aho-Corasick searches for multiple patterns simultaneously in a single O(n + total_output) pass over the text. The automaton is a trie of all keywords, augmented with fail transitions built by the set_fail_transitions method using a BFS over the trie. During search, the automaton advances through text characters. When find_next_state returns None, the algorithm follows fail_state links until either a valid transition is found or the root (state 0) is reached. Each output list attached to a state holds all keywords that end at that state, including those inherited through fail links. The doctest shows four keywords found across one pass of 27 characters, with "er" appearing at four positions. This makes Aho-Corasick the standard choice for virus scanning, log analysis, and any problem requiring simultaneous multi-pattern search.

Key takeaway

Aho-Corasick's fail transitions turn a trie into a finite automaton: each text character advances the state, and fail links handle mismatches without restarting from the root.

    def search_in(self, string: str) -> dict[str, list[int]]:
        """
        >>> A = Automaton(["what", "hat", "ver", "er"])
        >>> A.search_in("whatever, err ... , wherever")
        {'what': [0], 'hat': [1], 'ver': [5, 25], 'er': [6, 10, 22, 26]}
        """
        result: dict = {}  # returns a dict with keywords and list of its occurrences
        current_state = 0
        for i in range(len(string)):
            while (
                self.find_next_state(current_state, string[i]) is None
                and current_state != 0
            ):
                current_state = self.adlist[current_state]["fail_state"]
            next_state = self.find_next_state(current_state, string[i])
            if next_state is None:
                current_state = 0
            else:
                current_state = next_state
                for key in self.adlist[current_state]["output"]:
                    if key not in result:
                        result[key] = []
                    result[key].append(i - len(key) + 1)
        return result
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free