The Algorithms - Python Python TheAlgorithms/Python

How Edit Distance Works

Count the minimum insertions, deletions, and substitutions to transform one string into another. The 1965 algorithm behind spell-check, fuzzy search, and DNA alignment.

6 stops ~18 min Verified 2026-05-05
What you will learn
  • Why edit distance is the minimum number of single-character operations, not a measure of similarity
  • How the EditDistance class stores word1, word2, and the dp memoization table on the instance
  • How the private top-down method uses recursive memoization with a -1-sentinel table
  • How the bottom-up method fills an (m+1) by (n+1) table iteratively with the same recurrence
  • Why the base cases handle empty strings: converting n characters from nothing costs n insertions
  • How three doctests pin both implementations to the same expected outputs
  • Where Hamming distance and Damerau-Levenshtein differ from standard Levenshtein edit distance
Prerequisites
  • Comfort reading Python classes and private methods
  • Familiarity with recursion and memoization
  • No prior knowledge of dynamic programming required
The story behind edit distance

Vladimir Levenshtein, a Soviet mathematician at the Keldysh Institute of Applied Mathematics in Moscow, published the edit-distance problem in 1965 in a paper translated into English as "Binary codes capable of correcting deletions, insertions, and reversals." His formulation counted insertions, deletions, and substitutions as the three allowed operations, and the minimum number of them became the Levenshtein distance.

The algorithm reached Western computer science as spell-checking became a practical problem in the 1970s. Fred Damerau extended it in 1964 to include transposition of adjacent characters, producing what is now called the Damerau-Levenshtein distance, which handles common typing errors like teh for the. Hamming distance, defined by Richard Hamming in 1950, is the restricted special case where only substitutions are allowed and both strings must be the same length; it is the edit distance metric used in error-correcting codes.

Modern applications span every text-processing domain. Search engines use edit distance to generate query suggestions. Bioinformatics uses it to align DNA sequences where insertions and deletions are biological events. Git's merge driver uses a variant to merge diverged text files. Autocorrect on phones applies a weighted Levenshtein distance where transpositions and adjacent-key substitutions are cheaper than arbitrary ones.

1 / 6

Module Docstring: The Problem Definition

dynamic_programming/edit_distance.py:1

The module docstring names the author, the date, and the three permitted operations: removal, insertion, and substitution.

The module docstring does two things the contribution checklist requires: it credits the author with a name and date, and it states the problem precisely before any code. The problem statement on lines 8 through 10 names all three permitted operations. That precision matters because different flavors of edit distance allow different operation sets. Hamming distance allows only substitutions. Damerau-Levenshtein adds transposition of adjacent characters. This implementation follows the Levenshtein definition strictly: insertion, deletion, substitution, each at unit cost. A reader who starts with the module docstring can verify later that the recurrence in the private method on line 26 implements exactly that three-operation cost model.

Key takeaway

The problem statement names all three permitted operations. That list is the spec. Every line of the DP recurrence below must account for exactly those three and nothing else.

"""
Author  : Turfa Auliarachman
Date    : October 12, 2016

This is a pure Python implementation of Dynamic Programming solution to the edit
distance problem.

The problem is :
Given two strings A and B. Find the minimum number of operations to string B such that
A = B. The permitted operations are removal,  insertion, and substitution.
"""
2 / 6

The EditDistance Class and Instance State

dynamic_programming/edit_distance.py:14

The class stores word1, word2, and dp on the instance so both the top-down and bottom-up methods share the same state without parameter passing.

The class docstring's usage example is terse and prescriptive: instantiate once, call solve with two strings. That design choice is instructive. The class stores word1, word2, and dp as instance attributes rather than function locals so that the private recursive method __min_dist_top_down_dp can read and write the memoization table without receiving it as a parameter. The alternative would be a standalone function with a default-argument dict, which Python's functional DP files often use. Choosing a class here keeps the recursive helper's signature simple: two integer indices m and n. The cost is that calling min_dist_top_down after min_dist_bottom_up on the same instance would overwrite the dp table, so the caller must be aware.

Key takeaway

Instance state for word1, word2, and dp lets the private recursive helper take only two integer indices. The trade-off is that each public method reinitializes dp on entry.

class EditDistance:
    """
    Use :
    solver              = EditDistance()
    editDistanceResult  = solver.solve(firstString, secondString)
    """

    def __init__(self):
        self.word1 = ""
        self.word2 = ""
        self.dp = []
3 / 6

The Private Top-Down Recursive Method

dynamic_programming/edit_distance.py:26

The private method checks two base cases (either index is -1), reads the memo table, and otherwise computes insert, delete, and replace costs recursively.

The indices m and n are character positions in word1 and word2 respectively, starting from the last character and counting down. Base cases on lines 27 through 30 handle the empty-prefix situation: if m is -1, the first string is exhausted and we need n + 1 insertions to produce the remainder of the second. The memo check on line 31 returns a previously computed result immediately. When characters match on line 34, no operation is needed and the cost is the cost of the remaining prefixes. When they differ, three recursive calls compute the cost of each allowed operation: advance only n for insert, advance only m for delete, advance both for replace. The minimum plus one is stored in the table on line 40.

Key takeaway

Base cases use -1 sentinel indices. A memo check at line 31 prevents exponential recomputation. Matching characters skip an operation; mismatches take one plus the min of three recursive costs.

    def __min_dist_top_down_dp(self, m: int, n: int) -> int:
        if m == -1:
            return n + 1
        elif n == -1:
            return m + 1
        elif self.dp[m][n] > -1:
            return self.dp[m][n]
        else:
            if self.word1[m] == self.word2[n]:
                self.dp[m][n] = self.__min_dist_top_down_dp(m - 1, n - 1)
            else:
                insert = self.__min_dist_top_down_dp(m, n - 1)
                delete = self.__min_dist_top_down_dp(m - 1, n)
                replace = self.__min_dist_top_down_dp(m - 1, n - 1)
                self.dp[m][n] = 1 + min(insert, delete, replace)

            return self.dp[m][n]
4 / 6

The Public Top-Down Method Initializes the Memo Table

dynamic_programming/edit_distance.py:44

min_dist_top_down sets word1, word2, and dp to a -1-filled grid, then calls the private recursive helper with the last indices.

Three doctests pin the top-down implementation. ("intention", "execution") has edit distance 5, a standard textbook example: substitute i to e, substitute n to x, insert c, substitute t to u, substitute n to n (no op), substitute i to i (no op) and delete one more n. The empty second string has distance 9 because converting a 9-character string to the empty string costs 9 deletions. Two empty strings cost 0. Line 55 builds the memo table as a two-dimensional list filled with -1 sentinels, one for each pair of character positions. Line 57 starts the recursion at the last indices of both strings, which is the natural top-down entry point for a suffix-based recursion.

Key takeaway

The three doctests are the standard Levenshtein examples. A -1-filled grid is the memo table; calling the private method with (len-1, len-1) starts the recursion at the final characters.

    def min_dist_top_down(self, word1: str, word2: str) -> int:
        """
        >>> EditDistance().min_dist_top_down("intention", "execution")
        5
        >>> EditDistance().min_dist_top_down("intention", "")
        9
        >>> EditDistance().min_dist_top_down("", "")
        0
        """
        self.word1 = word1
        self.word2 = word2
        self.dp = [[-1 for _ in range(len(word2))] for _ in range(len(word1))]

        return self.__min_dist_top_down_dp(len(word1) - 1, len(word2) - 1)
5 / 6

The Bottom-Up Method: Same Recurrence, No Recursion

dynamic_programming/edit_distance.py:59

min_dist_bottom_up builds the table iteratively, row by row. Row 0 and column 0 encode the base cases as integers rather than sentinel recursion.

The bottom-up method computes the same three doctests as the top-down version because both implement the same Levenshtein recurrence. The structural difference is initialization. The top-down table uses -1 as a memo sentinel and base cases live in the recursive function. The bottom-up table on line 72 is (m+1) by (n+1) with an extra row and column for empty-prefix base cases, filled explicitly in the loop. Row 0 (i == 0) gets value j: converting an empty string to a j-character string requires j insertions. Column 0 (j == 0) gets value i: deleting all i characters from the first string to reach empty. The three-way min at lines 83 through 86 is the recurrence, identical in logic to the recursive version. Reading self.dp[m][n] at the end gives the answer without a traceback because only the length is returned, not the sequence of operations.

Key takeaway

Bottom-up initialization fills row 0 with j and column 0 with i to encode the empty-prefix costs. The inner recurrence matches the top-down version; only the evaluation order differs.

    def min_dist_bottom_up(self, word1: str, word2: str) -> int:
        """
        >>> EditDistance().min_dist_bottom_up("intention", "execution")
        5
        >>> EditDistance().min_dist_bottom_up("intention", "")
        9
        >>> EditDistance().min_dist_bottom_up("", "")
        0
        """
        self.word1 = word1
        self.word2 = word2
        m = len(word1)
        n = len(word2)
        self.dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]

        for i in range(m + 1):
            for j in range(n + 1):
                if i == 0:  # first string is empty
                    self.dp[i][j] = j
                elif j == 0:  # second string is empty
                    self.dp[i][j] = i
                elif word1[i - 1] == word2[j - 1]:  # last characters are equal
                    self.dp[i][j] = self.dp[i - 1][j - 1]
                else:
                    insert = self.dp[i][j - 1]
                    delete = self.dp[i - 1][j]
                    replace = self.dp[i - 1][j - 1]
                    self.dp[i][j] = 1 + min(insert, delete, replace)
        return self.dp[m][n]
6 / 6

The __main__ Block: Interactive Driver with Both Solvers

dynamic_programming/edit_distance.py:90

The __main__ block collects two strings from stdin and prints the edit distance from both the top-down and bottom-up methods, proving they agree.

The interactive driver is also a consistency check. A single EditDistance instance is created on line 91, both input strings are collected with .strip() on lines 96 and 97, and then both methods are called on lines 100 and 101. When both methods are working correctly they must produce identical output for any pair of strings, because they compute the same mathematical quantity by different means. A user who types the same two strings repeatedly and sees both lines print the same number has informal confirmation that neither implementation is broken. The separation of top-down and bottom-up in a single class is the pedagogical payoff of the class design: a learner can compare the two approaches side by side, run both, and verify the answers match.

Key takeaway

Printing both methods for the same input is the clearest possible test that top-down and bottom-up agree. Any divergence is a bug in one of the two implementations.

if __name__ == "__main__":
    solver = EditDistance()

    print("****************** Testing Edit Distance DP Algorithm ******************")
    print()

    S1 = input("Enter the first string: ").strip()
    S2 = input("Enter the second string: ").strip()

    print()
    print(f"The minimum edit distance is: {solver.min_dist_top_down(S1, S2)}")
    print(f"The minimum edit distance is: {solver.min_dist_bottom_up(S1, S2)}")
    print()
    print("*************** End of Testing Edit Distance DP Algorithm ***************")
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free