The Algorithms - Python Python TheAlgorithms/Python

How Longest Common Subsequence Works

A 2-D DP table, one recurrence, and a traceback walk. The algorithm at the core of diff, git-merge, and DNA sequence alignment.

6 stops ~18 min Verified 2026-05-05
What you will learn
  • Why LCS differs from longest-common-substring: subsequence elements need not be contiguous
  • How the 2-D dp table encodes subproblem answers so each cell costs O(1) to fill
  • What the three-way max recurrence means: skip from x, skip from y, or extend a match
  • How the traceback walk reconstructs the actual sequence from the filled table
  • Why the table is monotonically non-decreasing along both axes
  • How eleven doctests cover empty strings, identical strings, interleaved matches, and no-match cases
  • Where the O(mn) time and O(mn) space come from and how space can be reduced to O(min(m,n))
Prerequisites
  • Comfort reading nested Python loops
  • Familiarity with two-dimensional lists
  • No prior knowledge of dynamic programming required
The story behind LCS

The Hunt-McIlroy paper of 1976, published in ACM Transactions on Programming Languages and Systems, introduced the LCS algorithm as the engine for the Unix diff utility. James Hunt and Douglas McIlroy needed a way to compare two files line by line and report the minimum set of changes. The longest common subsequence of the two line sequences is exactly the unchanged core, and the lines not in the LCS are the insertions and deletions diff prints.

The biological case came earlier. Saul Needleman and Christian Wunsch published their sequence-alignment algorithm in 1970 in the Journal of Molecular Biology, and the Needleman-Wunsch recurrence is LCS with an affine gap penalty. Modern bioinformatics tools for aligning DNA and protein sequences are direct descendants. Git's three-way merge also uses LCS: when two branches both modify the same file, the merge driver finds the longest common subsequence of the shared ancestor's lines and the two branch heads, then stitches them together.

LCS is not the same as longest-common-substring. Substrings must be contiguous; subsequences need only preserve order. The string ace is a subsequence of abcde but not a substring, and the two problems require different recurrences.

1 / 6

Module Docstring: The Problem Statement

dynamic_programming/longest_common_subsequence.py:1

The module docstring defines what a subsequence is and gives a concrete example before any code appears.

Before the function signature, the file tells you exactly what problem it solves and where the definition of subsequence is non-obvious. A subsequence preserves order but does not require contiguity, which is what separates it from a substring. The docstring's example pins that distinction: "abc" and "abg" both appear in "abcdefgh" in order, but neither is a contiguous block of characters. That distinction matters for the recurrence: a substring algorithm resets when characters diverge, while an LCS algorithm skips mismatches and keeps searching. The module docstring is the contribution checklist's required source annotation; here it serves double duty as the problem definition a learner reads before running any code.

Key takeaway

Subsequences preserve order but not contiguity. That single distinction separates LCS from longest-common-substring and drives the entire recurrence design.

"""
LCS Problem Statement: Given two sequences, find the length of longest subsequence
present in both of them.  A subsequence is a sequence that appears in the same relative
order, but not necessarily continuous.
Example:"abc", "abg" are subsequences of "abcdefgh".
"""
2 / 6

Function Signature and Eleven Doctests

dynamic_programming/longest_common_subsequence.py:9

The function returns a tuple of (length, subsequence string). Eleven doctests cover empty inputs, identical strings, interleaved matches, and the no-match case.

The return type is a tuple of (int, str): the length of the longest subsequence and the subsequence itself. Returning the actual string, not only its length, is what requires the traceback walk later. Eleven doctests pin the contract across every boundary condition a reviewer might question. Empty strings on either or both sides return (0, ''). Identical strings return the string itself with length equal to the input length. The interleaved case ("abcdef", "ace") returns (3, 'ace'), proving the algorithm skips non-matching characters without resetting. The case ("ABCD", "ACBD") returning (3, 'ABD') shows that when two valid LCS sequences have the same length, the traceback picks one deterministically. Each doctest runs on every push via pytest --doctest-modules.

Key takeaway

Returning (length, subsequence) rather than length alone forces the implementation to include a traceback. Eleven doctests cover every boundary: empty, identical, no-match, interleaved.

def longest_common_subsequence(x: str, y: str):
    """
    Finds the longest common subsequence between two strings. Also returns the
    The subsequence found

    Parameters
    ----------

    x: str, one of the strings
    y: str, the other string

    Returns
    -------
    L[m][n]: int, the length of the longest subsequence. Also equal to len(seq)
    Seq: str, the subsequence found

    >>> longest_common_subsequence("programming", "gaming")
    (6, 'gaming')
    >>> longest_common_subsequence("physics", "smartphone")
    (2, 'ph')
    >>> longest_common_subsequence("computer", "food")
    (1, 'o')
    >>> longest_common_subsequence("", "abc")  # One string is empty
    (0, '')
    >>> longest_common_subsequence("abc", "")  # Other string is empty
    (0, '')
    >>> longest_common_subsequence("", "")  # Both strings are empty
    (0, '')
    >>> longest_common_subsequence("abc", "def")  # No common subsequence
    (0, '')
    >>> longest_common_subsequence("abc", "abc")  # Identical strings
    (3, 'abc')
    >>> longest_common_subsequence("a", "a")  # Single character match
    (1, 'a')
    >>> longest_common_subsequence("a", "b")  # Single character no match
    (0, '')
    >>> longest_common_subsequence("abcdef", "ace")  # Interleaved subsequence
    (3, 'ace')
    >>> longest_common_subsequence("ABCD", "ACBD")  # No repeated characters
    (3, 'ABD')
    """
3 / 6

Table Initialization and the Recurrence

dynamic_programming/longest_common_subsequence.py:50

A (m+1) by (n+1) table of zeros serves as the DP table. The double loop fills each cell using three possible predecessors.

The DP table on line 59 is (m+1) by (n+1) and filled with zeros. The extra row and column act as sentinel base cases: any subsequence against an empty string has length zero, which the zero-initialized borders already encode. The double loop then fills the table one cell at a time. For cell (i, j), the algorithm computes whether the current characters match (match = 1 or 0) and then takes the maximum of three predecessors: skip a character from x (dp[i-1][j]), skip from y (dp[i][j-1]), or extend the diagonal if there is a match (dp[i-1][j-1] + match). Because each cell depends only on cells already computed, the row-major traversal is correct. The table is monotonically non-decreasing along both axes, which is what guarantees the traceback will always find a valid path.

Key takeaway

The three-way max is the recurrence: skip from x, skip from y, or extend a diagonal match. The zero-bordered table encodes all base cases without explicit boundary checks.

    # find the length of strings

    assert x is not None
    assert y is not None

    m = len(x)
    n = len(y)

    # declaring the array for storing the dp values
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = 1 if x[i - 1] == y[j - 1] else 0

            dp[i][j] = max(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1] + match)
4 / 6

The Traceback: Walking the Table in Reverse

dynamic_programming/longest_common_subsequence.py:67

Starting from the bottom-right corner, the traceback recovers the actual subsequence by following the path that built each cell value.

Filling the table gives the length; recovering the sequence requires a second pass. The traceback starts at (m, n), the bottom-right corner, and walks backward toward (0, 0). At each cell, it checks which predecessor produced the current value. If the diagonal predecessor plus the match flag equals the current cell and the characters matched, this position is part of the subsequence, so x[i-1] is prepended to seq and both pointers step back. If the upper cell matches, only i steps back. Otherwise only j steps back. Prepending rather than appending builds the sequence in the correct left-to-right order as the traceback moves right-to-left. When either pointer reaches zero, the traceback is done and dp[m][n] is the length.

Key takeaway

Traceback starts at (m, n) and works backward. Prepending matched characters builds the sequence in correct order as the walk moves from bottom-right to top-left.

    seq = ""
    i, j = m, n
    while i > 0 and j > 0:
        match = 1 if x[i - 1] == y[j - 1] else 0

        if dp[i][j] == dp[i - 1][j - 1] + match:
            if match == 1:
                seq = x[i - 1] + seq
            i -= 1
            j -= 1
        elif dp[i][j] == dp[i - 1][j]:
            i -= 1
        else:
            j -= 1

    return dp[m][n], seq
5 / 6

The __main__ Block: A Named Example with Expected Output

dynamic_programming/longest_common_subsequence.py:85

The __main__ block uses AGGTAB and GXTXAYB, a textbook LCS example, and asserts the expected length and subsequence before printing.

The __main__ block uses the pair ("AGGTAB", "GXTXAYB"), which is the worked example from Cormen, Leiserson, Rivest, and Stein's Introduction to Algorithms. The expected LCS is "GTAB" with length 4, both stored in named variables so a reader comparing against the textbook sees the values explicitly rather than buried in a function call. The block also calls doctest.testmod() after printing, which runs all eleven doctests from the function docstring on every direct invocation. This means running python3 dynamic_programming/longest_common_subsequence.py both prints the CLRS example and validates the contract. The category tour for dynamic programming at /explore/TheAlgorithms/Python/tours/category-dp covers the other files in this directory.

Key takeaway

The AGGTAB-GXTXAYB pair is the CLRS worked example. Running the file directly both demonstrates the algorithm and validates all eleven doctests.

if __name__ == "__main__":
    a = "AGGTAB"
    b = "GXTXAYB"
    expected_ln = 4
    expected_subseq = "GTAB"

    ln, subseq = longest_common_subsequence(a, b)
    print("len =", ln, ", sub-sequence =", subseq)
    import doctest

    doctest.testmod()
6 / 6

Complexity and What the Table Looks Like

dynamic_programming/longest_common_subsequence.py:9

O(mn) time to fill the table, O(mn) space to store it. Reducing space to O(min(m,n)) requires keeping only two rows at once.

The function fills every cell of a (m+1) by (n+1) table once, so time complexity is O(mn). The table itself occupies O(mn) space. For strings of length 1,000 each, that is a million cells. In bioinformatics, where sequences can be tens of thousands of characters long, that space cost is prohibitive. The standard optimization is to observe that the fill loop only reads the previous row, so you can keep only two rows in memory at O(min(m, n)) space. The traceback then requires either storing the full table or re-running the fill. This implementation retains the full table to make the traceback direct and readable, which matches the educational intent. The docstring return annotation uses the name L[m][n], the standard textbook notation for the bottom-right cell of the LCS table.

Key takeaway

Time is O(mn) and space is O(mn). The space drops to O(min(m,n)) if you keep only two rows, but then traceback requires a second pass or a separate pointer structure.

def longest_common_subsequence(x: str, y: str):
    """
    Finds the longest common subsequence between two strings. Also returns the
    The subsequence found

    Parameters
    ----------

    x: str, one of the strings
    y: str, the other string

    Returns
    -------
    L[m][n]: int, the length of the longest subsequence. Also equal to len(seq)
    Seq: str, the subsequence found

    >>> longest_common_subsequence("programming", "gaming")
    (6, 'gaming')
    >>> longest_common_subsequence("physics", "smartphone")
    (2, 'ph')
    >>> longest_common_subsequence("computer", "food")
    (1, 'o')
    >>> longest_common_subsequence("", "abc")  # One string is empty
    (0, '')
    >>> longest_common_subsequence("abc", "")  # Other string is empty
    (0, '')
    >>> longest_common_subsequence("", "")  # Both strings are empty
    (0, '')
    >>> longest_common_subsequence("abc", "def")  # No common subsequence
    (0, '')
    >>> longest_common_subsequence("abc", "abc")  # Identical strings
    (3, 'abc')
    >>> longest_common_subsequence("a", "a")  # Single character match
    (1, 'a')
    >>> longest_common_subsequence("a", "b")  # Single character no match
    (0, '')
    >>> longest_common_subsequence("abcdef", "ace")  # Interleaved subsequence
    (3, 'ace')
    >>> longest_common_subsequence("ABCD", "ACBD")  # No repeated characters
    (3, 'ABD')
    """
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free