How MD5 Works
Four 32-bit state words, 64 rounds, a sin-derived constant table. The 1991 algorithm that still shows up in checksums everywhere.
What you will learn
- Why MD5 processes messages in 512-bit blocks and what padding achieves
- How little-endian byte ordering affects both input conversion and digest output
- Why 64 constants are derived from the sine function and what they provide
- How four rounds of four different bitwise functions mix the 128-bit state
- Why left rotation after each round makes the digest sensitive to bit position
- How reformat_hex converts each 32-bit word to an 8-character little-endian hex string
- Why MD5 is collision-broken and what that means for non-security uses
Prerequisites
- Familiarity with bitwise AND, OR, XOR, and NOT operations
- Basic understanding of hexadecimal and byte representations
- No prior knowledge of cryptographic hash functions required
The story behind MD5
Ronald Rivest designed MD5 in 1991 as a strengthened successor to MD4, publishing it as RFC 1321. The name stands for Message Digest algorithm 5. It produces a 128-bit digest from an input of any length by breaking the message into 512-bit blocks and running each block through four rounds of bitwise mixing on a 128-bit state.
MD5 was widely adopted through the 1990s for password storage, digital signatures, and software distribution checksums. The security assumptions held until 2004, when Xiaoyun Wang and Hongbo Yu demonstrated a collision attack that could find two distinct inputs producing the same digest in under an hour on a desktop machine. By 2007, chosen-prefix collisions were feasible, meaning an attacker could craft two documents with the same MD5 hash. Certificate authorities stopped accepting MD5-signed certificates, and password databases migrated to bcrypt and Argon2.
MD5 persists today for non-security purposes. Package managers use it to verify download integrity against accidental corruption, not active attackers. This implementation is educational: it follows RFC 1321 exactly, uses no external crypto libraries, and passes its own cross-check against Python's hashlib.md5.
Module Docstring: What MD5 Is and What It Cannot Do
hashes/md5.py:1The module docstring names the algorithm, describes the 512-bit block structure, and flags that MD5 is cracked and unsuitable for security.
The module docstring front-loads the two facts a reader needs before touching any code. First, the structure: a message is padded to a multiple of 512 bits, split into blocks, and each block is used to update a 128-bit state through 64 operations. Second, the limitation: MD5 was broken and must not be used for passwords, signatures, or any other security purpose. The note about little-endian ordering on line 6 flags a consequence that appears throughout the implementation: every integer the algorithm produces is stored and output with the least-significant byte first, which is why the to_little_endian and reformat_hex helpers exist. Reading this docstring before diving into the 64-round loop saves significant confusion later.
The docstring declares both the algorithm structure (512-bit blocks, 64 operations, 128-bit state) and the security constraint (cracked, non-security only).
"""
The MD5 algorithm is a hash function that's commonly used as a checksum to
detect data corruption. The algorithm works by processing a given message in
blocks of 512 bits, padding the message as needed. It uses the blocks to operate
a 128-bit state and performs a total of 64 such operations. Note that all values
are little-endian, so inputs are converted as needed.
Although MD5 was used as a cryptographic hash function in the past, it's since
been cracked, so it shouldn't be used for security purposes.
For more info, see https://en.wikipedia.org/wiki/MD5
"""Little-Endian Conversion: The Byte-Order Foundation
hashes/md5.py:18MD5 stores all values little-endian. to_little_endian reverses a 32-byte string in 8-byte groups to match RFC 1321 byte ordering.
Byte ordering matters because MD5 was designed to run on little-endian processors that store multi-byte integers with the least-significant byte at the lowest address. Every 32-bit word the algorithm produces must be serialized that way before it becomes part of the digest. to_little_endian handles this by treating its 32-byte input as four 8-character chunks and reversing their order: the iteration list [3, 2, 1, 0] on line 41 walks the chunks from last to first. The doctest confirms this concretely: the final 8 bytes of the input (pqrstuvw) appear first in the output. This function is called during padding and again when converting the four state words into the final hex digest. Without it, the output would be a valid hash of the bits but in big-endian order, and it would not match any other MD5 implementation.
MD5 is little-endian throughout. This helper reverses 8-character groups so every 32-bit word leaves the algorithm with its least-significant byte first.
def to_little_endian(string_32: bytes) -> bytes:
"""
Converts the given string to little-endian in groups of 8 chars.
Arguments:
string_32 {[string]} -- [32-char string]
Raises:
ValueError -- [input is not 32 char]
Returns:
32-char little-endian string
>>> to_little_endian(b'1234567890abcdfghijklmnopqrstuvw')
b'pqrstuvwhijklmno90abcdfg12345678'
>>> to_little_endian(b'1234567890')
Traceback (most recent call last):
...
ValueError: Input must be of length 32
"""
if len(string_32) != 32:
raise ValueError("Input must be of length 32")
little_endian = b""
for i in [3, 2, 1, 0]:
little_endian += string_32[8 * i : 8 * i + 8]
return little_endianPreprocessing: Padding a Message to a Multiple of 512 Bits
hashes/md5.py:90Padding appends a 1 bit, then zeros until the bit string is 448 mod 512, then appends the original length as a 64-bit little-endian integer.
MD5 cannot process arbitrary-length messages directly because its block size is fixed at 512 bits. Padding solves this by stretching the message to the next 512-bit boundary, but the padding is not arbitrary zeros: RFC 1321 requires appending a single 1 bit, then zeros until the length is 448 mod 512, then the original message length as a 64-bit little-endian integer in the final 64 bits of the last block. The doctest for the single character "a" makes this concrete: the character occupies 8 bits, the 1 bit and 439 zero bits fill to 448, and the 64-bit length field completes the 512-bit block. The length encoding uses to_little_endian twice, once on the high 32 bits and once on the low 32 bits. The empty-string case shows the padding alone fills a full 512-bit block.
The padding scheme is deterministic and length-encoded. Appending the original length prevents length-extension attacks against messages with shared prefixes.
def preprocess(message: bytes) -> bytes:
"""
Preprocesses the message string:
- Convert message to bit string
- Pad bit string to a multiple of 512 chars:
- Append a 1
- Append 0's until length = 448 (mod 512)
- Append length of original message (64 chars)
Example: Suppose the input is the following:
message = "a"
The message bit string is "01100001", which is 8 bits long. Thus, the
bit string needs 439 bits of padding so that
(bit_string + "1" + padding) = 448 (mod 512).
The message length is "000010000...0" in 64-bit little-endian binary.
The combined bit string is then 512 bits long.
Arguments:
message {[string]} -- [message string]
Returns:
processed bit string padded to a multiple of 512 chars
>>> preprocess(b"a") == (b"01100001" + b"1" +
... (b"0" * 439) + b"00001000" + (b"0" * 56))
True
>>> preprocess(b"") == b"1" + (b"0" * 447) + (b"0" * 64)
True
"""
bit_string = b""
for char in message:
bit_string += format(char, "08b").encode("utf-8")
start_len = format(len(bit_string), "064b").encode("utf-8")
# Pad bit_string to a multiple of 512 chars
bit_string += b"1"
while len(bit_string) % 512 != 448:
bit_string += b"0"
bit_string += to_little_endian(start_len[32:]) + to_little_endian(start_len[:32])
return bit_stringState Initialization and the Sine-Derived Constant Table
hashes/md5.py:324Four 32-bit state words start at RFC 1321 magic values. Sixty-four constants derived from the sine function provide non-linearity with no hidden structure.
Before the first block is processed, the implementation establishes two tables. The four state words a0, b0, c0, d0 are the magic constants from RFC 1321, chosen so that the initial state has no obvious algebraic structure that could bias the output. The 64 added_consts are computed as the floor of 2^32 times the absolute sine of i+1 for i from 0 to 63. The sine function was chosen because its values are transcendental and verifiable by anyone with a calculator, making it a nothing-up-my-sleeve number: there is no hidden trapdoor in the constant derivation. Each round uses a different constant so that identical input words in different positions produce different contributions to the state after mixing. A single list comprehension on line 327 computes all 64 at startup.
State words start at RFC 1321 magic constants. The 64 sine-derived values are verifiable by anyone and contain no hidden structure or trapdoor.
# Convert to bit string, add padding and append message length
bit_string = preprocess(message)
added_consts = [int(2**32 * abs(sin(i + 1))) for i in range(64)]
# Starting states
a0 = 0x67452301
b0 = 0xEFCDAB89
c0 = 0x98BADCFE
d0 = 0x10325476The Four-Round Mixing Loop
hashes/md5.py:410Sixty-four rounds organized into four groups of sixteen apply a different nonlinear bitwise function and message-word index formula to mix the 128-bit state.
The 64-round loop is the core of MD5. Each round applies one of four bitwise functions to the current state words, then folds in a constant and a message word before rotating the result into the state. Rounds 0 to 15 use F = d ^ (b & (c ^ d)), an algebraic simplification of (b AND c) OR (NOT b AND d). Rounds 16 to 31 use a different combination, rounds 32 to 47 use XOR of all three words, and rounds 48 to 63 use a fourth function. Each group of 16 also uses a different formula for g, which selects which of the 16 message words to incorporate, so the four rounds consume each message word exactly four times in different orders. The final four lines rotate the state words: b receives the mixed value and the others shift down one position.
Four groups of 16 rounds apply a different nonlinear function and message-word permutation. Every message word contributes to the state exactly four times across the full 64 rounds.
for i in range(64):
if i <= 15:
# f = (b & c) | (not_32(b) & d) # Alternate definition for f
f = d ^ (b & (c ^ d))
g = i
elif i <= 31:
# f = (d & b) | (not_32(d) & c) # Alternate definition for f
f = c ^ (d & (b ^ c))
g = (5 * i + 1) % 16
elif i <= 47:
f = b ^ c ^ d
g = (3 * i + 5) % 16
else:
f = c ^ (b | not_32(d))
g = (7 * i) % 16
f = (f + a + added_consts[i] + block_words[g]) % 2**32
a = d
d = c
c = b
b = sum_32(b, left_rotate_32(f, shift_amounts[i]))Digest Assembly: reformat_hex Serializes the Final State
hashes/md5.py:297md5_me returns a 32-character bytes object. reformat_hex converts each 32-bit state word to an 8-character little-endian hex string, truncating to 8 digits.
The public entry point md5_me names its three well-known test vectors directly: the empty string, the standard quick-brown-fox phrase, and the same phrase with a trailing period. Those two phrases differ by one character yet produce entirely different digests, which demonstrates the avalanche effect in action. The fourth doctest cross-checks this implementation against Python's standard library hashlib.md5 across four inputs including a UTF-8 multibyte string, so any deviation from RFC 1321 fails CI immediately. The digest is assembled after the block loop by concatenating four calls to reformat_hex, one per state word, each of which formats the integer as an 8-character little-endian hex string by reversing byte pairs, producing the 32-character output that tools like md5sum return.
Three named test vectors pin the output contract. A fourth cross-checks every output against hashlib.md5, so any RFC deviation fails CI automatically.
def md5_me(message: bytes) -> bytes:
"""
Returns the 32-char MD5 hash of a given message.
Reference: https://en.wikipedia.org/wiki/MD5#Algorithm
Arguments:
message {[string]} -- [message]
Returns:
32-char MD5 hash string
>>> md5_me(b"")
b'd41d8cd98f00b204e9800998ecf8427e'
>>> md5_me(b"The quick brown fox jumps over the lazy dog")
b'9e107d9d372bb6826bd81d3542a419d6'
>>> md5_me(b"The quick brown fox jumps over the lazy dog.")
b'e4d909c290d0fb1ca068ffaddf22cbd0'
>>> import hashlib
>>> from string import ascii_letters
>>> msgs = [b"", ascii_letters.encode("utf-8"), "Üñîçø∂é".encode("utf-8"),
... b"The quick brown fox jumps over the lazy dog."]
>>> all(md5_me(msg) == hashlib.md5(msg).hexdigest().encode("utf-8") for msg in msgs)
True
"""You've walked through 6 key areas of the The Algorithms - Python codebase.
Continue: How the Knuth-Morris-Pratt Algorithm Works → Browse all projectsCreate code tours for your project
Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.
Install Intraview Free