How a Bloom Filter Works
A bitarray, two hash functions, and a membership check that never gives false-negatives. Burton Bloom's 1970 data structure still in production databases.
What you will learn
- Why a Bloom filter trades false-positives for extreme space efficiency over a Python set
- How two hash functions map a string to bit positions in the bitarray
- Why add uses bitwise OR and exists uses bitwise AND to enforce one-way set semantics
- How the doctest traces Titanic, Avatar, and Ratatouille through the 8-bit array
- Why false-positive probability grows with the fill rate of the bitarray
- How estimated_error_rate computes (set_bits / size) squared for two hash functions
- Why standard Bloom filters cannot delete elements and when counting filters relax this
Prerequisites
- Familiarity with Python bitwise operators and integer bit representations
- Basic understanding of hash functions as deterministic black boxes
- No prior knowledge of probabilistic data structures required
The story behind the Bloom filter
Burton Bloom introduced the Bloom filter in a 1970 paper in Communications of the ACM titled "Space/time trade-offs in hash coding with allowable errors." The problem he was solving was simple: given a large set of items stored on disk, how do you avoid a slow disk lookup for items that definitely are not in the set? A perfect in-memory set would answer instantly, but the sets Bloom was targeting were far too large to hold in RAM.
His insight was that you do not need certainty in both directions. If you accept a small probability of false-positives (saying an item might be present when it is not), you can represent membership in a bit array orders of magnitude smaller than the original set. False-negatives are never allowed.
The structure remains in active production use. BigTable, Cassandra, and Redis (via the RedisBloom module) all use Bloom filters to skip disk reads for absent keys. Space complexity is O(m) for the bit array; insert and query each cost O(k) where k is the number of hash functions. Deletion requires a counting variant that replaces each bit with a counter.
Module Docstring: A Live Walkthrough of the Filter in Action
data_structures/hashing/bloom_filter.py:1The module docstring is the tutorial: it builds an 8-bit filter, adds films, demonstrates a false-positive, and shows the error rate growing with occupancy.
The module docstring is the data structure's manual, written as a runnable doctest. Starting from an all-zero 8-bit array, it adds two films and then asks whether a third film appears in the filter. "Titanic" sets bits at positions 1 and 2 (reading right-to-left in the bitstring), so the string becomes '01100000'. "Avatar" happens to map both its hash functions to the same position, so only one additional bit changes. Then comes the key demonstration: "Ratatouille" was never added, yet "Ratatouille" in bloom returns True because its two hash positions are both already set by previous insertions. That is a false-positive. The final lines show estimated_error_rate rising from 0.14 to 0.25 as a fourth element fills more bits, making collisions more likely.
The doctest is the best explanation of a false-positive: Ratatouille was never added, yet its two hash positions were already occupied by earlier insertions.
"""
See https://en.wikipedia.org/wiki/Bloom_filter
The use of this data structure is to test membership in a set.
Compared to Python's built-in set() it is more space-efficient.
In the following example, only 8 bits of memory will be used:
>>> bloom = Bloom(size=8)
Initially, the filter contains all zeros:
>>> bloom.bitstring
'00000000'
When an element is added, two bits are set to 1
since there are 2 hash functions in this implementation:
>>> "Titanic" in bloom
False
>>> bloom.add("Titanic")
>>> bloom.bitstring
'01100000'
>>> "Titanic" in bloom
True
However, sometimes only one bit is added
because both hash functions return the same value
>>> bloom.add("Avatar")
>>> "Avatar" in bloom
True
>>> bloom.format_hash("Avatar")
'00000100'
>>> bloom.bitstring
'01100100'
Not added elements should return False ...
>>> not_present_films = ("The Godfather", "Interstellar", "Parasite", "Pulp Fiction")
>>> {
... film: bloom.format_hash(film) for film in not_present_films
... } # doctest: +NORMALIZE_WHITESPACE
{'The Godfather': '00000101',
'Interstellar': '00000011',
'Parasite': '00010010',
'Pulp Fiction': '10000100'}
>>> any(film in bloom for film in not_present_films)
False
but sometimes there are false positives:
>>> "Ratatouille" in bloom
True
>>> bloom.format_hash("Ratatouille")
'01100000'
The probability increases with the number of elements added.
The probability decreases with the number of bits in the bitarray.
>>> bloom.estimated_error_rate
0.140625
>>> bloom.add("The Godfather")
>>> bloom.estimated_error_rate
0.25
>>> bloom.bitstring
'01100101'
"""Class Structure: Bitarray as a Single Integer
data_structures/hashing/bloom_filter.py:62The filter stores its state as a single Python integer. Bitwise OR sets bits on insert; bitwise AND checks bits on query. No list, no array.
The entire filter state fits in a single Python integer initialized to zero. Inserting an element calls hash_ to produce a bitmask, then sets the corresponding bits with self.bitarray |= h. Querying an element checks whether all the bits in the element's bitmask are already set: (h & self.bitarray) == h is true only if every bit in h appears in self.bitarray. Because bits are never cleared, a positive result may come from earlier insertions of different elements, which is exactly the false-positive mechanism. Using a Python integer rather than a list of bits means the entire state is one memory object; Python's arbitrary-precision integers grow automatically as size increases without any allocation arithmetic from the caller.
The filter is a single integer. OR sets bits on insert; AND-equality checks all bits on query. Bits never clear, which is why false-positives are possible.
from hashlib import md5, sha256
HASH_FUNCTIONS = (sha256, md5)
class Bloom:
def __init__(self, size: int = 8) -> None:
self.bitarray = 0b0
self.size = size
def add(self, value: str) -> None:
h = self.hash_(value)
self.bitarray |= h
def exists(self, value: str) -> bool:
h = self.hash_(value)
return (h & self.bitarray) == h
def __contains__(self, other: str) -> bool:
return self.exists(other)The Hash Function: Two Digests, Two Bit Positions
data_structures/hashing/bloom_filter.py:91hash_ runs sha256 and md5 over the value, reduces each digest modulo size, and sets the corresponding bit in the result integer via 2**position.
The quality of a Bloom filter depends on its hash functions producing independent, uniformly distributed bit positions. This implementation uses SHA-256 and MD5, which are cryptographic digests and therefore highly independent. For each function, the 32-byte digest is interpreted as a little-endian integer and reduced modulo size to produce a position between 0 and size minus 1. The bit at that position is then set in the result integer using 2**position. Because SHA-256 and MD5 are independent functions, they map the same input to different positions most of the time. When they happen to collide, only one bit changes instead of two, which the Avatar doctest demonstrates explicitly: format_hash("Avatar") returns a mask with only one bit set even though the loop runs twice.
Two independent hash functions produce two bit positions. When they collide (as with Avatar), only one bit sets. Position is digest modulo size.
def hash_(self, value: str) -> int:
res = 0b0
for func in HASH_FUNCTIONS:
position = (
int.from_bytes(func(value.encode()).digest(), "little") % self.size
)
res |= 2**position
return res
def format_hash(self, value: str) -> str:
return self.format_bin(self.hash_(value))Bitstring Formatting: Integer to Zero-Padded Binary String
data_structures/hashing/bloom_filter.py:83format_bin converts the integer bitarray to a binary string padded to size characters. zfill ensures leading zeros appear when the high bits are unset.
Python's bin() function returns a string like '0b110' for the integer 6. Slicing off the first two characters gives the raw binary digits, but that string only has as many characters as the highest set bit requires. For an 8-bit filter that is only half full, bin(6)[2:] returns '110', not '00000110'. The zfill(self.size) call pads the left side with zeros to the declared filter size, so the bitstring always shows all positions. This matters for the doctest, which asserts specific fixed-width strings like '01100000' that show exactly which of the eight positions are occupied. The bitstring property wraps this into a readable attribute so doctests and callers never work with raw integers.
zfill pads the binary string to the declared filter size. Without it, leading zeros disappear and the doctest strings do not match.
def format_bin(self, bitarray: int) -> str:
res = bin(bitarray)[2:]
return res.zfill(self.size)
@property
def bitstring(self) -> str:
return self.format_bin(self.bitarray)Error Rate Estimation: Fill Ratio Raised to the Power of k
data_structures/hashing/bloom_filter.py:103estimated_error_rate computes (set_bits / size) squared. It rises as more elements fill the bitarray and falls as the bitarray grows larger.
The false-positive probability for a Bloom filter with k hash functions is approximately the fill fraction raised to the power of k. The property computes this directly: count the 1-bits in the current bitarray with Python's string count("1") method, divide by size to get the fill fraction, and raise it to len(HASH_FUNCTIONS). In the doctest, after adding Titanic and Avatar, 3 of 8 bits are set, giving (3/8) squared = 0.140625. After adding The Godfather, 4 bits are set, giving (4/8) squared = 0.25. The formula reveals the two design levers: increasing size lowers the fill fraction, and using more hash functions raises the exponent. Real-world implementations use the optimal sizing formula to choose the bit array length for a target false-positive rate before any elements are added.
Error rate is (set_bits / total_bits) raised to the number of hash functions. More bits lower it; more elements raise it.
@property
def estimated_error_rate(self) -> float:
n_ones = bin(self.bitarray).count("1")
return (n_ones / self.size) ** len(HASH_FUNCTIONS)Why Deletion Is Impossible Without a Counting Variant
data_structures/hashing/bloom_filter.py:72add uses bitwise OR, which is irreversible: clearing a bit to delete an element would also delete other elements whose hash hit the same position.
The OR operation in add is the source of the filter's space efficiency and its deletion limitation in equal measure. When "Titanic" is added, bits 1 and 2 are set. When "Ratatouille" happens to hash to the same positions, those bits are already set. If you could remove "Titanic" by clearing its bits, you would simultaneously invalidate "Ratatouille" if it had been added. Since multiple elements share bit positions, a standard Bloom filter cannot tell which elements contributed to each bit: OR combines contributions from all of them without attribution. Counting Bloom filters solve this by replacing each bit with a small counter, incrementing on add and decrementing on delete, at the cost of roughly four times the space. The Redis BLOOM module supports both variants. This implementation is the space-minimal standard form.
Bits cannot be cleared because multiple elements share positions. Deletion requires a counting variant that replaces each bit with a counter.
def add(self, value: str) -> None:
h = self.hash_(value)
self.bitarray |= h
def exists(self, value: str) -> bool:
h = self.hash_(value)
return (h & self.bitarray) == hYou've walked through 6 key areas of the The Algorithms - Python codebase.
Continue: How an LRU Cache Works → Browse all projectsCreate code tours for your project
Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.
Install Intraview Free