Redis Data Structures & Encoding: How Values Live in Memory
From raw bytes to skip lists: the memory layouts that make Redis fast and compact
What you will learn
- Why Redis defines five different SDS header structs and how the `sds = char*` trick lets them share a single typedef
- How the two-table dict design with `rehashidx` spreads the cost of resizing across subsequent reads and writes
- How `intset` stores a sorted integer-only set in the minimum number of bytes and upgrades its element width on demand
- How listpack packs heterogeneous small values into a single contiguous byte array with no per-element allocation
- How quicklist decides when a listpack node has grown large enough to split, using both byte-size and entry-count limits
- What the `OBJ_ENCODING_*` constants are and which config thresholds trigger an encoding promotion to a full hash table or skip list
Prerequisites
- Comfortable reading C structs and pointer arithmetic (you don't need to write it, just follow the layout)
- Basic understanding of Redis key-value types: what a list, set, hash, and sorted set are
- Familiarity with the idea of memory alignment and struct packing is helpful but not required
SDS String Layout
src/sds.h:24Five header variants, one shared typedef, header at a negative offset from the sds pointer
sds is typedef char *sds, so the value points directly into character data, not into the header. SDS_HDR(T,s) recovers the header by subtracting sizeof(struct sdshdrT) from the sds pointer. The flags byte is always at s[-1], so a type check costs exactly one memory access regardless of which variant is in use.
The five structs differ only in the integer width of len and alloc:
sdshdr5: length packed into the high five bits offlags; noallocfield; never actually allocated, present only to document the layoutsdshdr8:uint8_t len/alloc; covers strings up to 255 bytessdshdr16:uint16_t len/alloc; covers strings up to 64 KBsdshdr32:uint32_t len/alloc; covers strings up to 4 GBsdshdr64:uint64_t len/alloc; covers the rest
__attribute__((__packed__)) strips struct padding on all five, making the layout byte-for-byte predictable. The allocation path in _sdsnewlen() (src/sds.c line 98) calls sdsReqType(initlen) to pick the minimal variant, writes the header, and returns a pointer to buf.
An sds pointer is a valid C string. The header sits at a negative offset and is invisible to callers unless they go through an SDS library function.
---
typedef char *sds;
/* Note: sdshdr5 is never used, we just access the flags byte directly.
* However is here to document the layout of type 5 SDS strings. */
struct __attribute__ ((__packed__)) sdshdr5 {
unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
uint8_t len; /* used */
uint8_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
uint16_t len; /* used */
uint16_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
uint32_t len; /* used */
uint32_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
uint64_t len; /* used */
uint64_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
#define SDS_TYPE_5 0dict + Incremental Rehash
src/dict.h:143The two-table design and how rehashing is spread across normal operations
int dictRehash(dict *d, int n) {
int empty_visits = n*10; /* Max number of empty buckets to visit. */
unsigned long s0 = DICTHT_SIZE(d->ht_size_exp[0]);
unsigned long s1 = DICTHT_SIZE(d->ht_size_exp[1]);
if (dict_can_resize == DICT_RESIZE_FORBID || !dictIsRehashing(d)) return 0;
if (dict_can_resize == DICT_RESIZE_AVOID &&
((s1 > s0 && s1 < dict_force_resize_ratio * s0) ||
(s1 < s0 && s0 < HASHTABLE_MIN_FILL * dict_force_resize_ratio * s1)))
{
return 0;
}
while(n-- && d->ht_used[0] != 0) {
assert(DICTHT_SIZE(d->ht_size_exp[0]) > (unsigned long)d->rehashidx);
while(d->ht_table[0][d->rehashidx] == NULL) {
d->rehashidx++;
if (--empty_visits == 0) return 1;
}
/* Move all the keys in this bucket from the old to the new hash HT */
rehashEntriesInBucketAtIndex(d, d->rehashidx);
d->rehashidx++;
}
return !dictCheckRehashingCompleted(d);
}
static void _dictRehashStep(dict *d) {
if (d->pauserehash == 0) dictRehash(d,1);
}ht_table[0] is the live table; ht_table[1] is the resize target. Under normal operation ht_table[1] is NULL and rehashidx is -1. When a load-factor threshold is crossed, dictExpand() allocates ht_table[1] and sets rehashidx = 0. From that point, every lookup, insert, and delete calls _dictRehashStep() as a side effect, moving exactly one bucket. The empty_visits cap (ten times the requested steps) prevents unbounded time on a sparse table.
DICT_RESIZE_AVOID is how Redis cooperates with fork-based persistence. During BGSAVE or BGREWRITEAOF, copy-on-write means any page the parent writes must be duplicated for the child. Redis skips rehash steps unless load exceeds dict_force_resize_ratio, which keeps the child's memory footprint from ballooning. The ht_size_exp fields store exponents, so the actual bucket count is always 1 << ht_size_exp[i], the power-of-two required for bitwise modulo in hash lookup.
Both tables stay consistent throughout a resize because work is distributed: rehashidx advances by one bucket per normal operation, so there is no stop-the-world resize step.
---
struct dict {
dictType *type;
dictEntry **ht_table[2];
unsigned long ht_used[2];
long rehashidx; /* rehashing not in progress if rehashidx == -1 */
/* Note: pauserehash is a full unsigned so iterator increments
* don't perform RMW on the same storage unit as other bitfields. */
unsigned pauserehash; /* If >0 rehashing is paused */
/* Keep small vars at end for optimal (minimal) struct padding */
signed char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */
int16_t pauseAutoResize; /* If >0 automatic resizing is disallowed (<0 indicates coding error) */
void *metadata[];
int dictRehash(dict *d, int n) {
int empty_visits = n*10; /* Max number of empty buckets to visit. */
unsigned long s0 = DICTHT_SIZE(d->ht_size_exp[0]);
unsigned long s1 = DICTHT_SIZE(d->ht_size_exp[1]);
if (dict_can_resize == DICT_RESIZE_FORBID || !dictIsRehashing(d)) return 0;
/* If dict_can_resize is DICT_RESIZE_AVOID, we want to avoid rehashing.
* - If expanding, the threshold is dict_force_resize_ratio which is 4.
* - If shrinking, the threshold is 1 / (HASHTABLE_MIN_FILL * dict_force_resize_ratio) which is 1/32. */
if (dict_can_resize == DICT_RESIZE_AVOID &&
((s1 > s0 && s1 < dict_force_resize_ratio * s0) ||
(s1 < s0 && s0 < HASHTABLE_MIN_FILL * dict_force_resize_ratio * s1)))
{
return 0;
}
while(n-- && d->ht_used[0] != 0) {
/* Note that rehashidx can't overflow as we are sure there are more
* elements because ht[0].used != 0 */
assert(DICTHT_SIZE(d->ht_size_exp[0]) > (unsigned long)d->rehashidx);
while(d->ht_table[0][d->rehashidx] == NULL) {
d->rehashidx++;
if (--empty_visits == 0) return 1;
}
/* Move all the keys in this bucket from the old to the new hash HT */
rehashEntriesInBucketAtIndex(d, d->rehashidx);
d->rehashidx++;
}
return !dictCheckRehashingCompleted(d);
}
long long timeInMilliseconds(void) {
struct timeval tv;
gettimeofday(&tv,NULL);
return (((long long)tv.tv_sec)*1000)+(tv.tv_usec/1000);
}
/* Rehash in us+"delta" microseconds. The value of "delta" is larger
* than 0, and is smaller than 1000 in most cases. The exact upper bound
* depends on the running time of dictRehash(d,100).*/
int dictRehashMicroseconds(dict *d, uint64_t us) {
if (d->pauserehash > 0) return 0;
monotime timer;
elapsedStart(&timer);
int rehashes = 0;
while(dictRehash(d,100)) {
rehashes += 100;
if (elapsedUs(timer) >= us) break;
}
return rehashes;
}
/* This function performs just a step of rehashing, and only if hashing has
* not been paused for our hash table. When we have iterators in the
* middle of a rehashing we can't mess with the two hash tables otherwise
* some elements can be missed or duplicated.
*
* This function is called by common lookup or update operations in the
* dictionary so that the hash table automatically migrates from H1 to H2
* while it is actively used. */
static void _dictRehashStep(dict *d) {
if (d->pauserehash == 0) dictRehash(d,1);
}
/* Performs rehashing on a single bucket. */
int _dictBucketRehash(dict *d, uint64_t idx) {intset
src/intset.c:165Packed sorted-integer set and the encoding-upgrade path
The intset struct (defined in src/intset.h) holds a uint32_t encoding, a uint32_t length, and a flexible int8_t contents[] array. The encoding is the byte-width of one element: INTSET_ENC_INT16 (2 bytes), INTSET_ENC_INT32 (4 bytes), or INTSET_ENC_INT64 (8 bytes). That makes contents a tightly packed sorted array with no gaps or per-element headers. intsetSearch runs binary search directly on this array, so SISMEMBER is O(log N) with no heap allocation per element.
An intset can only grow its element width, never shrink. When intsetAdd receives a value outside the current encoding range, it calls intsetUpgradeAndAdd, which:
- Sets the new encoding and resizes the allocation.
- Re-reads every element using the old encoding width (
_intsetGetEncodedwithcurenc). - Writes each element back at the new width, traversing back-to-front to avoid overwriting source data.
- Handles a negative new value with the
prependflag, placing it at index 0.
A single out-of-range integer triggers a full re-encoding of the entire set. The tradeoff is acceptable: upgrades are rare, and even after a re-encoding, intset is far more compact than a hash table.
intset pays one re-encoding cost when a value exceeds the current element width. After that, every element is the same fixed width, the array stays sorted, and SISMEMBER stays at O(log N) with no per-element allocation.
---
static intset *intsetUpgradeAndAdd(intset *is, int64_t value) {
uint8_t curenc = intrev32ifbe(is->encoding);
uint8_t newenc = _intsetValueEncoding(value);
int length = intrev32ifbe(is->length);
int prepend = value < 0 ? 1 : 0;
/* First set new encoding and resize */
is->encoding = intrev32ifbe(newenc);
is = intsetResize(is,intrev32ifbe(is->length)+1);
/* Upgrade back-to-front so we don't overwrite values.
* Note that the "prepend" variable is used to make sure we have an empty
* space at either the beginning or the end of the intset. */
while(length--)
_intsetSet(is,length+prepend,_intsetGetEncoded(is,length,curenc));
/* Set the value at the beginning or the end. */
if (prepend)
_intsetSet(is,0,value);
else
_intsetSet(is,intrev32ifbe(is->length),value);
is->length = intrev32ifbe(intrev32ifbe(is->length)+1);
return is;
}
static void intsetMoveTail(intset *is, uint32_t from, uint32_t to) {
void *src, *dst;
uint32_t bytes = intrev32ifbe(is->length)-from;
uint32_t encoding = intrev32ifbe(is->encoding);
if (encoding == INTSET_ENC_INT64) {
src = (int64_t*)is->contents+from;
dst = (int64_t*)is->contents+to;
bytes *= sizeof(int64_t);
} else if (encoding == INTSET_ENC_INT32) {
src = (int32_t*)is->contents+from;
dst = (int32_t*)is->contents+to;
bytes *= sizeof(int32_t);
} else {
src = (int16_t*)is->contents+from;
dst = (int16_t*)is->contents+to;
bytes *= sizeof(int16_t);
}
memmove(dst,src,bytes);
}
/* Insert an integer in the intset */
intset *intsetAdd(intset *is, int64_t value, uint8_t *success) {
uint8_t valenc = _intsetValueEncoding(value);
uint32_t pos;
if (success) *success = 1;
/* Upgrade encoding if necessary. If we need to upgrade, we know that
* this value should be either appended (if > 0) or prepended (if < 0),
* because it lies outside the range of existing values. */
if (valenc > intrev32ifbe(is->encoding)) {
/* This always succeeds, so we don't need to curry *success. */
return intsetUpgradeAndAdd(is,value);
} else {
/* Abort if the value is already present in the set.
* This call will populate "pos" with the right position to insert
* the value when it cannot be found. */
if (intsetSearch(is,value,&pos)) {
if (success) *success = 0;
return is;
}
is = intsetResize(is,intrev32ifbe(is->length)+1);
if (pos < intrev32ifbe(is->length)) intsetMoveTail(is,pos,pos+1);
}
_intsetSet(is,pos,value);
is->length = intrev32ifbe(intrev32ifbe(is->length)+1);
return is;
}
/* Delete integer from intset */
intset *intsetRemove(intset *is, int64_t value, int *success) {
uint8_t valenc = _intsetValueEncoding(value);listpack
src/listpack.c:27Single-allocation byte array for small collections, replacing ziplist's cascade-update-prone prevlen with a per-element backlen
unsigned char *lpNew(size_t capacity) {
unsigned char *lp = lp_malloc(capacity > LP_HDR_SIZE+1 ? capacity : LP_HDR_SIZE+1);
if (lp == NULL) return NULL;
lpSetTotalBytes(lp,LP_HDR_SIZE+1);
lpSetNumElements(lp,0);
lp[LP_HDR_SIZE] = LP_EOF;
return lp;
}
static inline void lpEncodeIntegerGetType(int64_t v, unsigned char *intenc, uint64_t *enclen) {
if (v >= 0 && v <= 127) {
/* Single byte 0-127 integer. */
if (intenc != NULL) intenc[0] = v;
if (enclen != NULL) *enclen = 1;
} else if (v >= -4096 && v <= 4095) {
/* 13 bit integer. */
if (v < 0) v = ((int64_t)1<<13)+v;
if (intenc != NULL) {
intenc[0] = (v>>8)|LP_ENCODING_13BIT_INT;
intenc[1] = v&0xff;
}
if (enclen != NULL) *enclen = 2;
} else if (v >= -32768 && v <= 32767) {
/* 16 bit integer. */
if (v < 0) v = ((int64_t)1<<16)+v;
if (intenc != NULL) {
intenc[0] = LP_ENCODING_16BIT_INT;
intenc[1] = v&0xff;
intenc[2] = v>>8;
}
if (enclen != NULL) *enclen = 3;
}
/* ... 24-bit, 32-bit, 64-bit cases follow the same pattern */
}A listpack is one contiguous allocation. The 6-byte header holds a 32-bit total byte length and a 16-bit element count. When the count reaches LP_HDR_NUMELE_UNKNOWN (65535) the count field is abandoned and a full scan is required. The final byte is always LP_EOF (0xFF).
Each element encodes three things in sequence: an encoding/data prefix, optional raw data bytes, and a backlen field that records the current element's own byte length for reverse traversal. The encoding prefix doubles as data for small integers: a value 0-127 fits in one byte with the high bit clear, so the full element is just two bytes (data + one-byte backlen). Larger values use 13-bit, 16-bit, 24-bit, 32-bit, or 64-bit integer forms, or string forms with a 6-bit to 32-bit length prefix.
lpNew() shows how cheap an empty listpack is: allocate LP_HDR_SIZE + 1 bytes (7 minimum), write the header, write the EOF byte. No hash table, no linked-list nodes, no per-element malloc.
Listpack has no random access; scanning always starts from the front. The payoff is zero per-element allocation and no pointer indirection anywhere in the structure. ---
#define LP_HDR_SIZE 6 /* 32 bit total len + 16 bit number of elements. */
#define LP_HDR_NUMELE_UNKNOWN UINT16_MAX
#define LP_MAX_INT_ENCODING_LEN 9
#define LP_MAX_BACKLEN_SIZE 5
#define LP_ENCODING_INT 0
unsigned char *lpNew(size_t capacity) {
unsigned char *lp = lp_malloc(capacity > LP_HDR_SIZE+1 ? capacity : LP_HDR_SIZE+1);
if (lp == NULL) return NULL;
lpSetTotalBytes(lp,LP_HDR_SIZE+1);
lpSetNumElements(lp,0);
lp[LP_HDR_SIZE] = LP_EOF;
return lp;
}
/* Free the specified listpack. */
void lpFree(unsigned char *lp) {
lp_free(lp);
}
/* Shrink the memory to fit. */
unsigned char* lpShrinkToFit(unsigned char *lp) {
size_t size = lpGetTotalBytes(lp);
if (size < lp_malloc_size(lp)) {
return lp_realloc(lp, size);
} else {
return lp;
}
}
/* Stores the integer encoded representation of 'v' in the 'intenc' buffer. */
static inline void lpEncodeIntegerGetType(int64_t v, unsigned char *intenc, uint64_t *enclen) {
if (v >= 0 && v <= 127) {
/* Single byte 0-127 integer. */
if (intenc != NULL) intenc[0] = v;
if (enclen != NULL) *enclen = 1;
} else if (v >= -4096 && v <= 4095) {
/* 13 bit integer. */
if (v < 0) v = ((int64_t)1<<13)+v;
if (intenc != NULL) {
intenc[0] = (v>>8)|LP_ENCODING_13BIT_INT;
intenc[1] = v&0xff;
}
if (enclen != NULL) *enclen = 2;
} else if (v >= -32768 && v <= 32767) {
/* 16 bit integer. */
if (v < 0) v = ((int64_t)1<<16)+v;
if (intenc != NULL) {
intenc[0] = LP_ENCODING_16BIT_INT;
intenc[1] = v&0xff;
intenc[2] = v>>8;
}
if (enclen != NULL) *enclen = 3;
} else if (v >= -8388608 && v <= 8388607) {
/* 24 bit integer. */
if (v < 0) v = ((int64_t)1<<24)+v;
if (intenc != NULL) {
intenc[0] = LP_ENCODING_24BIT_INT;
intenc[1] = v&0xff;
intenc[2] = (v>>8)&0xff;
intenc[3] = v>>16;
}
if (enclen != NULL) *enclen = 4;
} else if (v >= -2147483648 && v <= 2147483647) {
/* 32 bit integer. */
if (v < 0) v = ((int64_t)1<<32)+v;
if (intenc != NULL) {
intenc[0] = LP_ENCODING_32BIT_INT;
intenc[1] = v&0xff;
intenc[2] = (v>>8)&0xff;
intenc[3] = (v>>16)&0xff;
intenc[4] = v>>24;
}
if (enclen != NULL) *enclen = 5;
} else {
/* 64 bit integer. */
uint64_t uv = v;
if (intenc != NULL) {quicklist
src/quicklist.c:49Linked list of listpack chunks and the size threshold that decides when to split a node
/* Calculate the size limit of the quicklist node based on negative 'fill'. */
static size_t quicklistNodeNegFillLimit(int fill) {
assert(fill < 0);
size_t offset = (-fill) - 1;
size_t max_level = sizeof(optimization_level) / sizeof(*optimization_level);
if (offset >= max_level) offset = max_level - 1;
return optimization_level[offset];
}
void quicklistNodeLimit(int fill, size_t *size, unsigned int *count) {
*size = SIZE_MAX;
*count = UINT_MAX;
if (fill >= 0) {
*count = (fill == 0) ? 1 : fill;
} else {
*size = quicklistNodeNegFillLimit(fill);
}
}
REDIS_STATIC int _quicklistNodeAllowInsert(const quicklistNode *node,
const int fill, const size_t sz) {
if (unlikely(!node))
return 0;
if (unlikely(QL_NODE_IS_PLAIN(node) || isLargeElement(sz, fill)))
return 0;
size_t new_sz = node->sz + sz + SIZE_ESTIMATE_OVERHEAD;
if (unlikely(quicklistNodeExceedsLimit(fill, new_sz, node->count + 1)))
return 0;
return 1;
}A quicklist is a doubly-linked list where each node holds a listpack. The fill value, set by list-max-listpack-size in redis.conf, flows from the quicklist struct into each node and controls when nodes split.
Negative fill values index into optimization_level as byte-size limits:
-1: 4 KB per node-2: 8 KB per node (default)-3: 16 KB per node-4: 32 KB per node-5: 64 KB per node
Positive fill values cap element count instead of bytes: fill = 128 means at most 128 entries per node. The -2 default of 8 KB is documented in the source as a practical bound that limits fragmentation without producing excessively large nodes.
_quicklistNodeAllowInsert() checks before every push. It estimates post-insert size as node->sz + sz + SIZE_ESTIMATE_OVERHEAD (8 bytes, deliberately conservative) and calls quicklistNodeExceedsLimit(). An element large enough to exceed the limit on its own goes into a dedicated PLAIN node, so it never forces surrounding nodes to split.
A quicklist node splits when its estimated byte size exceeds the fill threshold. Negative fill values select a byte-size limit from a fixed five-entry table; positive values cap entry count instead. ---
static const size_t optimization_level[] = {4096, 8192, 16384, 32768, 65536};
/* This is for test suite development purposes only, 0 means disabled. */
static size_t packed_threshold = 0;
/* set threshold for PLAIN nodes for test suit, the real limit is based on `fill` */
int quicklistSetPackedThreshold(size_t sz) {
static size_t quicklistNodeNegFillLimit(int fill) {
assert(fill < 0);
size_t offset = (-fill) - 1;
size_t max_level = sizeof(optimization_level) / sizeof(*optimization_level);
if (offset >= max_level) offset = max_level - 1;
return optimization_level[offset];
}
/* Calculate the size limit or length limit of the quicklist node
* based on 'fill', and is also used to limit list listpack. */
void quicklistNodeLimit(int fill, size_t *size, unsigned int *count) {
*size = SIZE_MAX;
*count = UINT_MAX;
if (fill >= 0) {
/* Ensure that one node have at least one entry */
*count = (fill == 0) ? 1 : fill;
} else {
*size = quicklistNodeNegFillLimit(fill);
}
}
/* Check if the limit of the quicklist node has been reached to determine if
* insertions, merges or other operations that would increase the size of
* the node can be performed.
* Return 1 if exceeds the limit, otherwise 0. */
int quicklistNodeExceedsLimit(int fill, size_t new_sz, unsigned int new_count) {
size_t sz_limit;
unsigned int count_limit;
quicklistNodeLimit(fill, &sz_limit, &count_limit);
if (likely(sz_limit != SIZE_MAX)) {
return new_sz > sz_limit;
} else if (count_limit != UINT_MAX) {
/* when we reach here we know that the limit is a size limit (which is
* safe, see comments next to optimization_level and SIZE_SAFETY_LIMIT) */
if (!sizeMeetsSafetyLimit(new_sz)) return 1;
return new_count > count_limit;
}
redis_unreachable();
}
/* Determines whether a given size qualifies as a large element based on a threshold
* determined by the 'fill'. If the size is considered large, it will be stored in
* a plain node. */
static int isLargeElement(size_t sz, int fill) {
if (unlikely(packed_threshold != 0)) return sz >= packed_threshold;
if (fill >= 0)
return !sizeMeetsSafetyLimit(sz);
else
return sz > quicklistNodeNegFillLimit(fill);
}
REDIS_STATIC int _quicklistNodeAllowInsert(const quicklistNode *node,
const int fill, const size_t sz) {
if (unlikely(!node))
return 0;
if (unlikely(QL_NODE_IS_PLAIN(node) || isLargeElement(sz, fill)))
return 0;
/* Estimate how many bytes will be added to the listpack by this one entry.
* We prefer an overestimation, which would at worse lead to a few bytes
* below the lowest limit of 4k (see optimization_level).
* Note: No need to check for overflow below since both `node->sz` and
* `sz` are to be less than 1GB after the plain/large element check above. */
size_t new_sz = node->sz + sz + SIZE_ESTIMATE_OVERHEAD;
if (unlikely(quicklistNodeExceedsLimit(fill, new_sz, node->count + 1)))
return 0;
return 1;
}
REDIS_STATIC int _quicklistNodeAllowMerge(const quicklistNode *a,
const quicklistNode *b,
const int fill) {
if (!a || !b)
return 0;ziplist to listpack migration
src/object.h:75The encoding constant history and the current encoding map per type
OBJ_ENCODING_ZIPLIST (5) and OBJ_ENCODING_ZIPMAP (3) are marked "No longer used" but their constants remain. RDB files on disk may still reference the old ziplist format, so rdb.c must still decode them. Removing the constants would break any operator reloading a pre-7.0 snapshot.
Listpack replaced ziplist in Redis 7.0. Both pack small collections into a contiguous byte array, but ziplist had a cascade-update problem: its prevlen field stored the previous entry's byte length in a variable-width encoding, so inserting near the start of the list could require rewriting every subsequent prevlen field. Listpack's backlen encodes only the current element's own length, which eliminates that cascade.
The current encoding map for small collections:
- Small hashes:
OBJ_ENCODING_LISTPACK(11), orOBJ_ENCODING_LISTPACK_EX(12) when per-field TTLs are active - Small sorted sets:
OBJ_ENCODING_LISTPACK(11) - Small lists:
OBJ_ENCODING_QUICKLIST(9) with listpack nodes - Small sets:
OBJ_ENCODING_INTSET(6) if all members are integers, otherwiseOBJ_ENCODING_LISTPACK(11)
OBJ_ENCODING_ZIPLIST (5) stays in the codebase because RDB compatibility requires it. OBJ_ENCODING_LISTPACK (11) replaced it for all active use by fixing the prevlen cascade-update problem with a simpler per-element backlen.
---
#define OBJ_ENCODING_RAW 0 /* Raw representation */
#define OBJ_ENCODING_INT 1 /* Encoded as integer */
#define OBJ_ENCODING_HT 2 /* Encoded as hash table */
#define OBJ_ENCODING_ZIPMAP 3 /* No longer used: old hash encoding. */
#define OBJ_ENCODING_LINKEDLIST 4 /* No longer used: old list encoding. */
#define OBJ_ENCODING_ZIPLIST 5 /* No longer used: old list/hash/zset encoding. */
#define OBJ_ENCODING_INTSET 6 /* Encoded as intset */
#define OBJ_ENCODING_SKIPLIST 7 /* Encoded as skiplist */
#define OBJ_ENCODING_EMBSTR 8 /* Embedded sds string encoding */
#define OBJ_ENCODING_QUICKLIST 9 /* Encoded as linked list of listpacks */
#define OBJ_ENCODING_STREAM 10 /* Encoded as a radix tree of listpacks */
#define OBJ_ENCODING_LISTPACK 11 /* Encoded as a listpack */
#define OBJ_ENCODING_LISTPACK_EX 12 /* Encoded as listpack, extended with metadata */Skiplist for ZSET
src/t_zset.c:254Level randomization and span counters that make ZRANK an O(log N) operation
Each zskiplistNode (defined in src/server.h lines 1742-1752) carries a score, a backward pointer, and a flexible level[] array. At levels above 0, each entry holds a forward pointer and a span (the count of level-0 nodes that a single hop at this level skips over). Level 0 is special: its span slot is repurposed to store a zskiplistNodeInfo struct that packs the node's level count and an SDS offset.
zslRandomLevel() uses a geometric distribution with ZSKIPLIST_P = 0.25, capped at ZSKIPLIST_MAXLEVEL = 32 (src/server.h line 634). Each node has a 25% chance of a second level, 6.25% of a third, which keeps expected pointer comparisons at O(log N).
During zslInsertNode, a top-down traversal fills update[i] (last node at level i with a score strictly below the new one) and rank[i] (accumulated level-0 hop count to that position). After splicing the node in, span arithmetic runs in the same pass: the new node's span at level i is update[i]'s old span minus (rank[0] - rank[i]), and update[i]'s span becomes (rank[0] - rank[i]) + 1. Every span is correct at insert time, which is what lets ZRANK sum spans top-down instead of counting nodes.
ZRANK is O(log N) because insertion maintains accurate span counts at every level. Each query sums spans during traversal rather than walking individual level-0 nodes.
---
static int zslRandomLevel(void) {
static const int threshold = ZSKIPLIST_P*RAND_MAX;
int level = 1;
while (random() < threshold)
level += 1;
return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}
/* Insert an already-created node, with its score, element set into the skiplist
* at the correct position. Updates all forward/backward pointers and spans.
* The node's level must already be set via zslSetNodeInfo(). */
static void zslInsertNode(zskiplist *zsl, zskiplistNode *node) {
zskiplistNode *update[ZSKIPLIST_MAXLEVEL]; /* Nodes that will point to the new node at each level */
unsigned long rank[ZSKIPLIST_MAXLEVEL]; /* Rank (0-based) at each level during traversal */
zskiplistNode *x;
int i, level;
double score = node->score;
sds ele = zslGetNodeElement(node);
level = zslGetNodeInfo(node)->levels;
serverAssert(!isnan(score));
/* Find the position where this node should be inserted */
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
/* store rank that is crossed to reach the insert position */
rank[i] = i == (zsl->level-1) ? 0 : rank[i+1];
while (zslCompareWithNode(score, ele, x->level[i].forward) > 0) {
rank[i] += zslGetNodeSpanAtLevel(x, i);
x = x->level[i].forward;
}
update[i] = x;
}
/* Update skiplist level if needed */
if (level > zsl->level) {
for (i = zsl->level; i < level; i++) {
rank[i] = 0;
update[i] = zsl->header;
zslSetNodeSpanAtLevel(update[i], i, zsl->length);
}
zsl->level = level;
zslGetNodeInfo(zsl->header)->levels = level;
}
/* Insert the node at the found position */
for (i = 0; i < level; i++) {
node->level[i].forward = update[i]->level[i].forward;
update[i]->level[i].forward = node;
/* update span covered by update[i] as node is inserted here */
zslSetNodeSpanAtLevel(node, i, zslGetNodeSpanAtLevel(update[i], i) - (rank[0] - rank[i]));
zslSetNodeSpanAtLevel(update[i], i, (rank[0] - rank[i]) + 1);
}
/* increment span for untouched levels */
for (i = level; i < zsl->level; i++) {
zslIncrNodeSpanAtLevel(update[i], i, 1);
}
/* Update backward pointers */
node->backward = (update[0] == zsl->header) ? NULL : update[0];
if (node->level[0].forward)
node->level[0].forward->backward = node;
else
zsl->tail = node;
zsl->length++;
}
/* Insert a new node in the skiplist. Assumes the element does not alreadyEncoding Transition Rules
src/object.h:75The OBJ_ENCODING_* constants and the config thresholds that trigger promotion
/* redisServer config fields -- server.h lines 2426-2440 */
size_t hash_max_listpack_entries;
size_t hash_max_listpack_value;
size_t set_max_intset_entries;
size_t set_max_listpack_entries;
size_t set_max_listpack_value;
size_t zset_max_listpack_entries;
size_t zset_max_listpack_value;
size_t hll_sparse_max_bytes;
size_t stream_node_max_bytes;
long long stream_node_max_entries;
/* ... stream IDMP parameters ... */
/* List parameters */
int list_max_listpack_size;
int list_compress_depth;/* Promotion check in t_hash.c -- lines 619-631 */
if (new_fields > server.hash_max_listpack_entries) {
hashTypeConvert(db, o, OBJ_ENCODING_HT);
dictExpand(o->ptr, new_fields);
return;
}
if (len > server.hash_max_listpack_value) {
hashTypeConvert(db, o, OBJ_ENCODING_HT);
return;
}Every type with a compact encoding checks two independent thresholds before each mutation: entry count and value byte size. Promotion is one-way and immediate. Once hashTypeConvert() rebuilds a listpack as an OBJ_ENCODING_HT dict, there is no path back, even if entries are later removed.
The promotion rules per type:
- Hash:
OBJ_ENCODING_LISTPACKpromotes toOBJ_ENCODING_HTwhenhash_max_listpack_entries(default 128) orhash_max_listpack_value(default 64 bytes) is exceeded - Sorted set:
OBJ_ENCODING_LISTPACKpromotes toOBJ_ENCODING_SKIPLIST(azsetstruct containing both a dict and a skip list) at the same default thresholds - Set: starts as
OBJ_ENCODING_INTSETif all members are integers, promotes toOBJ_ENCODING_LISTPACKorOBJ_ENCODING_HTdepending on what is inserted - List: always
OBJ_ENCODING_QUICKLIST; individual nodes split based onlist_max_listpack_size
Raising hash-max-listpack-entries to 512 keeps more hashes as listpacks, saving per-dictEntry allocation overhead at the cost of O(N) field lookup. Lowering it to 0 gives every hash O(1) lookup from the first field, at the cost of more memory per hash.
Each compact encoding is held until one of two thresholds is crossed: entry count or value byte size. Once either threshold is exceeded, hashTypeConvert() (or its equivalent per type) rebuilds the object in the larger format and never converts back.
---
#define OBJ_ENCODING_RAW 0 /* Raw representation */
#define OBJ_ENCODING_INT 1 /* Encoded as integer */
#define OBJ_ENCODING_HT 2 /* Encoded as hash table */
#define OBJ_ENCODING_ZIPMAP 3 /* No longer used: old hash encoding. */
#define OBJ_ENCODING_LINKEDLIST 4 /* No longer used: old list encoding. */
#define OBJ_ENCODING_ZIPLIST 5 /* No longer used: old list/hash/zset encoding. */
#define OBJ_ENCODING_INTSET 6 /* Encoded as intset */
#define OBJ_ENCODING_SKIPLIST 7 /* Encoded as skiplist */
#define OBJ_ENCODING_EMBSTR 8 /* Embedded sds string encoding */
#define OBJ_ENCODING_QUICKLIST 9 /* Encoded as linked list of listpacks */
#define OBJ_ENCODING_STREAM 10 /* Encoded as a radix tree of listpacks */
#define OBJ_ENCODING_LISTPACK 11 /* Encoded as a listpack */
#define OBJ_ENCODING_LISTPACK_EX 12 /* Encoded as listpack, extended with metadata */
size_t hash_max_listpack_entries;
size_t hash_max_listpack_value;
size_t set_max_intset_entries;
size_t set_max_listpack_entries;
size_t set_max_listpack_value;
size_t zset_max_listpack_entries;
size_t zset_max_listpack_value;
size_t hll_sparse_max_bytes;
size_t stream_node_max_bytes;
long long stream_node_max_entries;
/* Stream IDMP parameters */
long long stream_idmp_duration; /* Default IDMP duration in seconds. */
long long stream_idmp_maxsize; /* Default IDMP max entries. */
/* List parameters */
int list_max_listpack_size;You've walked through 8 key areas of the Redis codebase.
Continue: RESP Protocol Deep Dive: How Redis Talks Over the Wire → Browse all projectsCreate code tours for your project
Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.
Install Intraview Free