Redis Architecture Grand Tour
How Redis processes commands, stores data, and persists to disk
What you will learn
- How Redis bootstraps from a single main() call into a fully wired server with signal handlers, databases, and a network listener
- Why Redis can handle tens of thousands of clients with a single thread: the ae event loop and its epoll/kqueue abstraction
- How the RESP protocol is parsed incrementally from a raw socket read into a command ready for execution
- How every keyspace write is dispatched through a central call() function that handles propagation to AOF and replicas in one place
- How Redis avoids O(n) rehashing pauses by spreading hash table migration across regular read and write operations
Prerequisites
- Comfortable reading C (you don't need to write it, just follow the logic)
- Basic understanding of key-value stores: what GET/SET do
- Familiarity with the concept of an event loop is helpful but not required
Server Lifecycle
src/server.c:8088How Redis bootstraps from process entry to event loop
main() in server.c is 355 lines of orchestration. By the time you reach line 8103, the config has been parsed, the locale is set, and the entropy pool is seeded. The real wiring happens in initServer(): signal handlers, database allocation, the event loop, and the listening socket all come up inside that one call. After that, main() loads persisted data from disk (RDB or AOF), then hands control to aeMain() -- a while loop that never returns until the process is told to stop. The return 0 at line 8176 is practically unreachable in production; it only runs on a clean shutdown. Notice how loadDataFromDisk() happens after the listener is configured but before the event loop starts -- clients cannot connect and run commands against a half-loaded dataset.
Redis startup is a strict ordered sequence: configure, wire, load data, then enter the event loop -- and that order is not accidental.
serverLog(LL_NOTICE, "oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo");
serverLog(LL_NOTICE,
"Redis version=%s, bits=%d, commit=%s, modified=%d, pid=%d, just started",
REDIS_VERSION,
(sizeof(long) == 8) ? 64 : 32,
redisGitSHA1(),
strtol(redisGitDirty(),NULL,10) > 0,
(int)getpid());
if (argc == 1) {
serverLog(LL_WARNING, "Warning: no config file specified, using the default config.");
}
initServer();
if (background || server.pidfile) createPidFile();
if (server.set_proc_title) redisSetProcTitle(NULL);
redisAsciiArt();
checkTcpBacklogSettings();
if (server.cluster_enabled) {
clusterCommonInit();
clusterInit();
}
if (!server.sentinel_mode) {
moduleInitModulesSystemLast();
moduleLoadInternalModules();
moduleLoadFromQueue();
}
ACLLoadUsersAtStartup();
initListeners();
/* ... load data from disk ... */
serverLog(LL_NOTICE,"Server initialized");
loadDataFromDisk();
aeMain(server.el);
aeDeleteEventLoop(server.el);
return 0;The Event Loop
src/ae.c:360How Redis handles thousands of clients without threads
The entire Redis concurrency model lives in these 140 lines. aeMain() is a while loop that calls aeProcessEvents() on every iteration. Inside aeProcessEvents(), the key move is aeApiPoll() -- a thin abstraction over epoll on Linux or kqueue on macOS. The poll call blocks for at most as long as the next scheduled timer event (usUntilEarliestTimer). When it returns, Redis has a list of file descriptors that are ready for I/O. Each one gets dispatched to its registered read or write handler. No threads, no callbacks in the Node.js sense -- just one tight loop. The beforesleep and aftersleep hooks let other subsystems (replication, AOF flushing) piggyback on the loop without breaking the single-threaded model.
Redis's legendary throughput comes from a seven-line while loop backed by the OS's own I/O readiness notification -- the event loop is not Redis magic, it is Unix done right.
int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
int processed = 0, numevents;
/* Nothing to do? return ASAP */
if (!(flags & AE_TIME_EVENTS) && !(flags & AE_FILE_EVENTS)) return 0;
if (eventLoop->maxfd != -1 ||
((flags & AE_TIME_EVENTS) && !(flags & AE_DONT_WAIT))) {
int j;
struct timeval tv, *tvp = NULL; /* NULL means infinite wait. */
int64_t usUntilTimer;
if (eventLoop->beforesleep != NULL && (flags & AE_CALL_BEFORE_SLEEP))
eventLoop->beforesleep(eventLoop);
if ((flags & AE_DONT_WAIT) || (eventLoop->flags & AE_DONT_WAIT)) {
tv.tv_sec = tv.tv_usec = 0;
tvp = &tv;
} else if (flags & AE_TIME_EVENTS) {
usUntilTimer = usUntilEarliestTimer(eventLoop);
if (usUntilTimer >= 0) {
tv.tv_sec = usUntilTimer / 1000000;
tv.tv_usec = usUntilTimer % 1000000;
tvp = &tv;
}
}
/* Call the multiplexing API, will return only on timeout or when
* some event fires. */
numevents = aeApiPoll(eventLoop, tvp);
/* After sleep callback. */
if (eventLoop->aftersleep != NULL && flags & AE_CALL_AFTER_SLEEP)
eventLoop->aftersleep(eventLoop);
for (j = 0; j < numevents; j++) {
int fd = eventLoop->fired[j].fd;
aeFileEvent *fe = &eventLoop->events[fd];
int mask = eventLoop->fired[j].mask;
/* dispatch to read or write handler */
}
}
/* ... time event processing ... */
}
void aeMain(aeEventLoop *eventLoop) {
eventLoop->stop = 0;
while (!eventLoop->stop) {
aeProcessEvents(eventLoop, AE_ALL_EVENTS|
AE_CALL_BEFORE_SLEEP|
AE_CALL_AFTER_SLEEP);
}
}RESP Protocol Parsing
src/networking.c:3718How raw bytes from a socket become a command
Every client connection in Redis has a client struct that carries its query buffer, parse state, and I/O flags. When the event loop fires a read event on a socket, readQueryFromClient() is the registered handler. The function reads raw bytes into c->querybuf using connRead() -- a thin wrapper that abstracts plain TCP from TLS. The read length is usually PROTO_IOBUF_LEN (16KB), but for large RESP bulk arguments the function calculates the exact remaining bytes to avoid unnecessary copying. After the read, processInputBuffer() walks the buffer looking for complete RESP frames. If it finds one, it parses the command name and arguments and queues execution -- all without blocking the event loop on a slow client.
RESP parsing in Redis is incremental and stateful: each call to readQueryFromClient() may process a partial command, and the client struct retains position across reads.
void readQueryFromClient(connection *conn) {
client *c = connGetPrivateData(conn);
int nread, big_arg = 0;
size_t qblen, readlen;
if (!(c->io_flags & CLIENT_IO_READ_ENABLED)) {
atomicSetWithSync(c->pending_read, 1);
return;
}
readlen = PROTO_IOBUF_LEN;
/* If this is a multi bulk request and we are processing a large argument,
* try to size the read exactly to the argument boundary to avoid copying. */
if (c->reqtype == PROTO_REQ_MULTIBULK && c->multibulklen && c->bulklen != -1
&& c->bulklen >= PROTO_MBULK_BIG_ARG)
{
ssize_t remaining = (size_t)(c->bulklen+2)-(sdslen(c->querybuf)-c->qb_pos);
big_arg = 1;
if (remaining > 0) readlen = remaining;
} else if (c->querybuf == NULL) {
/* Use thread-local reusable query buffer to avoid allocation. */
if (!thread_reusable_qb) {
thread_reusable_qb = sdsnewlen(NULL, PROTO_IOBUF_LEN);
sdsclear(thread_reusable_qb);
}
c->querybuf = thread_reusable_qb;
c->io_flags |= CLIENT_IO_REUSABLE_QUERYBUFFER;
thread_reusable_qb_used = 1;
}
qblen = sdslen(c->querybuf);
nread = connRead(c->conn, c->querybuf+qblen, readlen);
if (nread == -1) {
if (connGetState(conn) == CONN_STATE_CONNECTED) {
goto done; /* EAGAIN -- no data yet, try again next loop */
} else {
c->read_error = CLIENT_READ_CONN_DISCONNECTED;
freeClientAsync(c);
goto done;
}
} else if (nread == 0) {
c->read_error = CLIENT_READ_CONN_CLOSED;
freeClientAsync(c);
goto done;
}
sdsIncrLen(c->querybuf, nread);
/* Parse the buffer and execute if a complete command is present. */
if (processInputBuffer(c) == C_ERR)
c = NULL;
}Command Dispatch
src/server.c:3878How a parsed command reaches its implementation and propagates to AOF and replicas
call() is the single chokepoint through which every Redis command passes. The actual execution is one line: c->cmd->proc(c) -- a function pointer in the command table calling, for example, setCommand or getCommand. Everything else around that line is instrumentation and propagation setup. Before the call, Redis snapshots server.dirty (a counter of keyspace changes). After the call, it computes the delta: if dirty increased, this command mutated data and needs to be written to the AOF buffer and propagated to replicas. The CLIENT_FORCE_AOF and CLIENT_FORCE_REPL flags let individual commands override the default propagation rules. The slowlog and latency monitoring hooks also fire here, making call() the natural place to measure command latency.
One function pointer dispatch followed by dirty-counter arithmetic is how Redis decides whether to write to the AOF and propagate to replicas -- simplicity that scales.
void call(client *c, int flags) {
long long dirty;
uint64_t client_old_flags = c->flags;
struct redisCommand *real_cmd = c->realcmd;
client *prev_client = server.executing_client;
server.executing_client = c;
int update_command_stats = !isAOFLoadingContext();
/* Clear propagation flags before execution. */
c->flags &= ~(CLIENT_FORCE_AOF|CLIENT_FORCE_REPL|CLIENT_PREVENT_PROP);
/* Snapshot dirty counter to detect mutations. */
dirty = server.dirty;
long long old_master_repl_offset = server.master_repl_offset;
incrCommandStatsOnError(NULL, 0);
const int use_hw_clock = monotonicGetType() == MONOTONIC_CLOCK_HW;
monotime monotonic_start = 0;
if (use_hw_clock) {
monotonic_start = getMonotonicUs();
/* Sync cached time periodically to avoid repeated syscalls. */
if (server.execution_nesting == 0) {
server.accum_call_count_since_ustime++;
if (monotonic_start - server.monotonic_us_when_ustime > 10 ||
server.accum_call_count_since_ustime > 25)
{
updateCachedTime(0);
monotonic_start = getMonotonicUs();
server.monotonic_us_when_ustime = monotonic_start;
server.accum_call_count_since_ustime = 0;
}
}
}
const long long call_timer = use_hw_clock ? server.ustime : ustime();
enterExecutionUnit(1, call_timer);
c->flags |= CLIENT_EXECUTING_COMMAND;
/* THE call: one function pointer dispatch */
c->cmd->proc(c);
exitExecutionUnit();
if (!(c->flags & CLIENT_BLOCKED)) c->flags &= ~(CLIENT_EXECUTING_COMMAND);
ustime_t duration;
if (use_hw_clock)
duration = getMonotonicUs() - monotonic_start;
else
duration = ustime() - call_timer;
c->duration += duration;
dirty = server.dirty - dirty;
if (dirty < 0) dirty = 0;Incremental Rehashing
src/dict.c:405How Redis grows its hash table without blocking
Redis's dict is a chained hash table with two internal tables: ht_table[0] holds the current data and ht_table[1] is the resized target. When the load factor crosses a threshold, Redis does not stop and rehash everything -- it sets rehashidx to 0 and starts migrating one bucket per normal operation. Every call to dictAdd(), dictFind(), or dictDelete() routes through _dictRehashStep(), which calls dictRehash(d, 1) to move exactly one bucket. The empty_visits cap (ten times the requested steps) prevents the function from stalling on a sparse table with many empty slots. The DICT_RESIZE_AVOID flag lets Redis skip rehashing during BGSAVE or BGREWRITEAOF -- fork-based persistence works best when pages are not mutated, so rehashing is deferred.
Redis rehashing is amortized across every subsequent operation: there is no rehash pause, only a steady trickle of bucket migrations piggybacked on normal reads and writes.
int dictRehash(dict *d, int n) {
int empty_visits = n*10; /* Max number of empty buckets to visit. */
unsigned long s0 = DICTHT_SIZE(d->ht_size_exp[0]);
unsigned long s1 = DICTHT_SIZE(d->ht_size_exp[1]);
if (dict_can_resize == DICT_RESIZE_FORBID || !dictIsRehashing(d)) return 0;
if (dict_can_resize == DICT_RESIZE_AVOID &&
((s1 > s0 && s1 < dict_force_resize_ratio * s0) ||
(s1 < s0 && s0 < HASHTABLE_MIN_FILL * dict_force_resize_ratio * s1)))
{
return 0;
}
while(n-- && d->ht_used[0] != 0) {
assert(DICTHT_SIZE(d->ht_size_exp[0]) > (unsigned long)d->rehashidx);
while(d->ht_table[0][d->rehashidx] == NULL) {
d->rehashidx++;
if (--empty_visits == 0) return 1;
}
/* Move all the keys in this bucket from the old to the new hash table. */
rehashEntriesInBucketAtIndex(d, d->rehashidx);
d->rehashidx++;
}
return !dictCheckRehashingCompleted(d);
}
/* This function is called by common lookup or update operations in the
* dictionary so that the hash table automatically migrates from H1 to H2
* while it is actively used. */
static void _dictRehashStep(dict *d) {
if (d->pauserehash == 0) dictRehash(d,1);
}
/* Add an element to the target hash table */
int dictAdd(dict *d, void *key __stored_key, void *val)
{
dictEntry *entry = dictAddRaw(d,key,NULL);
if (!entry) return DICT_ERR;
if (!d->type->no_value) dictSetVal(d, entry, val);
return DICT_OK;
}Sorted Set Internals
src/t_zset.c:254How a skip list gives Redis O(log N) rank queries
Redis sorted sets (ZSET) use a dual index: a skip list for ordered range queries and a hash table for O(1) score lookups by member. The skip list is the interesting half. zslRandomLevel() decides the height of each new node by flipping a biased coin (ZSKIPLIST_P is 0.25, meaning each level has a 25% chance of growing to the next). In zslInsertNode(), a top-down traversal finds the insert position at each level, storing the predecessor nodes in update[]. The rank[] array tracks cumulative span counts so that after insertion, each node's span -- the number of elements it skips over at a given level -- is updated in O(log N) time. This span count is what makes ZRANK an O(log N) operation rather than O(N).
The span counters in each skip list node are what turn ZRANK and ZRANGE from O(N) scans into O(log N) traversals -- insertion is slightly more expensive to keep rank queries cheap.
/* Returns a level between 1 and ZSKIPLIST_MAXLEVEL with a powerlaw
* distribution where higher levels are less likely. */
static int zslRandomLevel(void) {
static const int threshold = ZSKIPLIST_P*RAND_MAX;
int level = 1;
while (random() < threshold)
level += 1;
return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}
/* Insert an already-created node into the skiplist at the correct position. */
static void zslInsertNode(zskiplist *zsl, zskiplistNode *node) {
zskiplistNode *update[ZSKIPLIST_MAXLEVEL];
unsigned long rank[ZSKIPLIST_MAXLEVEL];
zskiplistNode *x;
int i, level = zslGetNodeInfo(node)->levels;
double score = node->score;
sds ele = zslGetNodeElement(node);
/* Walk down from the top level, tracking the last node at each level
* that is still less than the insertion point. */
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
rank[i] = i == (zsl->level-1) ? 0 : rank[i+1];
while (zslCompareWithNode(score, ele, x->level[i].forward) > 0) {
rank[i] += zslGetNodeSpanAtLevel(x, i);
x = x->level[i].forward;
}
update[i] = x;
}
/* Splice the new node into all levels up to its randomly chosen height. */
for (i = 0; i < level; i++) {
node->level[i].forward = update[i]->level[i].forward;
update[i]->level[i].forward = node;
/* Adjust span counts so rank queries remain accurate. */
zslSetNodeSpanAtLevel(node, i,
zslGetNodeSpanAtLevel(update[i], i) - (rank[0] - rank[i]));
zslSetNodeSpanAtLevel(update[i], i, (rank[0] - rank[i]) + 1);
}
/* Update backward pointer for reverse iteration. */
node->backward = (update[0] == zsl->header) ? NULL : update[0];
if (node->level[0].forward)
node->level[0].forward->backward = node;
else
zsl->tail = node;
zsl->length++;
}Fork-Based Snapshotting
src/rdb.c:2404How Redis saves a point-in-time snapshot without blocking clients
BGSAVE is three ideas working together: fork(), copy-on-write (COW), and atomic rename. rdbSaveBackground() calls redisFork(), which on success returns 0 in the child and the child PID in the parent. The parent returns C_OK immediately and keeps serving clients -- it never blocks. The child process has an exact copy of the parent's virtual address space courtesy of the OS's COW semantics: pages are shared until one side modifies them. The child then serializes the entire dataset to a temp file (temp-<pid>.rdb) using rdbSaveInternal(). When done, it calls rename() -- an atomic operation on POSIX systems -- to replace the live RDB file. Clients reading the old RDB file see a consistent snapshot; the new one appears instantaneously.
Redis achieves non-blocking snapshots by delegating the entire serialization to a forked child; the OS's copy-on-write mechanism means the snapshot is consistent without any locking on the parent.
int rdbSave(int req, char *filename, rdbSaveInfo *rsi, int rdbflags) {
char tmpfile[256];
char cwd[MAXPATHLEN];
startSaving(rdbflags);
snprintf(tmpfile, 256, "temp-%d.rdb", (int) getpid());
if (rdbSaveInternal(req, tmpfile, rsi, rdbflags) != C_OK) {
stopSaving(0);
return C_ERR;
}
/* Atomic rename -- clients never see a partial RDB file. */
if (rename(tmpfile, filename) == -1) {
serverLog(LL_WARNING, "Error moving temp DB file %s on the final destination", tmpfile);
unlink(tmpfile);
stopSaving(0);
return C_ERR;
}
/* ... */
}
int rdbSaveBackground(int req, char *filename, rdbSaveInfo *rsi, int rdbflags) {
pid_t childpid;
if (hasActiveChildProcess()) return C_ERR;
server.stat_rdb_saves++;
server.dirty_before_bgsave = server.dirty;
server.lastbgsave_try = time(NULL);
if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
/* Child process: serialize the dataset and exit. */
redisSetProcTitle("redis-rdb-bgsave");
redisSetCpuAffinity(server.bgsave_cpulist);
int retval = rdbSave(req, filename, rsi, rdbflags);
if (retval == C_OK) {
sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB");
}
exitFromChild((retval == C_OK) ? 0 : 1, 0);
} else {
/* Parent process: record child PID and return immediately. */
if (childpid == -1) {
server.lastbgsave_status = C_ERR;
serverLog(LL_WARNING, "Can't save in background: fork: %s", strerror(errno));
return C_ERR;
}
serverLog(LL_NOTICE, "Background saving started by pid %ld", (long) childpid);
server.rdb_save_time_start = time(NULL);
server.rdb_child_type = RDB_CHILD_TYPE_DISK;
return C_OK;
}
return C_OK; /* unreached */
}AOF Persistence
src/aof.c:2748How every write command is recorded for durability
Every write command that passes through call() is also handed to feedAppendOnlyFile(). The function does not write to disk -- it appends to server.aof_buf, an in-memory SDS string. The actual write() syscall happens in flushAppendOnlyFile(), which runs in the beforesleep hook of the event loop, just before Redis blocks on aeApiPoll(). This means the disk write always happens after the client reply is queued but before Redis sleeps -- so durability lags client response by at most one event loop iteration. The SELECT injection is a subtle correctness detail: the AOF file must be self-contained so that redis-check-aof and BGREWRITEAOF can replay it on any database state. The AOF_WAIT_REWRITE state handles the case where an AOF rewrite is in progress and new commands must be buffered for both the live AOF and the rewrite child.
AOF durability is a two-stage pipeline: commands land in an in-memory buffer inside call(), then flush to disk in the event loop's pre-sleep hook -- decoupling command latency from disk I/O.
void feedAppendOnlyFile(int dictid, robj **argv, int argc) {
sds buf = sdsempty();
serverAssert(dictid == -1 || (dictid >= 0 && dictid < server.dbnum));
/* Prepend a timestamp annotation if aof-timestamp-enabled is set. */
if (server.aof_timestamp_enabled) {
sds ts = genAofTimestampAnnotationIfNeeded(0);
if (ts != NULL) {
buf = sdscatsds(buf, ts);
sdsfree(ts);
}
}
/* If the target DB changed since the last appended command,
* emit a SELECT so the AOF is self-contained. */
if (dictid != -1 && dictid != server.aof_selected_db) {
char seldb[64];
snprintf(seldb, sizeof(seldb), "%d", dictid);
buf = sdscatprintf(buf, "*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
(unsigned long)strlen(seldb), seldb);
server.aof_selected_db = dictid;
}
/* Serialize the command in RESP format and append to the buffer. */
buf = catAppendOnlyGenericCommand(buf, argc, argv);
/* Append to the in-memory AOF buffer. It will be flushed to disk just
* before the next event loop iteration, after the client gets its reply. */
if (server.aof_state == AOF_ON ||
(server.aof_state == AOF_WAIT_REWRITE &&
server.child_type == CHILD_TYPE_AOF))
{
server.aof_buf = sdscatlen(server.aof_buf, buf, sdslen(buf));
}
sdsfree(buf);
}Create code tours for your project
Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.
Install Intraview Free