Redis C redis/redis

Redis Persistence Deep Dive: RDB, AOF, and the Hybrid

How Redis survives a restart: binary snapshots, append-only logs, and the hybrid mode that combines both

8 stops ~30 min Verified 2026-04-30
What you will learn
  • How the RDB binary format is structured: magic header, version number, length-encoded types, DB selector opcodes, and the CRC64 footer that guards the whole file
  • Why `rdbSaveBackground()` can snapshot gigabytes of data without pausing a single client command -- and what copy-on-write actually buys you here
  • How every write command is serialised into `server.aof_buf` inside `feedAppendOnlyFile()` and then flushed to disk in the event loop's pre-sleep hook, not inline with the command
  • Why AOF rewrite is a separate child process writing a fresh minimal log while the parent buffers new commands into a diff stream that gets merged at the end
  • What `aof-use-rdb-preamble` really does: the base file gets the compact RDB layout, the incremental tail stays as RESP, and `loadSingleAppendOnlyFile()` auto-detects which format it is reading by checking the first five bytes
  • How the three `appendfsync` policies map to three completely different latency and durability points, and which one Redis's own documentation has called the right default for most workloads
  • Why replication and persistence are tightly coupled: a full sync to a new replica triggers `rdbSaveBackground()` on the same code path as `BGSAVE`, and the replication backlog must stay consistent with the AOF buffer during that window
Prerequisites
  • Comfortable reading C (you don't need to write it, just follow the logic)
  • Familiarity with what `fork()` does in Unix processes
  • Basic understanding of Redis data types: strings, hashes, lists, sets, sorted sets
1 / 8

RDB Binary Format

src/rdb.h:21

Magic header, version, length-encoding, type constants, opcode markers, and CRC64 footer

The RDB file opens with the nine-byte string "REDIS%04d" (for example "REDIS0014") written by rdbSaveRio() in src/rdb.c (line 1762). That magic word plus the zero-padded version number is exactly what loadSingleAppendOnlyFile() inspects when it reads the first five bytes to decide whether a file is RDB or RESP.

Each database section is introduced by RDB_OPCODE_SELECTDB (254) followed by the database number, then RDB_OPCODE_RESIZEDB with a count hint for pre-allocation. Every key-value pair is prefixed by a one-byte type from the frozen RDB_TYPE_* constants; the in-memory OBJ_* constants can evolve, but the on-disk wire types cannot. The two most significant bits of each length byte determine encoding width: 6-bit, 14-bit, 32-bit, 64-bit, or a specially encoded integer or LZF-compressed string (RDB_ENC_LZF).

The file ends with RDB_OPCODE_EOF (255) and eight bytes of CRC64 checksum, accumulated incrementally by rioGenericUpdateChecksum() so no second pass is needed. During diskless replication, a zero checksum signals the loader to skip verification, trading safety for throughput.

Key takeaway

The RDB format is a flat binary stream. Every byte position follows from the preceding length fields and type opcodes. That makes sequential writes fast, but appending after the fact requires rewriting the CRC footer. ---

#define RDB_VERSION 14

/* Defines related to the dump file format. To store 32 bits lengths for short
 * keys requires a lot of space, so we check the most significant 2 bits of
 * the first byte to interpreter the length:
 *
 * 00|XXXXXX => if the two MSB are 00 the len is the 6 bits of this byte
 * 01|XXXXXX XXXXXXXX =>  01, the len is 14 bits, 6 bits + 8 bits of next byte
 * 10|000000 [32 bit integer] => A full 32 bit len in net byte order will follow
 * 10|000001 [64 bit integer] => A full 64 bit len in net byte order will follow
 * 11|OBKIND this means: specially encoded object will follow. The six bits
 *           number specify the kind of object that follows.
 *           See the RDB_ENC_* defines.
 *
 * Lengths up to 63 are stored using a single byte, most DB keys, and may
 * values, will fit inside. */
#define RDB_6BITLEN 0
#define RDB_14BITLEN 1
#define RDB_32BITLEN 0x80
#define RDB_64BITLEN 0x81
#define RDB_ENCVAL 3
#define RDB_LENERR UINT64_MAX

/* When a length of a string object stored on disk has the first two bits
 * set, the remaining six bits specify a special encoding for the object
 * accordingly to the following defines: */
#define RDB_ENC_INT8 0        /* 8 bit signed integer */
#define RDB_ENC_INT16 1       /* 16 bit signed integer */
#define RDB_ENC_INT32 2       /* 32 bit signed integer */
#define RDB_ENC_LZF 3         /* string compressed with FASTLZ */

/* Map object types to RDB object types. Macros starting with OBJ_ are for
 * memory storage and may change. Instead RDB types must be fixed because
 * we store them on disk. */
#define RDB_TYPE_STRING 0
#define RDB_TYPE_LIST   1
#define RDB_TYPE_SET    2
#define RDB_TYPE_ZSET   3
#define RDB_TYPE_HASH   4
#define RDB_TYPE_ZSET_2 5 /* ZSET version 2 with doubles stored in binary. */
#define RDB_TYPE_MODULE_PRE_GA 6 /* Used in 4.0 release candidates */
#define RDB_TYPE_MODULE_2 7 /* Module value with annotations for parsing without
                               the generating module being loaded. */
#define RDB_TYPE_HASH_ZIPMAP    9
#define RDB_TYPE_LIST_ZIPLIST  10
#define RDB_TYPE_SET_INTSET    11
#define RDB_TYPE_ZSET_ZIPLIST  12
#define RDB_TYPE_HASH_ZIPLIST  13
#define RDB_TYPE_LIST_QUICKLIST 14
#define RDB_TYPE_STREAM_LISTPACKS 15
#define RDB_TYPE_HASH_LISTPACK 16
#define RDB_TYPE_ZSET_LISTPACK 17
#define RDB_TYPE_LIST_QUICKLIST_2   18
#define RDB_TYPE_STREAM_LISTPACKS_2 19
#define RDB_TYPE_SET_LISTPACK  20
#define RDB_TYPE_STREAM_LISTPACKS_3 21
#define RDB_TYPE_HASH_METADATA_PRE_GA 22      /* Hash with HFEs. Doesn't attach min TTL at start (7.4 RC) */
#define RDB_TYPE_HASH_LISTPACK_EX_PRE_GA 23   /* Hash LP with HFEs. Doesn't attach min TTL at start (7.4 RC) */
#define RDB_TYPE_HASH_METADATA 24             /* Hash with HFEs. Attach min TTL at start */
#define RDB_TYPE_HASH_LISTPACK_EX 25          /* Hash LP with HFEs. Attach min TTL at start */
#define RDB_TYPE_STREAM_LISTPACKS_4 26        /* Stream with IDMP support */
#define RDB_TYPE_STREAM_LISTPACKS_5 27        /* Stream with XNACK support (NACKed entries) */
#define RDB_TYPE_GCRA 28                      /* GCRA object */
/* NOTE: WHEN ADDING NEW RDB TYPE, UPDATE rdbIsObjectType(), and rdb_type_string[] */

/* Test if a type is an object type. */
#define rdbIsObjectType(t) (((t) >= 0 && (t) <= 7) || ((t) >= 9 && (t) <= 28))

/* Special RDB opcodes (saved/loaded with rdbSaveType/rdbLoadType). */
#define RDB_OPCODE_KEY_META   243   /* Key metadata (module metadata classes). */
#define RDB_OPCODE_SLOT_INFO  244   /* Individual slot info, such as slot id and size (cluster mode only). */
#define RDB_OPCODE_FUNCTION2  245   /* function library data */
#define RDB_OPCODE_FUNCTION_PRE_GA   246   /* old function library data for 7.0 rc1 and rc2 */
#define RDB_OPCODE_MODULE_AUX 247   /* Module auxiliary data. */
#define RDB_OPCODE_IDLE       248   /* LRU idle time. */
#define RDB_OPCODE_FREQ       249   /* LFU frequency. */
#define RDB_OPCODE_AUX        250   /* RDB aux field. */
#define RDB_OPCODE_RESIZEDB   251   /* Hash table resize hint. */
#define RDB_OPCODE_EXPIRETIME_MS 252    /* Expire time in milliseconds. */
#define RDB_OPCODE_EXPIRETIME 253       /* Old expire time in seconds. */
#define RDB_OPCODE_SELECTDB   254   /* DB number of the following keys. */
#define RDB_OPCODE_EOF        255   /* End of the RDB file. */
2 / 8

BGSAVE Fork Strategy

src/rdb.c:1899

rdbSaveBackground() forks a child, the child serialises via rdbSave() and atomic rename

rdbSaveBackground() is the entry point for both the BGSAVE command and the save config directive fired by serverCron. It immediately checks hasActiveChildProcess(); Redis enforces one background child at a time to prevent two concurrent forks from doubling copy-on-write memory cost. It then calls redisFork(CHILD_TYPE_RDB), a thin wrapper around fork() that also sets up the child info pipe for COW reporting.

The return value splits the two code paths: zero is the child, non-zero is the parent with the child's PID. The parent records the start time and returns C_OK immediately, with no lock and no blocking. The child writes to a temp file named temp-<pid>.rdb, then rename() atomically replaces the destination. fsyncFileDir() follows to flush the directory entry, since a crash just after rename() could otherwise leave the directory pointing to the old inode.

Before exiting, sendChildCowInfo() reports the COW page count back through the pipe. That number surfaces in INFO persistence as rdb_last_cow_size, the best signal for whether write traffic during a save is causing memory amplification.

Key takeaway

fork() gives the child a zero-cost snapshot of the address space. Copy-on-write then charges the parent only for pages it modifies after the fork. No application-level locking is involved. ---

int rdbSave(int req, char *filename, rdbSaveInfo *rsi, int rdbflags) {
    char tmpfile[256];
    char cwd[MAXPATHLEN]; /* Current working dir path for error messages. */

    startSaving(rdbflags);
    snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid());

    if (rdbSaveInternal(req,tmpfile,rsi,rdbflags) != C_OK) {
        stopSaving(0);
        return C_ERR;
    }
    
    /* Use RENAME to make sure the DB file is changed atomically only
     * if the generate DB file is ok. */
    if (rename(tmpfile,filename) == -1) {
        char *str_err = strerror(errno);
        char *cwdp = getcwd(cwd,MAXPATHLEN);
        serverLog(LL_WARNING,
            "Error moving temp DB file %s on the final "
            "destination %s (in server root dir %s): %s",
            tmpfile,
            filename,
            cwdp ? cwdp : "unknown",
            str_err);
        unlink(tmpfile);
        stopSaving(0);
        return C_ERR;
    }
    if (fsyncFileDir(filename) != 0) {
        serverLog(LL_WARNING,
            "Failed to fsync directory while saving DB: %s", strerror(errno));
        stopSaving(0);
        return C_ERR;
    }

    serverLog(LL_NOTICE,"DB saved on disk");
    server.dirty = 0;
    server.lastsave = time(NULL);
    server.lastbgsave_status = C_OK;
    stopSaving(1);
    return C_OK;
}

int rdbSaveBackground(int req, char *filename, rdbSaveInfo *rsi, int rdbflags) {
    pid_t childpid;

    if (hasActiveChildProcess()) return C_ERR;
    server.stat_rdb_saves++;

    server.dirty_before_bgsave = server.dirty;
    server.lastbgsave_try = time(NULL);

    if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
        int retval;

        /* Child */
        redisSetProcTitle("redis-rdb-bgsave");
        redisSetCpuAffinity(server.bgsave_cpulist);
        retval = rdbSave(req, filename,rsi,rdbflags);
        if (retval == C_OK) {
            sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB");
        }
        exitFromChild((retval == C_OK) ? 0 : 1, 0);
    } else {
        /* Parent */
        if (childpid == -1) {
            server.lastbgsave_status = C_ERR;
            serverLog(LL_WARNING,"Can't save in background: fork: %s",
                strerror(errno));
            return C_ERR;
        }
        serverLog(LL_NOTICE,"Background saving started by pid %ld",(long) childpid);
        server.rdb_save_time_start = time(NULL);
        server.rdb_child_type = RDB_CHILD_TYPE_DISK;
        return C_OK;
    }
    return C_OK; /* unreached */
}

/* Note that we may call this function in signal handle 'sigShutdownHandler',
 * so we need guarantee all functions we call are async-signal-safe.
 * If we call this function from signal handle, we won't call bg_unlink that
 * is not async-signal-safe. */
void rdbRemoveTempFile(pid_t childpid, int from_signal) {
    char tmpfile[256];
    char pid[32];
3 / 8

AOF Append and Buffering

src/aof.c:1409

feedAppendOnlyFile() writes to server.aof_buf; flushAppendOnlyFile() drains it from the beforesleep hook

feedAppendOnlyFile() is called by the propagation layer inside call() after every write command. It performs no I/O: it serialises the command into RESP via catAppendOnlyGenericCommand() and appends the result to server.aof_buf, a plain SDS string that lives for the lifetime of the server process.

The SELECT injection near the top is a correctness requirement. The AOF must be self-contained for redis-check-aof and rewrite to replay against an empty dataset. If the current command targeted a different database than the last appended command, a SELECT frame is prepended and aof_selected_db is updated to suppress future duplicates.

The actual write() syscall happens in flushAppendOnlyFile() (src/aof.c line 1147), called from the event loop's beforesleep hook just before aeApiPoll() blocks on I/O. The client reply is already queued; disk I/O drains separately. The AOF_WAIT_REWRITE branch keeps new commands flowing into aof_buf while a rewrite child is running so incremental files stay current.

Key takeaway

feedAppendOnlyFile() writes nothing to disk. The command is serialised into server.aof_buf inside call(), and the actual write() syscall runs in the pre-sleep hook, decoupled from command execution. ---

void feedAppendOnlyFile(int dictid, robj **argv, int argc) {
    sds buf = sdsempty();

    serverAssert(dictid == -1 || (dictid >= 0 && dictid < server.dbnum));

    /* Feed timestamp if needed */
    if (server.aof_timestamp_enabled) {
        sds ts = genAofTimestampAnnotationIfNeeded(0);
        if (ts != NULL) {
            buf = sdscatsds(buf, ts);
            sdsfree(ts);
        }
    }

    /* The DB this command was targeting is not the same as the last command
     * we appended. To issue a SELECT command is needed. */
    if (dictid != -1 && dictid != server.aof_selected_db) {
        char seldb[64];

        snprintf(seldb,sizeof(seldb),"%d",dictid);
        buf = sdscatprintf(buf,"*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
            (unsigned long)strlen(seldb),seldb);
        server.aof_selected_db = dictid;
    }

    /* All commands should be propagated the same way in AOF as in replication.
     * No need for AOF-specific translation. */
    buf = catAppendOnlyGenericCommand(buf,argc,argv);

    /* Append to the AOF buffer. This will be flushed on disk just before
     * of re-entering the event loop, so before the client will get a
     * positive reply about the operation performed. */
    if (server.aof_state == AOF_ON ||
        (server.aof_state == AOF_WAIT_REWRITE && server.child_type == CHILD_TYPE_AOF))
    {
        server.aof_buf = sdscatlen(server.aof_buf, buf, sdslen(buf));
    }

    sdsfree(buf);
}

/* ----------------------------------------------------------------------------
4 / 8

AOF Rewrite

src/aof.c:2664

rewriteAppendOnlyFileBackground() forks a child to write a minimal AOF; the parent accumulates new commands in a diff buffer during the rewrite

AOF files accumulate mutations indefinitely. A file recording every INCR over a week is far larger than a single SET with the final value. BGREWRITEAOF solves this by forking a child that walks the in-memory dataset and writes the minimal command set needed to reconstruct it from scratch.

Before forking, the function flushes the current aof_buf to disk with flushAppendOnlyFile(1), then opens a new incremental file via openNewIncrAofForAppend(). This incremental file is the diff buffer: every write that arrives while the child is running appends there. The directory layout keeps base files (rewrite output) separate from incremental files (ongoing diff), so there is no coverage gap.

The child calls rewriteAppendOnlyFile(), which writes an RDB-format base snapshot if aof_use_rdb_preamble is on, or iterates the dataset command-by-command with rewriteAppendOnlyFileRio() if it is off. When the child exits, backgroundRewriteDoneHandler() promotes the new base file. The aof_selected_db = -1 reset at the top of the function forces a fresh SELECT at the start of the new incremental file, establishing unambiguous database context.

Key takeaway

The child compacts the current dataset while the parent buffers new writes into a separate incremental file. The server is never blocked during rewrite. ---

int rewriteAppendOnlyFile(char *filename) {
    rio aof;
    FILE *fp = NULL;
    char tmpfile[256];

    /* Note that we have to use a different temp name here compared to the
     * one used by rewriteAppendOnlyFileBackground() function. */
    snprintf(tmpfile,256,"temp-rewriteaof-%d.aof", (int) getpid());
    fp = fopen(tmpfile,"w");
    if (!fp) {
        serverLog(LL_WARNING, "Opening the temp file for AOF rewrite in rewriteAppendOnlyFile(): %s", strerror(errno));
        return C_ERR;
    }

    rioInitWithFile(&aof,fp);

    if (server.aof_rewrite_incremental_fsync) {
        rioSetAutoSync(&aof,REDIS_AUTOSYNC_BYTES);
        rioSetReclaimCache(&aof,1);
    }

    startSaving(RDBFLAGS_AOF_PREAMBLE);

    if (server.aof_use_rdb_preamble) {
        int error;
        if (rdbSaveRio(SLAVE_REQ_NONE,&aof,&error,RDBFLAGS_AOF_PREAMBLE,NULL) == C_ERR) {
            errno = error;
            goto werr;
        }
    } else {
        if (rewriteAppendOnlyFileRio(&aof) == C_ERR) goto werr;
    }

    /* Make sure data will not remain on the OS's output buffers */
    if (fflush(fp)) goto werr;
    if (fsync(fileno(fp))) goto werr;
    if (reclaimFilePageCache(fileno(fp), 0, 0) == -1) {
        /* A minor error. Just log to know what happens */
        serverLog(LL_NOTICE,"Unable to reclaim page cache: %s", strerror(errno));
    }
    if (fclose(fp)) { fp = NULL; goto werr; }
    fp = NULL;

    /* Use RENAME to make sure the DB file is changed atomically only
     * if the generate DB file is ok. */
    if (rename(tmpfile,filename) == -1) {
        serverLog(LL_WARNING,"Error moving temp append only file on the final destination: %s", strerror(errno));
        unlink(tmpfile);
        stopSaving(0);
        return C_ERR;
    }
    stopSaving(1);

    return C_OK;

werr:
    serverLog(LL_WARNING,"Write error writing append only file on disk: %s", strerror(errno));
    if (fp) fclose(fp);
    unlink(tmpfile);
    stopSaving(0);
    return C_ERR;
}
/* ----------------------------------------------------------------------------
 * AOF background rewrite
 * ------------------------------------------------------------------------- */

/* This is how rewriting of the append only file in background works:
 *
 * 1) The user calls BGREWRITEAOF
 * 2) Redis calls this function, that forks():
 *    2a) the child rewrite the append only file in a temp file.
 *    2b) the parent open a new INCR AOF file to continue writing.
 * 3) When the child finished '2a' exists.
 * 4) The parent will trap the exit code, if it's OK, it will:
 *    4a) get a new BASE file name and mark the previous (if we have) as the HISTORY type
 *    4b) rename(2) the temp file in new BASE file name
 *    4c) mark the rewritten INCR AOFs as history type
 *    4d) persist AOF manifest file
 *    4e) Delete the history files use bio
 */
int rewriteAppendOnlyFileBackground(void) {
    pid_t childpid;

    if (hasActiveChildProcess()) return C_ERR;

    if (dirCreateIfMissing(server.aof_dirname) == -1) {
        serverLog(LL_WARNING, "Can't open or create append-only dir %s: %s",
            server.aof_dirname, strerror(errno));
        server.aof_lastbgrewrite_status = C_ERR;
        return C_ERR;
    }

    /* We set aof_selected_db to -1 in order to force the next call to the
     * feedAppendOnlyFile() to issue a SELECT command. */
    server.aof_selected_db = -1;
    flushAppendOnlyFile(1);
    if (openNewIncrAofForAppend() != C_OK) {
        server.aof_lastbgrewrite_status = C_ERR;
        return C_ERR;
    }

    if (server.aof_state == AOF_WAIT_REWRITE) {
        /* Wait for all bio jobs related to AOF to drain. This prevents a race
         * between updates to `fsynced_reploff_pending` of the worker thread, belonging
         * to the previous AOF, and the new one. This concern is specific for a full
         * sync scenario where we don't wanna risk the ACKed replication offset
         * jumping backwards or forward when switching to a different master. */
        bioDrainWorker(BIO_AOF_FSYNC);

        /* Set the initial repl_offset, which will be applied to fsynced_reploff
         * when AOFRW finishes (after possibly being updated by a bio thread) */
        atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
        server.fsynced_reploff = 0;
    }

    server.stat_aof_rewrites++;

    if ((childpid = redisFork(CHILD_TYPE_AOF)) == 0) {
        char tmpfile[256];

        /* Child */
        redisSetProcTitle("redis-aof-rewrite");
        redisSetCpuAffinity(server.aof_rewrite_cpulist);
        snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
        if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
            serverLog(LL_NOTICE,
                "Successfully created the temporary AOF base file %s", tmpfile);
            sendChildCowInfo(CHILD_INFO_TYPE_AOF_COW_SIZE, "AOF rewrite");
            exitFromChild(0, 0);
        } else {
            exitFromChild(1, 0);
        }
    } else {
        /* Parent */
        if (childpid == -1) {
            server.aof_lastbgrewrite_status = C_ERR;
            serverLog(LL_WARNING,
                "Can't rewrite append only file in background: fork: %s",
                strerror(errno));
            return C_ERR;
        }
        serverLog(LL_NOTICE,
            "Background append only file rewriting started by pid %ld",(long) childpid);
        server.aof_rewrite_scheduled = 0;
        server.aof_rewrite_time_start = time(NULL);
        return C_OK;
    }
    return C_OK; /* unreached */
}

void bgrewriteaofCommand(client *c) {
    if (server.child_type == CHILD_TYPE_AOF) {
        addReplyError(c,"Background append only file rewriting already in progress");
    } else if (hasActiveChildProcess() || server.in_exec) {
        server.aof_rewrite_scheduled = 1;
        /* When manually triggering AOFRW we reset the count 
         * so that it can be executed immediately. */
5 / 8

Hybrid Persistence

src/aof.c:437

aof-use-rdb-preamble yes writes an RDB snapshot as the base file; loadSingleAppendOnlyFile() detects the format from the first five bytes

    /* Check if the AOF file is in RDB format (it may be RDB encoded base AOF
     * or old style RDB-preamble AOF). In that case we need to load the RDB file 
     * and later continue loading the AOF tail if it is an old style RDB-preamble AOF. */
    char sig[5]; /* "REDIS" */
    if (fread(sig,1,5,fp) != 5 || memcmp(sig,"REDIS",5) != 0) {
        /* Not in RDB format, seek back at 0 offset. */
        if (fseek(fp,0,SEEK_SET) == -1) goto readerr;
    } else {
        /* RDB format. Pass loading the RDB functions. */
        rio rdb;
        int old_style = !strcmp(filename, server.aof_filename);
        if (old_style)
            serverLog(LL_NOTICE, "Reading RDB preamble from AOF file...");
        else 
            serverLog(LL_NOTICE, "Reading RDB base file on AOF loading..."); 

        if (fseek(fp,0,SEEK_SET) == -1) goto readerr;
        rioInitWithFile(&rdb,fp);
        if (rdbLoadRio(&rdb,RDBFLAGS_AOF_PREAMBLE,NULL) != C_OK) {
    if (server.aof_use_rdb_preamble) {
        int error;
        if (rdbSaveRio(SLAVE_REQ_NONE,&aof,&error,RDBFLAGS_AOF_PREAMBLE,NULL) == C_ERR) {
            errno = error;
            goto werr;
        }
    } else {
        if (rewriteAppendOnlyFileRio(&aof) == C_ERR) goto werr;
    }

Hybrid persistence is the default in Redis 7+. When aof-use-rdb-preamble yes is set, the base file from a rewrite is written in RDB binary format rather than RESP, and the extension changes from .base.aof to .base.rdb. Incremental diff files on top of it remain plain RESP.

The payoff is startup speed. Loading a large dataset from RESP requires parsing and re-executing every command; loading from RDB is a direct memory copy with no parsing overhead. After rdbLoadRio() finishes the base file, loadSingleAppendOnlyFile() continues reading the incremental .aof files that accumulated since the last rewrite, replaying only the small delta.

Detection is format-driven: the loader reads five bytes, checks for "REDIS", and routes to rdbLoadRio() on a match or falls back to the RESP loop on a miss. The RDBFLAGS_AOF_PREAMBLE flag tells rdbLoadRio() to leave the file pointer open at EOF so the caller can continue reading the RESP tail in the older single-file hybrid layout.

Key takeaway

Hybrid mode stores a compact RDB snapshot as the rewrite base and layers RESP incremental files on top. The loader auto-detects the format per file from the first five bytes, with no load-time configuration required. ---

 *  appendonly.aof.1.base.aof  (server.aof_use_rdb_preamble is no)
 *  appendonly.aof.1.base.rdb  (server.aof_use_rdb_preamble is yes)
 */
sds getNewBaseFileNameAndMarkPreAsHistory(aofManifest *am) {
    serverAssert(am != NULL);
    if (am->base_aof_info) {
        serverAssert(am->base_aof_info->file_type == AOF_FILE_TYPE_BASE);
        am->base_aof_info->file_type = AOF_FILE_TYPE_HIST;
        listAddNodeHead(am->history_aof_list, am->base_aof_info);
    }

    char *format_suffix = server.aof_use_rdb_preamble ?
        RDB_FORMAT_SUFFIX:AOF_FORMAT_SUFFIX;


     * or old style RDB-preamble AOF). In that case we need to load the RDB file 
     * and later continue loading the AOF tail if it is an old style RDB-preamble AOF. */
    char sig[5]; /* "REDIS" */
    if (fread(sig,1,5,fp) != 5 || memcmp(sig,"REDIS",5) != 0) {
        /* Not in RDB format, seek back at 0 offset. */
        if (fseek(fp,0,SEEK_SET) == -1) goto readerr;
    } else {
        /* RDB format. Pass loading the RDB functions. */
        rio rdb;
        int old_style = !strcmp(filename, server.aof_filename);
        if (old_style)
            serverLog(LL_NOTICE, "Reading RDB preamble from AOF file...");
        else 
            serverLog(LL_NOTICE, "Reading RDB base file on AOF loading..."); 

        if (fseek(fp,0,SEEK_SET) == -1) goto readerr;
        rioInitWithFile(&rdb,fp);
        if (rdbLoadRio(&rdb,RDBFLAGS_AOF_PREAMBLE,NULL) != C_OK) {
            if (old_style)
                serverLog(LL_WARNING, "Error reading the RDB preamble of the AOF file %s, AOF loading aborted", filename);
            else
                serverLog(LL_WARNING, "Error reading the RDB base file %s, AOF loading aborted", filename);


    startSaving(RDBFLAGS_AOF_PREAMBLE);

    if (server.aof_use_rdb_preamble) {
        int error;
        if (rdbSaveRio(SLAVE_REQ_NONE,&aof,&error,RDBFLAGS_AOF_PREAMBLE,NULL) == C_ERR) {
            errno = error;
            goto werr;
        }
    } else {
        if (rewriteAppendOnlyFileRio(&aof) == C_ERR) goto werr;
    }
6 / 8

fsync Policies

src/server.h:639

appendfsync always | everysec | no, the background sync thread, latency versus durability

void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;
    mstime_t latency;

    if (sdslen(server.aof_buf) == 0) {
        /* ... handle edge case where buffer is empty but fsync still needed ... */
        return;
    }

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = aofFsyncInProgress();

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
        /* With this append fsync policy we do background fsyncing.
         * If the fsync is still in progress we can try to delay
         * the write for a couple of seconds. */
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {
                server.aof_flush_postponed_start = server.mstime;
                return;
            } else if (server.mstime - server.aof_flush_postponed_start < 2000) {
                return;
            }
            server.aof_delayed_fsync++;
            serverLog(LL_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");
        }
    }

    nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));

    /* ... error handling ... */

    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* redis_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
        if (redis_fsync(server.aof_fd) == -1) {
            serverLog(LL_WARNING,"Can't persist AOF for fsync error when the "
              "AOF fsync policy is 'always': %s. Exiting...", strerror(errno));
            exit(1);
        }
        latencyAddSampleIfNeeded("aof-fsync-always",latency);
        server.aof_last_fsync = server.mstime;
        atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
    } else if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
               server.mstime - server.aof_last_fsync >= 1000) {
        if (!sync_in_progress) {
            aof_background_fsync(server.aof_fd);
            server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        }
        server.aof_last_fsync = server.mstime;
    }
}

flushAppendOnlyFile() handles all three fsync policies in a single code path. The difference between them is where the blocking syscall lands.

  • AOF_FSYNC_NO: Redis calls write() but never fsync(). The OS decides when to flush its page cache. An unclean shutdown loses everything not yet flushed.
  • AOF_FSYNC_ALWAYS: calls redis_fsync() (aliased to fdatasync() on Linux, skipping inode metadata) synchronously in the main thread before flushAppendOnlyFile() returns. Every acknowledged write is on disk, at the cost of one blocking syscall per event loop iteration.
  • AOF_FSYNC_EVERYSEC: write() runs in the main thread, but fsync() is handed to a BIO (Background I/O) worker via aof_background_fsync(). If a background fsync is still in progress when the next flush arrives, the write is delayed up to two seconds before being forced through, with a slow-disk warning logged.

The aof_delayed_fsync counter in INFO persistence increments each time that two-second grace period expires. A non-zero value means disk latency is affecting AOF throughput before it shows up in command latency.

Key takeaway

everysec is the only mode that keeps fsync latency off the main thread without abandoning durability. The other two modes make an explicit trade: no gives up durability entirely, always puts the blocking fdatasync() directly on the hot path. ---

#define AOF_FSYNC_NO 0
#define AOF_FSYNC_ALWAYS 1
#define AOF_FSYNC_EVERYSEC 2

void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;
    mstime_t latency;

    if (sdslen(server.aof_buf) == 0) {
        if (server.aof_last_incr_fsync_offset == server.aof_last_incr_size) {
            /* All data is fsync'd already: Update fsynced_reploff_pending just in case.
             * This is needed to avoid a WAITAOF hang in case a module used RM_Call
             * with the NO_AOF flag, in which case master_repl_offset will increase but
             * fsynced_reploff_pending won't be updated (because there's no reason, from
             * the AOF POV, to call fsync) and then WAITAOF may wait on the higher offset
             * (which contains data that was only propagated to replicas, and not to AOF) */
            if (!aofFsyncInProgress())
                atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
        } else {
            /* Check if we need to do fsync even the aof buffer is empty,
             * because previously in AOF_FSYNC_EVERYSEC mode, fsync is
             * called only when aof buffer is not empty, so if users
             * stop write commands before fsync called in one second,
             * the data in page cache cannot be flushed in time. */
            if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
                server.mstime - server.aof_last_fsync >= 1000 &&
                !(sync_in_progress = aofFsyncInProgress()))
                goto try_fsync;

            /* Check if we need to do fsync even the aof buffer is empty,
             * the reason is described in the previous AOF_FSYNC_EVERYSEC block,
             * and AOF_FSYNC_ALWAYS is also checked here to handle a case where
             * aof_fsync is changed from everysec to always. */
            if (server.aof_fsync == AOF_FSYNC_ALWAYS)
                goto try_fsync;
        }
        return;
    }

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = aofFsyncInProgress();

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
        /* With this append fsync policy we do background fsyncing.
         * If the fsync is still in progress we can try to delay
         * the write for a couple of seconds. */
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {
                /* No previous write postponing, remember that we are
                 * postponing the flush and return. */
                server.aof_flush_postponed_start = server.mstime;
                return;
            } else if (server.mstime - server.aof_flush_postponed_start < 2000) {
                /* We were already waiting for fsync to finish, but for less
                 * than two seconds this is still ok. Postpone again. */
                return;
            }
            /* Otherwise fall through, and go write since we can't wait
             * over two seconds. */
            server.aof_delayed_fsync++;
            serverLog(LL_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");
        }
    }
    /* We want to perform a single write. This should be guaranteed atomic
     * at least if the filesystem we are writing is a real physical one.
     * While this will save us against the server being killed I don't think
     * there is much to do about the whole server stopping for power problems
     * or alike */

    if (server.aof_flush_sleep && sdslen(server.aof_buf)) {
        usleep(server.aof_flush_sleep);
    }

    latencyStartMonitor(latency);
    nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
    latencyEndMonitor(latency);
    /* We want to capture different events for delayed writes:
     * when the delay happens with a pending fsync, or with a saving child
     * active, and when the above two conditions are missing.
     * We also use an additional event name to save all samples which is
     * useful for graphing / monitoring purposes. */
    if (sync_in_progress) {
        latencyAddSampleIfNeeded("aof-write-pending-fsync",latency);
    } else if (hasActiveChildProcess()) {
        latencyAddSampleIfNeeded("aof-write-active-child",latency);
    } else {
        latencyAddSampleIfNeeded("aof-write-alone",latency);
    }
    latencyAddSampleIfNeeded("aof-write",latency);

    /* We performed the write so reset the postponed flush sentinel to zero. */
    server.aof_flush_postponed_start = 0;

    if (nwritten != (ssize_t)sdslen(server.aof_buf)) {
        static time_t last_write_error_log = 0;
        int can_log = 0;

        /* Limit logging rate to 1 line per AOF_WRITE_LOG_ERROR_RATE seconds. */
        if ((server.unixtime - last_write_error_log) > AOF_WRITE_LOG_ERROR_RATE) {
            can_log = 1;
            last_write_error_log = server.unixtime;
        }

        /* Log the AOF write error and record the error code. */
        if (nwritten == -1) {
            if (can_log) {
                serverLog(LL_WARNING,"Error writing to the AOF file: %s",
                    strerror(errno));
            }
            server.aof_last_write_errno = errno;
        } else {
            if (can_log) {
                serverLog(LL_WARNING,"Short write while writing to "
                                       "the AOF file: (nwritten=%lld, "
                                       "expected=%lld)",
                                       (long long)nwritten,
                                       (long long)sdslen(server.aof_buf));
            }

            if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {
                if (can_log) {
                    serverLog(LL_WARNING, "Could not remove short write "
                             "from the append-only file.  Redis may refuse "
                             "to load the AOF the next time it starts.  "
                             "ftruncate: %s", strerror(errno));
                }
            } else {
                /* If the ftruncate() succeeded we can set nwritten to
                 * -1 since there is no longer partial data into the AOF. */
                nwritten = -1;
            }
            server.aof_last_write_errno = ENOSPC;
        }

        /* Handle the AOF write error. */
        if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
            /* We can't recover when the fsync policy is ALWAYS since the reply
             * for the client is already in the output buffers (both writes and
             * reads), and the changes to the db can't be rolled back. Since we
             * have a contract with the user that on acknowledged or observed
             * writes are is synced on disk, we must exit. */
            serverLog(LL_WARNING,"Can't recover from AOF write error when the AOF fsync policy is 'always'. Exiting...");
            exit(1);
        } else {
            /* Recover from failed write leaving data into the buffer. However
             * set an error to stop accepting writes as long as the error
             * condition is not cleared. */
            server.aof_last_write_status = C_ERR;

            /* Trim the sds buffer if there was a partial write, and there
             * was no way to undo it with ftruncate(2). */
            if (nwritten > 0) {
                server.aof_current_size += nwritten;
                server.aof_last_incr_size += nwritten;
                sdsrange(server.aof_buf,nwritten,-1);
            }
            return; /* We'll try again on the next call... */
        }
    } else {
        /* Successful write(2). If AOF was in error state, restore the
         * OK state and log the event. */
        if (server.aof_last_write_status == C_ERR) {
            serverLog(LL_NOTICE,
                "AOF write error looks solved, Redis can write again.");
            server.aof_last_write_status = C_OK;
        }
    }
    server.aof_current_size += nwritten;
    server.aof_last_incr_size += nwritten;

    /* Re-use AOF buffer when it is small enough. The maximum comes from the
     * arena size of 4k minus some overhead (but is otherwise arbitrary). */
    if ((sdslen(server.aof_buf)+sdsavail(server.aof_buf)) < 4000) {
        sdsclear(server.aof_buf);
    } else {
        sdsfree(server.aof_buf);
        server.aof_buf = sdsempty();
    }

try_fsync:
    /* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
     * children doing I/O in the background. */
    if (server.aof_no_fsync_on_rewrite && hasActiveChildProcess())
        return;

    /* Perform the fsync if needed. */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* redis_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
        latencyStartMonitor(latency);
        /* Let's try to get this data on the disk. To guarantee data safe when
         * the AOF fsync policy is 'always', we should exit if failed to fsync
         * AOF (see comment next to the exit(1) after write error above). */
        if (redis_fsync(server.aof_fd) == -1) {
            serverLog(LL_WARNING,"Can't persist AOF for fsync error when the "
              "AOF fsync policy is 'always': %s. Exiting...", strerror(errno));
            exit(1);
        }
        latencyEndMonitor(latency);
        latencyAddSampleIfNeeded("aof-fsync-always",latency);
        server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        server.aof_last_fsync = server.mstime;
        atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
    } else if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
               server.mstime - server.aof_last_fsync >= 1000) {
        if (!sync_in_progress) {
            aof_background_fsync(server.aof_fd);
            server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        }
        server.aof_last_fsync = server.mstime;
    }
}
7 / 8

Durability Tradeoffs

src/aof.c:1147

Failure modes for each fsync policy; why everysec is the common production default

Each policy's worst-case failure window on an unclean shutdown (power loss, kernel panic, OOM kill):

  • AOF_FSYNC_NO: loses everything in the OS page cache at crash time, potentially many seconds of acknowledged writes.
  • AOF_FSYNC_ALWAYS: loses at most the commands buffered in the current event loop iteration, sub-millisecond in practice. The cost is one fdatasync() per iteration in the main thread: roughly 100-200 µs on NVMe, over 1 ms on a networked EBS volume. Single-client throughput is capped at roughly the inverse of your fsync latency.
  • AOF_FSYNC_EVERYSEC: loses at most one second in a clean failure. If the background fsync was already delayed when the crash hit and the two-second grace period was active, the window stretches to approximately two seconds. Redis documentation calls this "probably the best tradeoff", and the two-second postponement window in the code is exactly that mechanism.

The atomicSet(server.fsynced_reploff_pending, server.master_repl_offset) line after each always fsync tracks the replication offset guaranteed durable on disk. WAITAOF uses this value to let clients block until a specific offset is synced.

Key takeaway

everysec trades at most one second of data loss for keeping fsync off the main thread. always is only worth the throughput cost if your application requires zero data loss and your disk delivers sub-millisecond fdatasync(). ---

void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;
    mstime_t latency;

    if (sdslen(server.aof_buf) == 0) {
        if (server.aof_last_incr_fsync_offset == server.aof_last_incr_size) {
            /* All data is fsync'd already: Update fsynced_reploff_pending just in case.
             * This is needed to avoid a WAITAOF hang in case a module used RM_Call
             * with the NO_AOF flag, in which case master_repl_offset will increase but
             * fsynced_reploff_pending won't be updated (because there's no reason, from
             * the AOF POV, to call fsync) and then WAITAOF may wait on the higher offset
             * (which contains data that was only propagated to replicas, and not to AOF) */
            if (!aofFsyncInProgress())
                atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
        } else {
            /* Check if we need to do fsync even the aof buffer is empty,
             * because previously in AOF_FSYNC_EVERYSEC mode, fsync is
             * called only when aof buffer is not empty, so if users
             * stop write commands before fsync called in one second,
             * the data in page cache cannot be flushed in time. */
            if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
                server.mstime - server.aof_last_fsync >= 1000 &&
                !(sync_in_progress = aofFsyncInProgress()))
                goto try_fsync;

            /* Check if we need to do fsync even the aof buffer is empty,
             * the reason is described in the previous AOF_FSYNC_EVERYSEC block,
             * and AOF_FSYNC_ALWAYS is also checked here to handle a case where
             * aof_fsync is changed from everysec to always. */
            if (server.aof_fsync == AOF_FSYNC_ALWAYS)
                goto try_fsync;
        }
        return;
    }

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = aofFsyncInProgress();

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
        /* With this append fsync policy we do background fsyncing.
         * If the fsync is still in progress we can try to delay
         * the write for a couple of seconds. */
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {
                /* No previous write postponing, remember that we are
                 * postponing the flush and return. */
                server.aof_flush_postponed_start = server.mstime;
                return;
            } else if (server.mstime - server.aof_flush_postponed_start < 2000) {
                /* We were already waiting for fsync to finish, but for less
                 * than two seconds this is still ok. Postpone again. */
                return;
            }
            /* Otherwise fall through, and go write since we can't wait
             * over two seconds. */
            server.aof_delayed_fsync++;
            serverLog(LL_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");
        }
    }
    /* We want to perform a single write. This should be guaranteed atomic
     * at least if the filesystem we are writing is a real physical one.
     * While this will save us against the server being killed I don't think
     * there is much to do about the whole server stopping for power problems
     * or alike */

    if (server.aof_flush_sleep && sdslen(server.aof_buf)) {
        usleep(server.aof_flush_sleep);
    }

    latencyStartMonitor(latency);
    nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
    latencyEndMonitor(latency);
    /* We want to capture different events for delayed writes:
     * when the delay happens with a pending fsync, or with a saving child
     * active, and when the above two conditions are missing.
     * We also use an additional event name to save all samples which is
     * useful for graphing / monitoring purposes. */
    if (sync_in_progress) {
        latencyAddSampleIfNeeded("aof-write-pending-fsync",latency);
    } else if (hasActiveChildProcess()) {
        latencyAddSampleIfNeeded("aof-write-active-child",latency);
    } else {
        latencyAddSampleIfNeeded("aof-write-alone",latency);
    }
    latencyAddSampleIfNeeded("aof-write",latency);

    /* We performed the write so reset the postponed flush sentinel to zero. */
    server.aof_flush_postponed_start = 0;

    if (nwritten != (ssize_t)sdslen(server.aof_buf)) {
        static time_t last_write_error_log = 0;
        int can_log = 0;

        /* Limit logging rate to 1 line per AOF_WRITE_LOG_ERROR_RATE seconds. */
        if ((server.unixtime - last_write_error_log) > AOF_WRITE_LOG_ERROR_RATE) {
            can_log = 1;
            last_write_error_log = server.unixtime;
        }

        /* Log the AOF write error and record the error code. */
        if (nwritten == -1) {
            if (can_log) {
                serverLog(LL_WARNING,"Error writing to the AOF file: %s",
                    strerror(errno));
            }
            server.aof_last_write_errno = errno;
        } else {
            if (can_log) {
                serverLog(LL_WARNING,"Short write while writing to "
                                       "the AOF file: (nwritten=%lld, "
                                       "expected=%lld)",
                                       (long long)nwritten,
                                       (long long)sdslen(server.aof_buf));
            }

            if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {
                if (can_log) {
                    serverLog(LL_WARNING, "Could not remove short write "
                             "from the append-only file.  Redis may refuse "
                             "to load the AOF the next time it starts.  "
                             "ftruncate: %s", strerror(errno));
                }
            } else {
                /* If the ftruncate() succeeded we can set nwritten to
                 * -1 since there is no longer partial data into the AOF. */
                nwritten = -1;
            }
            server.aof_last_write_errno = ENOSPC;
        }

        /* Handle the AOF write error. */
        if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
            /* We can't recover when the fsync policy is ALWAYS since the reply
             * for the client is already in the output buffers (both writes and
             * reads), and the changes to the db can't be rolled back. Since we
             * have a contract with the user that on acknowledged or observed
             * writes are is synced on disk, we must exit. */
            serverLog(LL_WARNING,"Can't recover from AOF write error when the AOF fsync policy is 'always'. Exiting...");
            exit(1);
        } else {
            /* Recover from failed write leaving data into the buffer. However
             * set an error to stop accepting writes as long as the error
             * condition is not cleared. */
            server.aof_last_write_status = C_ERR;

            /* Trim the sds buffer if there was a partial write, and there
             * was no way to undo it with ftruncate(2). */
            if (nwritten > 0) {
                server.aof_current_size += nwritten;
                server.aof_last_incr_size += nwritten;
                sdsrange(server.aof_buf,nwritten,-1);
            }
            return; /* We'll try again on the next call... */
        }
    } else {
        /* Successful write(2). If AOF was in error state, restore the
         * OK state and log the event. */
        if (server.aof_last_write_status == C_ERR) {
            serverLog(LL_NOTICE,
                "AOF write error looks solved, Redis can write again.");
            server.aof_last_write_status = C_OK;
        }
    }
    server.aof_current_size += nwritten;
    server.aof_last_incr_size += nwritten;

    /* Re-use AOF buffer when it is small enough. The maximum comes from the
     * arena size of 4k minus some overhead (but is otherwise arbitrary). */
    if ((sdslen(server.aof_buf)+sdsavail(server.aof_buf)) < 4000) {
        sdsclear(server.aof_buf);
    } else {
        sdsfree(server.aof_buf);
        server.aof_buf = sdsempty();
    }

try_fsync:
    /* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
     * children doing I/O in the background. */
    if (server.aof_no_fsync_on_rewrite && hasActiveChildProcess())
        return;

    /* Perform the fsync if needed. */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* redis_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
        latencyStartMonitor(latency);
        /* Let's try to get this data on the disk. To guarantee data safe when
         * the AOF fsync policy is 'always', we should exit if failed to fsync
         * AOF (see comment next to the exit(1) after write error above). */
        if (redis_fsync(server.aof_fd) == -1) {
            serverLog(LL_WARNING,"Can't persist AOF for fsync error when the "
              "AOF fsync policy is 'always': %s. Exiting...", strerror(errno));
            exit(1);
        }
        latencyEndMonitor(latency);
        latencyAddSampleIfNeeded("aof-fsync-always",latency);
        server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        server.aof_last_fsync = server.mstime;
        atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
    } else if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
               server.mstime - server.aof_last_fsync >= 1000) {
        if (!sync_in_progress) {
            aof_background_fsync(server.aof_fd);
            server.aof_last_incr_fsync_offset = server.aof_last_incr_size;
        }
        server.aof_last_fsync = server.mstime;
    }
}

sds catAppendOnlyGenericCommand(sds dst, int argc, robj **argv) {
    char buf[32];
    int len, j;
    robj *o;

#define AOF_FSYNC_NO 0
#define AOF_FSYNC_ALWAYS 1
#define AOF_FSYNC_EVERYSEC 2
8 / 8

Replication Interaction

src/replication.c:618

RDB used for full sync transfer; the replication buffer must stay consistent with the AOF feed during BGSAVE

int replicationSetupSlaveForFullResync(client *slave, long long offset) {
    slave->psync_initial_offset = offset;
    slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;
    /* We are going to accumulate the incremental changes for this
     * slave as well. Set slaveseldb to -1 in order to force to re-emit
     * a SELECT statement in the replication stream. */
    server.slaveseldb = -1;
    /* ... */
}
void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
    /* ... */
    /* Don't feed replicas that are still waiting for BGSAVE to start. */
    if (replica->replstate == SLAVE_STATE_WAIT_BGSAVE_START ||
        replica->replstate == SLAVE_STATE_WAIT_BGSAVE_END) {
        /* ... buffer commands for later ... */
    }
    /* ... */
    feedReplicationBuffer(buf,buflen);
}

When a new replica connects and requests a full sync, Redis uses no separate snapshot mechanism. startBgsaveForReplication() calls the same rdbSaveBackground() (or rdbSaveToSlavesSockets() for diskless mode) used by BGSAVE. The RDB file is the initial state shipped to the replica.

The gap problem is the hard part: the parent keeps serving writes after the fork, and those writes must reach the replica for it to converge. replicationSetupSlaveForFullResync() records the replication offset at fork time as psync_initial_offset and sets the replica to SLAVE_STATE_WAIT_BGSAVE_END. While in that state, replicationFeedSlaves() accumulates write commands in the shared replication ring buffer (server.repl_buffer_blocks). When RDB transfer completes, Redis streams everything since psync_initial_offset, then switches to live streaming.

The AOF buffer stays consistent with the replication buffer because both feedAppendOnlyFile() and replicationFeedSlaves() are called from the same propagation point inside call(). The server.slaveseldb = -1 reset in replicationSetupSlaveForFullResync() matches the server.aof_selected_db = -1 reset in rewriteAppendOnlyFileBackground(): both force a fresh SELECT at stream start to establish unambiguous database context.

Key takeaway

Full replication sync goes through the same rdbSaveBackground() call as BGSAVE. The replication ring buffer accumulates writes during RDB transfer, and the SELECT reset at stream start uses the same pattern as AOF rewrite for the same reason. ---

static void feedReplicationBuffer(const char *buf, size_t len) {
    replBufWriter wr;
    replBufWriterBegin(&wr);
    replBufWriterAppend(&wr, buf, len);
    replBufWriterEnd(&wr);
}

/* Propagate write commands to replication stream.
 *
 * This function is used if the instance is a master: we use the commands
 * received by our clients in order to create the replication stream.
 * Instead if the instance is a replica and has sub-replicas attached, we use
 * replicationFeedStreamFromMasterStream() */
void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
    int j, len;
    char llstr[LONG_STR_SIZE];

    /* In case we propagate a command that doesn't touch keys (PING, REPLCONF) we
     * pass dbid=-1 that indicate there is no need to replicate `select` command. */
    serverAssert(dictid == -1 || (dictid >= 0 && dictid < server.dbnum));

    /* If the instance is not a top level master, return ASAP: we'll just proxy
     * the stream of data we receive from our master instead, in order to
     * propagate *identical* replication stream. In this way this slave can
     * advertise the same replication ID as the master (since it shares the
     * master replication history and has the same backlog and offsets). */
    if (server.masterhost != NULL) return;

    /* If current client is marked as master, we will proxy the command stream
     * to our slaves instead of replicating them, that also happens when being
     * in atomic slot migration. */
    if (server.current_client && server.current_client->flags & CLIENT_MASTER) return;

    /* If there aren't slaves, and there is no backlog buffer to populate,
     * we can return ASAP. */
    if (server.repl_backlog == NULL && listLength(slaves) == 0) {
        /* We increment the repl_offset anyway, since we use that for tracking AOF fsyncs
         * even when there's no replication active. This code will not be reached if AOF
         * is also disabled. */
        server.master_repl_offset += 1;
        return;
    }

    /* We can't have slaves attached and no backlog. */
    serverAssert(!(listLength(slaves) != 0 && server.repl_backlog == NULL));

    /* Update the time of sending replication stream to replicas. */
    server.repl_stream_lastio = server.unixtime;

    /* Must install write handler for all replicas first before feeding
     * replication stream. */
    prepareReplicasToWrite();

    /* Send SELECT command to every slave if needed. */
    if (dictid != -1 && server.slaveseldb != dictid) {
        robj *selectcmd;

        /* For a few DBs we have pre-computed SELECT command. */
        if (dictid >= 0 && dictid < PROTO_SHARED_SELECT_CMDS) {
            selectcmd = shared.select[dictid];
        } else {
            int dictid_len;

            dictid_len = ll2string(llstr,sizeof(llstr),dictid);
            selectcmd = createObject(OBJ_STRING,
                sdscatprintf(sdsempty(),
                "*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n",
                dictid_len, llstr));
        }

        feedReplicationBuffer(selectcmd->ptr, sdslen(selectcmd->ptr));

        /* Although the SELECT command is not associated with any slot,
         * its per-slot network-bytes-out accumulation is made by the above function call.
         * To cancel-out this accumulation, below adjustment is made. */
        clusterSlotStatsDecrNetworkBytesOutForReplication(sdslen(selectcmd->ptr));

        if (dictid < 0 || dictid >= PROTO_SHARED_SELECT_CMDS)
            decrRefCount(selectcmd);

        server.slaveseldb = dictid;
    }

    /* Write the command to the replication buffer if any. */
    char aux[LONG_STR_SIZE+3];
    replBufWriter wr;
    replBufWriterBegin(&wr);

    /* Write the multi bulk count */
    replBufWriterAppendBulkLen(&wr, '*', argc);

    for (j = 0; j < argc; j++) {
        /* Write the bulk count */
        long objlen = stringObjectLen(argv[j]);
        replBufWriterAppendBulkLen(&wr, '$', objlen);

        /* Write the bulk data */
        if (argv[j]->encoding == OBJ_ENCODING_INT) {
            len = ll2string(aux, sizeof(aux), (long)argv[j]->ptr);
            replBufWriterAppend(&wr, aux, len);
        } else {
            replBufWriterAppend(&wr, argv[j]->ptr, objlen);
        }
        replBufWriterAppend(&wr, "\r\n", 2);
    }

    replBufWriterEnd(&wr);
}

/* This is a debugging function that gets called when we detect something
 * wrong with the replication protocol: the goal is to peek into the
 * replication backlog and show a few final bytes to make simpler to
 * guess what kind of bug it could be. */
void showLatestBacklog(void) {
    if (server.repl_backlog == NULL) return;
    if (listLength(server.repl_buffer_blocks) == 0) return;
    if (server.hide_user_data_from_log) {
        serverLog(LL_NOTICE,"hide-user-data-from-log is on, skip logging backlog content to avoid spilling PII.");
        return;
    }

    size_t dumplen = 256;
    if (server.repl_backlog->histlen < (long long)dumplen)
        dumplen = server.repl_backlog->histlen;

    sds dump = sdsempty();
    listNode *node = listLast(server.repl_buffer_blocks);
    while(dumplen) {
        if (node == NULL) break;
        replBufBlock *o = listNodeValue(node);
        size_t thislen = o->used >= dumplen ? dumplen : o->used;
        sds head = sdscatrepr(sdsempty(), o->buf+o->used-thislen, thislen);
        sds tmp = sdscatsds(head, dump);
        sdsfree(dump);
        dump = tmp;
        dumplen -= thislen;
        node = listPrevNode(node);
    }

    /* Finally log such bytes: this is vital debugging info to
     * understand what happened. */
    serverLog(LL_NOTICE,"Latest backlog is: '%s'", dump);
    sdsfree(dump);
}

/* This function is used in order to proxy what we receive from our master
 * to our sub-slaves. Besides, we also proxy the replication stream from
 * the source node when being in atomic slot migration. */
void replicationFeedStreamFromMasterStream(char *buf, size_t buflen) {
    /* There must be replication backlog if having attached slaves. */
    if (listLength(server.slaves)) serverAssert(server.repl_backlog != NULL);
    if (server.repl_backlog) {
        /* Must install write handler for all replicas first before feeding
         * replication stream. */
        prepareReplicasToWrite();
        feedReplicationBuffer(buf,buflen);

int replicationSetupSlaveForFullResync(client *slave, long long offset) {
    char buf[128];
    int buflen;

    slave->psync_initial_offset = offset;
    slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;
    /* We are going to accumulate the incremental changes for this
     * slave as well. Set slaveseldb to -1 in order to force to re-emit
     * a SELECT statement in the replication stream. */
    server.slaveseldb = -1;

    /* Slots snapshot. */
    if (slave->flags & CLIENT_REPL_RDB_CHANNEL &&
        slave->slave_req & SLAVE_REQ_SLOTS_SNAPSHOT)
    {
        /* Start to deliver the commands stream on migrating slots. */
        asmSlotSnapshotAndStreamStart(slave->task);

int startBgsaveForReplication(int mincapa, int req) {
    int retval;
    int socket_target = 0;
    listIter li;
    listNode *ln;

    /* We use a socket target if slave can handle the EOF marker and we're configured to do diskless syncs.
     * Note that in case we're creating a "filtered" RDB (functions-only, for example) we also force socket replication
     * to avoid overwriting the snapshot RDB file with filtered data. */
    socket_target = (server.repl_diskless_sync || req & SLAVE_REQ_RDB_MASK) && (mincapa & SLAVE_CAPA_EOF);
    /* `SYNC` should have failed with error if we don't support socket and require a filter, assert this here */
    serverAssert(socket_target || !(req & SLAVE_REQ_RDB_MASK));

    int slots_req = req & SLAVE_REQ_SLOTS_SNAPSHOT;
    serverLog(LL_NOTICE,"Starting BGSAVE for SYNC with target: %s%s",
        socket_target ? (slots_req ? "slot migration destination socket" : "replicas sockets") : "disk",
        (req & SLAVE_REQ_RDB_CHANNEL) ? " (rdb-channel)" : "");

    rdbSaveInfo rsi, *rsiptr;
    rsiptr = rdbPopulateSaveInfo(&rsi);
    /* Only do rdbSave* when rsiptr is not NULL,
     * otherwise slave will miss repl-stream-db. */
    if (rsiptr) {
        if (socket_target)
            retval = rdbSaveToSlavesSockets(req,rsiptr);
        else {
            /* Keep the page cache since it'll get used soon */
            retval = rdbSaveBackground(req, server.rdb_filename, rsiptr, RDBFLAGS_REPLICATION | RDBFLAGS_KEEP_CACHE);
        }
        if (server.repl_debug_pause & REPL_DEBUG_AFTER_FORK)
            debugPauseProcess();
    } else {
        serverLog(LL_WARNING,"BGSAVE for replication: replication information not available, can't generate the RDB file right now. Try later.");
        retval = C_ERR;
    }

    /* If we succeeded to start a BGSAVE with disk target, let's remember
     * this fact, so that we can later delete the file if needed. Note
     * that we don't set the flag to 1 if the feature is disabled, otherwise
     * it would never be cleared: the file is not deleted. This way if
     * the user enables it later with CONFIG SET, we are fine. */
    if (retval == C_OK && !socket_target && server.rdb_del_sync_files)
        RDBGeneratedByReplication = 1;

    /* If we failed to BGSAVE, remove the slaves waiting for a full
     * resynchronization from the list of slaves, inform them with
     * an error about what happened, close the connection ASAP. */
    if (retval == C_ERR) {
        serverLog(LL_WARNING,"BGSAVE for replication failed");
        listRewind(server.slaves,&li);
        while((ln = listNext(&li))) {
            client *slave = ln->value;

            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
                slave->replstate = REPL_STATE_NONE;
                slave->flags &= ~CLIENT_SLAVE;
                listDelNode(server.slaves,ln);
                addReplyError(slave,
                    "BGSAVE failed, replication can't continue");
                slave->flags |= CLIENT_CLOSE_AFTER_REPLY;
            }
        }
        return retval;
    }

    /* If the target is socket, rdbSaveToSlavesSockets() already setup
     * the slaves for a full resync. Otherwise for disk target do it now.*/
    if (!socket_target) {
        listRewind(server.slaves,&li);
        while((ln = listNext(&li))) {
            client *slave = ln->value;

            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
                /* Check slave has the exact requirements */
                if (slave->slave_req != req)
                    continue;
                replicationSetupSlaveForFullResync(slave, getPsyncInitialOffset());
            }
        }
    }

    return retval;
}
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free