Redis C redis/redis

Redis Cluster: Sharding, Gossip, and Failover

How Redis distributes data across nodes, detects failures through gossip, and promotes replicas without human intervention

7 stops ~30 min Verified 2026-04-30
What you will learn
  • How a node initializes its cluster state from scratch or from a persisted `nodes.conf` file, and what fields are zero at startup
  • How `keyHashSlot()` maps every key to one of 16,384 slots using CRC16 with hash tag support
  • How the gossip sub-protocol inside PING/PONG packets propagates topology state across the cluster without a central coordinator
  • How a node transitions from PFAIL (suspected failure) to FAIL (confirmed failure) through quorum agreement, and what triggers that transition
  • How clients are redirected with MOVED or ASK depending on whether a slot is fully owned or mid-migration
  • How the `CLUSTER SETSLOT` handshake stages a live slot migration using MIGRATING and IMPORTING flags on source and destination nodes
  • How `CLUSTER FAILOVER` coordinates a zero-data-loss promotion by pausing the primary until the replica has consumed every byte of the replication stream
Prerequisites
  • Comfortable reading C (you don't need to write it, just follow the logic)
  • Basic understanding of key-value stores and what a Redis shard means
  • Familiarity with CRC and hash functions is helpful but not required
1 / 7

Cluster Bootstrap

src/cluster_legacy.c:959

How a node initializes its cluster state

clusterInit() runs after initServer() in main() when cluster mode is on. The node starts in CLUSTER_FAIL because it has no slot ownership information yet — serving clients before the topology loads would be unsafe. All auth counters, vote-epoch fields, and the slot table (server.cluster->slots, a 16,384-element array of clusterNode * pointers) are zeroed before any peer contact happens.

The branch on clusterLoadConfig() is the identity decision:

  • If nodes.conf exists from a prior run, the node reads its identity, peers, slot assignments, and config epochs from disk.
  • If no config exists, createClusterNode(NULL, CLUSTER_NODE_MYSELF|CLUSTER_NODE_MASTER) mints a fresh 40-hex-character node ID and sets saveconf = 1 so the new identity is written before the function returns.

After clusterInit() completes, server.cluster holds a known identity, an initialized peer dictionary, and a zeroed slot table. From that point clusterCron() — firing every 100 ms via the event loop — takes over, dialing peers and exchanging PING/PONG packets.

Key takeaway

A node begins in CLUSTER_FAIL with no peers and no slots. clusterInit() either loads a prior identity from nodes.conf or mints a fresh one, then hands off to clusterCron() to discover the rest of the topology. ---

void clusterInit(void) {
    int saveconf = 0;

    server.cluster = zmalloc(sizeof(struct clusterState));
    server.cluster->myself = NULL;
    server.cluster->currentEpoch = 0;
    server.cluster->state = CLUSTER_FAIL;
    server.cluster->size = 0;
    server.cluster->todo_before_sleep = 0;
    server.cluster->nodes = dictCreate(&clusterNodesDictType);
    server.cluster->shards = dictCreate(&clusterSdsToListType);
    server.cluster->nodes_black_list =
        dictCreate(&clusterNodesBlackListDictType);
    server.cluster->failover_auth_time = 0;
    server.cluster->failover_auth_count = 0;
    server.cluster->failover_auth_rank = 0;
    server.cluster->failover_auth_epoch = 0;
    server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE;
    server.cluster->lastVoteEpoch = 0;

    /* Initialize stats */
    for (int i = 0; i < CLUSTERMSG_TYPE_COUNT; i++) {
        server.cluster->stats_bus_messages_sent[i] = 0;
        server.cluster->stats_bus_messages_received[i] = 0;
    }
    server.cluster->stats_pfail_nodes = 0;
    server.cluster->stat_cluster_links_buffer_limit_exceeded = 0;

    memset(server.cluster->slots,0, sizeof(server.cluster->slots));
    clusterCloseAllSlots();

    memset(server.cluster->owner_not_claiming_slot, 0, sizeof(server.cluster->owner_not_claiming_slot));

    /* Lock the cluster config file to make sure every node uses
     * its own nodes.conf. */
    server.cluster_config_file_lock_fd = -1;
    if (clusterLockConfig(server.cluster_configfile) == C_ERR)
        exit(1);

    /* Load or create a new nodes configuration. */
    if (clusterLoadConfig(server.cluster_configfile) == C_ERR) {
        /* No configuration found. We will just use the random name provided
         * by the createClusterNode() function. */
        myself = server.cluster->myself =
            createClusterNode(NULL,CLUSTER_NODE_MYSELF|CLUSTER_NODE_MASTER);
        serverLog(LL_NOTICE,"No cluster configuration found, I'm %.40s",
            myself->name);
        clusterAddNode(myself);
        clusterAddNodeToShard(myself->shard_id, myself);
        saveconf = 1;
    }
    if (saveconf) clusterSaveConfigOrDie(1);

    /* Port sanity check II
     * The other handshake port check is triggered too late to stop
     * us from trying to use a too-high cluster port number. */
    int port = defaultClientPort();
    if (!server.cluster_port && port > (65535-CLUSTER_PORT_INCR)) {
        serverLog(LL_WARNING, "Redis port number too high. "
                   "Cluster communication port is 10,000 port "
2 / 7

Hash Slots and CRC16

src/cluster.h:22

The 16,384-slot mapping that determines which node owns a key

All sharding logic in Redis Cluster reduces to two constants and one inline function. CLUSTER_SLOT_MASK_BITS is 14, so CLUSTER_SLOTS is 16,384 — large enough to spread evenly across hundreds of nodes, small enough that the slots pointer array fits in roughly 128 KB. keyHashSlot() is marked static inline so it compiles away at every call site with no function-call overhead.

The function handles three degenerate cases before reaching the hash-tag path, all of which fall back to hashing the whole key:

  1. No opening brace found.
  2. Opening brace with no closing brace.
  3. Empty braces {} where e == s+1.

Only when both braces are present and non-empty does the function hash the content between them. This lets developers colocate related keys — {user:1}:profile and {user:1}:sessions both hash to the same slot because both extract user:1 — enabling multi-key commands without cross-slot errors.

The CLUSTER_REDIR_* constants in this file are the numeric codes getNodeByQuery() uses to tell the command dispatcher which error to return when a key lands on the wrong node.

Key takeaway

All sharding logic reduces to 14 bits of CRC16: slot = crc16(hashtag) & 0x3FFF, where the hashtag is the full key unless it contains {...}, in which case only the content between the braces is hashed. ---

#define CLUSTER_SLOT_MASK_BITS 14 /* Number of bits used for slot id. */
#define CLUSTER_SLOTS (1<<CLUSTER_SLOT_MASK_BITS) /* Total number of slots in cluster mode, which is 16384. */
#define CLUSTER_SLOT_MASK ((unsigned long long)(CLUSTER_SLOTS - 1)) /* Bit mask for slot id stored in LSB. */
#define INVALID_CLUSTER_SLOT (-1) /* Invalid slot number. */
#define CLUSTER_CROSSSLOT  (-2)
#define CLUSTER_OK 0            /* Everything looks ok */
#define CLUSTER_FAIL 1          /* The cluster can't work */
#define CLUSTER_NAMELEN 40      /* sha1 hex length */

/* Redirection errors returned by getNodeByQuery(). */
#define CLUSTER_REDIR_NONE 0          /* Node can serve the request. */
#define CLUSTER_REDIR_CROSS_SLOT 1    /* -CROSSSLOT request. */
#define CLUSTER_REDIR_UNSTABLE 2      /* -TRYAGAIN redirection required */
#define CLUSTER_REDIR_ASK 3           /* -ASK redirection required. */
#define CLUSTER_REDIR_MOVED 4         /* -MOVED redirection required. */
#define CLUSTER_REDIR_DOWN_STATE 5    /* -CLUSTERDOWN, global state. */
#define CLUSTER_REDIR_DOWN_UNBOUND 6  /* -CLUSTERDOWN, unbound slot. */
#define CLUSTER_REDIR_DOWN_RO_STATE 7 /* -CLUSTERDOWN, allow reads. */
#define CLUSTER_REDIR_TRIMMING 8      /* -TRYAGAIN, slot is being trimmed. */

typedef struct _clusterNode clusterNode;
struct clusterState;

/* Flags that a module can set in order to prevent certain Redis Cluster
 * features to be enabled. Useful when implementing a different distributed
 * system on top of Redis Cluster message bus, using modules. */
#define CLUSTER_MODULE_FLAG_NONE 0
#define CLUSTER_MODULE_FLAG_NO_FAILOVER (1<<1)
#define CLUSTER_MODULE_FLAG_NO_REDIRECTION (1<<2)

/* ---------------------- API exported outside cluster.c -------------------- */

/* We have 16384 hash slots. The hash slot of a given key is obtained
 * as the least significant 14 bits of the crc16 of the key.
 *
 * However, if the key contains the {...} pattern, only the part between
 * { and } is hashed. This may be useful in the future to force certain
 * keys to be in the same node (assuming no resharding is in progress). */
static inline unsigned int keyHashSlot(const char *key, int keylen) {
    int s, e; /* start-end indexes of { and } */

    for (s = 0; s < keylen; s++)
        if (key[s] == '{') break;

    /* No '{' ? Hash the whole key. This is the base case. */
    if (likely(s == keylen)) return crc16(key,keylen) & 0x3FFF;

    /* '{' found? Check if we have the corresponding '}'. */
    for (e = s+1; e < keylen; e++)
        if (key[e] == '}') break;

    /* No '}' or nothing between {} ? Hash the whole key. */
    if (e == keylen || e == s+1) return crc16(key,keylen) & 0x3FFF;

    /* If we are here there is both a { and a } on its right. Hash
     * what is in the middle between { and }. */
    return crc16(key+s+1,e-s-1) & 0x3FFF;
}
3 / 7

Gossip -- PING/PONG Cluster Bus

src/cluster_legacy.c:3680

How nodes propagate topology information without a central coordinator

clusterSendPing() handles PING, PONG, and MEET — all three share the same packet format and function, distinguished by type. The fixed header describes the sender's own state; the variable tail carries clusterMsgDataGossip entries, each describing a third-party node the sender knows about.

The 1/10 rule is probability math, not a guess. With N nodes, each gossip entry has a 1/N chance of featuring any given node. With N/10 entries per packet and at least 4 exchanges per node_timeout window, the expected number of failure reports reaching any node exceeds N/2 — the quorum threshold — before the timeout expires.

PFAIL nodes bypass the random sample entirely: pfail_wanted equals stats_pfail_nodes, so every suspected node rides in every packet unconditionally. The receiving node processes these entries in clusterProcessGossipSection(), updating its local reachability view for each mentioned peer.

Key takeaway

Gossip packets carry two layers: a random N/10 sample for background topology convergence, and every PFAIL node listed unconditionally to fast-track failure detection. No central authority — just probability math encoded in packet construction. ---

void clusterSendPing(clusterLink *link, int type) {
    static unsigned long long cluster_pings_sent = 0;
    cluster_pings_sent++;
    int gossipcount = 0; /* Number of gossip sections added so far. */
    int wanted; /* Number of gossip sections we want to append if possible. */
    int estlen; /* Upper bound on estimated packet length */
    /* freshnodes is the max number of nodes we can hope to append at all:
     * nodes available minus two (ourself and the node we are sending the
     * message to). However practically there may be less valid nodes since
     * nodes in handshake state, disconnected, are not considered. */
    int freshnodes = dictSize(server.cluster->nodes)-2;

    /* How many gossip sections we want to add? 1/10 of the number of nodes
     * and anyway at least 3. Why 1/10?
     *
     * If we have N masters, with N/10 entries, and we consider that in
     * node_timeout we exchange with each other node at least 4 packets
     * (we ping in the worst case in node_timeout/2 time, and we also
     * receive two pings from the host), we have a total of 8 packets
     * in the node_timeout*2 failure reports validity time. So we have
     * that, for a single PFAIL node, we can expect to receive the following
     * number of failure reports (in the specified window of time):
     *
     * PROB * GOSSIP_ENTRIES_PER_PACKET * TOTAL_PACKETS:
     *
     * PROB = probability of being featured in a single gossip entry,
     *        which is 1 / NUM_OF_NODES.
     * ENTRIES = 10.
     * TOTAL_PACKETS = 2 * 4 * NUM_OF_MASTERS.
     *
     * If we assume we have just masters (so num of nodes and num of masters
     * is the same), with 1/10 we always get over the majority, and specifically
     * 80% of the number of nodes, to account for many masters failing at the
     * same time.
     *
     * Since we have non-voting slaves that lower the probability of an entry
     * to feature our node, we set the number of entries per packet as
     * 10% of the total nodes we have. */
    wanted = floor(dictSize(server.cluster->nodes)/10);
    if (wanted < 3) wanted = 3;
    if (wanted > freshnodes) wanted = freshnodes;

    /* Include all the nodes in PFAIL state, so that failure reports are
     * faster to propagate to go from PFAIL to FAIL state. */
    int pfail_wanted = server.cluster->stats_pfail_nodes;

    /* Compute the maximum estlen to allocate our buffer. We'll fix the estlen
     * later according to the number of gossip sections we really were able
     * to put inside the packet. */
    estlen = sizeof(clusterMsg) - sizeof(union clusterMsgData);
    estlen += (sizeof(clusterMsgDataGossip)*(wanted + pfail_wanted));
    if (link->node && nodeSupportsExtensions(link->node)) {
        estlen += writePingExt(NULL, 0);
    }
    /* Note: clusterBuildMessageHdr() expects the buffer to be always at least
     * sizeof(clusterMsg) or more. */
    if (estlen < (int)sizeof(clusterMsg)) estlen = sizeof(clusterMsg);
    clusterMsgSendBlock *msgblock = createClusterMsgSendBlock(type, estlen);
    clusterMsg *hdr = getMessageFromSendBlock(msgblock);
4 / 7

Failure Detection -- PFAIL to FAIL

src/cluster_legacy.c:1907

The two-stage detection that separates a suspected failure from a confirmed one

Failure detection is two-stage by design. Stage one is local: in clusterCron(), when a node stops receiving PONGs from a peer within cluster-node-timeout milliseconds, it sets CLUSTER_NODE_PFAIL on that peer. The flag costs nothing by itself — the node keeps pinging and starts piggybacking the suspect in every gossip packet.

Stage two requires a quorum. When gossip arrives mentioning a node this node already suspects, it records a timestamped failure report from the sender. markNodeAsFailingIfNeeded() counts those reports against needed_quorum = (server.cluster->size / 2) + 1. If the local node is a master it counts itself. Once the threshold is met, CLUSTER_NODE_PFAIL is cleared and CLUSTER_NODE_FAIL is set atomically, then clusterSendFail() broadcasts the confirmed failure directly to all connected nodes — bypassing gossip cadence to spread the news immediately.

Recovery runs through clearNodeFailureIfNeeded(). For replicas, FAIL clears immediately on reconnect. For masters, it clears only after the FAIL has been in place long enough that an in-progress replica promotion is unlikely to still be running.

Key takeaway

PFAIL is a local suspicion that costs nothing but a flag. FAIL requires a quorum of master votes collected through gossip and triggers an immediate cluster-wide broadcast via clusterSendFail() — the two stages separate network hiccups from genuine partitions. ---

void markNodeAsFailingIfNeeded(clusterNode *node) {
    int failures;
    int needed_quorum = (server.cluster->size / 2) + 1;

    if (!nodeTimedOut(node)) return; /* We can reach it. */
    if (nodeFailed(node)) return; /* Already FAILing. */

    failures = clusterNodeFailureReportsCount(node);
    /* Also count myself as a voter if I'm a master. */
    if (clusterNodeIsMaster(myself)) failures++;
    if (failures < needed_quorum) return; /* No weak agreement from masters. */

    serverLog(LL_NOTICE,
        "Marking node %.40s (%s) as failing (quorum reached).", node->name, node->human_nodename);

    /* Mark the node as failing. */
    node->flags &= ~CLUSTER_NODE_PFAIL;
    node->flags |= CLUSTER_NODE_FAIL;
    node->fail_time = mstime();

    /* Broadcast the failing node name to everybody, forcing all the other
     * reachable nodes to flag the node as FAIL.
     * We do that even if this node is a replica and not a master: anyway
     * the failing state is triggered collecting failure reports from masters,
     * so here the replica is only helping propagating this status. */
    clusterSendFail(node->name);
    clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
}

/* This function is called only if a node is marked as FAIL, but we are able
 * to reach it again. It checks if there are the conditions to undo the FAIL
 * state. */
void clearNodeFailureIfNeeded(clusterNode *node) {
    mstime_t now = mstime();

    serverAssert(nodeFailed(node));

    /* For slaves we always clear the FAIL flag if we can contact the
     * node again. */
    if (nodeIsSlave(node) || node->numslots == 0) {
        serverLog(LL_NOTICE,
            "Clear FAIL state for node %.40s (%s):%s is reachable again.",
                node->name,node->human_nodename,
                nodeIsSlave(node) ? "replica" : "master without slots");
        node->flags &= ~CLUSTER_NODE_FAIL;
        clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
    }

    /* If it is a master and...
     * 1) The FAIL state is old enough.
5 / 7

MOVED and ASK Redirection

src/cluster.c:1390

How a client is steered to the correct node for a key it sent to the wrong one

getNodeByQuery() is called for every command in cluster mode. The base case is the last two lines: if the slot's owning node n is not this node, set CLUSTER_REDIR_MOVED and return n. clusterRedirectClient() formats this as -MOVED <slot> <ip>:<port>, which well-behaved clients use to update their slot map and retry against the correct node permanently.

The migration case is more nuanced. migrating_slot is true when this node has staged a CLUSTER SETSLOT <slot> MIGRATING <dest>. If some keys are already on the destination but others remain here, the response is -TRYAGAIN. If none of the requested keys are present on this node at all, it returns -ASK <slot> <ip>:<port>. The client must then send ASKING to the destination before retrying — which sets CLIENT_ASKING on that connection for exactly one command. The destination accepts the request even though it does not yet officially own the slot, because importing_slot && c->flags & CLIENT_ASKING is true. This lets individual keys be served from the destination one at a time as they migrate, without the destination advertising full ownership prematurely.

MOVED tells the client to update its routing table permanently. ASK says try there once, without touching the table.

Key takeaway

MOVED and ASK are different answers to the same question. MOVED means "update your slot map and always go there." ASK means "try there once for this key — migration is still in progress, keep your slot map unchanged." ---


    /* If we don't have all the keys and we are migrating the slot, send
     * an ASK redirection or TRYAGAIN. */
    if (migrating_slot && missing_keys) {
        /* If we have keys but we don't have all keys, we return TRYAGAIN */
        if (existing_keys) {
            if (error_code) *error_code = CLUSTER_REDIR_UNSTABLE;
            return NULL;
        } else {
            if (error_code) *error_code = CLUSTER_REDIR_ASK;
            return getMigratingSlotDest(slot);
        }
    }

    /* If we are receiving the slot, and the client correctly flagged the
     * request as "ASKING", we can serve the request. However if the request
     * involves multiple keys and we don't have them all, the only option is
     * to send a TRYAGAIN error. */
    if (importing_slot &&
        (c->flags & CLIENT_ASKING || cmd_flags & CMD_ASKING))
    {
        if (multiple_keys && missing_keys) {
            if (error_code) *error_code = CLUSTER_REDIR_UNSTABLE;
            return NULL;
        } else {
            return myself;
        }
    }

    /* Handle the read-only client case reading from a slave: if this
     * node is a slave and the request is about a hash slot our master
     * is serving, we can reply without redirection. */
    int is_write_command = (cmd_flags & CMD_WRITE) ||
                           (c->cmd->proc == execCommand && (c->mstate.cmd_flags & CMD_WRITE));
    if (((c->flags & CLIENT_READONLY) || pubsubshard_included) &&
        !is_write_command &&
        clusterNodeIsSlave(myself) &&
        clusterNodeGetSlaveof(myself) == n)
    {
        return myself;
    }

    /* If this node is responsible for the slot and is currently trimming it,
     * SFLUSH may have triggered active trimming and it could still be in progress.
     * Here we reject any write commands as no writes should be accepted for
     * trimming slots while active trimming is in progress. */
    if (n == myself && is_write_command && isSlotInTrimJob(slot)) {
        if (error_code) *error_code = CLUSTER_REDIR_TRIMMING;
        return NULL;
    }

    /* Base case: just return the right node. However, if this node is not
     * myself, set error_code to MOVED since we need to issue a redirection. */
    if (n != myself && error_code) *error_code = CLUSTER_REDIR_MOVED;
    return n;
}
6 / 7

Slot Migration

src/cluster_legacy.c:6101

How a live slot migration is staged using MIGRATING and IMPORTING flags

This is the CLUSTER SETSLOT handler inside clusterCommand(). A live migration is a four-step protocol; this code implements the first two — the flag assignments.

The operator (or a resharding tool like redis-cli --cluster reshard) opens the migration by issuing two commands:

  1. CLUSTER SETSLOT <slot> MIGRATING <dest-node-id> to the source node.
  2. CLUSTER SETSLOT <slot> IMPORTING <src-node-id> to the destination node.

The handler enforces ownership symmetry: only the current owner may set MIGRATING on a slot; only a non-owner may set IMPORTING. The result is a single pointer write — migrating_slots_to[slot] on the source and importing_slots_from[slot] on the destination — that puts both nodes into a transitional mode. The resharding tool then iterates every key in the slot, calling MIGRATE to atomically transfer each one.

The remaining two sub-commands close the loop. CLUSTER SETSLOT <slot> NODE <dest> reassigns ownership in the cluster topology (and refuses if countKeysInSlot(slot) != 0 on the source). STABLE is the rollback: it nulls both pointers, canceling an incomplete migration.

Key takeaway

Slot migration is a two-pointer handshake: migrating_slots_to on the source and importing_slots_from on the destination put both nodes into a transitional mode where keys migrate one at a time and clients receive ASK redirects rather than stale MOVED errors. ---


        if (!strcasecmp(c->argv[3]->ptr,"migrating") && c->argc == 5) {
            if (server.cluster->slots[slot] != myself) {
                addReplyErrorFormat(c,"I'm not the owner of hash slot %u",slot);
                return 1;
            }
            n = clusterLookupNode(c->argv[4]->ptr, sdslen(c->argv[4]->ptr));
            if (n == NULL) {
                addReplyErrorFormat(c,"I don't know about node %s",
                    (char*)c->argv[4]->ptr);
                return 1;
            }
            if (nodeIsSlave(n)) {
                addReplyError(c,"Target node is not a master");
                return 1;
            }
            server.cluster->migrating_slots_to[slot] = n;
        } else if (!strcasecmp(c->argv[3]->ptr,"importing") && c->argc == 5) {
            if (server.cluster->slots[slot] == myself) {
                addReplyErrorFormat(c,
                    "I'm already the owner of hash slot %u",slot);
                return 1;
            }
            n = clusterLookupNode(c->argv[4]->ptr, sdslen(c->argv[4]->ptr));
            if (n == NULL) {
                addReplyErrorFormat(c,"I don't know about node %s",
                    (char*)c->argv[4]->ptr);
                return 1;
            }
            if (nodeIsSlave(n)) {
                addReplyError(c,"Target node is not a master");
                return 1;
            }
            server.cluster->importing_slots_from[slot] = n;
        } else if (!strcasecmp(c->argv[3]->ptr,"stable") && c->argc == 4) {
            /* CLUSTER SETSLOT <SLOT> STABLE */
            server.cluster->importing_slots_from[slot] = NULL;
            server.cluster->migrating_slots_to[slot] = NULL;
        } else if (!strcasecmp(c->argv[3]->ptr,"node") && c->argc == 5) {
            /* CLUSTER SETSLOT <SLOT> NODE <NODE ID> */
            n = clusterLookupNode(c->argv[4]->ptr, sdslen(c->argv[4]->ptr));
            if (!n) {
                addReplyErrorFormat(c,"Unknown node %s",
                    (char*)c->argv[4]->ptr);
                return 1;
            }
            if (nodeIsSlave(n)) {
                addReplyError(c,"Target node is not a master");
                return 1;
            }
            /* If this hash slot was served by 'myself' before to switch
             * make sure there are no longer local keys for this hash slot. */
            if (server.cluster->slots[slot] == myself && n != myself) {
                if (countKeysInSlot(slot) != 0) {
                    addReplyErrorFormat(c,
                        "Can't assign hashslot %d to a different node "
                        "while I still hold keys for this hash slot.", slot);
                    return 1;
                }
            }
7 / 7

Manual Failover

src/cluster_legacy.c:6307

How CLUSTER FAILOVER promotes a replica with primary cooperation and no data loss

CLUSTER FAILOVER is issued to a replica, not a primary — the first precondition check rejects the command immediately if clusterNodeIsMaster(myself) is true. The three modes trade safety for speed:

  • Default (no option): calls clusterSendMFStart(myself->slaveof), asking the primary to pause client writes and start flagging its PING packets with CLUSTERMSG_FLAG0_PAUSED. The replica watches for that flag, reads the primary's current replication offset into mf_master_offset, then spins in clusterHandleManualFailover() until replicationGetSlaveOffset() == mf_master_offset. Only then does it set mf_can_start = 1 and request votes. Because the primary was paused and the replica caught up fully, no writes are lost.
  • FORCE: sets mf_can_start = 1 immediately, skipping offset coordination. Appropriate when the primary is alive but unreachable from clients.
  • TAKEOVER: calls clusterBumpConfigEpochWithoutConsensus() and claims the primary's slots directly, bypassing the vote entirely. This is the emergency option and accepts the risk of a split-brain.
Key takeaway

The default CLUSTER FAILOVER path achieves zero data loss by pausing the primary and waiting for the replica to drain the replication backlog before promoting — the coordination between two live nodes eliminates the replication lag that makes asynchronous failover lossy. ---

    } else if (!strcasecmp(c->argv[1]->ptr,"failover") &&
               (c->argc == 2 || c->argc == 3))
    {
        /* CLUSTER FAILOVER [FORCE|TAKEOVER] */
        int force = 0, takeover = 0;

        if (c->argc == 3) {
            if (!strcasecmp(c->argv[2]->ptr,"force")) {
                force = 1;
            } else if (!strcasecmp(c->argv[2]->ptr,"takeover")) {
                takeover = 1;
                force = 1; /* Takeover also implies force. */
            } else {
                addReplyErrorObject(c,shared.syntaxerr);
                return 1;
            }
        }

        /* Check preconditions. */
        if (clusterNodeIsMaster(myself)) {
            addReplyError(c,"You should send CLUSTER FAILOVER to a replica");
            return 1;
        } else if (myself->slaveof == NULL) {
            addReplyError(c,"I'm a replica but my master is unknown to me");
            return 1;
        } else if (!force &&
                   (nodeFailed(myself->slaveof) ||
                    myself->slaveof->link == NULL))
        {
            addReplyError(c,"Master is down or failed, "
                            "please use CLUSTER FAILOVER FORCE");
            return 1;
        }
        resetManualFailover();
        server.cluster->mf_end = mstime() + CLUSTER_MF_TIMEOUT;

        if (takeover) {
            /* A takeover does not perform any initial check. It just
             * generates a new configuration epoch for this node without
             * consensus, claims the master's slots, and broadcast the new
             * configuration. */
            serverLog(LL_NOTICE,"Taking over the master (user request).");
            clusterBumpConfigEpochWithoutConsensus();
            clusterFailoverReplaceYourMaster();
        } else if (force) {
            /* If this is a forced failover, we don't need to talk with our
             * master to agree about the offset. We just failover taking over
             * it without coordination. */
            serverLog(LL_NOTICE,"Forced failover user request accepted.");
            server.cluster->mf_can_start = 1;
        } else {
            serverLog(LL_NOTICE,"Manual failover user request accepted.");
            clusterSendMFStart(myself->slaveof);
        }
        addReply(c,shared.ok);
    } else if (!strcasecmp(c->argv[1]->ptr,"set-config-epoch") && c->argc == 3)
    {
        /* CLUSTER SET-CONFIG-EPOCH <epoch>
         *
         * The user is allowed to set the config epoch only when a node is
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free