In Search of an Understandable Consensus Algorithm (Raft)
What is Raft and Why Does It Matter?
Imagine you have multiple servers that need to stay in sync - they all need to agree on the same data, even when some servers fail or network issues occur. This is the fundamental problem of consensus in distributed systems. Raft solves this problem in a way that's actually understandable.
Before Raft, the dominant algorithm was Paxos, which is notoriously difficult to understand (even experts struggle with it). Raft was designed from the ground up to be easier to grasp, making it practical for developers to actually implement correctly.
Abstract (From the Paper)
Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.
Introduction: The Problem with Paxos
Consensus algorithms are crucial for building reliable distributed systems. They ensure that multiple servers can agree on shared data even when failures occur. However, the traditional solution - Paxos - has a major problem: it's extremely difficult to understand.
Why is understandability so important? When developers don't fully understand an algorithm, they're likely to make mistakes when implementing or extending it. This can lead to subtle bugs that compromise the entire system's reliability.
The Raft authors took an unusual approach: they made understandability their primary design goal. Here's how they achieved it:
Key Design Principles:
- Decomposition: Raft breaks consensus into three separate, manageable pieces - leader election, log replication, and safety. This is much easier to reason about than trying to understand everything at once.
- State space reduction: Raft reduces the number of possible states servers can be in, making the system more predictable and easier to analyze. For example, Raft doesn't allow gaps in logs, which eliminates whole categories of edge cases you'd have to consider otherwise.
The result? In user studies, students found Raft significantly easier to learn than Paxos.
Background: Replicated State Machines
Before diving into Raft, let's understand what problem we're solving. The core concept is called a replicated state machine - a pattern used throughout distributed systems.
What is a Replicated State Machine?
Think of it like this: you have multiple servers, and each one is running the same program (the "state machine"). The trick is making sure they all execute the same commands in the same order, so they all end up with identical state. It's like having multiple calculators that need to stay perfectly in sync.
How it works:
- Each server maintains a log - an ordered list of commands
- The consensus algorithm (like Raft) ensures all servers have identical logs
- Each server executes the commands from its log in order
- Since they all execute the same commands in the same order, they all arrive at the same state
Real-world examples: Systems like Google File System (GFS), HDFS, and RAMCloud use replicated state machines for critical tasks like leader election and storing configuration data. ZooKeeper and Chubby are entire systems built around this concept.
What Makes a Good Consensus Algorithm?
Practical consensus algorithms need three key properties:
-
Safety: They never return incorrect results, even with network problems (delays, lost packets, duplicated messages, etc.). Note: this assumes non-Byzantine conditions, meaning servers aren't actively trying to deceive each other.
-
Availability: The system stays functional as long as a majority of servers are working. For example, a 5-server cluster can survive 2 failures.
-
Timing independence: The algorithm doesn't rely on synchronized clocks or assume messages arrive within specific timeframes. Clock issues might slow things down but won't cause incorrect results.
What's Wrong with Paxos?
If Paxos solves the same problem, why create Raft? The answer lies in Paxos's fundamental design flaws:
Problem 1: It's Really Hard to Understand
At a 2012 conference (NSDI), the authors surveyed attendees and found that even experienced researchers struggled to understand Paxos. This isn't just an academic concern - if experts can't grasp it, how can practitioners implement it correctly?
The core issue is Paxos's foundation: it starts with "single-decree Paxos" (making one decision) and then builds up to "multi-Paxos" (making many decisions, like maintaining a log). This bottom-up approach is conceptually awkward. It's like learning to build a house by first studying how to lay a single brick, when really you should start with the overall architecture.
Problem 2: Poor Foundation for Real Systems
Paxos isn't just hard to understand - it's also difficult to implement in practice:
-
Peer-to-peer architecture: Paxos uses a symmetric approach where all servers are equal. While elegant in theory, practical systems almost always need a leader for efficiency. Paxos only mentions leadership as an optional "optimization," leaving implementers to figure out how to add it properly.
-
Gap between theory and practice: The Paxos paper describes the algorithm but doesn't provide clear guidance on building production systems. This means every team has to solve the same practical problems from scratch, often in incompatible ways.
The Raft authors believed there was a better way: design the algorithm from the start with understandability and practical implementation in mind.
Designing for Understandability
The Raft authors didn't just claim to make things simpler - they used concrete techniques:
Technique 1: Problem Decomposition
Instead of describing consensus as one monolithic algorithm, Raft breaks it into independent subproblems:
- Leader election: How to choose which server is in charge
- Log replication: How the leader distributes commands to followers
- Safety: How to ensure nothing goes wrong even with failures
- Membership changes: How to add or remove servers from the cluster
You can understand each piece separately, then see how they fit together. This is much easier than trying to grasp everything at once.
Technique 2: Reduce the State Space
A key insight: simpler systems have fewer possible states, making them easier to reason about.
For example:
- No log holes: Raft never allows gaps in the log. If entries 1, 2, and 3 exist, you know entry 2 is present. This eliminates entire categories of edge cases.
- Limited inconsistency: Raft carefully constrains how logs can differ between servers, making it easier to reason about recovery scenarios.
Technique 3: Strategic Use of Randomization
Randomization might seem like it adds unpredictability (and it does), but it can actually simplify things by making all choices equivalent. Raft uses randomized timeouts for leader election - instead of complex coordination, servers just wait a random amount of time, and whoever times out first starts an election. Simple and effective.
The goal isn't just to create an algorithm that works - it's to create one that developers can understand well enough to implement correctly and extend for their specific needs.
How Raft Works: The Core Algorithm
Now for the main event - how Raft actually works. Let's start with the basics.
Server Roles
At any moment, each server in a Raft cluster is in exactly one of three states:
-
Leader: The boss. It accepts client requests, adds them to the log, and tells other servers to replicate these entries. There's only one leader at a time.
-
Follower: Passive workers. They don't initiate anything - they just respond to requests from the leader or candidates. Most servers are followers most of the time.
-
Candidate: A temporary state when a server is trying to become the new leader. This happens when the current leader fails.
Terms: Logical Time
Raft divides time into terms of arbitrary length. Think of terms like presidential terms - each one has a number (term 1, term 2, etc.), and at most one leader is elected per term. Terms serve as a logical clock: if a server sees a request from an old term, it knows that information is stale and can safely reject it.
Communication: Just Two RPC Types
The entire algorithm uses only two remote procedure calls (RPCs):
-
RequestVote RPC: Used during elections. Candidates ask other servers: "Will you vote for me?"
-
AppendEntries RPC: Used by leaders for two purposes:
- Replicate log entries to followers
- Send heartbeats (empty AppendEntries) to let followers know the leader is still alive
That's it! Just two types of messages. Servers retry RPCs if they don't get responses, and they send RPCs in parallel for better performance.
Leader Election: Choosing a New Boss
When does an election happen? When a follower stops hearing from the leader. Each follower has an election timeout - if this much time passes without hearing from the leader (via heartbeat), the follower assumes the leader has crashed and starts an election.
The Election Process:
-
The follower transitions to the candidate state and votes for itself
-
It sends RequestVote RPCs to all other servers
-
Then one of three things happens:
Outcome A: The candidate wins
- It receives votes from a majority of servers
- It becomes the new leader and starts sending heartbeats
Outcome B: Another server becomes leader
- While waiting for votes, the candidate receives an AppendEntries RPC from another server with a term number equal to or greater than its own
- The candidate recognizes this other server as the legitimate leader and returns to follower state
Outcome C: Nobody wins (split vote)
- No candidate gets a majority (maybe votes were split among multiple candidates)
- Eventually timeouts expire and a new election starts
Voting Rules:
- Each server votes for at most one candidate per term
- Votes are given on a first-come, first-served basis
- A candidate needs votes from a majority of servers to win
Preventing Endless Split Votes:
Without special handling, split votes could repeat forever. Raft's solution? Randomized election timeouts. Each server chooses its timeout randomly from a range (e.g., 150-300ms). This means servers are unlikely to become candidates at the same time, so split votes are rare and resolve quickly when they do occur.
Log Replication: Getting Everyone in Sync
Once we have a leader, it needs to replicate its log to all followers. This is the heart of Raft.
The Normal Case (No Failures):
- Client sends a command to the leader
- Leader appends the command to its own log
- Leader sends AppendEntries RPCs to all followers with the new entry
- Once a majority of servers have stored the entry, the leader considers it committed
- The leader applies the committed entry to its state machine and responds to the client
- The leader includes the commit index in future AppendEntries (including heartbeats), so followers learn which entries are committed and can apply them to their state machines
Important Concept: Committed vs. Stored
A log entry is stored when it's written to a server's log. But it's only committed once the leader has replicated it to a majority of servers. Committed entries are guaranteed to be durable and will eventually execute on all available servers.
Ensuring Consistency
How does Raft ensure all logs stay consistent? When sending AppendEntries, the leader includes the index and term of the entry immediately preceding the new entries. This is like saying "here's a new entry that should go after entry X."
The follower checks: "Do I have entry X with that exact term?" If yes, it accepts the new entries. If no, it rejects them. This consistency check ensures logs don't diverge.
Handling Failures and Inconsistencies
What if leaders crash? The old leader might not have fully replicated all its entries, leaving logs inconsistent across servers. Over multiple crashes, these inconsistencies can compound.
Raft's solution is elegant: the leader's log is always right. When a new leader takes over:
- It maintains a nextIndex for each follower (the next log entry to send to that follower)
- Initially, nextIndex is set optimistically (one past the leader's last entry)
- If a follower rejects an AppendEntries (because logs don't match), the leader decrements nextIndex and retries
- Eventually, the leader finds the point where its log matches the follower's log
- The leader then sends all entries after that point, overwriting any conflicting entries in the follower's log
This process continues until the follower's log matches the leader's. The leader never modifies or deletes its own log entries - only followers' logs get overwritten.
Reliability
The leader retries AppendEntries indefinitely until all followers eventually store all entries. Even if followers crash, run slowly, or network packets are lost, the leader keeps trying. This ensures the system eventually reaches consistency.
Safety: Election Restrictions
Here's a critical question: What if a server with an incomplete log becomes leader? It might be missing committed entries! This could be disastrous.
Raft's solution: prevent servers with incomplete logs from becoming leader in the first place.
How? Through the voting process. When a candidate sends a RequestVote RPC, it includes information about its log (the index and term of its last entry). Voters use this information to decide whether to grant their vote:
- If the voter's log is more up-to-date than the candidate's, the voter denies its vote
- "More up-to-date" means: higher term number in the last entry, or same term but longer log
This ensures that any candidate receiving majority votes must have all committed entries (since committed entries are on a majority of servers, and the candidate needs votes from a majority).
The result: log entries only flow one direction - from leaders to followers. Leaders never need to fetch entries from followers, simplifying the design significantly.
Handling Other Failures
Follower and Candidate Crashes:
These are actually simple to handle. If a follower or candidate crashes, the leader's retries will eventually succeed once they restart. Because Raft's RPCs are idempotent (applying them multiple times has the same effect as applying once), duplicate messages cause no problems.
Timing and Availability
Raft's correctness doesn't depend on timing, but its availability does. The system needs to satisfy this timing requirement:
broadcastTime ≪ electionTimeout ≪ MTBF
Where:
- broadcastTime: Average time to send an RPC to all servers and receive responses
- electionTimeout: The timeout before starting an election
- MTBF: Mean Time Between Failures (average time until a server fails)
What this means in plain English:
-
Broadcast time should be much less than election timeout (ideally 10x less). This lets leaders send heartbeats reliably, preventing unnecessary elections.
-
Election timeout should be much less than MTBF (ideally 100x less). This ensures the system can elect a stable leader between failures.
Typical Values:
- Broadcast time: 0.5-20ms (depends on storage speed, since RPCs persist data)
- Election timeout: 10-500ms (you choose this)
- MTBF: Months (typical server reliability)
With these values, Raft can maintain steady leadership and make continuous progress.
Cluster Membership Changes
Real systems need to change their configuration: adding servers to increase capacity, removing failed servers, or replacing machines. You could take the cluster offline, update configuration files, and restart - but this sacrifices availability.
Can we change membership while the cluster is running? Yes, but it's tricky.
The Problem with Naive Approaches
Here's why you can't just switch all servers at once: servers will transition at different times, creating a dangerous window where the cluster might split into two independent majorities (one using the old configuration, one using the new). This could lead to two leaders being elected for the same term - a safety violation!
Raft's Solution: Joint Consensus
Raft uses a two-phase approach:
-
Phase 1: Transition to a "joint consensus" configuration that combines both old and new configurations
- Decisions require majorities from both the old and new configurations
- This prevents split-brain scenarios
-
Phase 2: Once joint consensus is committed, transition to the new configuration
- Now only the new configuration matters
Configurations are stored as special entries in the replicated log. Once a server adds a new configuration entry to its log, it immediately starts using that configuration for all decisions (even before it's committed).
Three Additional Challenges
Challenge 1: New Servers Starting from Scratch
New servers joining the cluster have empty logs. If they immediately participate in voting, they could cause availability problems.
Solution: New servers first join as non-voting members. The leader replicates log entries to them, but they don't count toward majorities. Once they've caught up, the actual reconfiguration begins.
Challenge 2: The Leader Might Be Removed
What if the new configuration doesn't include the current leader? Raft handles this gracefully: the leader manages the transition to the new configuration, then steps down once the new configuration is committed. During the transition, it replicates entries but doesn't count itself in majorities.
Challenge 3: Removed Servers Can Cause Disruptions
Servers removed from the cluster won't receive heartbeats anymore. They'll timeout and start elections with new term numbers, potentially disrupting the current leader!
Solution: Servers ignore RequestVote RPCs if they've recently heard from a leader (within the minimum election timeout). This doesn't affect normal elections (where servers wait at least this long before starting an election), but it prevents removed servers from causing trouble.
Implementation and Evaluation
The paper authors didn't just describe Raft theoretically - they implemented and tested it.
Implementation Size
The reference implementation is remarkably compact: about 2000 lines of C++ code (excluding tests, comments, and blank lines). This small size is a testament to Raft's simplicity.
Formal Verification
The authors created a formal specification in TLA+ (a mathematical specification language) spanning about 400 lines. They used the TLA proof system to mechanically verify key properties, like the Log Completeness Property. This specification is valuable for anyone implementing Raft, providing a precise reference.
Performance
Raft is designed for understandability first, but it's also efficient:
- Easily supports batching (grouping multiple requests together)
- Supports pipelining (sending requests without waiting for responses)
- These optimizations provide higher throughput and lower latency
Availability During Leader Changes
When a leader fails, the minimum possible downtime is about half the election timeout. The authors recommend using conservative election timeouts like 150-300ms. This is:
- Short enough for good availability (most clients won't even notice)
- Long enough to avoid unnecessary leader changes due to temporary network hiccups
Related Work: Viewstamped Replication
Before Raft, there was another leader-based consensus algorithm called Viewstamped Replication (VR), developed by Oki and Liskov around the same time as Paxos. VR has many similarities to Raft, but Raft improves on it with clearer presentation and additional mechanisms (like handling membership changes elegantly).
Conclusion: Why Raft Matters
Key Differences from Paxos
The biggest difference between Raft and Paxos is Raft's strong leadership:
- Paxos: Uses a two-phase protocol for consensus, with leader election as a separate, optional mechanism
- Raft: Integrates leader election directly into the consensus algorithm as the first phase
This architectural difference makes Raft more coherent and easier to understand.
Simplicity Compared to Alternatives
Raft is simpler than other practical systems like VR or ZooKeeper:
- Fewer message types: Raft uses just 2 RPC types (RequestVote and AppendEntries), fewer than any other consensus algorithm
- Less mechanism: Raft minimizes functionality in non-leaders, keeping the design clean
- Better leader changes: While VR and ZooKeeper describe transmitting entire logs during leader changes (requiring optimization to be practical), Raft's approach is efficient from the start
Membership Changes
The joint consensus approach for membership changes is elegant because it leverages the existing consensus protocol. Very little additional mechanism is needed, keeping the algorithm simple.
The Importance of Understandability
This is the paper's central thesis: unless developers deeply understand an algorithm and can build intuitions about it, they'll struggle to implement it correctly and maintain its desirable properties. By prioritizing understandability, Raft makes reliable distributed systems more accessible.
Design Lessons
Throughout the design process, the authors found themselves repeatedly using the same techniques:
- Decomposition: Breaking complex problems into independent subproblems
- State space reduction: Limiting possibilities to make the system more predictable
These principles apply beyond Raft - they're valuable for any complex system design.
Real-World Impact
Since its publication, Raft has been widely adopted. It's used in production systems including:
- MongoDB (replication)
- RabbitMQ (queue replication)
- etcd (distributed key-value store used by Kubernetes)
- Consul (service discovery)
- Many distributed databases and systems
Raft proved that understandability and practical effectiveness can go hand-in-hand.
Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #34 in this series.