Paxos Made Simple
Abstract
The Paxos algorithm, when presented in plain English, is very simple.
Introduction
Imagine you have multiple computers that need to agree on a single decision—like which value to store in a database—even when some computers might crash or messages might get lost. This is the consensus problem, and it's fundamental to building reliable distributed systems.
The Paxos algorithm, created by Leslie Lamport, solves this problem elegantly. It's a consensus algorithm that enables a group of computers (or "nodes") to agree on a single value even in the presence of failures. The algorithm is designed for fault-tolerant distributed systems, meaning the system continues to work correctly even when individual components fail.
What makes Paxos special is that its design follows naturally from the safety requirements we need in distributed systems. Once you understand what properties we need, the algorithm becomes almost inevitable.
The Problem
Let's say we have a collection of processes (think of them as individual computers or programs) that can each propose different values. A consensus algorithm needs to ensure that exactly one of these proposed values gets chosen by the group.
The rules are straightforward:
- If no one proposes anything, nothing should be chosen
- Once a value is chosen, any process should be able to learn what was chosen
- The system must guarantee these safety requirements:
- Only a value that has been proposed may be chosen (no making up values)
- Only a single value is chosen (no disagreement)
- A process never learns that a value has been chosen unless it actually has been (no false information)
The Three Roles
Paxos organizes participants into three roles:
- Proposers: These initiate the process by suggesting values that should be chosen
- Acceptors: These vote on proposals and help reach consensus (think of them as the voting body)
- Learners: These find out which value was ultimately chosen
A single node in the system can play multiple roles simultaneously.
The System Model
Paxos operates under realistic failure conditions (what computer scientists call the "asynchronous, non-Byzantine model"):
- Processes can fail and restart: They might crash at any time but aren't malicious. For the system to work, failed processes need to remember some information when they restart (typically stored on disk).
- Network is unreliable: Messages can be delayed indefinitely, lost, or even arrive multiple times. However, messages aren't corrupted—if you receive a message, you can trust its contents.
The Solution
Now let's walk through how Paxos actually works. The key insight is using proposal numbers to establish ordering and prevent conflicts.
How Proposers Work
When a proposer wants to get a value accepted, it follows these steps:
Step 1: Send a Prepare Request
The proposer picks a unique proposal number n (higher than any it's used before) and sends a prepare request to a majority of acceptors. This request asks each acceptor:
- (a) "Promise me you won't accept any proposals numbered less than
n" - (b) "Tell me about the highest-numbered proposal (if any) that you've already accepted"
Step 2: Send an Accept Request
If the proposer gets responses from a majority of acceptors, it can now send an accept request. This request says "please accept proposal number n with value v".
But which value v should it use? Here's the clever part:
- If any acceptor reported a previously accepted proposal, use the value from the highest-numbered one
- Otherwise, the proposer can choose any value it wants
This ensures that once a value starts gaining acceptance, Paxos converges on that value rather than switching to a different one.
How Acceptors Work
Acceptors have a simpler job—they receive two types of requests and decide whether to respond:
For Prepare Requests:
- An acceptor can always respond to a prepare request
- If it responds, it promises not to accept any proposals with numbers lower than the one in the request
- It also reports any proposal it has previously accepted
For Accept Requests:
- An acceptor accepts the proposal unless it has already promised (via a prepare response) not to accept it
- Specifically, it can accept proposal
nonly if it hasn't responded to a prepare request with a number greater thann
Optimization
There's an important optimization: if an acceptor receives a prepare request numbered n, but it has already responded to a prepare request numbered greater than n, it can simply ignore the new request. It has already promised not to accept that proposal, so there's no point responding.
With this optimization, each acceptor only needs to remember two things:
- The highest-numbered proposal it has ever accepted
- The highest-numbered prepare request it has responded to
The Algorithm
Let's put it all together into a concise, two-phase protocol:
Phase 1: Prepare
(a) Proposer: Selects a unique proposal number n and sends a prepare request with number n to a majority of acceptors.
(b) Acceptor: If it receives a prepare request with number n greater than any prepare request it has already responded to, it responds with:
- A promise not to accept any proposals numbered less than
n - The highest-numbered proposal (if any) that it has already accepted
Phase 2: Accept
(a) Proposer: If it receives responses from a majority of acceptors, it sends an accept request to each of those acceptors for proposal number n with value v, where:
vis the value of the highest-numbered proposal among the responses, ORvis any value the proposer chooses (if no proposals were reported)
(b) Acceptor: If it receives an accept request for proposal number n, it accepts the proposal unless it has already responded to a prepare request with a number greater than n.
Why This Works
The algorithm guarantees safety through a clever trick: the majority requirement. Since any two majorities must overlap by at least one acceptor, if a value has been chosen (accepted by a majority), any new proposer will learn about it from at least one acceptor in the overlapping set. This prevents the system from choosing a different value.
Guaranteeing Progress
The algorithm described so far guarantees safety (we'll never choose two different values), but what about liveness (will we eventually choose a value)?
There's a potential problem: if multiple proposers keep issuing prepare requests with increasing proposal numbers, they can block each other indefinitely. For example:
- Proposer A sends prepare(n=1)
- Proposer B sends prepare(n=2), which blocks A's proposal
- Proposer A sends prepare(n=3), which blocks B's proposal
- And so on...
The solution is to elect a distinguished proposer (also called a "leader") that becomes the only one allowed to issue proposals. As long as this leader can communicate with a majority of acceptors and uses a proposal number greater than any used before, it will successfully get a value accepted.
In practice, systems implement leader election mechanisms (which themselves might use Paxos!) to choose and maintain this distinguished proposer.
Implementing a State Machine
Now let's see how Paxos is used in practice to build fault-tolerant systems.
The Single Server Approach
The simplest way to build a distributed system is to have a central server that acts as a deterministic state machine. Clients send commands to the server, which executes them in order and produces outputs. Because it's deterministic (same inputs always produce same outputs), the behavior is predictable and consistent.
The problem? If that server fails, your entire system goes down.
The Multi-Server Solution
Instead of one server, we use multiple servers, each maintaining its own copy of the state machine. Since the state machine is deterministic, as long as all servers execute the same sequence of commands in the same order, they'll all end up in the same state and produce the same outputs.
Now clients can get responses from any server, providing both fault tolerance and load distribution.
Using Paxos for Command Ordering
The key challenge is: how do we ensure all servers execute commands in the same order?
The answer is brilliant: we run a separate instance of Paxos for each position in the command sequence:
- Paxos instance #1 chooses what the 1st command should be
- Paxos instance #2 chooses what the 2nd command should be
- Paxos instance #3 chooses what the 3rd command should be
- And so on...
Each server participates in all three roles (proposer, acceptor, and learner) for each Paxos instance. This way, all servers agree on the complete sequence of commands and execute them in the same order, keeping their state machines synchronized.
This approach, often called "Multi-Paxos," is the foundation for many real-world distributed systems like Google's Chubby lock service and Apache ZooKeeper.
Key Takeaways
Paxos elegantly solves one of computer science's fundamental problems: achieving consensus among distributed processes that can fail. Its key insights are:
- Use proposal numbers to establish ordering and prevent conflicts
- Require majorities for decisions, ensuring any two decision groups overlap
- Preserve previously accepted values to ensure convergence on a single choice
- Elect a leader to guarantee progress in practice
While Paxos has a reputation for being difficult to understand, at its core it's a straightforward algorithm that follows naturally from the properties we need. Understanding Paxos provides insight into how modern distributed databases, coordination services, and replicated systems work.
Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #8 in this series.