CRDTs: The Hard Parts
What Are CRDTs and Why Should You Care?
Imagine you and a colleague are working on the same Google Doc while on a plane with no internet. You're both typing away, making changes independently. When you land and reconnect, your changes magically merge together without conflicts. That's the kind of problem CRDTs solve.
Conflict-free Replicated Data Types (CRDTs) are clever algorithms that let multiple people edit the same data simultaneously, even when they're offline. When everyone comes back online, CRDTs automatically merge all the changes into a consistent state - no manual conflict resolution needed. They power collaborative tools like Google Docs and Figma, as well as distributed databases and other systems that need to work across multiple locations.
The basic idea sounds simple, but as Martin Kleppmann reveals in this talk, the devil is in the details. While CRDTs are conceptually elegant, implementing them correctly is surprisingly tricky. Many published algorithms have subtle bugs that cause weird behavior in edge cases. Even worse, naive implementations often perform terribly - we're talking about 100x more memory usage than the actual data!
In this talk, Martin goes beyond introductory CRDT concepts and shares hard-won lessons from years of research on making CRDTs work in real-world applications.
The Two Approaches to Real-Time Collaboration
Before diving into the challenges, it's helpful to understand the landscape of collaborative editing technologies:
-
Operational Transforms (OT): The older approach (dating back to 1989), used in systems like Google Docs. Requires a central server to coordinate all changes and decide the order of operations. When you type in Google Docs, your keystrokes are sent to a server that transforms them to work with everyone else's edits.
-
CRDTs: The newer approach (gaining popularity since 2006) that doesn't need a central coordinator. Each device can work independently and merge changes later. This makes CRDTs better for peer-to-peer apps and offline-first systems. However, they're deceptively easy to implement incorrectly.
The Four Hard Problems
Martin covers four major challenges that make CRDTs difficult to implement correctly:
1. Interleaving Anomalies: When Text Gets Scrambled
The Problem: Imagine you and I are both editing the word "HELLO". You type "AB" after the "H", and I type "XY" after the "H" at the same time. What should the final text be? "HABXY"? "HXYAB"? A scrambled mess like "HAXBY"?
How CRDTs Handle This: Text CRDTs assign a unique ID to every character you type. When merging concurrent edits, these IDs determine where characters should appear. However, this can lead to interleaving - where concurrent edits at the same position get mixed together in unexpected ways.
The RGA Algorithm: One popular CRDT algorithm called RGA (Replicated Growable Array) handles this reasonably well but isn't perfect. It can still produce non-deterministic (unpredictable) output, especially when you move your cursor to different positions and insert text at multiple locations. Different replicas might merge the same edits in slightly different orders, leading to inconsistent results.
2. Reordering List Items: It's Not Just Delete and Insert
The Problem: You might think that moving an item in a list is just deleting it from one position and inserting it at another. But in CRDTs, this doesn't work well. If two people move the same item to different positions simultaneously, the result can be ambiguous or unintuitive.
Why It's Hard: Sometimes there isn't a single "correct" answer. If I move item A before item B, and you move item B before item A at the same time, what's the right final order?
The Solution: Building complex CRDTs by combining simpler ones. Martin shows an example where you can combine:
- A List CRDT (for ordering)
- An AWSet (Add-Wins Set, which handles concurrent additions gracefully)
- An LWWRegister (Last-Writer-Wins Register, which picks the most recent value)
This composability lets you create more sophisticated data structures with well-defined merge semantics.
3. Moving Subtrees in Trees: Preventing Folder Chaos
The Problem: File systems and organizational hierarchies are trees. What happens when two people move folders around simultaneously? Even worse, what if these moves could create a cycle - like folder A becoming a child of folder B, while folder B becomes a child of folder A?
Preventing Cycles: One common approach is Last-Writer-Wins (LWW): whoever made the most recent change wins. This prevents cycles but can be surprising - your carefully organized folder structure might suddenly change because someone else moved something a second later.
Real-World Complications: Martin discusses how this affects real systems:
- Dropbox and Google Drive both had to solve this problem, each with different trade-offs
- Different vendors have different ideas about what the "correct" result should be when conflicts occur
- There might be multiple valid results, and picking one requires making judgment calls
The Time-Travel Problem: Some systems let you undo operations by going back to a previous timestamp. But in a distributed system without a central clock, "previous timestamp" is surprisingly ambiguous. Which device's clock do you trust?
4. Reducing Metadata Overhead: The 100x Problem
The Shocking Reality: Remember how CRDTs need to track a unique ID for every character? This metadata takes up space. Early CRDT implementations had terrible overhead - we're talking about 100 bytes of metadata for every 1 byte of actual data. Store a 1KB document? It actually takes 100KB in memory. Yikes.
Why This Happens: To merge changes correctly, CRDTs need to remember:
- Who made each change
- When it was made
- What came before and after it
- Various other bookkeeping information
All of this adds up fast, especially for text documents where each keystroke is a separate operation.
The Solution: Columnar Encoding: Martin discusses how the Automerge project solved this problem using columnar encoding - the same technique databases use to compress tables efficiently. Instead of storing each operation's metadata separately, you store all the metadata in columns and compress the patterns.
The Results: Using these compression techniques, Automerge reduced overhead from 100:1 to close to 1:1. Modern versions use only about 30% more space than the raw data - less than one extra byte per character. This makes CRDTs practical for real-world applications.
Key Takeaways
CRDTs promise automatic conflict resolution for distributed systems, but implementing them correctly requires careful attention to:
- Correctness: Ensuring that concurrent edits merge in predictable, intuitive ways
- Performance: Optimizing metadata storage so you don't use 100x more memory than necessary
- Semantics: Making difficult design choices about what "correct" means when conflicts occur
- Real-world constraints: Handling offline editing, network partitions, and time synchronization
Martin's talk shows that while CRDTs are powerful, they're not magic. Building production-ready CRDT systems requires deep understanding of the trade-offs and careful engineering. If you're building collaborative software or distributed systems, this talk provides essential insights into the challenges you'll face and proven solutions to overcome them.
Links
Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #29 in this series.