Optimizing Space Amplification in RocksDB

February 26, 2022

Paper Review

Abstract

RocksDB is an embedded, high-performance, persistent key-value storage engine developed at Facebook. Much of our current focus in developing and configuring RocksDB is to give priority to resource efficiency instead of giving priority to the more standard performance metrics, such as response time latency and throughput, as long as the latter remain acceptable. In particular, we optimize space efficiency while ensuring read and write latencies meet service-level requirements for the intended workloads. This choice is motivated by the fact that storage space is most often the primary bottleneck when using Flash SSDs under typical production workloads at Facebook. RocksDB uses log-structured merge trees to obtain significant space efficiency and better write throughput while achieving acceptable read performance.

This paper describes methods we used to reduce storage usage in RocksDB. We discuss how we are able to trade off storage efficiency and CPU overhead, as well as read and write amplification. Based on experimental evaluations of MySQL with RocksDB as the embedded storage engine (using TPC-C and LinkBench benchmarks) and based on measurements taken from production databases, we show that RocksDB uses less than half the storage that InnoDB uses, yet performs well and in many cases even better than the B-tree-based InnoDB storage engine. To the best of our knowledge, this is the first time a Log-structured merge tree-based storage engine has shown competitive performance when running OLTP workloads at large scale.

What You Need to Know First

Before diving into the paper, here are some key concepts that will help you understand RocksDB better:

What is RocksDB? RocksDB is a database storage engine - essentially the part of a database that actually stores data on disk and retrieves it. Think of it as the "hard drive manager" for databases. It was created by Facebook (now Meta) to handle their massive data storage needs more efficiently.

Log-Structured Merge Trees (LSM-trees): This is the fundamental data structure behind RocksDB. Unlike traditional databases that use B-trees (which update data in place), LSM-trees optimize for writes by always appending new data sequentially. Here's how it works:

New data is first written to an in-memory buffer (called a memtable)
When the buffer fills up, it's written to disk as an immutable file (called an SSTable - Sorted String Table)
Periodically, these files are merged together in the background to keep things organized
This approach makes writes very fast (sequential writes are much faster than random writes) but can make reads slower since data might be scattered across multiple files

The Three Amplification Factors: These are the key metrics for evaluating storage engines, and understanding them is crucial to appreciating what this paper achieves:

Space Amplification: How much disk space you use compared to your actual data. If you store 10 MB of data but it takes 50 MB on disk, your space amplification is 5x. Lower is better.
Write Amplification: How much data is actually written to disk for each database write. If writing 10 MB to the database causes 30 MB to be written to disk (due to internal reorganization), your write amplification is 3x. This matters because SSDs can only handle a finite number of writes before wearing out.
Read Amplification: How many disk reads are needed to answer a query. If you need to check 5 different files to find your data, your read amplification is 5. Lower is better for performance.

The Fundamental Tradeoff: Here's the catch - you can optimize for at most two of these three factors. It's impossible to minimize space, read, and write amplification simultaneously. This paper is about how RocksDB prioritizes space efficiency while keeping read and write performance acceptable.

Why This Matters: At Facebook's scale (tens of petabytes), even small improvements in space efficiency translate to massive cost savings in hardware and power. If you can store the same data in half the space, you need half as many servers.

Key Highlights

Context and Background:

Facebook has one of the largest MySQL installations in the world, storing tens of petabytes of online data. To put that in perspective, one petabyte is 1,000 terabytes - enough to store roughly 500 billion pages of text.
RocksDB got started by forking the code from Google's LevelDB (another LSM-tree based storage engine) and then heavily modifying and optimizing it for Facebook's specific needs.
MyRocks is RocksDB integrated as a MySQL storage engine, allowing traditional MySQL databases to use RocksDB under the hood instead of the default InnoDB engine. This gives you the benefit of LSM-tree storage while keeping the familiar MySQL interface.
Within Facebook, RocksDB powers multiple critical systems: Laser (high query throughput key-value storage), ZippyDB (distributed key-value store with Paxos replication for consistency), Dragon (Social Graph indexing), and Stylus (stream processing), to name a few.

The Storage Challenge:

At Facebook's scale, the per-node query rate is actually quite low. This might sound counterintuitive, but here's why: the data volume is so massive that it must be distributed (sharded) across thousands of servers. The more servers you spread the data across, the fewer queries each individual server handles. This means optimizing for storage efficiency becomes more important than optimizing for extreme read performance on each node.
Minimizing space amplification makes SSDs an increasingly attractive alternative to traditional spinning hard drives, even for "colder" data that isn't accessed frequently. As SSD prices continue to decline, space efficiency becomes the deciding factor in whether SSDs make economic sense.

How RocksDB Searches for Data:

To understand the optimizations, it helps to know how RocksDB finds data. In an LSM-tree, data is organized in levels (think of them as floors in a building), with each level potentially containing multiple SST files. When searching for a key:

First binary search: Find which SST file might contain your key by checking the Manifest File (a catalog of all SSTs and their key ranges)
Second binary search: Within that SST file, find which data block contains your key by checking the SST's index block
Third binary search: Search within that data block to find your actual key-value pair

This could mean checking multiple levels (3 searches per level), which is why read amplification can be a concern. However, RocksDB uses Bloom filters - probabilistic data structures that can quickly tell you "this key definitely isn't in this file" - to skip unnecessary file searches. In the common case, this means only one data block needs to be read from disk.

Space Amplification Math:

Here's a concrete example of LSM-tree space efficiency. Imagine each level is 10x larger than the previous level:

Level 0: 1 GB
Level 1: 10 GB
Level 2: 100 GB
Level 3: 1,000 GB (the last level)

If the last level is full, all the upper levels combined only add 11.1 GB (1 + 10 + 100) to the 1,000 GB in the last level. That's just 1.1% overhead, giving you a space amplification of approximately 1.111x - very efficient! This is why the paper focuses so much on the level size multiplier.

Novel Techniques Introduced:

This paper describes several techniques that were being published for the first time:

Dynamic Level Size Adaptation: The level size multiplier (how much bigger each level is than the previous one) isn't fixed at 10 - it's adjusted dynamically based on the current database size. Larger multipliers reduce space and read amplification but increase write amplification. RocksDB tunes this automatically based on workload.
Tiered Compression: Not all levels need the same compression. Since about 90% of data ends up in the last level, RocksDB uses aggressive compression (like zlib or Zstandard) only there, where it has the most impact. Upper levels use lighter compression to save CPU resources. This is smart because the last level is read and written less frequently, so the CPU cost of decompression is amortized.
Dictionary-Based Compression: Instead of compressing each block independently, RocksDB can build a shared dictionary of common patterns across blocks, achieving better compression ratios.
Prefix Bloom Filters: Traditional Bloom filters work on complete keys. Prefix Bloom filters work on key prefixes, which is useful when your queries often search for keys with common prefixes (like all keys starting with "user:").
Variable Size Multipliers: Different levels can have different size multipliers (not just a uniform 10x throughout), allowing fine-grained control over the space/read/write amplification tradeoffs at each level.

Results and Impact

The paper presents compelling evidence that RocksDB achieves its goals:

Benchmark Performance:

The researchers evaluated RocksDB against InnoDB (MySQL's default storage engine, which uses B-trees) using two industry-standard benchmarks:

TPC-C: A traditional OLTP (Online Transaction Processing) benchmark that simulates complex business transactions like order processing, payment handling, and inventory management. OLTP workloads are characterized by many small, concurrent transactions.
LinkBench: A benchmark specifically designed to simulate social networking workloads, mimicking the access patterns of large-scale web services like Facebook itself.

Key Findings:

RocksDB uses less than half the storage that InnoDB requires for the same data
Despite using significantly less space, RocksDB performs competitively with InnoDB and in many cases even outperforms it
This is the first time an LSM-tree-based storage engine demonstrated it could handle traditional OLTP workloads at large scale while being more space-efficient than B-tree engines

Why This is a Big Deal:

Historically, B-tree-based engines like InnoDB have dominated OLTP workloads because they offer predictable read performance. LSM-trees were thought to be too slow for reads. This paper demonstrates that with the right optimizations, LSM-trees can be competitive for OLTP while using dramatically less storage space.

For Facebook's production deployment, this translated to:

50% reduction in storage infrastructure costs
Extended SSD lifespan due to reduced write amplification
Ability to store "colder" data on SSDs instead of slower spinning disks, improving overall system performance

Summary of Techniques

Here's a quick reference of the novel optimization techniques introduced in this paper:

Dynamic LSM-tree level size adjustment: Automatically tunes level sizes based on current database size
Tiered compression: Uses aggressive compression only where it matters most (the last level)
Shared compression dictionary: Improves compression ratios by finding common patterns across data blocks
Prefix Bloom filters: Optimizes queries that search by key prefixes
Variable size multipliers: Fine-tunes the tradeoff between space, read, and write amplification at each level

These techniques work together to minimize space amplification while keeping read and write performance within acceptable bounds. The key insight is that at Facebook's scale, where storage is the primary bottleneck, trading some CPU overhead for better space efficiency is worth it.

PDF

I love reading foundational papers in Computer Science and publish my notes here on this blog. This was post #36 in this series.

Anant Jain