Anant Jain

Still All on One Server: Perforce at Scale

Paper Review

What is This Paper About?

This 2011 paper by Dan Bloch describes how Google managed one of the world's largest version control systems using Perforce (a centralized version control system similar to Git, but designed for handling massive codebases). At the time, Google ran the busiest single Perforce server on the planet, and the paper shares hard-won lessons about keeping such a massive system running smoothly.

Think of version control as a time machine for code - it tracks every change made to files, who made it, and why. Perforce is particularly good at handling large projects with many developers and large binary files (like images or videos), which made it popular at companies like Google before they built their own custom solutions.

The Scale of the Challenge

To understand the challenges Google faced, here are some mind-boggling numbers:

  • 12,000+ developers using the system daily
  • 11-12 million commands executed every day (that's about 127 commands per second)
  • Over 1 terabyte of metadata (information about files, not the files themselves)
  • 20 million changelists created over 11 years (a "changelist" is like a Git commit - a group of file changes submitted together)
  • The largest single workspace contained 6 million files
  • The server ran on a 16-core machine with 256 GB of memory - powerful for 2011, but modest given the workload

Key Lessons Learned

1. Database Locking is the Biggest Performance Killer

What's database locking? When multiple people try to modify the same database at once, the system "locks" it so only one person can make changes at a time. Others have to wait their turn.

For large Perforce installations, this is the single most important performance factor. Google reduced their worst-case blocking from 10 minutes (in 2005) down to 30-60 seconds (by 2010) through careful optimization. The key insight: often just a few users or commands cause most of the problems. Fix those, and everyone benefits.

2. Hardware Resources: What Actually Matters

  • Memory: Important up to a point, but there's diminishing returns. Once you have enough RAM, adding more doesn't help much (though it doesn't hurt either - it gets used for disk caching).

  • Disk I/O: This is the critical bottleneck for large installations. Your disk setup matters enormously:

    • Use local disks, not network storage, for the database
    • Use RAID 10 to spread data across multiple disk drives
    • Use solid-state drives (SSDs) for metadata if possible
  • CPU: Less critical than you might think for version control workloads

3. Clean Up Your Metadata Regularly

Metadata is information about your code (file paths, history, labels) rather than the code itself. Google found several effective cleanup strategies:

  • Delete unused workspaces and labels: Old developer workspaces that nobody uses anymore still consume server resources
  • Use sparse clients: Instead of downloading every file in a huge repository, developers only sync the files they actually need
  • Obliterate old branches: "Obliterate" means permanently deleting files and their entire history from the system. Google deleted all branches older than 2 years (after letting users save exceptions), which:
    • Removed 11% of file paths from the system
    • Saved about $100,000 in infrastructure costs
    • Delayed a major hardware upgrade by three months

4. Proactive Monitoring and Command Management

Google's philosophy: "You should never be notified about a problem by one of your users." They achieved this through:

  • Automated monitoring that sends alerts when issues arise
  • Automatic killing of runaway commands that consume excessive resources
  • The system acts before users notice problems

One interesting tool: Google built "Mondrian," an internal code review dashboard that let developers browse and review code changes. It was extremely popular but also expensive in terms of server load - a good example of balancing user experience with system performance.

5. Distribute the Load Strategically

Google used several approaches to reduce strain on the main server:

  • Replica servers: Read-only copies of the main server that handle queries without impacting the primary system
  • Specialized databases: Instead of querying Perforce directly, they built separate systems containing much of the same metadata that could be queried more efficiently
  • File system integrations: Serve files through other means so developers don't always need to run sync commands
  • Multiple servers for separate projects: Completely unrelated projects can go on different servers, but be careful - you can't just split servers arbitrarily without considering the administrative overhead

6. Plan for Downtime (Because It Will Happen)

Google averaged about one hour of downtime per month, including planned maintenance. Their advice:

  • Maintain a test server that exactly mirrors production
  • Keep a hot standby server ready for failover
  • Have a failover plan and actually test it regularly
  • Conduct postmortems after every outage with concrete action items to prevent recurrence

7. Reduce Administrative Burden Through Consistency

When managing multiple servers:

  • Make them all look identical from a user perspective
  • Use the same accounts and passwords across all servers
  • Invest heavily in automation
  • Document everything and share knowledge across the team
  • Consider longer licensing terms to reduce renewal overhead

Why This Paper Still Matters

This paper was written in 2011, and Google has since moved to their own custom system called Piper. So why read this paper today?

Because the fundamental challenges of scaling version control systems haven't changed. Whether you're using Perforce, Git, or any other version control system at scale, you'll face similar issues:

  • Database contention when many developers work simultaneously
  • Storage and performance optimization challenges
  • The need for proactive monitoring
  • Balancing user experience with system resources
  • Planning for failures and minimizing downtime

The paper's key insight is that there's often a pioneer's penalty - Google had to build custom solutions to problems that Perforce later solved natively. But the lessons learned about performance optimization, system monitoring, and operational best practices apply broadly to any large-scale infrastructure.

The Bottom Line: Running infrastructure at Google scale requires constant attention to performance bottlenecks, aggressive cleanup of technical debt, and building tools that catch problems before users do. These principles transcend any specific technology.

PDF


Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #32 in this series.