Borg, Omega, and Kubernetes

April 11, 2020

Paper Review

Introduction

This paper documents Google's decade-long journey building three successive container-management systems. If you've heard of Kubernetes, this is the story of how it came to be, along with its two predecessors.

What are containers?

Before diving in, let's clarify what containers are. Think of a container as a lightweight package that includes an application and everything it needs to run (libraries, dependencies, configurations). Unlike virtual machines that virtualize hardware, containers share the host operating system's kernel but isolate the application from other processes. This makes them fast to start and efficient with resources.

The Three Systems

Borg (Google's first generation) was built to manage both long-running services (like web servers that run continuously) and batch jobs (like data processing tasks that run once and complete). Previously, these were handled by two separate systems called Babysitter and the Global Work Queue. Borg unified them into a single system that could manage thousands of machines efficiently.

Omega (the second generation) was an experimental redesign of Borg, driven by a desire to improve its software architecture. The key innovation was storing the cluster state in a centralized database-like store that used:

Paxos: A consensus algorithm that ensures multiple computers agree on data even if some fail or messages get lost (think of it as a reliable voting system for computers)
Optimistic concurrency control: A technique where different components assume they won't conflict with each other and only check for conflicts when saving changes (like editing a Google Doc where you assume no one else is editing simultaneously)

This architecture allowed different parts of the system (like schedulers) to work independently as peers, rather than everything going through one central controller.

Kubernetes (the third generation) was designed for a different world: one where external developers needed access to container management, and Google was building a cloud business. Kubernetes is open source, unlike its predecessors which were Google-internal systems. It takes a middle-ground approach: like Omega, it has a shared store that components watch for changes, but unlike Omega, it doesn't expose the raw database. Instead, all access goes through a REST API (a standardized web interface) that enforces rules, validates data, and handles versioning. This makes it safer for diverse users and tools to interact with the system.

Why Containers Matter

Efficiency through resource sharing: The resource isolation provided by containers has enabled Google to drive utilization significantly higher than industry norms. Here's how it works:

Production services (like Gmail or Search) typically reserve more resources than they usually need. This extra capacity helps them handle traffic spikes and server failures. But most of the time, this reserved capacity sits idle. Containers make it possible to reclaim these idle resources to run batch jobs (like data processing or machine learning training), dramatically improving overall cluster efficiency.

Limitations to understand: Container isolation isn't perfect. Containers can't prevent interference in hardware resources that the operating system kernel doesn't directly control, such as:

Level 3 (L3) processor caches - the fast memory built into CPUs
Memory bandwidth - how fast data can move to and from RAM

For this reason, in multi-tenant cloud environments where you can't trust all users, containers need an additional security layer (such as virtual machines) to protect against malicious actors.

More than just isolation: A modern container includes two key components:

Isolation: Separating the application from other processes
Image: A package containing all the files needed to run the application (code, runtime, libraries, dependencies)

Application-Oriented Infrastructure

Containerization represents a fundamental shift: instead of organizing data centers around machines, we organize them around applications. This mindset shift has profound implications.

Consistent Environments Across Development and Production

The problem containers solve: Traditionally, developers faced the "it works on my machine" problem. Code that ran fine on a developer's laptop would break in production because of subtle differences in operating system versions, installed libraries, or system configurations.

How containers help: By separating the container image from the host operating system, containers provide the same deployment environment in both development and production. This dramatically improves deployment reliability and speeds up development by eliminating inconsistencies.

The dependency relationship: When done correctly, the container only depends on the Linux kernel's system-call interface (the standardized way programs talk to the operating system). This limited interface makes containers highly portable across different Linux distributions and versions.

Remaining challenges: Even this approach isn't perfect. Applications can still be affected by changes in:

Socket options (network communication settings)
The /proc filesystem (where Linux exposes system information)
ioctl calls (input/output control operations)

These interfaces have a large surface area and occasionally change between kernel versions.

Containers as the Unit of Management

Identity shift: Building management APIs around containers instead of machines fundamentally changes how we think about the data center. In database terms, the "primary key" shifts from machine to application.

Metadata and communication: Kubernetes attaches key-value annotations to each container's metadata to communicate application structure. For example, you might annotate a container with its version number, team ownership, or environment (staging vs. production). These annotations can be set by the container itself or by other parts of the management system, like a deployment tool rolling out updates.

Ripple effects throughout the stack: This application-centric approach changes everything:

Load balancing: Instead of distributing traffic across machines, load balancers distribute traffic across application instances. If you have 10 instances of a web service, traffic goes to those 10 instances regardless of which machines they're running on.
Logging: Logs are organized by application, not by machine. This makes it easy to collect and view all logs for a service without pollution from other applications or system operations running on the same machine.
Monitoring and debugging: You can detect application failures and identify root causes without having to filter out irrelevant machine-level signals.

Pods: Grouping containers together: While we often think of one container per application, real systems frequently need to group multiple containers that should run together on the same machine:

The outer container provides a pool of shared resources (CPU, memory, network)
The inner containers provide deployment isolation for different components

In Borg, this outer container is called an "alloc" (short for allocation). In Kubernetes, it's called a pod. Kubernetes standardized this concept: every application container always runs inside a pod, even if that pod contains just a single container. This consistency simplifies the system's design.

Orchestration is the Beginning, Not the End

API consistency reduces complexity: Kubernetes could have become overwhelmingly complex, but it avoids this by using a consistent structure for all objects. Every Kubernetes object has three standard fields:

ObjectMetadata: Same for all objects, containing:
- Name and unique identifier (UID)
- Version number (for optimistic concurrency control)
- Labels (key-value pairs for organizing objects)
Spec: Describes the desired state - "what you want"
Status: Reports the current state - "what actually exists" (read-only)

This three-part structure applies whether you're describing a pod, a service, or any other Kubernetes resource. The contents of Spec and Status vary by object type, but the concept remains constant.

Different types of workloads: Kubernetes provides three main controllers for running replicated pods:

ReplicationController: For long-running services like web servers that should always be running
DaemonSet: Ensures exactly one instance runs on each machine (useful for logging agents or monitoring tools)
Job: For batch workloads that run to completion, like data processing tasks (can be parallelized)

Reconciliation loops: The secret to resilience: All three systems (Borg, Omega, and Kubernetes) use a powerful pattern called the reconciliation loop:

Observe the current state (e.g., "there are 3 running pods")
Compare it to the desired state (e.g., "there should be 5 pods")
Take action to close the gap (e.g., "start 2 more pods")
Repeat continuously

This approach is remarkably robust. If a controller crashes and restarts, it simply observes the current state and continues working toward the desired state. There's no complex state machine to get out of sync.

Choreography vs. Orchestration: Kubernetes uses a "choreography" model rather than centralized "orchestration":

Orchestration (centralized): One conductor tells everyone exactly what to do, step by step
Choreography (decentralized): Multiple autonomous components each handle their own concerns, and the desired behavior emerges from their collaboration

While centralized orchestration seems simpler initially, it becomes brittle over time. When unexpected errors occur, a choreographed system adapts better because each component continues working toward its local goals independently.

Design Lessons: Things to Avoid

Through years of experience, Google learned several important lessons about what NOT to do when building container management systems.

1. Don't Make the Container System Manage Port Numbers

The problem with Borg: In early systems, when multiple containers ran on the same machine, the system had to assign each container different port numbers to avoid conflicts. This created a nightmare for operations.

Kubernetes's solution: Give each pod its own IP address, aligning network identity with application identity. This seemingly simple change has major benefits:

Applications can use standard, well-known ports (like port 80 for HTTP) without conflicts
Off-the-shelf software works without modification
Existing network tools work as expected for segmentation, bandwidth control, and monitoring

How it works: On bare metal servers, you can use Software Defined Networking (SDN) overlays or configure Layer 3 (IP-level) routing to give each pod a unique IP, even when multiple pods share the same physical machine.

2. Don't Just Number Containers: Give Them Labels

Why labels matter: In large systems with thousands of containers, naming them "container-1", "container-2", etc. is useless. You need rich metadata to organize and select groups of containers.

How labels work: Kubernetes uses key-value pairs as labels. For example:

role=frontend and stage=production identifies production frontend servers
role=database and team=search identifies database containers owned by the search team

Flexibility: Labels can be dynamically added, removed, or modified by automated tools or users. Different teams can manage their own labels independently without coordination.

Label selectors: You can query objects using expressions like stage==production && role==frontend to find all production frontend pods.

3. Be Careful with Ownership

The potential problem: In Kubernetes, controllers (like ReplicationControllers) determine which pods they manage using label selectors. If you're not careful, multiple controllers might claim the same pod, creating conflicts.

Why you need to be careful: Imagine if two ReplicationControllers both selected the same pod. One might try to delete it while the other tries to keep it running, creating confusion.

The upside: While this requires careful configuration, the flexibility enables powerful patterns. You can "orphan" a pod (remove it from one controller's management) and "adopt" it under another controller, useful for gradual migrations and testing.

4. Don't Expose Raw State

The architectural difference: This is a key difference between the three systems:

Borg: Monolithic system where everything goes through a central master
Omega: Fully decentralized - components directly access the shared database
Kubernetes: Middle ground - components are decentralized but all access goes through an API server

Why the middle ground works: Kubernetes gets the benefits of Omega's decentralized, scalable architecture while still enforcing system-wide rules. The API server:

Validates all changes (preventing invalid configurations)
Applies default values (reducing boilerplate)
Handles versioning (supporting rolling upgrades of the system itself)
Enforces policies (security, quotas, etc.)
Hides database implementation details (allowing the storage layer to change)

Open Problems (Still Unsolved)

Even after a decade of development, some problems remain stubbornly difficult. Here are two major challenges that the paper identifies:

1. Configuration Complexity

The problem: Application configuration becomes a dumping ground for everything the container system doesn't handle yet. As requirements grow, configuration systems evolve their own programming languages to handle logic.

How it happens: It starts innocently. Someone needs to adjust memory allocation based on the number of database shards. So the configuration language adds basic math. Then someone needs conditional logic. Then loops. Eventually, the configuration language becomes "Turing complete" (capable of any computation a programming language can do).

Why this is bad:

You've recreated the problem you tried to solve (hard-coded parameters in code) but in a worse form
Instead of using a real programming language with debuggers, IDE support, and testing frameworks, you're using a custom configuration language with poor tooling
Configurations become "configuration as code" - just as hard to understand and debug as regular code
Operational complexity doesn't decrease; it just moves to a different layer

Still unsolved: The paper doesn't offer a solution, acknowledging this remains an open problem in the field.

2. Dependency Management

The scope of the problem: When you deploy a service, you rarely deploy just that service. You need:

Monitoring and alerting systems
Logging infrastructure
Databases or storage systems
CI/CD pipelines
Service discovery
And more...

Why it's hard: Almost no container system captures these dependencies in a usable way. Two approaches have been tried, both with problems:

Manual declaration: Ask developers to document their dependencies
- Problem: Documentation gets out of date quickly as services evolve
Automatic detection: Trace what services actually communicate with
- Problem: You see the "what" but not the "why" - missing semantic information about whether a dependency is critical or optional

A potential path forward: Require applications to explicitly declare their dependencies upfront, and have the infrastructure enforce this by blocking access to undeclared services. This is strict but prevents hidden dependencies from accumulating.

Status: Like configuration, this remains largely unsolved at the infrastructure level.

Key Takeaways

If you're new to container orchestration, here's what you should remember from this paper:

Containers enable efficiency: By isolating applications while sharing hardware, containers let Google run both latency-sensitive services and batch jobs on the same machines, dramatically improving utilization.
Evolution through learning: Each generation (Borg → Omega → Kubernetes) incorporated lessons from the previous one. Kubernetes represents a decade of accumulated wisdom.
Application-centric thinking: Moving from machine-oriented to application-oriented infrastructure fundamentally changes how you build and operate systems.
Reconciliation loops are powerful: Continuously comparing desired state with actual state and correcting differences creates remarkably resilient systems.
Choreography scales better than orchestration: Decentralized, autonomous components collaborating are more robust than centralized control.
Some problems persist: Configuration management and dependency tracking remain challenging even for the most advanced systems.

Anant Jain