Harvest, Yield, and Scalable Tolerant Systems

September 19, 2020

Paper Review

Abstract

The cost of reconciling consistency and state management with high availability is highly magnified by the unprecedented scale and robustness requirements of today's Internet applications. We propose two strategies for improving overall availability using simple mechanisms that scale over large applications whose output behavior tolerates graceful degradation. We characterize this degradation in terms of harvest and yield, and map it directly onto engineering mechanisms that enhance availability by improving fault isolation, and in some cases also simplify programming. By collecting examples of related techniques in the literature and illustrating the surprising range of applications that can benefit from these approaches, we hope to motivate a broader research program in this area.

What's This Paper About?

This 1999 paper by Armando Fox and Eric Brewer tackles a fundamental challenge in distributed systems: how do you keep a service running when parts of it fail? Instead of trying to make systems perfectly reliable (which is expensive and often impossible), the authors propose letting systems degrade gracefully - continuing to work with reduced functionality rather than failing completely.

The key insight is that for many applications, it's better to return partial results or serve some requests rather than shutting down entirely when things go wrong. Think of a search engine: if one server fails, it's better to return search results from the remaining servers than to show an error page to all users.

Core Concepts

The CAP Principle: Understanding the Trade-offs

The paper builds on the CAP theorem (also known as Brewer's theorem, after co-author Eric Brewer), which states that in a distributed system, you can only guarantee two out of these three properties simultaneously:

Consistency: Every read gets the most recent write - all nodes see the same data at the same time
Availability: Every request gets a response (without guarantee that it's the most recent data)
Partition-tolerance: The system continues working even when network problems prevent some nodes from communicating

Strong CAP Principle: You can pick at most 2 out of 3. Since network failures are inevitable in distributed systems, you usually have to choose between consistency and availability.

Weak CAP Principle: In practice, many applications work best with reduced consistency or availability rather than absolute guarantees. For example:

Bayou (a distributed database) trades some consistency for availability
Coda (a distributed file system) explicitly prioritizes availability over strict consistency
Leases (time-based consistency mechanisms) provide fault-tolerant consistency management

The key insight: the stronger your guarantees for any two properties, the weaker your guarantees for the third.

Harvest and Yield: Two Ways to Measure Success

Instead of thinking about systems as simply "working" or "broken," this paper introduces two metrics:

Yield: The percentage of requests that are successfully completed. If you handle 95 out of 100 requests, your yield is 95%. This is similar to traditional "uptime" metrics.

Harvest: The completeness of the data in each response. If one of your ten database nodes is down, you might still answer queries, but your results will only include 90% of the total data - a 90% harvest.

When something fails, you face a choice:

Return no answer at all (reduced yield)
Return a partial answer (reduced harvest)

Example: Imagine a price comparison website that searches 10 retailers. If 2 retailers' servers are down:

Low yield approach: Show an error message to users
Low harvest approach: Show results from the 8 working retailers with a note that 2 are unavailable

Why "Harvest" and "Yield"?

The terms come from semiconductor manufacturing: each chip on a silicon wafer either works completely or fails (no partial functionality), and "yield" is the fraction of working chips. In distributed systems, the paper argues we can do better by accepting partial results (harvest) rather than all-or-nothing behavior.

Does This Work for Updates?

At first glance, partial results seem to only apply to read queries. But the model also works for "single-location updates" - changes that only affect one node or partition, such as:

Personalization databases (updating one user's preferences)
Collaborative filtering (updating one item's ratings)

The approach doesn't work well for updates that need to change data across many nodes simultaneously.

Strategy 1: Trading Harvest for Yield (Probabilistic Availability)

The first strategy is simple: accept partial results to keep the system responding to requests.

Why not just replicate everything? You could copy all your data to multiple servers, but this is expensive and still can't guarantee 100% harvest or yield. Why? Because the Internet itself uses "best-effort" protocols - packets can get lost, connections can fail, and there are no absolute guarantees.

Real-world example: Transformation proxies for mobile devices trade harvest for yield. When a slow mobile connection requests a web page, the proxy might:

Compress images more aggressively (reduced harvest - lower quality images)
Strip out some content (reduced harvest - missing elements)
But ensure the page actually loads (maintained yield)

Without this trade-off, the mobile device might timeout and get nothing at all (zero yield).

The key principle: it's often better to serve 100% of your users with 90% of the data than to serve only 90% of users with 100% of the data.

Strategy 2: Application Decomposition and Orthogonal Mechanisms

The second strategy is more sophisticated: break your application into independent subsystems that can fail separately without bringing down the whole system.

Breaking Systems into Pieces

Imagine an e-commerce site with these subsystems:

Product catalog (show items for sale)
Recommendations (suggest related items)
Reviews (show customer ratings)
Shopping cart (track items to purchase)

Each subsystem might fail individually (reducing yield for that component), but the overall application continues working with reduced functionality. For example, if recommendations fail, users can still browse and buy products - they just don't see personalized suggestions.

This decomposition provides two key benefits:

Benefit 1 - Tailored State Management: You can manage each subsystem's data differently, providing strong consistency or persistent storage only where needed. For example:

Shopping cart: needs strong consistency (can't lose items)
Recommendations: can use cached data (temporary inconsistency is fine)
Reviews: needs persistent storage (can't lose customer feedback)

This saves significant complexity and cost, especially when only a few small subsystems need the heavy machinery of strong consistency.

Benefit 2 - Orthogonal Mechanisms: This is where it gets interesting. Instead of building layered systems where each layer depends on the layer below it, you can build orthogonal (independent) mechanisms.

What Are Orthogonal Mechanisms?

An orthogonal mechanism is independent from other mechanisms and has essentially no runtime interface to them (maybe just a configuration interface). Think of orthogonal as "perpendicular" - these mechanisms don't intersect or interfere with each other during operation.

Why does this matter?

Research shows that:

Software complexity grows with the square of the number of engineers (Brooks's Law)
Most failures in complex systems come from unexpected interactions between components, not bugs within components (Leveson's research)

Conclusion: fewer interactions between components means dramatically lower complexity and fewer failure modes. Less machinery is (quadratically) better.

Traditional approach: Narrow interface layers between subsystems (e.g., API boundaries) Better approach: Orthogonal mechanisms that don't need to talk to each other at runtime

Programming With Orthogonal Mechanisms

The paper references SNS (a system for managing server clusters) and SRM (Scalable Reliable Multicast) as examples. The key point: state maintenance was orthogonal to SNS because no interfaces or behaviors needed to be modified in SNS to support SRM applications. They worked independently alongside each other.

Real-World Examples of Orthogonal Mechanisms

The beauty of orthogonal mechanisms is that they work independently to provide functionality without requiring changes to your application code:

Security: SSL/TLS

SSL (Secure Socket Layer) is a perfect example of orthogonal security. An initial handshake establishes a secure channel, which then works as the foundation for any network connection. Your application code doesn't need to handle encryption - SSL does it independently. You write the same code whether you're using HTTP or HTTPS.

Safety: Sandboxing

Various sandboxing techniques provide orthogonal safety:

Stack overflow guards: Detect when your program uses too much memory
System call monitoring: Watch for suspicious system operations
Software fault isolation: Contain errors within specific parts of your program

These mechanisms run independently of your application logic, providing protection without requiring changes to your code.

Scaling: Load Balancing

Application developers don't need to write code for:

Replication (copying data to multiple servers)
Load management (distributing requests across servers)
High availability (keeping the service running)

Simple mechanisms handle these functions for all applications automatically.

Why Orthogonal Mechanisms Win

Benefit 1 - Compile-time checking: Since orthogonal subsystems don't interact at runtime, you can check for harmful interactions when you build the system rather than discovering them during operation.

Benefit 2 - Better fault containment: When orthogonal guard mechanisms are in place, the runtime interactions that do occur are more robust because failures are contained.

The paper's contribution is identifying this pattern across many different systems and showing how orthogonal composition improves robustness. The common theme: simple, independent mechanisms with small state spaces are easier to reason about and more reliable.

Key Takeaways and Practical Techniques

The authors identify several concrete mechanisms that simplify engineering while improving robustness and scalability:

1. Simple Mechanisms with Small State Spaces

These mechanisms are easy to understand and reason about:

Timeout-based failure handling: If a server doesn't respond within X seconds, assume it failed and move on
Guard timers: Set maximum time limits for operations to prevent indefinite waiting
Orthogonal security: SSL/TLS working independently of application code

These are inspired by safety-critical systems where simplicity and predictability are paramount.

2. Separating Concerns Through Orthogonalization

Keep application logic separate from availability mechanisms. Your application code focuses on business logic, while independent systems handle:

High availability
Load balancing
Fault tolerance

The SNS and SRM example shows this separation in action - they work together without being intertwined.

3. Soft State Instead of Hard State

This is a powerful technique that deserves explanation:

Hard state: Persistent data that must be explicitly saved, managed, and deleted. If it's lost, the system breaks.

Soft state: Temporary data that automatically expires unless refreshed. If it's lost, the system rebuilds it.

Example: A load balancer tracking which servers are healthy

Hard state approach: Maintain a database of server status, update it carefully, handle inconsistencies
Soft state approach: Servers send "I'm alive" heartbeats every 5 seconds; if you don't hear from a server for 15 seconds, assume it's dead

Why soft state is better:

Recovery code is the same as normal code (just rebuild the state)
Simpler to implement and debug
Automatically handles many failure modes

The SNS load balancer uses soft state inspired by IP multicast routing - servers continuously announce their presence rather than being tracked in a persistent database.

4. Making Large-Scale Engineering Tractable

Cluster-based Internet services can rival or exceed even expensive specialized systems (like Teradata's 768-node data mining cluster) in size and capacity. The techniques in this paper make it possible to manage such massive scale without drowning in complexity.

The Philosophy

The authors consistently chose:

Simple techniques to make programming manageable
Good fault isolation to preserve the natural isolation benefits of using separate machines (clusters)

The result: systems that are easier to build, understand, and maintain, while being more robust at scale.

Summary: Why This Paper Matters

Published in 1999, this paper was prescient about the challenges of building large-scale Internet services. Today, these ideas underpin much of modern cloud architecture:

Core Ideas:

Embrace graceful degradation - It's better to serve partial results than to fail completely
Measure what matters - Track both yield (request completion) and harvest (result completeness)
Design for failure - Use simple, orthogonal mechanisms that fail independently
Choose soft state - Temporary, self-refreshing data is easier to manage than persistent state

Modern Applications:

Content Delivery Networks (CDNs) trade harvest for yield by serving cached content when origin servers are slow
Microservices architectures use decomposition and orthogonal mechanisms to isolate failures
Load balancers use soft state (health checks) to track server availability
Cloud platforms provide orthogonal services (logging, monitoring, scaling) that work independently of your application

The Big Picture: Rather than trying to build perfectly reliable systems (which is impossibly expensive), we can build systems that degrade gracefully and recover automatically. By using simple mechanisms that operate independently, we can manage enormous scale with reasonable complexity.

This paper's insights remain relevant 25+ years later because they acknowledge fundamental trade-offs in distributed systems and provide practical strategies for dealing with them.

PDF

Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #24 in this series.

Anant Jain