2025-07-28 | Categories Distributed Systems | Tags MIT 6.824 Fault Tolerance High Availability SLA Distributed Systems

Fault Tolerance and High Availability: Building Stable Distributed Systems

1. Everyday Analogy: Airplane “Failures” and Passenger Safety

Imagine a flight where the airplane might encounter engine failure or turbulence. To ensure safety, multiple backup systems and emergency plans are designed. Distributed systems face similar “failures,” and ensuring stable operation is a core design challenge.

2. Fault Models and Fault Handling

1. Common Fault Types

Fault Type	Description	Analogy Example
Node Failure	Server crash or shutdown	Airplane engine failure
Network Failure	Network partition, message loss or delay	Airplane communication cut-off
Software Bug	Program bug causing abnormal behavior	Flight system software flaw
Hardware Fault	Disk failure, memory error	Airplane instrument failure

2. Fault Tolerance Goals

Detect faults: Quickly identify anomalies
Recover service: Replace or repair failed nodes
Maintain consistency: Ensure data correctness

3. Fault Tolerance Techniques

1. Retry Mechanism

Automatically retry failed requests, suitable for transient faults.

// Simple retry example
func Retry(op func() error, attempts int) error {
    for i := 0; i < attempts; i++ {
        if err := op(); err == nil {
            return nil
        }
        time.Sleep(time.Millisecond * 100)
    }
    return errors.New("all retries failed")
}

2. Checkpoint

Periodically save system state to reduce recovery workload.

Checkpoint Illustration:

Running State ----> [Save Snapshot] ----> New State
    ↑                             |
    |-----------------------------|
      Recovery starts from snapshot

3. Failover

Automatically switch to standby nodes to ensure continuous service.

Failover Process:

Primary Node Failure
       ↓
Monitoring Detects
       ↓
Standby Node Takes Over
       ↓
Service Restored

4. High Availability and Service Level Agreement (SLA)

1. Availability Metric

Availability = (Uptime) / (Total Time)
Common targets: 99.9% (“three nines”) availability equals roughly 8.7 hours downtime per year

2. SLA Definition

SLA specifies quality and availability commitments, including response and recovery times.

SLA Metric	Description	Example
Availability	Percentage uptime	99.9%
Response Time	Max time for request	Within 100ms
Recovery Time	Time to recover from failure	Within 5 minutes

5. Practical Observations and Debugging Tips

Monitoring Systems: Real-time health detection and alerting
Log Analysis: Trace fault causes and bottlenecks
Fault Injection: Simulate failures to verify resilience
Recovery Drills: Regular failover process testing

6. Terminology Mapping Table

Everyday Term	Technical Term	Explanation
Backup Engine	Standby Node	Server that takes over when primary fails
Repair Plane	Fault Recovery	Restoring system to normal operation
Retry Attempt	Retry Mechanism	Automatic request resending on failure
Safety Net	Checkpoint	Periodic system snapshot

7. Thought Exercises and Practice

How to design retry strategies to avoid cascading failures?
How do checkpoints and logs coordinate during recovery?
Implement a simple failover detection and switchover module.

8. Conclusion: The Engineering Wisdom of Fault Tolerance and High Availability

Fault tolerance techniques and high availability design form the foundation of business continuity in distributed systems. Understanding fault models, leveraging retries and checkpoints wisely, and designing reasonable failover and SLA agreements are essential skills for every distributed systems engineer.

< Practical Raft A Deep Dive into Distributed Replicated Log Systems Fault-tolerant Key-Value Store Based on Raft Practical Analysis >