Chapter 5: Fault Tolerance

What Happens When Things Go Wrong (And They Will)

🎭 The Improv Theater Analogy

In improv comedy, when an actor forgets their line or trips, the show must go on. The other actors adapt, cover, and continue seamlessly.

PCCL is like a well-trained improv troupe - when a peer "forgets their line" (crashes), the others adapt and continue training!

Why Fault Tolerance is Hard

💡 The Core Challenge

Fault tolerance isn't a feature you bolt on. It's a property of the ENTIRE system. If error paths aren't unwound correctly:

System could stall (deadlock)
System could enter inconsistent state
Recovery could be slower than restart

Previous PCCL Attempt: "Let peers join and leave anytime! Maximum flexibility!"

Reality: "That creates infinite possible states. You can't handle all error paths."

Previous PCCL: "But users want flexibility—"

Reality: "Users want it to WORK. Restrict operations. Make states enumerable."

Current PCCL: "One operation at a time. Explicit state machine. Now I can handle every failure."

Failure Scenario Matrix

When Failure Occurs	What Happens	Recovery Action
Before collective starts	Peer detected as dead during vote	Remove peer, continue with smaller world_size
During reduce-scatter	Send/recv times out or errors	Abort signal → restore backup → retry without failed peer
During all-gather	Send/recv times out or errors	Abort signal → restore backup → retry without failed peer
During vote-complete	Peer doesn't acknowledge	Timeout → remove peer → operation still succeeds for others
During shared state sync	P2P transfer fails	Abort → retry sync from different peer
Master dies	All coordination stops	Restart master → peers reconnect → resume from last state

The Abort Signal Mechanism

┌─────────────────────────────────────────────────────────────────────────┐ │ ABORT SIGNAL PROPAGATION │ └─────────────────────────────────────────────────────────────────────────┘ Scenario: Peer B dies during all-reduce Peer A Master Peer C │ │ │ │ ══════════ ALL-REDUCE IN PROGRESS ════│ │ │ │ │──── chunk ────►[Peer B DEAD] │ │ │ │ │ (timeout) │ │ │ │ │ │── report_failure ─►│ │ │ │ │ │ │◄── report_failure─│ (Peer C also noticed) │ │ │ │◄───── ABORT ───────│────── ABORT ─────►│ │ │ │ │ restore_backup │ restore_backup │ │ │ │ │ │── remove Peer B ──│ │ │ │ │◄── topology_update─│── topology_update►│ │ │ │ │ ══════════ RETRY (world_size=2) ══════│ │ │ │ │◄──────────────────►│◄─────────────────►│ │ │ │ │ SUCCESS! │ ▼ ▼ ▼

Buffer Backup & Restore

⚠️ Critical for Correctness

Without buffer backup, a failed collective leaves you with CORRUPTED data - partially reduced, unusable. You'd have to reload from checkpoint!

fn all_reduce(buffer: &mut [f32]) -> Result<()> {
    // CRITICAL: Backup before ANY modification
    let backup = buffer.to_vec();
    
    // Attempt the collective
    match perform_ring_reduce(buffer) {
        Ok(()) => {
            // Success! Backup no longer needed
            Ok(())
        }
        Err(Aborted) => {
            // Failure! Restore original data
            buffer.copy_from_slice(&backup);
            Err(Aborted)
        }
    }
}

💡 Error Path ≈ Success Path

In PCCL, recovering from an error is almost as fast as succeeding. No expensive rollback mechanisms, no checkpoint reloads. Just:

Restore buffer from backup (memcpy)
Update topology (remove failed peer)
Retry operation

The No-I/O Abort Check

Checking for abort signals must NOT add I/O overhead to the hot path:

┌─────────────────────────────────────────────────────────────────────────┐ │ ABORT CHECK ARCHITECTURE │ └─────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────┐ │ Master Socket │ ◄── Separate TCP connection │ (background) │ └──────────┬──────────┘ │ │ Abort message arrives ▼ ┌─────────────────────┐ │ Receive Thread │ ◄── Dedicated thread, always listening │ (async) │ └──────────┬──────────┘ │ │ Push to queue (lock-free) ▼ ┌─────────────────────┐ │ Lock-Free Queue │ ◄── No locks, no syscalls to check └──────────┬──────────┘ │ │ Collective loop checks: queue.pop() ▼ ┌─────────────────────┐ │ Collective Loop │ │ │ │ while not done: │ │ send_async() │ │ recv_async() │ │ if queue.pop(): │ ◄── Nearly FREE! Just memory read │ return ABORT │ │ accumulate() │ └─────────────────────┘

Stress Testing

PCCL passes 8-hour stress tests on Linux, macOS, and Windows with:

Peers spawned and killed every 500-1000 milliseconds
High-frequency training loop (~100ms per iteration)
Multiple concurrent collective operations
Random peer churn throughout

As long as shared state advances correctly despite chaos, the test passes!

World Size Threshold

Q: What if ALL peers die?

A: PCCL supports a minimum world_size threshold. If peers drop below this, training pauses and waits for new arrivals.

Q: What about the model weights?

A: Weights are on the PEERS, not the master. Even if all peers die, when new peers join, they must provide the correct shared state hash (from checkpoint) to resume.

World Size Behavior: ───────────────────────────────────────────────────────────────────────── world_size = 10 Normal operation │ │ (3 peers die) ▼ world_size = 7 Still above threshold, continue │ │ (4 more peers die) ▼ world_size = 3 Below threshold (e.g., min=5) │ │ PAUSE - wait for new peers │ │ (3 new peers join) ▼ world_size = 6 Above threshold, resume!

What PCCL Does NOT Handle

⚠️ Limitations

Byzantine failures: PCCL assumes peers are honest (crash-stop model)
Network partitions: If master is unreachable, peers cannot coordinate
Data corruption: PCCL detects via hash, but doesn't correct (triggers resync)
Master HA: Single master - if it dies, coordination stops until restart

Chapter Summary

Restrict to succeed: Limited operations make fault tolerance tractable
Failure matrix: Every scenario has a defined recovery path
Abort propagation: Master broadcasts abort, peers restore backups
Buffer backup: Clone before modify, restore on failure
No-I/O abort check: Lock-free queue, separate TCP stream
Error ≈ success: Recovery is nearly as fast as normal operation
Stress tested: 8 hours of chaos, 500ms peer churn