Chapter 5: Fault Tolerance
What Happens When Things Go Wrong (And They Will)
π The Improv Theater Analogy
In improv comedy, when an actor forgets their line or trips, the show must go on. The other actors adapt, cover, and continue seamlessly.
PCCL is like a well-trained improv troupe - when a peer "forgets their line" (crashes), the others adapt and continue training!
Why Fault Tolerance is Hard
π‘ The Core Challenge
Fault tolerance isn't a feature you bolt on. It's a property of the ENTIRE system. If error paths aren't unwound correctly:
- System could stall (deadlock)
- System could enter inconsistent state
- Recovery could be slower than restart
Failure Scenario Matrix
| When Failure Occurs | What Happens | Recovery Action |
|---|---|---|
| Before collective starts | Peer detected as dead during vote | Remove peer, continue with smaller world_size |
| During reduce-scatter | Send/recv times out or errors | Abort signal β restore backup β retry without failed peer |
| During all-gather | Send/recv times out or errors | Abort signal β restore backup β retry without failed peer |
| During vote-complete | Peer doesn't acknowledge | Timeout β remove peer β operation still succeeds for others |
| During shared state sync | P2P transfer fails | Abort β retry sync from different peer |
| Master dies | All coordination stops | Restart master β peers reconnect β resume from last state |
The Abort Signal Mechanism
Buffer Backup & Restore
β οΈ Critical for Correctness
Without buffer backup, a failed collective leaves you with CORRUPTED data - partially reduced, unusable. You'd have to reload from checkpoint!
fn all_reduce(buffer: &mut [f32]) -> Result<()> {
// CRITICAL: Backup before ANY modification
let backup = buffer.to_vec();
// Attempt the collective
match perform_ring_reduce(buffer) {
Ok(()) => {
// Success! Backup no longer needed
Ok(())
}
Err(Aborted) => {
// Failure! Restore original data
buffer.copy_from_slice(&backup);
Err(Aborted)
}
}
}
π‘ Error Path β Success Path
In PCCL, recovering from an error is almost as fast as succeeding. No expensive rollback mechanisms, no checkpoint reloads. Just:
- Restore buffer from backup (memcpy)
- Update topology (remove failed peer)
- Retry operation
The No-I/O Abort Check
Checking for abort signals must NOT add I/O overhead to the hot path:
Stress Testing
PCCL passes 8-hour stress tests on Linux, macOS, and Windows with:
- Peers spawned and killed every 500-1000 milliseconds
- High-frequency training loop (~100ms per iteration)
- Multiple concurrent collective operations
- Random peer churn throughout
As long as shared state advances correctly despite chaos, the test passes!
World Size Threshold
Q: What if ALL peers die?
A: PCCL supports a minimum world_size threshold. If peers drop below this, training pauses and waits for new arrivals.
Q: What about the model weights?
A: Weights are on the PEERS, not the master. Even if all peers die, when new peers join, they must provide the correct shared state hash (from checkpoint) to resume.
What PCCL Does NOT Handle
β οΈ Limitations
- Byzantine failures: PCCL assumes peers are honest (crash-stop model)
- Network partitions: If master is unreachable, peers cannot coordinate
- Data corruption: PCCL detects via hash, but doesn't correct (triggers resync)
- Master HA: Single master - if it dies, coordination stops until restart
Chapter Summary
- Restrict to succeed: Limited operations make fault tolerance tractable
- Failure matrix: Every scenario has a defined recovery path
- Abort propagation: Master broadcasts abort, peers restore backups
- Buffer backup: Clone before modify, restore on failure
- No-I/O abort check: Lock-free queue, separate TCP stream
- Error β success: Recovery is nearly as fast as normal operation
- Stress tested: 8 hours of chaos, 500ms peer churn