Chapter 5: Fault Tolerance

What Happens When Things Go Wrong (And They Will)

🎭 The Improv Theater Analogy

In improv comedy, when an actor forgets their line or trips, the show must go on. The other actors adapt, cover, and continue seamlessly.

PCCL is like a well-trained improv troupe - when a peer "forgets their line" (crashes), the others adapt and continue training!

Why Fault Tolerance is Hard

πŸ’‘ The Core Challenge

Fault tolerance isn't a feature you bolt on. It's a property of the ENTIRE system. If error paths aren't unwound correctly:

Previous PCCL Attempt: "Let peers join and leave anytime! Maximum flexibility!"
Reality: "That creates infinite possible states. You can't handle all error paths."
Previous PCCL: "But users want flexibilityβ€”"
Reality: "Users want it to WORK. Restrict operations. Make states enumerable."
Current PCCL: "One operation at a time. Explicit state machine. Now I can handle every failure."

Failure Scenario Matrix

When Failure Occurs What Happens Recovery Action
Before collective starts Peer detected as dead during vote Remove peer, continue with smaller world_size
During reduce-scatter Send/recv times out or errors Abort signal β†’ restore backup β†’ retry without failed peer
During all-gather Send/recv times out or errors Abort signal β†’ restore backup β†’ retry without failed peer
During vote-complete Peer doesn't acknowledge Timeout β†’ remove peer β†’ operation still succeeds for others
During shared state sync P2P transfer fails Abort β†’ retry sync from different peer
Master dies All coordination stops Restart master β†’ peers reconnect β†’ resume from last state

The Abort Signal Mechanism

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ABORT SIGNAL PROPAGATION β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Scenario: Peer B dies during all-reduce Peer A Master Peer C β”‚ β”‚ β”‚ β”‚ ══════════ ALL-REDUCE IN PROGRESS ════│ β”‚ β”‚ β”‚ │──── chunk ────►[Peer B DEAD] β”‚ β”‚ β”‚ β”‚ β”‚ (timeout) β”‚ β”‚ β”‚ β”‚ β”‚ │── report_failure ─►│ β”‚ β”‚ β”‚ β”‚ β”‚ │◄── report_failure─│ (Peer C also noticed) β”‚ β”‚ β”‚ │◄───── ABORT ───────│────── ABORT ─────►│ β”‚ β”‚ β”‚ β”‚ restore_backup β”‚ restore_backup β”‚ β”‚ β”‚ β”‚ β”‚ │── remove Peer B ──│ β”‚ β”‚ β”‚ │◄── topology_update─│── topology_updateβ–Ίβ”‚ β”‚ β”‚ β”‚ β”‚ ══════════ RETRY (world_size=2) ══════│ β”‚ β”‚ β”‚ │◄──────────────────►│◄─────────────────►│ β”‚ β”‚ β”‚ β”‚ SUCCESS! β”‚ β–Ό β–Ό β–Ό

Buffer Backup & Restore

⚠️ Critical for Correctness

Without buffer backup, a failed collective leaves you with CORRUPTED data - partially reduced, unusable. You'd have to reload from checkpoint!

fn all_reduce(buffer: &mut [f32]) -> Result<()> {
    // CRITICAL: Backup before ANY modification
    let backup = buffer.to_vec();
    
    // Attempt the collective
    match perform_ring_reduce(buffer) {
        Ok(()) => {
            // Success! Backup no longer needed
            Ok(())
        }
        Err(Aborted) => {
            // Failure! Restore original data
            buffer.copy_from_slice(&backup);
            Err(Aborted)
        }
    }
}

πŸ’‘ Error Path β‰ˆ Success Path

In PCCL, recovering from an error is almost as fast as succeeding. No expensive rollback mechanisms, no checkpoint reloads. Just:

  1. Restore buffer from backup (memcpy)
  2. Update topology (remove failed peer)
  3. Retry operation

The No-I/O Abort Check

Checking for abort signals must NOT add I/O overhead to the hot path:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ABORT CHECK ARCHITECTURE β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Master Socket β”‚ ◄── Separate TCP connection β”‚ (background) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Abort message arrives β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Receive Thread β”‚ ◄── Dedicated thread, always listening β”‚ (async) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Push to queue (lock-free) β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Lock-Free Queue β”‚ ◄── No locks, no syscalls to check β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Collective loop checks: queue.pop() β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Collective Loop β”‚ β”‚ β”‚ β”‚ while not done: β”‚ β”‚ send_async() β”‚ β”‚ recv_async() β”‚ β”‚ if queue.pop(): β”‚ ◄── Nearly FREE! Just memory read β”‚ return ABORT β”‚ β”‚ accumulate() β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stress Testing

PCCL passes 8-hour stress tests on Linux, macOS, and Windows with:

As long as shared state advances correctly despite chaos, the test passes!

World Size Threshold

Q: What if ALL peers die?

A: PCCL supports a minimum world_size threshold. If peers drop below this, training pauses and waits for new arrivals.

Q: What about the model weights?

A: Weights are on the PEERS, not the master. Even if all peers die, when new peers join, they must provide the correct shared state hash (from checkpoint) to resume.

World Size Behavior: ───────────────────────────────────────────────────────────────────────── world_size = 10 Normal operation β”‚ β”‚ (3 peers die) β–Ό world_size = 7 Still above threshold, continue β”‚ β”‚ (4 more peers die) β–Ό world_size = 3 Below threshold (e.g., min=5) β”‚ β”‚ PAUSE - wait for new peers β”‚ β”‚ (3 new peers join) β–Ό world_size = 6 Above threshold, resume!

What PCCL Does NOT Handle

⚠️ Limitations

Chapter Summary