Chapter 3: State Machines
The Finite States That Make Fault Tolerance Possible
π¦ The Traffic Light Analogy
A traffic light has exactly 3 states: RED, YELLOW, GREEN. It can only transition in specific ways (GREENβYELLOWβREDβGREEN).
This predictability is what makes intersections safe! You KNOW what comes next.
PCCL's state machine is the same idea - by limiting possible states and transitions, we can handle EVERY failure scenario.
Why State Machines Matter
π‘ The Key Insight
Previous PCCL attempts failed because they allowed too many states:
- "Join anytime" = infinite possible states
- Infinite states = can't enumerate error paths
- Can't enumerate = can't handle all failures
Solution: Restrict to finite state machine β every error path is testable!
The Three Levels of State
PCCL tracks state at three levels:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THREE LEVELS OF STATE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Level 1: CONNECTION PHASE (per peer)
βββββββββββββββββββββββββββββββββββββ
"Am I part of the group yet?"
ββββββββββββββ ββββββββββββββ
β REGISTERED β ββββ vote ββββΊ β ACCEPTED β
β β passes β β
ββββββββββββββ ββββββββββββββ
β’ REGISTERED: Connected, but can't participate in collectives
β’ ACCEPTED: Full member, can participate in everything
Level 2: CONNECTION STATE (per peer)
βββββββββββββββββββββββββββββββββββββ
"What am I doing right now?"
ββββββββ βββββββββββββββββββββ βββββββββββββββββββββββ
β IDLE ββββββΊβ VOTE_ACCEPT_PEERS ββββββΊβ COLLECTIVE_RUNNING β
ββββββββ βββββββββββββββββββββ βββββββββββββββββββββββ
β² β²
β βββββββββββββββββββ β
ββββββββββΊβ SYNCING_STATE βββββββββββββββββ
βββββββββββββββββββ
Level 3: COLLECTIVE STATE (per operation tag)
βββββββββββββββββββββββββββββββββββββββββββββ
"Where is this specific operation?"
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
β VOTE_INITIATE ββββββΊβ PERFORMING ββββββΊβ VOTE_COMPLETE β
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
Level 1: Connection Phase
| Phase |
Description |
Can Do |
REGISTERED |
Connected to master, waiting for acceptance |
Wait, receive topology updates |
ACCEPTED |
Voted in by existing peers |
Everything: collectives, sync, P2P |
Connection Phase Transition:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
New Peer Master Existing Peers
β β β
ββββ connect() ββββββββββββΊβ β
β β β
ββββ phase=REGISTERED ββββββ β
β β β
β ββββ "accept new peer?" ββββ
β β β
β ββββ vote request βββββββββΊβ
β β β
β ββββ all vote YES ββββββββββ
β β β
ββββ phase=ACCEPTED ββββββββ β
β β β
βββββββββββββββββ establish P2P connections ββββββββββΊβ
β β β
βΌ βΌ βΌ
Level 2: Connection State
| State |
Description |
Transitions To |
IDLE |
Ready for any operation |
Any other state |
VOTE_ACCEPT_NEW_PEERS |
Voting on whether to accept registered peers |
IDLE (after vote) |
SYNCING_SHARED_STATE |
Synchronizing model weights/optimizer state |
IDLE (after sync) |
COLLECTIVE_COMMUNICATIONS_RUNNING |
Executing a collective operation |
IDLE (after complete) |
Level 3: Collective State
Each collective operation (identified by a tag) has its own state:
| State |
Description |
Next State |
VOTE_INITIATE |
Peers voting to start the operation |
PERFORMING (if all agree) |
PERFORMING |
Actually executing (reduce-scatter, all-gather) |
VOTE_COMPLETE (when done) |
VOTE_COMPLETE |
Peers confirming they finished |
(end of operation) |
The Golden Rule
β οΈ ONE Operation At A Time!
Within a single peer group, PCCL enforces:
- Only ONE major operation active at any moment
- Cannot accept peers WHILE running collective
- Cannot sync state WHILE accepting peers
- Cannot run collective WHILE syncing state
Why? This restriction makes the state space finite and error paths enumerable!
The One-Operation Rule:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ALLOWED:
IDLE β VOTE_ACCEPT β IDLE β COLLECTIVE β IDLE β SYNC β IDLE
(sequential, one at a time)
NOT ALLOWED:
IDLE β VOTE_ACCEPT ββ¬ββΊ COLLECTIVE β CONFLICT!
βββΊ SYNC β CONFLICT!
(parallel operations)
Micro-Consensus
π‘ Every Transition Needs Agreement
Before ANY state transition, ALL peers must agree. Like a jury - unanimous verdict required!
Micro-Consensus Flow:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Master: "Ready to transition to COLLECTIVE_RUNNING?"
β
ββββΊ Peer A: "Ready!" βββ
ββββΊ Peer B: "Ready!" βββΌβββΊ Master: "All agreed!"
ββββΊ Peer C: "Ready!" βββ β
βΌ
Broadcast: "GO!"
β
βββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
Peer A Peer B Peer C
transitions transitions transitions
simultaneously simultaneously simultaneously
Complete State Transition Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPLETE STATE MACHINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ
β REGISTERED β
ββββββββ¬βββββββ
β vote passes
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ACCEPTED PHASE β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β IDLE β β
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββΌββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β VOTE_ACCEPT β β SYNC_STATE β β COLLECTIVE β β
β β _NEW_PEERS β β β β _RUNNING β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β β β β β
β ββββββββββββββββββββββ΄βββββββββββββββββββββ β
β β β
β βΌ β
β back to IDLE β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Exercise: Trace the Path
βοΈ Scenario
A new peer wants to join and participate in an all-reduce. Trace the states:
- New peer connects β Phase: ___
- Existing peers vote β Phase: ___
- Peer is accepted β Phase: ___, State: ___
- All-reduce starts β State: ___, CollectiveState: ___
- All-reduce completes β State: ___
Answers:
1. REGISTERED
2. REGISTERED (still waiting)
3. ACCEPTED, IDLE
4. COLLECTIVE_RUNNING, VOTE_INITIATE β PERFORMING β VOTE_COMPLETE
5. IDLE
"RAI-SCV" = Registered β Accepted β Idle β Sync/Collective/Vote
The journey of a peer through PCCL states!
Chapter Summary
- Three levels: ConnectionPhase, ConnectionState, CollectiveState
- Finite states: Makes error paths enumerable and testable
- One operation rule: Only one major operation at a time
- Micro-consensus: All peers must agree before any transition
- Predictability: Like traffic lights - you always know what's next