Chapter 2: Architecture
The Master-Client Model That Makes It All Work
πͺ The Wedding Planner Analogy
Think of PCCL's architecture like a wedding:
- Master = Wedding Planner - Coordinates everything, but doesn't cook the food
- Peers = Vendors - Caterer, photographer, DJ - they do the actual work
- P2P Connections = Vendors talking directly - "Hey DJ, I'm serving dessert now!"
The Big Picture
βββββββββββββββββββββββ
β MASTER β
β (Wedding Planner) β
β β
β β’ Tracks who's in β
β β’ Coordinates ops β
β β’ Optimizes ring β
β β’ NO data transfer β
ββββββββββββ¬βββββββββββ
β
Coordination only β (lightweight!)
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β PEER A βββββββββββββββΊβ PEER B βββββββββββββββΊβ PEER C β
β (GPU 0) β P2P Data β (GPU 1) β P2P Data β (GPU 2) β
βββββββββββ βββββββββββ βββββββββββ
β² β
βββββββββββββββββββββ P2P Data βββββββββββββββββββββ
Data flows DIRECTLY between peers - Master never touches it!
What the Master Tracks
| Responsibility | Details |
| Client Status | REGISTERED vs ACCEPTED phase |
| Ring Topology | Optimal peer ordering (via ATSP) |
| Shared State Hashes | Identifies out-of-sync peers |
| Collective Operations | Consensus on start/complete/abort |
The Golden Rule
β οΈ ONE Operation At A Time!
PCCL enforces: only ONE major operation can be active:
- Accepting new peers, OR
- Synchronizing shared state, OR
- Running a collective (all-reduce)
Why? This makes fault tolerance TRACTABLE. Previous attempts failed with concurrent operations!
SimpleHash: GPU-Parallel Hashing
π‘ The Problem
To verify all peers have identical weights, we need to hash gigabytes of data. Standard hashes (SHA-256) are sequential and slow on GPU.
SimpleHash Algorithm
PCCL uses a custom hash inspired by FNV-1a but designed for GPU parallelism:
SimpleHash Architecture (Warp-Tree Reduce)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 1: Parallel chunk hashing (one thread per element)
ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ
β h0 β h1 β h2 β h3 β h4 β h5 β h6 β h7 β β 8 threads, 8 hashes
βββ¬βββ΄ββ¬βββ΄βββ¬ββ΄βββ¬ββ΄βββ¬ββ΄βββ¬ββ΄βββ¬ββ΄βββ¬ββ
β β β β β β β β
ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ
β β β β
Step 2: Warp-level reduction (XOR + multiply)
βββ΄ββ βββ΄ββ βββ΄ββ βββ΄ββ
βh01β βh23β βh45β βh67β β 4 partial hashes
βββ¬ββ βββ¬ββ βββ¬ββ βββ¬ββ
β β β β
ββββββ¬ββββββ ββββββ¬βββββ
β β
Step 3: Block-level reduction
ββββ΄βββ ββββ΄βββ
βh0123β βh4567β β 2 partial hashes
ββββ¬βββ ββββ¬βββ
β β
βββββββββββ¬βββββββββββ
β
Step 4: Final hash
βββββββ΄ββββββ
β FINAL HASH β β 64-bit result
βββββββββββββ
// SimpleHash core (FNV-1a inspired)
__device__ uint64_t simple_hash_step(uint64_t hash, uint64_t value) {
const uint64_t FNV_PRIME = 0x100000001b3ULL;
hash ^= value;
hash *= FNV_PRIME;
return hash;
}
// Warp-level reduction using shuffle
__device__ uint64_t warp_reduce_hash(uint64_t hash) {
for (int offset = 16; offset > 0; offset /= 2) {
uint64_t other = __shfl_down_sync(0xffffffff, hash, offset);
hash = simple_hash_step(hash, other);
}
return hash;
}
Cross-GPU Determinism: SimpleHash produces IDENTICAL results on GTX 980 Ti through B200. This required careful handling of floating-point bit patterns and avoiding architecture-specific optimizations!
ATSP: Optimal Ring Ordering
π§ The Traveling Salesman Problem
For ring all-reduce, peer ORDER matters. If Peer A (Tokyo) is next to Peer B (London) in the ring, every message crosses the ocean = slow!
PCCL solves the Asymmetric TSP because upload β download speed.
Why Asymmetric?
Asymmetric Bandwidth Example:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tokyo βββββββββββββββββΊ London
100 Mbps upload
Tokyo βββββββββββββββββ London
500 Mbps download
The cost TokyoβLondon β LondonβTokyo!
Standard TSP assumes symmetric costs - WRONG for networks.
ATSP Solution
ATSP Ring Optimization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input: Bandwidth matrix (measured via probing)
βββββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββ
β From\To β Tokyo β Seoul β London β NYC β
βββββββββββΌβββββββββΌβββββββββΌβββββββββΌβββββββββ€
β Tokyo β - β 800 β 100 β 150 β
β Seoul β 750 β - β 120 β 140 β
β London β 100 β 110 β - β 500 β
β NYC β 140 β 130 β 450 β - β
βββββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββ
(Mbps)
Output: Optimal ring order
Tokyo β Seoul β NYC β London β Tokyo
β β β
800 130 500 = min bottleneck path!
// ATSP solver (greedy nearest-neighbor + 2-opt improvement)
fn solve_atsp(bandwidth: &Matrix) -> Vec {
// Start with greedy solution
let mut tour = greedy_nearest_neighbor(bandwidth);
// Improve with 2-opt swaps
loop {
let improved = two_opt_improve(&mut tour, bandwidth);
if !improved { break; }
}
tour
}
// Objective: maximize minimum edge (bottleneck TSP variant)
fn tour_cost(tour: &[usize], bw: &Matrix) -> u64 {
tour.windows(2)
.map(|w| bw[w[0]][w[1]])
.min()
.unwrap()
}
β οΈ When Topology Changes
ATSP is re-solved when:
- New peer joins (need to insert in optimal position)
- Peer leaves (close the gap)
- Bandwidth changes significantly (periodic re-probe)
Shared State Consistency
Step 1: Everyone computes SimpleHash of their weights
βββββββββββ βββββββββββ βββββββββββ
β Peer A β β Peer B β β Peer C β
βhash=abc β βhash=abc β βhash=xyz β β Different!
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
ββββββββββββββββΌβββββββββββββββ
βΌ
Step 2: Master identifies outlier (majority vote)
βββββββββββ
β MASTER β
β "C is β
β wrong!"β
ββββββ¬βββββ
βΌ
Step 3: P2P transfer to fix C
βββββββββββ βββββββββββ
β Peer A β βββββββββββΊ β Peer C β
βhash=abc β "Here's βhash=abc β β Fixed!
βββββββββββ the data" βββββββββββ
Low Overhead
| Operation | Latency | What It Does |
updateTopology | 0.097 ms | Accept peers, re-solve ATSP |
syncSharedState | 1.16 ms | SimpleHash + verify consensus |
allReduceAsync | 0.005 ms | Start all-reduce (non-blocking) |
awaitAsyncReduce | 7.03 ms | Wait for completion |
βοΈ Test Your Understanding
1. Why can't PCCL use SHA-256 for shared state verification?
2. Why is the TSP "asymmetric" for network topology?
3. What happens if SimpleHash produces different results on different GPUs?
Answers:
1. SHA-256 is sequential; SimpleHash parallelizes across GPU threads
2. Upload speed β download speed (asymmetric bandwidth)
3. Peers would appear "out of sync" when they're not - catastrophic!
"SHAT" = SimpleHash + ATSP + Topology
The three pillars of PCCL architecture optimization!
Chapter Summary
- Master-Client: Master coordinates, peers do the work, P2P transfers data
- One Operation Rule: Only one major operation at a time
- SimpleHash: FNV-1a inspired, GPU-parallel, cross-architecture deterministic
- ATSP: Asymmetric TSP finds optimal ring order based on bandwidth
- Low Overhead: Sub-millisecond coordination latency