Chapter 2: Architecture

The Master-Client Model That Makes It All Work

πŸŽͺ The Wedding Planner Analogy

Think of PCCL's architecture like a wedding:

The Big Picture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MASTER β”‚ β”‚ (Wedding Planner) β”‚ β”‚ β”‚ β”‚ β€’ Tracks who's in β”‚ β”‚ β€’ Coordinates ops β”‚ β”‚ β€’ Optimizes ring β”‚ β”‚ β€’ NO data transfer β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Coordination only β”‚ (lightweight!) β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PEER A │◄────────────►│ PEER B │◄────────────►│ PEER C β”‚ β”‚ (GPU 0) β”‚ P2P Data β”‚ (GPU 1) β”‚ P2P Data β”‚ (GPU 2) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β”‚ └──────────────────── P2P Data β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Data flows DIRECTLY between peers - Master never touches it!

What the Master Tracks

ResponsibilityDetails
Client StatusREGISTERED vs ACCEPTED phase
Ring TopologyOptimal peer ordering (via ATSP)
Shared State HashesIdentifies out-of-sync peers
Collective OperationsConsensus on start/complete/abort

The Golden Rule

⚠️ ONE Operation At A Time!

PCCL enforces: only ONE major operation can be active:

Why? This makes fault tolerance TRACTABLE. Previous attempts failed with concurrent operations!

SimpleHash: GPU-Parallel Hashing

πŸ’‘ The Problem

To verify all peers have identical weights, we need to hash gigabytes of data. Standard hashes (SHA-256) are sequential and slow on GPU.

SimpleHash Algorithm

PCCL uses a custom hash inspired by FNV-1a but designed for GPU parallelism:

SimpleHash Architecture (Warp-Tree Reduce) ───────────────────────────────────────────────────────────────────────── Step 1: Parallel chunk hashing (one thread per element) β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ h0 β”‚ h1 β”‚ h2 β”‚ h3 β”‚ h4 β”‚ h5 β”‚ h6 β”‚ h7 β”‚ ← 8 threads, 8 hashes β””β”€β”¬β”€β”€β”΄β”€β”¬β”€β”€β”΄β”€β”€β”¬β”€β”΄β”€β”€β”¬β”€β”΄β”€β”€β”¬β”€β”΄β”€β”€β”¬β”€β”΄β”€β”€β”¬β”€β”΄β”€β”€β”¬β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”¬β”€β”˜ β””β”€β”€β”¬β”€β”˜ β””β”€β”€β”¬β”€β”˜ β””β”€β”€β”¬β”€β”˜ β”‚ β”‚ β”‚ β”‚ Step 2: Warp-level reduction (XOR + multiply) β”Œβ”€β”΄β”€β” β”Œβ”€β”΄β”€β” β”Œβ”€β”΄β”€β” β”Œβ”€β”΄β”€β” β”‚h01β”‚ β”‚h23β”‚ β”‚h45β”‚ β”‚h67β”‚ ← 4 partial hashes β””β”€β”¬β”€β”˜ β””β”€β”¬β”€β”˜ β””β”€β”¬β”€β”˜ β””β”€β”¬β”€β”˜ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”‚ Step 3: Block-level reduction β”Œβ”€β”€β”΄β”€β”€β” β”Œβ”€β”€β”΄β”€β”€β” β”‚h0123β”‚ β”‚h4567β”‚ ← 2 partial hashes β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Step 4: Final hash β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚ FINAL HASH β”‚ ← 64-bit result β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
// SimpleHash core (FNV-1a inspired)
__device__ uint64_t simple_hash_step(uint64_t hash, uint64_t value) {
    const uint64_t FNV_PRIME = 0x100000001b3ULL;
    hash ^= value;
    hash *= FNV_PRIME;
    return hash;
}

// Warp-level reduction using shuffle
__device__ uint64_t warp_reduce_hash(uint64_t hash) {
    for (int offset = 16; offset > 0; offset /= 2) {
        uint64_t other = __shfl_down_sync(0xffffffff, hash, offset);
        hash = simple_hash_step(hash, other);
    }
    return hash;
}

Cross-GPU Determinism: SimpleHash produces IDENTICAL results on GTX 980 Ti through B200. This required careful handling of floating-point bit patterns and avoiding architecture-specific optimizations!

ATSP: Optimal Ring Ordering

🧠 The Traveling Salesman Problem

For ring all-reduce, peer ORDER matters. If Peer A (Tokyo) is next to Peer B (London) in the ring, every message crosses the ocean = slow!

PCCL solves the Asymmetric TSP because upload β‰  download speed.

Why Asymmetric?

Asymmetric Bandwidth Example: ───────────────────────────────────────────────────────────────────────── Tokyo ────────────────► London 100 Mbps upload Tokyo ◄──────────────── London 500 Mbps download The cost Tokyoβ†’London β‰  Londonβ†’Tokyo! Standard TSP assumes symmetric costs - WRONG for networks.

ATSP Solution

ATSP Ring Optimization: ───────────────────────────────────────────────────────────────────────── Input: Bandwidth matrix (measured via probing) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ From\To β”‚ Tokyo β”‚ Seoul β”‚ London β”‚ NYC β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Tokyo β”‚ - β”‚ 800 β”‚ 100 β”‚ 150 β”‚ β”‚ Seoul β”‚ 750 β”‚ - β”‚ 120 β”‚ 140 β”‚ β”‚ London β”‚ 100 β”‚ 110 β”‚ - β”‚ 500 β”‚ β”‚ NYC β”‚ 140 β”‚ 130 β”‚ 450 β”‚ - β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜ (Mbps) Output: Optimal ring order Tokyo β†’ Seoul β†’ NYC β†’ London β†’ Tokyo ↑ ↑ ↑ 800 130 500 = min bottleneck path!
// ATSP solver (greedy nearest-neighbor + 2-opt improvement)
fn solve_atsp(bandwidth: &Matrix) -> Vec {
    // Start with greedy solution
    let mut tour = greedy_nearest_neighbor(bandwidth);
    
    // Improve with 2-opt swaps
    loop {
        let improved = two_opt_improve(&mut tour, bandwidth);
        if !improved { break; }
    }
    
    tour
}

// Objective: maximize minimum edge (bottleneck TSP variant)
fn tour_cost(tour: &[usize], bw: &Matrix) -> u64 {
    tour.windows(2)
        .map(|w| bw[w[0]][w[1]])
        .min()
        .unwrap()
}

⚠️ When Topology Changes

ATSP is re-solved when:

Shared State Consistency

Step 1: Everyone computes SimpleHash of their weights β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Peer A β”‚ β”‚ Peer B β”‚ β”‚ Peer C β”‚ β”‚hash=abc β”‚ β”‚hash=abc β”‚ β”‚hash=xyz β”‚ ← Different! β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό Step 2: Master identifies outlier (majority vote) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MASTER β”‚ β”‚ "C is β”‚ β”‚ wrong!"β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β–Ό Step 3: P2P transfer to fix C β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Peer A β”‚ ──────────► β”‚ Peer C β”‚ β”‚hash=abc β”‚ "Here's β”‚hash=abc β”‚ ← Fixed! β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ the data" β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Low Overhead

OperationLatencyWhat It Does
updateTopology0.097 msAccept peers, re-solve ATSP
syncSharedState1.16 msSimpleHash + verify consensus
allReduceAsync0.005 msStart all-reduce (non-blocking)
awaitAsyncReduce7.03 msWait for completion

✏️ Test Your Understanding

1. Why can't PCCL use SHA-256 for shared state verification?

2. Why is the TSP "asymmetric" for network topology?

3. What happens if SimpleHash produces different results on different GPUs?

Answers:
1. SHA-256 is sequential; SimpleHash parallelizes across GPU threads
2. Upload speed β‰  download speed (asymmetric bandwidth)
3. Peers would appear "out of sync" when they're not - catastrophic!

"SHAT" = SimpleHash + ATSP + Topology

The three pillars of PCCL architecture optimization!

Chapter Summary