Chapter 2: Architecture

The Master-Client Model That Makes It All Work

🎪 The Wedding Planner Analogy

Think of PCCL's architecture like a wedding:

Master = Wedding Planner - Coordinates everything, but doesn't cook the food
Peers = Vendors - Caterer, photographer, DJ - they do the actual work
P2P Connections = Vendors talking directly - "Hey DJ, I'm serving dessert now!"

The Big Picture

┌─────────────────────┐ │ MASTER │ │ (Wedding Planner) │ │ │ │ • Tracks who's in │ │ • Coordinates ops │ │ • Optimizes ring │ │ • NO data transfer │ └──────────┬──────────┘ │ Coordination only │ (lightweight!) │ ┌─────────────────────────┼─────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ PEER A │◄────────────►│ PEER B │◄────────────►│ PEER C │ │ (GPU 0) │ P2P Data │ (GPU 1) │ P2P Data │ (GPU 2) │ └─────────┘ └─────────┘ └─────────┘ ▲ │ └──────────────────── P2P Data ────────────────────┘ Data flows DIRECTLY between peers - Master never touches it!

What the Master Tracks

Responsibility	Details
Client Status	REGISTERED vs ACCEPTED phase
Ring Topology	Optimal peer ordering (via ATSP)
Shared State Hashes	Identifies out-of-sync peers
Collective Operations	Consensus on start/complete/abort

The Golden Rule

⚠️ ONE Operation At A Time!

PCCL enforces: only ONE major operation can be active:

Accepting new peers, OR
Synchronizing shared state, OR
Running a collective (all-reduce)

Why? This makes fault tolerance TRACTABLE. Previous attempts failed with concurrent operations!

SimpleHash: GPU-Parallel Hashing

💡 The Problem

To verify all peers have identical weights, we need to hash gigabytes of data. Standard hashes (SHA-256) are sequential and slow on GPU.

SimpleHash Algorithm

PCCL uses a custom hash inspired by FNV-1a but designed for GPU parallelism:

SimpleHash Architecture (Warp-Tree Reduce) ───────────────────────────────────────────────────────────────────────── Step 1: Parallel chunk hashing (one thread per element) ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ h0 │ h1 │ h2 │ h3 │ h4 │ h5 │ h6 │ h7 │ ← 8 threads, 8 hashes └─┬──┴─┬──┴──┬─┴──┬─┴──┬─┴──┬─┴──┬─┴──┬─┘ │ │ │ │ │ │ │ │ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ │ │ │ │ Step 2: Warp-level reduction (XOR + multiply) ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │h01│ │h23│ │h45│ │h67│ ← 4 partial hashes └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ │ │ │ │ └────┬─────┘ └────┬────┘ │ │ Step 3: Block-level reduction ┌──┴──┐ ┌──┴──┐ │h0123│ │h4567│ ← 2 partial hashes └──┬──┘ └──┬──┘ │ │ └─────────┬──────────┘ │ Step 4: Final hash ┌─────┴─────┐ │ FINAL HASH │ ← 64-bit result └───────────┘

// SimpleHash core (FNV-1a inspired)
__device__ uint64_t simple_hash_step(uint64_t hash, uint64_t value) {
    const uint64_t FNV_PRIME = 0x100000001b3ULL;
    hash ^= value;
    hash *= FNV_PRIME;
    return hash;
}

// Warp-level reduction using shuffle
__device__ uint64_t warp_reduce_hash(uint64_t hash) {
    for (int offset = 16; offset > 0; offset /= 2) {
        uint64_t other = __shfl_down_sync(0xffffffff, hash, offset);
        hash = simple_hash_step(hash, other);
    }
    return hash;
}

Cross-GPU Determinism: SimpleHash produces IDENTICAL results on GTX 980 Ti through B200. This required careful handling of floating-point bit patterns and avoiding architecture-specific optimizations!

ATSP: Optimal Ring Ordering

🧠 The Traveling Salesman Problem

For ring all-reduce, peer ORDER matters. If Peer A (Tokyo) is next to Peer B (London) in the ring, every message crosses the ocean = slow!

PCCL solves the Asymmetric TSP because upload ≠ download speed.

Why Asymmetric?

Asymmetric Bandwidth Example: ───────────────────────────────────────────────────────────────────────── Tokyo ────────────────► London 100 Mbps upload Tokyo ◄──────────────── London 500 Mbps download The cost Tokyo→London ≠ London→Tokyo! Standard TSP assumes symmetric costs - WRONG for networks.

ATSP Solution

ATSP Ring Optimization: ───────────────────────────────────────────────────────────────────────── Input: Bandwidth matrix (measured via probing) ┌─────────┬────────┬────────┬────────┬────────┐ │ From\To │ Tokyo │ Seoul │ London │ NYC │ ├─────────┼────────┼────────┼────────┼────────┤ │ Tokyo │ - │ 800 │ 100 │ 150 │ │ Seoul │ 750 │ - │ 120 │ 140 │ │ London │ 100 │ 110 │ - │ 500 │ │ NYC │ 140 │ 130 │ 450 │ - │ └─────────┴────────┴────────┴────────┴────────┘ (Mbps) Output: Optimal ring order Tokyo → Seoul → NYC → London → Tokyo ↑ ↑ ↑ 800 130 500 = min bottleneck path!

// ATSP solver (greedy nearest-neighbor + 2-opt improvement)
fn solve_atsp(bandwidth: &Matrix) -> Vec {
    // Start with greedy solution
    let mut tour = greedy_nearest_neighbor(bandwidth);
    
    // Improve with 2-opt swaps
    loop {
        let improved = two_opt_improve(&mut tour, bandwidth);
        if !improved { break; }
    }
    
    tour
}

// Objective: maximize minimum edge (bottleneck TSP variant)
fn tour_cost(tour: &[usize], bw: &Matrix) -> u64 {
    tour.windows(2)
        .map(|w| bw[w[0]][w[1]])
        .min()
        .unwrap()
}

⚠️ When Topology Changes

ATSP is re-solved when:

New peer joins (need to insert in optimal position)
Peer leaves (close the gap)
Bandwidth changes significantly (periodic re-probe)

Shared State Consistency

Step 1: Everyone computes SimpleHash of their weights ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Peer A │ │ Peer B │ │ Peer C │ │hash=abc │ │hash=abc │ │hash=xyz │ ← Different! └────┬────┘ └────┬────┘ └────┬────┘ └──────────────┼──────────────┘ ▼ Step 2: Master identifies outlier (majority vote) ┌─────────┐ │ MASTER │ │ "C is │ │ wrong!"│ └────┬────┘ ▼ Step 3: P2P transfer to fix C ┌─────────┐ ┌─────────┐ │ Peer A │ ──────────► │ Peer C │ │hash=abc │ "Here's │hash=abc │ ← Fixed! └─────────┘ the data" └─────────┘

Low Overhead

Operation	Latency	What It Does
`updateTopology`	0.097 ms	Accept peers, re-solve ATSP
`syncSharedState`	1.16 ms	SimpleHash + verify consensus
`allReduceAsync`	0.005 ms	Start all-reduce (non-blocking)
`awaitAsyncReduce`	7.03 ms	Wait for completion

✏️ Test Your Understanding

1. Why can't PCCL use SHA-256 for shared state verification?

2. Why is the TSP "asymmetric" for network topology?

3. What happens if SimpleHash produces different results on different GPUs?

Answers:
1. SHA-256 is sequential; SimpleHash parallelizes across GPU threads
2. Upload speed ≠ download speed (asymmetric bandwidth)
3. Peers would appear "out of sync" when they're not - catastrophic!

"SHAT" = SimpleHash + ATSP + Topology

The three pillars of PCCL architecture optimization!

Chapter Summary

Master-Client: Master coordinates, peers do the work, P2P transfers data
One Operation Rule: Only one major operation at a time
SimpleHash: FNV-1a inspired, GPU-parallel, cross-architecture deterministic
ATSP: Asymmetric TSP finds optimal ring order based on bandwidth
Low Overhead: Sub-millisecond coordination latency