Appendix A: Alternatives to PCCL

Know Your Options

Comparison Table

Library Fault Tolerant Dynamic Membership Bit-Identical WAN Optimized
PCCL βœ… Native βœ… Runtime βœ… Guaranteed βœ… ATSP ring
NCCL ❌ None ❌ Fixed ❌ Non-deterministic ❌ Datacenter only
Gloo ❌ None ❌ Fixed ❌ Non-deterministic ⚠️ Basic TCP
Hivemind βœ… DHT-based βœ… Fully dynamic ❌ Eventual βœ… Designed for WAN
Horovod Elastic ⚠️ Checkpoint βœ… With restart ❌ Non-deterministic ❌ Uses NCCL/Gloo

NCCL (NVIDIA Collective Communications Library)

NCCL Architecture: ───────────────────────────────────────────────────────────────────────── GPU 0 ◄══NVLink══► GPU 1 ◄══NVLink══► GPU 2 ◄══NVLink══► GPU 3 β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ InfiniBand / PCIe Optimized for: Same machine or datacenter with NVLink/InfiniBand
ProsCons
Fastest for datacenterNo fault tolerance
NVLink/InfiniBand optimizedFixed world size
NVIDIA-supportedNon-deterministic reductions
Widely adoptedCrashes kill entire job

⚠️ NCCL Non-Determinism

NCCL uses tree-based reductions with non-deterministic ordering. Running the same all-reduce twice can produce different results due to floating-point associativity!

a + (b + c) β‰  (a + b) + c in floating point.

Gloo (Facebook)

ProsCons
Works on any hardwareSlower than NCCL for GPU
Good for CPU trainingNo fault tolerance
Open sourceFixed membership

Hivemind

🐝 The Beehive Analogy

Hivemind is like a beehive - no queen (master), bees come and go freely, and the hive keeps functioning. Great for chaos, but you can't guarantee every bee has the exact same honey recipe!

Hivemind Architecture (Fully Decentralized): ───────────────────────────────────────────────────────────────────────── β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DHT (Distributed Hash β”‚ β”‚ Table) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β–² β–² β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Peer A │◄─────►│ Peer B │◄─────►│ Peer C β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No master! Peers discover each other via DHT.
ProsCons
No single point of failureEventual consistency only
Truly decentralizedComplex failure modes
Handles heterogeneous hardwareNon-deterministic results
Good for volunteer computingHigher coordination overhead

Horovod Elastic

Horovod Elastic Recovery: ───────────────────────────────────────────────────────────────────────── Normal operation: Worker 0 ──► Worker 1 ──► Worker 2 ──► Worker 3 β”‚ β–Ό CRASH! Recovery (slow): 1. Detect failure (timeout) ~30 seconds 2. Kill all workers ~5 seconds 3. Load checkpoint from disk ~60 seconds 4. Restart with new world size ~30 seconds ───────────────────────────────────────────────── Total recovery time: ~2 minutes PCCL recovery: ~250ms (1000x faster!)

When to Use What

πŸ’‘ Decision Guide

ScenarioBest Choice
Single datacenter, NVIDIA GPUs, stableNCCL
CPU-only trainingGloo
Volunteer/heterogeneous computeHivemind
Cloud spot instances, need reliabilityHorovod Elastic
WAN, fault tolerance, bit-identicalPCCL

Why PCCL Was Created

Researcher: "I want to train across multiple datacenters with spot instances."
NCCL: "I only work in one datacenter with InfiniBand."
Gloo: "I can do WAN, but if one node dies, everything dies."
Hivemind: "I handle failures, but your gradients won't be bit-identical."
Horovod: "I can recover, but it takes 2 minutes and loses progress."
Researcher: "I need ALL of: WAN, fault tolerance, bit-identical, fast recovery."
Everyone: "..."
PCCL: "That's why I exist."

Migration Guide

From NCCL to PCCL

# Before (NCCL via PyTorch)
dist.init_process_group(backend='nccl')
dist.all_reduce(tensor)

# After (PCCL)
import pccl
client = pccl.Client(master_addr, master_port)
client.all_reduce(tensor)  # fault-tolerant!

From Hivemind to PCCL

# Before (Hivemind)
dht = hivemind.DHT(initial_peers=[...])
optimizer = hivemind.Optimizer(dht, ...)

# After (PCCL) - simpler, deterministic
client = pccl.Client(master_addr, master_port)
# Use standard optimizer, PCCL handles sync

✏️ Choose the Right Tool

For each scenario, which library would you choose?

  1. Training GPT-4 scale model in a single AWS region with p4d instances
  2. Distributed training across home GPUs donated by volunteers
  3. Training on preemptible GCP instances that can be killed anytime
  4. Research requiring reproducible, bit-identical results
Answers:
1. NCCL (stable datacenter, maximum performance)
2. Hivemind (heterogeneous, unreliable, no central control)
3. PCCL (fault tolerance + fast recovery)
4. PCCL (only option with bit-identical guarantee)