Appendix A: Alternatives to PCCL

Know Your Options

Comparison Table

Library	Fault Tolerant	Dynamic Membership	Bit-Identical	WAN Optimized
PCCL	✅ Native	✅ Runtime	✅ Guaranteed	✅ ATSP ring
NCCL	❌ None	❌ Fixed	❌ Non-deterministic	❌ Datacenter only
Gloo	❌ None	❌ Fixed	❌ Non-deterministic	⚠️ Basic TCP
Hivemind	✅ DHT-based	✅ Fully dynamic	❌ Eventual	✅ Designed for WAN
Horovod Elastic	⚠️ Checkpoint	✅ With restart	❌ Non-deterministic	❌ Uses NCCL/Gloo

NCCL (NVIDIA Collective Communications Library)

NCCL Architecture: ───────────────────────────────────────────────────────────────────────── GPU 0 ◄══NVLink══► GPU 1 ◄══NVLink══► GPU 2 ◄══NVLink══► GPU 3 │ │ │ └───────────────────┴───────────────────┘ InfiniBand / PCIe Optimized for: Same machine or datacenter with NVLink/InfiniBand

Pros	Cons
Fastest for datacenter	No fault tolerance
NVLink/InfiniBand optimized	Fixed world size
NVIDIA-supported	Non-deterministic reductions
Widely adopted	Crashes kill entire job

⚠️ NCCL Non-Determinism

NCCL uses tree-based reductions with non-deterministic ordering. Running the same all-reduce twice can produce different results due to floating-point associativity!

a + (b + c) ≠ (a + b) + c in floating point.

Gloo (Facebook)

CPU-focused collective library
Supports TCP, shared memory, InfiniBand
Used by PyTorch for CPU training
No GPU-direct support (copies through CPU)

Pros	Cons
Works on any hardware	Slower than NCCL for GPU
Good for CPU training	No fault tolerance
Open source	Fixed membership

Hivemind

🐝 The Beehive Analogy

Hivemind is like a beehive - no queen (master), bees come and go freely, and the hive keeps functioning. Great for chaos, but you can't guarantee every bee has the exact same honey recipe!

Hivemind Architecture (Fully Decentralized): ───────────────────────────────────────────────────────────────────────── ┌─────────────────────────────┐ │ DHT (Distributed Hash │ │ Table) │ └─────────────────────────────┘ ▲ ▲ ▲ │ │ │ ┌─────────────────┼────┼────┼─────────────────┐ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Peer A │◄─────►│ Peer B │◄─────►│ Peer C │ └─────────┘ └─────────┘ └─────────┘ No master! Peers discover each other via DHT.

Pros	Cons
No single point of failure	Eventual consistency only
Truly decentralized	Complex failure modes
Handles heterogeneous hardware	Non-deterministic results
Good for volunteer computing	Higher coordination overhead

Horovod Elastic

Python-level wrapper around NCCL/Gloo
Fault tolerance via checkpointing
Workers can join/leave between epochs
Requires driver process to coordinate

Horovod Elastic Recovery: ───────────────────────────────────────────────────────────────────────── Normal operation: Worker 0 ──► Worker 1 ──► Worker 2 ──► Worker 3 │ ▼ CRASH! Recovery (slow): 1. Detect failure (timeout) ~30 seconds 2. Kill all workers ~5 seconds 3. Load checkpoint from disk ~60 seconds 4. Restart with new world size ~30 seconds ───────────────────────────────────────────────── Total recovery time: ~2 minutes PCCL recovery: ~250ms (1000x faster!)

When to Use What

💡 Decision Guide

Scenario	Best Choice
Single datacenter, NVIDIA GPUs, stable	NCCL
CPU-only training	Gloo
Volunteer/heterogeneous compute	Hivemind
Cloud spot instances, need reliability	Horovod Elastic
WAN, fault tolerance, bit-identical	PCCL

Why PCCL Was Created

Researcher: "I want to train across multiple datacenters with spot instances."

NCCL: "I only work in one datacenter with InfiniBand."

Gloo: "I can do WAN, but if one node dies, everything dies."

Hivemind: "I handle failures, but your gradients won't be bit-identical."

Horovod: "I can recover, but it takes 2 minutes and loses progress."

Researcher: "I need ALL of: WAN, fault tolerance, bit-identical, fast recovery."

Everyone: "..."

PCCL: "That's why I exist."

Migration Guide

From NCCL to PCCL

# Before (NCCL via PyTorch)
dist.init_process_group(backend='nccl')
dist.all_reduce(tensor)

# After (PCCL)
import pccl
client = pccl.Client(master_addr, master_port)
client.all_reduce(tensor)  # fault-tolerant!

From Hivemind to PCCL

# Before (Hivemind)
dht = hivemind.DHT(initial_peers=[...])
optimizer = hivemind.Optimizer(dht, ...)

# After (PCCL) - simpler, deterministic
client = pccl.Client(master_addr, master_port)
# Use standard optimizer, PCCL handles sync

✏️ Choose the Right Tool

For each scenario, which library would you choose?

Training GPT-4 scale model in a single AWS region with p4d instances
Distributed training across home GPUs donated by volunteers
Training on preemptible GCP instances that can be killed anytime
Research requiring reproducible, bit-identical results

Answers:
1. NCCL (stable datacenter, maximum performance)
2. Hivemind (heterogeneous, unreliable, no central control)
3. PCCL (fault tolerance + fast recovery)
4. PCCL (only option with bit-identical guarantee)