Appendix A: Alternatives to PCCL
Know Your Options
Comparison Table
| Library | Fault Tolerant | Dynamic Membership | Bit-Identical | WAN Optimized |
|---|---|---|---|---|
| PCCL | β Native | β Runtime | β Guaranteed | β ATSP ring |
| NCCL | β None | β Fixed | β Non-deterministic | β Datacenter only |
| Gloo | β None | β Fixed | β Non-deterministic | β οΈ Basic TCP |
| Hivemind | β DHT-based | β Fully dynamic | β Eventual | β Designed for WAN |
| Horovod Elastic | β οΈ Checkpoint | β With restart | β Non-deterministic | β Uses NCCL/Gloo |
NCCL (NVIDIA Collective Communications Library)
NCCL Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU 0 βββNVLinkβββΊ GPU 1 βββNVLinkβββΊ GPU 2 βββNVLinkβββΊ GPU 3
β β β
βββββββββββββββββββββ΄ββββββββββββββββββββ
InfiniBand / PCIe
Optimized for: Same machine or datacenter with NVLink/InfiniBand
| Pros | Cons |
|---|---|
| Fastest for datacenter | No fault tolerance |
| NVLink/InfiniBand optimized | Fixed world size |
| NVIDIA-supported | Non-deterministic reductions |
| Widely adopted | Crashes kill entire job |
β οΈ NCCL Non-Determinism
NCCL uses tree-based reductions with non-deterministic ordering. Running the same all-reduce twice can produce different results due to floating-point associativity!
a + (b + c) β (a + b) + c in floating point.
Gloo (Facebook)
- CPU-focused collective library
- Supports TCP, shared memory, InfiniBand
- Used by PyTorch for CPU training
- No GPU-direct support (copies through CPU)
| Pros | Cons |
|---|---|
| Works on any hardware | Slower than NCCL for GPU |
| Good for CPU training | No fault tolerance |
| Open source | Fixed membership |
Hivemind
π The Beehive Analogy
Hivemind is like a beehive - no queen (master), bees come and go freely, and the hive keeps functioning. Great for chaos, but you can't guarantee every bee has the exact same honey recipe!
Hivemind Architecture (Fully Decentralized):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββ
β DHT (Distributed Hash β
β Table) β
βββββββββββββββββββββββββββββββ
β² β² β²
β β β
βββββββββββββββββββΌβββββΌβββββΌββββββββββββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Peer A ββββββββΊβ Peer B ββββββββΊβ Peer C β
βββββββββββ βββββββββββ βββββββββββ
No master! Peers discover each other via DHT.
| Pros | Cons |
|---|---|
| No single point of failure | Eventual consistency only |
| Truly decentralized | Complex failure modes |
| Handles heterogeneous hardware | Non-deterministic results |
| Good for volunteer computing | Higher coordination overhead |
Horovod Elastic
- Python-level wrapper around NCCL/Gloo
- Fault tolerance via checkpointing
- Workers can join/leave between epochs
- Requires driver process to coordinate
Horovod Elastic Recovery:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Normal operation:
Worker 0 βββΊ Worker 1 βββΊ Worker 2 βββΊ Worker 3
β
βΌ CRASH!
Recovery (slow):
1. Detect failure (timeout) ~30 seconds
2. Kill all workers ~5 seconds
3. Load checkpoint from disk ~60 seconds
4. Restart with new world size ~30 seconds
βββββββββββββββββββββββββββββββββββββββββββββββββ
Total recovery time: ~2 minutes
PCCL recovery: ~250ms (1000x faster!)
When to Use What
π‘ Decision Guide
| Scenario | Best Choice |
|---|---|
| Single datacenter, NVIDIA GPUs, stable | NCCL |
| CPU-only training | Gloo |
| Volunteer/heterogeneous compute | Hivemind |
| Cloud spot instances, need reliability | Horovod Elastic |
| WAN, fault tolerance, bit-identical | PCCL |
Why PCCL Was Created
Migration Guide
From NCCL to PCCL
# Before (NCCL via PyTorch)
dist.init_process_group(backend='nccl')
dist.all_reduce(tensor)
# After (PCCL)
import pccl
client = pccl.Client(master_addr, master_port)
client.all_reduce(tensor) # fault-tolerant!
From Hivemind to PCCL
# Before (Hivemind)
dht = hivemind.DHT(initial_peers=[...])
optimizer = hivemind.Optimizer(dht, ...)
# After (PCCL) - simpler, deterministic
client = pccl.Client(master_addr, master_port)
# Use standard optimizer, PCCL handles sync
βοΈ Choose the Right Tool
For each scenario, which library would you choose?
- Training GPT-4 scale model in a single AWS region with p4d instances
- Distributed training across home GPUs donated by volunteers
- Training on preemptible GCP instances that can be killed anytime
- Research requiring reproducible, bit-identical results
Answers:
1. NCCL (stable datacenter, maximum performance)
2. Hivemind (heterogeneous, unreliable, no central control)
3. PCCL (fault tolerance + fast recovery)
4. PCCL (only option with bit-identical guarantee)
1. NCCL (stable datacenter, maximum performance)
2. Hivemind (heterogeneous, unreliable, no central control)
3. PCCL (fault tolerance + fast recovery)
4. PCCL (only option with bit-identical guarantee)