Chapter 8: Benchmarks
Real Numbers from the PCCL Paper
⚠️ DISCLAIMER
All benchmark numbers in this chapter are from the original PCCL paper (arXiv:2505.14065).
Results depend heavily on: network topology, cloud provider, time of day, congestion, and hardware. Your mileage WILL vary. These are reference points, not guarantees.
The Critical Insight: Multiple Connections
🔴 THIS IS THE MOST IMPORTANT SECTION
Single TCP connection over WAN is SLOW. The paper's key finding:
- 1 connection: ~3.6 Gbit/s (Europe West)
- 64 connections: ~45 Gbit/s (Europe West)
12x improvement! If you deploy PCCL with single connections, you're leaving 90% of bandwidth on the table.
Why Multiple Connections Matter
From the paper (Section 6.2):
- TCP receive-window auto-scaling rarely reaches peak on high-latency WAN links
- Per-flow fair-queuing in routers means more flows = more bandwidth slices
- Multiple concurrent all-reduces can dispatch to connection pool
Actual Benchmark Results (From Paper)
Experiment 1: Europe West (6 nodes)
| Metric | Value |
|---|---|
| World Size | 6 peers |
| Reduce Contribution | 1.073 GB |
| Reduce Time (single conn) | 8.3s ± 0.33s |
| Effective Throughput | 129.2 MB/s |
| Bandwidth Utilization | 3.67 Gbit/s |
Locations: Frankfurt, Paris, Belgium, London, Netherlands
Experiment 2: North America (12 nodes)
| Metric | Value |
|---|---|
| World Size | 12 peers |
| Reduce Contribution | 1.073 GB |
| Reduce Time (single conn) | 35.2s ± 0.31s |
| Effective Throughput | 30.48 MB/s |
| Bandwidth Utilization | 897.6 Mbit/s |
Locations: Oregon, Texas, South Carolina, Iowa, Montreal, Toronto, Virginia
Experiment 3: North America + Europe (18 nodes)
⚠️ Cross-Continental is SLOW
The undersea cable is the bottleneck. No amount of software optimization fixes physics.
| Metric | Value |
|---|---|
| World Size | 18 peers |
| Reduce Contribution | 1.073 GB |
| Reduce Time (single conn) | 90.5s ± 0.35s |
| Effective Throughput | 11.85 MB/s |
| Bandwidth Utilization | 358.4 Mbit/s |
With Concurrent Connections
Europe West (64 concurrent connections)
| Metric | Value |
|---|---|
| Time | 2.6s ± 0.23s |
| Effective Throughput | 1.655 GB/s |
| Bandwidth | 44.47 Gbit/s |
North America (64 concurrent connections)
| Metric | Value |
|---|---|
| Time | 4.9s ± 0.60s |
| Effective Throughput | 0.878 GB/s |
| Bandwidth | 26.08 Gbit/s |
PCCL vs Gloo Comparison
💡 Honest Assessment
The paper states PCCL is "competitive with Gloo" in HPC benchmarks, not dramatically superior. The advantage is fault tolerance and topology optimization, not raw speed.
| Experiment | PCCL Time | Gloo Time | Improvement |
|---|---|---|---|
| NA + Europe (18 nodes) | 90.5s ± 0.35s | 94.4s ± 1.84s | 4.15% |
| North America (12 nodes) | 35.2s ± 0.31s | 37.6s ± 0.85s | 6.33% |
| Europe West (6 nodes) | 8.3s ± 0.33s | 9.67s ± 0.77s | 14.17% |
Key insight: PCCL's advantage comes from ATSP topology optimization. Gloo uses naive rank order = suboptimal ring.
⚠️ Gloo Cannot Do Concurrent All-Reduces
Gloo doesn't natively support concurrent all-reduce operations. This means it cannot exploit the multiple-connection trick that gives PCCL its biggest WAN advantage.
HPC Benchmark (Datacenter)
💡 The Real Tradeoff
PCCL sacrifices ~5-10% raw HPC performance for:
- Fault tolerance (abort mid-operation)
- Dynamic membership
- Bit-identical guarantees
If you're in a stable datacenter with InfiniBand, use NCCL. PCCL is for unreliable WAN.
Stress Testing
| Parameter | Value |
|---|---|
| Duration | ~8 hours per run |
| Peer churn interval | 500-1000ms (random) |
| Iteration time | ~100ms |
| Platforms tested | Linux, macOS, Windows |
| Pass criteria | Shared state advances correctly despite churn |