Chapter 8: Benchmarks

Real Numbers from the PCCL Paper

⚠️ DISCLAIMER

All benchmark numbers in this chapter are from the original PCCL paper (arXiv:2505.14065).

Results depend heavily on: network topology, cloud provider, time of day, congestion, and hardware. Your mileage WILL vary. These are reference points, not guarantees.

The Critical Insight: Multiple Connections

🔴 THIS IS THE MOST IMPORTANT SECTION

Single TCP connection over WAN is SLOW. The paper's key finding:

1 connection: ~3.6 Gbit/s (Europe West)
64 connections: ~45 Gbit/s (Europe West)

12x improvement! If you deploy PCCL with single connections, you're leaving 90% of bandwidth on the table.

Why Multiple Connections Matter

From the paper (Section 6.2):

TCP receive-window auto-scaling rarely reaches peak on high-latency WAN links
Per-flow fair-queuing in routers means more flows = more bandwidth slices
Multiple concurrent all-reduces can dispatch to connection pool

Bandwidth vs Connection Count (Europe West, from paper): ───────────────────────────────────────────────────────────────────────── Connections Bandwidth (Gbit/s) ──────────── ────────────────── 8 11.25 16 18.17 32 34.29 64 44.47 100 44.55 128 45.74 Plateau at ~64 connections. Beyond that, diminishing returns.

Actual Benchmark Results (From Paper)

Experiment 1: Europe West (6 nodes)

Metric	Value
World Size	6 peers
Reduce Contribution	1.073 GB
Reduce Time (single conn)	8.3s ± 0.33s
Effective Throughput	129.2 MB/s
Bandwidth Utilization	3.67 Gbit/s

Locations: Frankfurt, Paris, Belgium, London, Netherlands

Experiment 2: North America (12 nodes)

Metric	Value
World Size	12 peers
Reduce Contribution	1.073 GB
Reduce Time (single conn)	35.2s ± 0.31s
Effective Throughput	30.48 MB/s
Bandwidth Utilization	897.6 Mbit/s

Locations: Oregon, Texas, South Carolina, Iowa, Montreal, Toronto, Virginia

Experiment 3: North America + Europe (18 nodes)

⚠️ Cross-Continental is SLOW

The undersea cable is the bottleneck. No amount of software optimization fixes physics.

Metric	Value
World Size	18 peers
Reduce Contribution	1.073 GB
Reduce Time (single conn)	90.5s ± 0.35s
Effective Throughput	11.85 MB/s
Bandwidth Utilization	358.4 Mbit/s

With Concurrent Connections

Europe West (64 concurrent connections)

Metric	Value
Time	2.6s ± 0.23s
Effective Throughput	1.655 GB/s
Bandwidth	44.47 Gbit/s

North America (64 concurrent connections)

Metric	Value
Time	4.9s ± 0.60s
Effective Throughput	0.878 GB/s
Bandwidth	26.08 Gbit/s

PCCL vs Gloo Comparison

💡 Honest Assessment

The paper states PCCL is "competitive with Gloo" in HPC benchmarks, not dramatically superior. The advantage is fault tolerance and topology optimization, not raw speed.

Experiment	PCCL Time	Gloo Time	Improvement
NA + Europe (18 nodes)	90.5s ± 0.35s	94.4s ± 1.84s	4.15%
North America (12 nodes)	35.2s ± 0.31s	37.6s ± 0.85s	6.33%
Europe West (6 nodes)	8.3s ± 0.33s	9.67s ± 0.77s	14.17%

Key insight: PCCL's advantage comes from ATSP topology optimization. Gloo uses naive rank order = suboptimal ring.

⚠️ Gloo Cannot Do Concurrent All-Reduces

Gloo doesn't natively support concurrent all-reduce operations. This means it cannot exploit the multiple-connection trick that gives PCCL its biggest WAN advantage.

HPC Benchmark (Datacenter)

HPC Reduce Throughput (35.6 Gbit/s Ethernet, 1.073 GB per peer): ───────────────────────────────────────────────────────────────────────── World Size PCCL (MB/s) Gloo (MB/s) ────────── ─────────── ─────────── 2 ~1800 ~1900 4 ~1600 ~1700 8 ~1400 ~1500 Note: PCCL is slightly SLOWER in pure HPC due to abort-checking overhead (futex syscalls). The paper acknowledges this tradeoff.

💡 The Real Tradeoff

PCCL sacrifices ~5-10% raw HPC performance for:

Fault tolerance (abort mid-operation)
Dynamic membership
Bit-identical guarantees

If you're in a stable datacenter with InfiniBand, use NCCL. PCCL is for unreliable WAN.

Stress Testing

Parameter	Value
Duration	~8 hours per run
Peer churn interval	500-1000ms (random)
Iteration time	~100ms
Platforms tested	Linux, macOS, Windows
Pass criteria	Shared state advances correctly despite churn

SimpleHash Performance

SimpleHash vs Thrust Reduce (RTX 4090): ───────────────────────────────────────────────────────────────────────── Data Size SimpleHash (GB/s) Thrust (GB/s) Theoretical Max ───────── ───────────────── ───────────── ─────────────── 16 MB ~400 ~350 1008.1 GB/s 64 MB ~700 ~650 256 MB ~850 ~800 1024 MB ~900 ~850 SimpleHash achieves ~90% of theoretical memory bandwidth.