Chapter 8: Benchmarks

Real Numbers from the PCCL Paper

⚠️ DISCLAIMER

All benchmark numbers in this chapter are from the original PCCL paper (arXiv:2505.14065).

Results depend heavily on: network topology, cloud provider, time of day, congestion, and hardware. Your mileage WILL vary. These are reference points, not guarantees.

The Critical Insight: Multiple Connections

🔴 THIS IS THE MOST IMPORTANT SECTION

Single TCP connection over WAN is SLOW. The paper's key finding:

12x improvement! If you deploy PCCL with single connections, you're leaving 90% of bandwidth on the table.

Why Multiple Connections Matter

From the paper (Section 6.2):

Bandwidth vs Connection Count (Europe West, from paper): ───────────────────────────────────────────────────────────────────────── Connections Bandwidth (Gbit/s) ──────────── ────────────────── 8 11.25 16 18.17 32 34.29 64 44.47 100 44.55 128 45.74 Plateau at ~64 connections. Beyond that, diminishing returns.

Actual Benchmark Results (From Paper)

Experiment 1: Europe West (6 nodes)

MetricValue
World Size6 peers
Reduce Contribution1.073 GB
Reduce Time (single conn)8.3s ± 0.33s
Effective Throughput129.2 MB/s
Bandwidth Utilization3.67 Gbit/s

Locations: Frankfurt, Paris, Belgium, London, Netherlands

Experiment 2: North America (12 nodes)

MetricValue
World Size12 peers
Reduce Contribution1.073 GB
Reduce Time (single conn)35.2s ± 0.31s
Effective Throughput30.48 MB/s
Bandwidth Utilization897.6 Mbit/s

Locations: Oregon, Texas, South Carolina, Iowa, Montreal, Toronto, Virginia

Experiment 3: North America + Europe (18 nodes)

⚠️ Cross-Continental is SLOW

The undersea cable is the bottleneck. No amount of software optimization fixes physics.

MetricValue
World Size18 peers
Reduce Contribution1.073 GB
Reduce Time (single conn)90.5s ± 0.35s
Effective Throughput11.85 MB/s
Bandwidth Utilization358.4 Mbit/s

With Concurrent Connections

Europe West (64 concurrent connections)

MetricValue
Time2.6s ± 0.23s
Effective Throughput1.655 GB/s
Bandwidth44.47 Gbit/s

North America (64 concurrent connections)

MetricValue
Time4.9s ± 0.60s
Effective Throughput0.878 GB/s
Bandwidth26.08 Gbit/s

PCCL vs Gloo Comparison

💡 Honest Assessment

The paper states PCCL is "competitive with Gloo" in HPC benchmarks, not dramatically superior. The advantage is fault tolerance and topology optimization, not raw speed.

Experiment PCCL Time Gloo Time Improvement
NA + Europe (18 nodes) 90.5s ± 0.35s 94.4s ± 1.84s 4.15%
North America (12 nodes) 35.2s ± 0.31s 37.6s ± 0.85s 6.33%
Europe West (6 nodes) 8.3s ± 0.33s 9.67s ± 0.77s 14.17%

Key insight: PCCL's advantage comes from ATSP topology optimization. Gloo uses naive rank order = suboptimal ring.

⚠️ Gloo Cannot Do Concurrent All-Reduces

Gloo doesn't natively support concurrent all-reduce operations. This means it cannot exploit the multiple-connection trick that gives PCCL its biggest WAN advantage.

HPC Benchmark (Datacenter)

HPC Reduce Throughput (35.6 Gbit/s Ethernet, 1.073 GB per peer): ───────────────────────────────────────────────────────────────────────── World Size PCCL (MB/s) Gloo (MB/s) ────────── ─────────── ─────────── 2 ~1800 ~1900 4 ~1600 ~1700 8 ~1400 ~1500 Note: PCCL is slightly SLOWER in pure HPC due to abort-checking overhead (futex syscalls). The paper acknowledges this tradeoff.

💡 The Real Tradeoff

PCCL sacrifices ~5-10% raw HPC performance for:

If you're in a stable datacenter with InfiniBand, use NCCL. PCCL is for unreliable WAN.

Stress Testing

ParameterValue
Duration~8 hours per run
Peer churn interval500-1000ms (random)
Iteration time~100ms
Platforms testedLinux, macOS, Windows
Pass criteriaShared state advances correctly despite churn

SimpleHash Performance

SimpleHash vs Thrust Reduce (RTX 4090): ───────────────────────────────────────────────────────────────────────── Data Size SimpleHash (GB/s) Thrust (GB/s) Theoretical Max ───────── ───────────────── ───────────── ─────────────── 16 MB ~400 ~350 1008.1 GB/s 64 MB ~700 ~650 256 MB ~850 ~800 1024 MB ~900 ~850 SimpleHash achieves ~90% of theoretical memory bandwidth.