Appendix B: Production Notes
What the Paper Authors Learned the Hard Way
🔴 READ THIS BEFORE DEPLOYING
This appendix contains critical operational details from the PCCL paper that are easy to miss but essential for production deployments.
1. Connection Pool Sizing
⚠️ CRITICAL: Single Connection = 90% Bandwidth Loss
From paper Section 6.2:
- Europe West: 1 conn = 3.6 Gbit/s → 64 conn = 45 Gbit/s
- Cross-continental: 1 conn = ~5 Gbit/s → 100 conn = 40+ Gbit/s
Recommendation: Start with 64 connections, tune based on your network.
2. Platform-Specific Quirks
From paper Section 5.1:
| Platform | Issue | Workaround |
|---|---|---|
| macOS (XNU) | Send buffer exhaustion handling bugs | PCCL has workarounds; monitor for hangs |
| Windows (WSA) | Drastically different error codes | PCCL handles internally |
| Windows | Half-close drain behavior differs | PCCL handles internally |
| Docker | Socket implementation quirks in containers | Test thoroughly in your container setup |
| All platforms | recv() blocking behavior varies with close()/shutdown() | PCCL handles internally |
💡 The Paper's Advice
"We relied on extensive CI testing and long-running stress testing simulating extreme conditions of peer churn to validate behavior."
Translation: They found bugs by running 8-hour stress tests. You should too.
3. Scheduler Sensitivity
⚠️ PCCL is Latency-Sensitive
From paper Section 6.1:
"PCCL must perform many futex syscalls during the course of performing a collective operation. Thus PCCL is more sensitive to OS scheduler-induced latencies and achieved throughput is therefore subject to higher variance."
Implications:
- Avoid noisy neighbors on shared VMs
- Consider CPU pinning for PCCL threads
- Monitor for throughput variance
- Dedicated instances recommended for production
4. Async DiLoCo: The Full Algorithm
The simplified version in Chapter 6 omits critical peer-churn handling. Here's what actually happens:
⚠️ Two Shared State Syncs Required
When peers join, you need TWO syncSharedState calls:
enforcePopular- Initial sync when peer joinssendOnly/receiveOnly- Provide result of all-reduce newcomer missed
The second sync uses sendOnly for existing peers and receiveOnly for newcomers to prevent newcomers' state from being chosen if more newcomers than existing peers.
5. Quantization Determinism Tradeoff
From paper Section 5.3.1:
// The problem:
D(Q(x)) ≠ x
// Dequantization doesn't recover original values.
// Each peer has higher precision for its OWN contribution.
// Option A: Discard extra precision (bit-identical)
// Option B: Keep extra precision (better approximation as world_size grows)
// PCCL chooses Option A for determinism.
💡 "Lingering Precision"
If you keep the extra precision from local data, the fully-reduced result on peer A will differ from peer B by the precision that A has for its own contribution but B doesn't have.
PCCL discards this to maintain bit-identical results across all peers.
6. PTX Determinism Verification
The paper verified __nv_expf produces identical bits across:
- GTX 980 Ti (sm_52)
- GTX 1060 (sm_61)
- RTX 4090 (sm_89)
- GH 200 (sm_90)
- B200 (sm_100)
Test: Hash output bits across all 2^32 float bit-patterns. Result:
Sum of bits: 4602279786742895247
XOR of bits: 0x45ABDA35
// Identical on ALL tested architectures!
⚠️ Not All PTX is Deterministic
NVIDIA provides only a "soft guarantee" of forward compatibility. Some intrinsics may behave differently across architectures. The paper only verified __nv_expf.
If your outer optimizer uses other intrinsics, verify them yourself.
7. threadpark: Why C++ Primitives Weren't Enough
From paper Section 5.2:
"C++ synchronization primitives such as condition variables proved insufficient in terms of wake-up latency."
PCCL uses custom threadpark library with:
- Linux:
futex - macOS:
__ulock_wait - Windows:
WakeByAddressSingle - FreeBSD/OpenBSD: futex equivalents
Implication: If you're debugging latency issues, standard profiling tools may not show the full picture.
8. Zero-Malloc Policy
From paper Section 5.3:
"The multi-threaded nature of the algorithm paired with a strictly necessary zero-malloc policy for the reduce code path necessitated a custom caching allocator."
Why it matters:
- No
malloc()/free()in hot path - No page faults from lazy initialization
- Custom allocator pre-allocates and caches
If you're integrating PCCL with other libraries that allocate during collectives, expect performance degradation.
9. Abort Signal Propagation
10. Recovery Time Expectations
⚠️ No Specific Recovery Time Claimed
The paper does NOT claim "250ms recovery time" or similar specific numbers.
What it DOES say:
"PCCL's error paths in comparison are roughly equally fast as the success paths and do not require dedicated rollback mechanisms."
Translation: Recovery is fast, but actual time depends on your network and operation size.
Production Checklist
✅ Before Going Live
- ☐ Connection pool sized appropriately (start with 64)
- ☐ Stress tested for 8+ hours with peer churn
- ☐ Platform-specific quirks understood
- ☐ Scheduler sensitivity addressed (dedicated instances?)
- ☐ Async DiLoCo peer-churn handling implemented correctly
- ☐ Quantization determinism tradeoffs understood
- ☐ Monitoring for throughput variance in place
- ☐ Rollback plan documented