Appendix B: Production Notes

What the Paper Authors Learned the Hard Way

🔴 READ THIS BEFORE DEPLOYING

This appendix contains critical operational details from the PCCL paper that are easy to miss but essential for production deployments.

1. Connection Pool Sizing

⚠️ CRITICAL: Single Connection = 90% Bandwidth Loss

From paper Section 6.2:

Europe West: 1 conn = 3.6 Gbit/s → 64 conn = 45 Gbit/s
Cross-continental: 1 conn = ~5 Gbit/s → 100 conn = 40+ Gbit/s

Recommendation: Start with 64 connections, tune based on your network.

2. Platform-Specific Quirks

From paper Section 5.1:

Platform	Issue	Workaround
macOS (XNU)	Send buffer exhaustion handling bugs	PCCL has workarounds; monitor for hangs
Windows (WSA)	Drastically different error codes	PCCL handles internally
Windows	Half-close drain behavior differs	PCCL handles internally
Docker	Socket implementation quirks in containers	Test thoroughly in your container setup
All platforms	recv() blocking behavior varies with close()/shutdown()	PCCL handles internally

💡 The Paper's Advice

"We relied on extensive CI testing and long-running stress testing simulating extreme conditions of peer churn to validate behavior."

Translation: They found bugs by running 8-hour stress tests. You should too.

3. Scheduler Sensitivity

⚠️ PCCL is Latency-Sensitive

From paper Section 6.1:

"PCCL must perform many futex syscalls during the course of performing a collective operation. Thus PCCL is more sensitive to OS scheduler-induced latencies and achieved throughput is therefore subject to higher variance."

Implications:

Avoid noisy neighbors on shared VMs
Consider CPU pinning for PCCL threads
Monitor for throughput variance
Dedicated instances recommended for production

4. Async DiLoCo: The Full Algorithm

The simplified version in Chapter 6 omits critical peer-churn handling. Here's what actually happens:

Async DiLoCo with Peer Churn (Paper Algorithm 4): ───────────────────────────────────────────────────────────────────────── When new peer joins mid-training: 1. Check are_peers_pending() [micro-consensus, no I/O overhead] │ 2. If pending peers exist: │ ├─► Wait for in-flight all-reduce to complete │ ├─► updateTopology() - accept new peers │ ├─► syncSharedState(enforcePopular) - sync aggregator params │ └─► θ_local ← θ_global 3. After outer optimizer step (for existing peers): │ └─► syncSharedState(sendOnly) - provide result to newcomers 4. For newcomers who missed previous all-reduce: │ └─► syncSharedState(receiveOnly) - "eavesdrop" on result

⚠️ Two Shared State Syncs Required

When peers join, you need TWO syncSharedState calls:

enforcePopular - Initial sync when peer joins
sendOnly/receiveOnly - Provide result of all-reduce newcomer missed

The second sync uses sendOnly for existing peers and receiveOnly for newcomers to prevent newcomers' state from being chosen if more newcomers than existing peers.

5. Quantization Determinism Tradeoff

From paper Section 5.3.1:

// The problem:
D(Q(x)) ≠ x

// Dequantization doesn't recover original values.
// Each peer has higher precision for its OWN contribution.

// Option A: Discard extra precision (bit-identical)
// Option B: Keep extra precision (better approximation as world_size grows)

// PCCL chooses Option A for determinism.

💡 "Lingering Precision"

If you keep the extra precision from local data, the fully-reduced result on peer A will differ from peer B by the precision that A has for its own contribution but B doesn't have.

PCCL discards this to maintain bit-identical results across all peers.

6. PTX Determinism Verification

The paper verified __nv_expf produces identical bits across:

GTX 980 Ti (sm_52)
GTX 1060 (sm_61)
RTX 4090 (sm_89)
GH 200 (sm_90)
B200 (sm_100)

Test: Hash output bits across all 2^32 float bit-patterns. Result:

Sum of bits: 4602279786742895247
XOR of bits: 0x45ABDA35

// Identical on ALL tested architectures!

⚠️ Not All PTX is Deterministic

NVIDIA provides only a "soft guarantee" of forward compatibility. Some intrinsics may behave differently across architectures. The paper only verified __nv_expf.

If your outer optimizer uses other intrinsics, verify them yourself.

7. threadpark: Why C++ Primitives Weren't Enough

From paper Section 5.2:

"C++ synchronization primitives such as condition variables proved insufficient in terms of wake-up latency."

PCCL uses custom threadpark library with:

Linux: futex
macOS: __ulock_wait
Windows: WakeByAddressSingle
FreeBSD/OpenBSD: futex equivalents

Implication: If you're debugging latency issues, standard profiling tools may not show the full picture.

8. Zero-Malloc Policy

From paper Section 5.3:

"The multi-threaded nature of the algorithm paired with a strictly necessary zero-malloc policy for the reduce code path necessitated a custom caching allocator."

Why it matters:

No malloc()/free() in hot path
No page faults from lazy initialization
Custom allocator pre-allocates and caches

If you're integrating PCCL with other libraries that allocate during collectives, expect performance degradation.

9. Abort Signal Propagation

How PCCL Checks for Aborts Without I/O Overhead: ───────────────────────────────────────────────────────────────────────── Master connection is SEPARATE TCP stream. Abort signal pushed from SEPARATE thread. During collective: ┌─────────────────────────────────────────────────────────────────────┐ │ while (!doneSend && !doneRecv) { │ │ try_send_next_chunk(peerTx, txSpan); │ │ try_recv_next_chunk(peerRx, recvBuf); │ │ │ │ if (new_chunk_arrived) { │ │ reduce_into(rxSpan, recvBuf); │ │ │ │ // THIS IS THE KEY: No I/O, just check queue │ │ abortReceived = masterSocket.recvQueue.pop(); │ │ if (abortReceived) return ABORTED; │ │ } │ │ } │ └─────────────────────────────────────────────────────────────────────┘ The abort check is "largely free" because it's just a queue pop.

10. Recovery Time Expectations

⚠️ No Specific Recovery Time Claimed

The paper does NOT claim "250ms recovery time" or similar specific numbers.

What it DOES say:

"PCCL's error paths in comparison are roughly equally fast as the success paths and do not require dedicated rollback mechanisms."

Translation: Recovery is fast, but actual time depends on your network and operation size.

Production Checklist

✅ Before Going Live

☐ Connection pool sized appropriately (start with 64)
☐ Stress tested for 8+ hours with peer churn
☐ Platform-specific quirks understood
☐ Scheduler sensitivity addressed (dedicated instances?)
☐ Async DiLoCo peer-churn handling implemented correctly
☐ Quantization determinism tradeoffs understood
☐ Monitoring for throughput variance in place
☐ Rollback plan documented