Chapter 6: The DiLoCo Family
From DDP to Async DiLoCo - A Communication Diet
📞 The Long-Distance Relationship Analogy
Imagine you're in a long-distance relationship:
- DDP: Video call 24/7. Great connection, but your internet bill is INSANE.
- DiLoCo: One video call per week. Catch up on everything, then live your life.
- Async DiLoCo: Send a video message while doing other stuff. Watch their reply later.
Same relationship, different communication patterns. Choose based on your "bandwidth budget"!
The Communication Spectrum
Algorithm 1: DDP (Distributed Data Parallel)
⚠️ NOT Recommended for Internet Training!
DDP synchronizes gradients EVERY step. Over the internet, this is painfully slow. Included here as a baseline only.
# DDP: Communicate every step
for batch in data_loader:
# 1. Forward & backward
loss = model(batch)
loss.backward()
# 2. All-reduce gradients (EVERY STEP!)
all_reduce(model.gradients, op=AVG)
# 3. Update weights
optimizer.step()
optimizer.zero_grad()
Algorithm 2: DiLoCo (Distributed Low-Communication)
💡 The Key Insight
Instead of syncing gradients, sync parameter differences (pseudo-gradients) every H steps.
Each peer trains locally for H steps, then they compare notes!
# DiLoCo: Communicate every H steps
for outer_step in range(max_outer_steps):
# Update topology, sync state
comm.update_topology()
comm.sync_shared_state(shared_state)
# INNER LOOP: H local steps, NO communication
for i in range(H):
batch = data_loader.next()
loss = model(batch)
loss.backward()
inner_optimizer.step()
inner_optimizer.zero_grad()
# OUTER STEP: Compute and sync pseudo-gradients
delta = theta_global - theta_local # Parameter difference
all_reduce(delta, op=AVG)
outer_optimizer.step(theta_global, delta) # Nesterov momentum
theta_local = theta_global.clone() # Reset local model
Q: Why two optimizers?
A: Inner optimizer (AdamW) handles local training. Outer optimizer (Nesterov) handles global synchronization. They serve different purposes!
Q: What's a good value for H?
A: Depends on your network! H=1 with SGD outer ≈ DDP. H=500 is common for internet training. Experiment!
Q: Does accuracy suffer with high H?
A: Slightly, but the speedup is worth it. The DiLoCo paper shows it works surprisingly well!
Algorithm 3: Async DiLoCo
🧠 The Brilliant Trick
What if we could do the all-reduce IN THE BACKGROUND while training continues?
Async DiLoCo: Apply the PREVIOUS iteration's result while computing the CURRENT one!
# Async DiLoCo: Overlap communication with compute
delta_prev = zeros() # Previous iteration's delta
active_thread = None
for outer_step in range(max_outer_steps):
# Check for new peers (can't overlap with collectives!)
if comm.are_peers_pending():
await_async_all_reduce(delta_prev) # Finish current
comm.update_topology()
comm.sync_shared_state(shared_state)
# INNER LOOP: Local training
for i in range(H):
batch = data_loader.next()
loss = model(batch)
loss.backward()
inner_optimizer.step()
# Wait for PREVIOUS all-reduce to complete
await_async_all_reduce(delta_prev)
# Compute CURRENT delta
delta_current = theta_global - theta_local
# Start CURRENT all-reduce in background
async_all_reduce(delta_current)
# Apply PREVIOUS result (one-step delay!)
if delta_prev.is_finished():
outer_optimizer.step(theta_global, delta_prev)
theta_local = theta_global.clone()
⚠️ Peer Churn Complicates Things!
When new peers join during async DiLoCo:
- Wait for in-flight all-reduce to complete
- Accept new peers
- Sync shared state (newcomer gets current global state)
- Extra sync so newcomer "eavesdrops" on previous result
It's complex, but PCCL handles it!
Comparison Table
| Algorithm | Communication | Latency Hiding | Complexity | Best For |
|---|---|---|---|---|
| DDP | Every step | None | Simple | LAN/HPC |
| DiLoCo | Every H steps | None | Medium | WAN |
| Async DiLoCo | Every H steps | Complete! | Complex | Slow WAN |
✏️ Choose the Right Algorithm
Match the scenario to the best algorithm:
- 8 GPUs in same datacenter, NVLink connected
- 50 nodes across US and Europe, 100ms latency
- 100 spot instances, variable connectivity
1. DDP - Low latency, might as well sync every step
2. DiLoCo - Reduce communication, H=100-500
3. Async DiLoCo - Hide latency, handle churn gracefully
The name "DiLoCo" stands for Distributed Low-Communication. It was introduced by Douillard et al. in 2024 and is inspired by federated learning techniques!
"DDP = Daily, DiLoCo = Weekly, Async = While You Sleep"
Communication frequency analogy!
Chapter Summary
- DDP: Sync every step. Simple but slow over WAN.
- DiLoCo: Sync every H steps. Inner optimizer + outer optimizer.
- Async DiLoCo: Overlap communication with compute. One-step delay.
- Peer Churn: Async DiLoCo needs extra sync steps for newcomers.
- Choose wisely: Match algorithm to your network characteristics!