Chapter 6: The DiLoCo Family

From DDP to Async DiLoCo - A Communication Diet

📞 The Long-Distance Relationship Analogy

Imagine you're in a long-distance relationship:

DDP: Video call 24/7. Great connection, but your internet bill is INSANE.
DiLoCo: One video call per week. Catch up on everything, then live your life.
Async DiLoCo: Send a video message while doing other stuff. Watch their reply later.

Same relationship, different communication patterns. Choose based on your "bandwidth budget"!

The Communication Spectrum

Communication Frequency: HIGH ◄─────────────────────────────────────────────► LOW │ │ │ DDP DiLoCo (H=10) DiLoCo (H=500) │ │ │ │ │ │ │ Every step Every 10 steps Every 500 steps │ │ │ │ Best for: Best for: Best for: │ │ - LAN/HPC - Fast WAN - Slow WAN │ │ - Low latency - Medium latency - High latency │ └────────────────────────────────────────────────────┘

Algorithm 1: DDP (Distributed Data Parallel)

⚠️ NOT Recommended for Internet Training!

DDP synchronizes gradients EVERY step. Over the internet, this is painfully slow. Included here as a baseline only.

# DDP: Communicate every step
for batch in data_loader:
    # 1. Forward & backward
    loss = model(batch)
    loss.backward()
    
    # 2. All-reduce gradients (EVERY STEP!)
    all_reduce(model.gradients, op=AVG)
    
    # 3. Update weights
    optimizer.step()
    optimizer.zero_grad()

DDP: "I sync every step. My gradients are always fresh!"

Internet: "That's 100ms latency per step. Your training will take forever."

DDP: "But... accuracy..."

DiLoCo: "I sync every 500 steps and get similar accuracy. Math checks out."

Algorithm 2: DiLoCo (Distributed Low-Communication)

💡 The Key Insight

Instead of syncing gradients, sync parameter differences (pseudo-gradients) every H steps.

Each peer trains locally for H steps, then they compare notes!

DiLoCo Structure: ┌─────────────────────────────────────────────────────────┐ │ INNER LOOP (H steps) │ │ No communication! │ │ │ │ for i in range(H): │ │ loss = forward_backward(θ_local, batch) │ │ inner_optimizer.step() # e.g., AdamW │ │ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ OUTER STEP │ │ Synchronize! │ │ │ │ Δ = θ_global - θ_local # "How far did I drift?" │ │ all_reduce(Δ, op=AVG) # Average everyone's Δ │ │ outer_optimizer.step(θ_global, Δ) # e.g., Nesterov │ │ θ_local = θ_global # Reset local to global │ │ │ └─────────────────────────────────────────────────────────┘

# DiLoCo: Communicate every H steps
for outer_step in range(max_outer_steps):
    # Update topology, sync state
    comm.update_topology()
    comm.sync_shared_state(shared_state)
    
    # INNER LOOP: H local steps, NO communication
    for i in range(H):
        batch = data_loader.next()
        loss = model(batch)
        loss.backward()
        inner_optimizer.step()
        inner_optimizer.zero_grad()
    
    # OUTER STEP: Compute and sync pseudo-gradients
    delta = theta_global - theta_local  # Parameter difference
    all_reduce(delta, op=AVG)
    
    outer_optimizer.step(theta_global, delta)  # Nesterov momentum
    theta_local = theta_global.clone()  # Reset local model

Q: Why two optimizers?

A: Inner optimizer (AdamW) handles local training. Outer optimizer (Nesterov) handles global synchronization. They serve different purposes!

Q: What's a good value for H?

A: Depends on your network! H=1 with SGD outer ≈ DDP. H=500 is common for internet training. Experiment!

Q: Does accuracy suffer with high H?

A: Slightly, but the speedup is worth it. The DiLoCo paper shows it works surprisingly well!

Algorithm 3: Async DiLoCo

🧠 The Brilliant Trick

What if we could do the all-reduce IN THE BACKGROUND while training continues?

Async DiLoCo: Apply the PREVIOUS iteration's result while computing the CURRENT one!

Async DiLoCo Timeline: Time ──────────────────────────────────────────────────────► Outer Step 1: ├── Inner training (H steps) ──┤ ├── All-reduce Δ₁ (background) ──┤ Outer Step 2: ├── Inner training (H steps) ──┤ │ │ │ Apply Δ₁ result here! ─────┤ ├── All-reduce Δ₂ ──┤ Outer Step 3: ├── Inner training ──┤ │ │ │ Apply Δ₂ here! ──┤ Communication is HIDDEN behind compute!

# Async DiLoCo: Overlap communication with compute
delta_prev = zeros()  # Previous iteration's delta
active_thread = None

for outer_step in range(max_outer_steps):
    # Check for new peers (can't overlap with collectives!)
    if comm.are_peers_pending():
        await_async_all_reduce(delta_prev)  # Finish current
        comm.update_topology()
        comm.sync_shared_state(shared_state)
    
    # INNER LOOP: Local training
    for i in range(H):
        batch = data_loader.next()
        loss = model(batch)
        loss.backward()
        inner_optimizer.step()
    
    # Wait for PREVIOUS all-reduce to complete
    await_async_all_reduce(delta_prev)
    
    # Compute CURRENT delta
    delta_current = theta_global - theta_local
    
    # Start CURRENT all-reduce in background
    async_all_reduce(delta_current)
    
    # Apply PREVIOUS result (one-step delay!)
    if delta_prev.is_finished():
        outer_optimizer.step(theta_global, delta_prev)
        theta_local = theta_global.clone()

⚠️ Peer Churn Complicates Things!

When new peers join during async DiLoCo:

Wait for in-flight all-reduce to complete
Accept new peers
Sync shared state (newcomer gets current global state)
Extra sync so newcomer "eavesdrops" on previous result

It's complex, but PCCL handles it!

Comparison Table

Algorithm	Communication	Latency Hiding	Complexity	Best For
DDP	Every step	None	Simple	LAN/HPC
DiLoCo	Every H steps	None	Medium	WAN
Async DiLoCo	Every H steps	Complete!	Complex	Slow WAN

✏️ Choose the Right Algorithm

Match the scenario to the best algorithm:

8 GPUs in same datacenter, NVLink connected
50 nodes across US and Europe, 100ms latency
100 spot instances, variable connectivity

Answers:
1. DDP - Low latency, might as well sync every step
2. DiLoCo - Reduce communication, H=100-500
3. Async DiLoCo - Hide latency, handle churn gracefully

The name "DiLoCo" stands for Distributed Low-Communication. It was introduced by Douillard et al. in 2024 and is inspired by federated learning techniques!

"DDP = Daily, DiLoCo = Weekly, Async = While You Sleep"

Communication frequency analogy!

Chapter Summary

DDP: Sync every step. Simple but slow over WAN.
DiLoCo: Sync every H steps. Inner optimizer + outer optimizer.
Async DiLoCo: Overlap communication with compute. One-step delay.
Peer Churn: Async DiLoCo needs extra sync steps for newcomers.
Choose wisely: Match algorithm to your network characteristics!