Chapter 6: The DiLoCo Family

From DDP to Async DiLoCo - A Communication Diet

📞 The Long-Distance Relationship Analogy

Imagine you're in a long-distance relationship:

Same relationship, different communication patterns. Choose based on your "bandwidth budget"!

The Communication Spectrum

Communication Frequency: HIGH ◄─────────────────────────────────────────────► LOW │ │ │ DDP DiLoCo (H=10) DiLoCo (H=500) │ │ │ │ │ │ │ Every step Every 10 steps Every 500 steps │ │ │ │ Best for: Best for: Best for: │ │ - LAN/HPC - Fast WAN - Slow WAN │ │ - Low latency - Medium latency - High latency │ └────────────────────────────────────────────────────┘

Algorithm 1: DDP (Distributed Data Parallel)

⚠️ NOT Recommended for Internet Training!

DDP synchronizes gradients EVERY step. Over the internet, this is painfully slow. Included here as a baseline only.

# DDP: Communicate every step
for batch in data_loader:
    # 1. Forward & backward
    loss = model(batch)
    loss.backward()
    
    # 2. All-reduce gradients (EVERY STEP!)
    all_reduce(model.gradients, op=AVG)
    
    # 3. Update weights
    optimizer.step()
    optimizer.zero_grad()
DDP: "I sync every step. My gradients are always fresh!"
Internet: "That's 100ms latency per step. Your training will take forever."
DDP: "But... accuracy..."
DiLoCo: "I sync every 500 steps and get similar accuracy. Math checks out."

Algorithm 2: DiLoCo (Distributed Low-Communication)

💡 The Key Insight

Instead of syncing gradients, sync parameter differences (pseudo-gradients) every H steps.

Each peer trains locally for H steps, then they compare notes!

DiLoCo Structure: ┌─────────────────────────────────────────────────────────┐ │ INNER LOOP (H steps) │ │ No communication! │ │ │ │ for i in range(H): │ │ loss = forward_backward(θ_local, batch) │ │ inner_optimizer.step() # e.g., AdamW │ │ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ OUTER STEP │ │ Synchronize! │ │ │ │ Δ = θ_global - θ_local # "How far did I drift?" │ │ all_reduce(Δ, op=AVG) # Average everyone's Δ │ │ outer_optimizer.step(θ_global, Δ) # e.g., Nesterov │ │ θ_local = θ_global # Reset local to global │ │ │ └─────────────────────────────────────────────────────────┘
# DiLoCo: Communicate every H steps
for outer_step in range(max_outer_steps):
    # Update topology, sync state
    comm.update_topology()
    comm.sync_shared_state(shared_state)
    
    # INNER LOOP: H local steps, NO communication
    for i in range(H):
        batch = data_loader.next()
        loss = model(batch)
        loss.backward()
        inner_optimizer.step()
        inner_optimizer.zero_grad()
    
    # OUTER STEP: Compute and sync pseudo-gradients
    delta = theta_global - theta_local  # Parameter difference
    all_reduce(delta, op=AVG)
    
    outer_optimizer.step(theta_global, delta)  # Nesterov momentum
    theta_local = theta_global.clone()  # Reset local model

Q: Why two optimizers?

A: Inner optimizer (AdamW) handles local training. Outer optimizer (Nesterov) handles global synchronization. They serve different purposes!

Q: What's a good value for H?

A: Depends on your network! H=1 with SGD outer ≈ DDP. H=500 is common for internet training. Experiment!

Q: Does accuracy suffer with high H?

A: Slightly, but the speedup is worth it. The DiLoCo paper shows it works surprisingly well!

Algorithm 3: Async DiLoCo

🧠 The Brilliant Trick

What if we could do the all-reduce IN THE BACKGROUND while training continues?

Async DiLoCo: Apply the PREVIOUS iteration's result while computing the CURRENT one!

Async DiLoCo Timeline: Time ──────────────────────────────────────────────────────► Outer Step 1: ├── Inner training (H steps) ──┤ ├── All-reduce Δ₁ (background) ──┤ Outer Step 2: ├── Inner training (H steps) ──┤ │ │ │ Apply Δ₁ result here! ─────┤ ├── All-reduce Δ₂ ──┤ Outer Step 3: ├── Inner training ──┤ │ │ │ Apply Δ₂ here! ──┤ Communication is HIDDEN behind compute!
# Async DiLoCo: Overlap communication with compute
delta_prev = zeros()  # Previous iteration's delta
active_thread = None

for outer_step in range(max_outer_steps):
    # Check for new peers (can't overlap with collectives!)
    if comm.are_peers_pending():
        await_async_all_reduce(delta_prev)  # Finish current
        comm.update_topology()
        comm.sync_shared_state(shared_state)
    
    # INNER LOOP: Local training
    for i in range(H):
        batch = data_loader.next()
        loss = model(batch)
        loss.backward()
        inner_optimizer.step()
    
    # Wait for PREVIOUS all-reduce to complete
    await_async_all_reduce(delta_prev)
    
    # Compute CURRENT delta
    delta_current = theta_global - theta_local
    
    # Start CURRENT all-reduce in background
    async_all_reduce(delta_current)
    
    # Apply PREVIOUS result (one-step delay!)
    if delta_prev.is_finished():
        outer_optimizer.step(theta_global, delta_prev)
        theta_local = theta_global.clone()

⚠️ Peer Churn Complicates Things!

When new peers join during async DiLoCo:

  1. Wait for in-flight all-reduce to complete
  2. Accept new peers
  3. Sync shared state (newcomer gets current global state)
  4. Extra sync so newcomer "eavesdrops" on previous result

It's complex, but PCCL handles it!

Comparison Table

Algorithm Communication Latency Hiding Complexity Best For
DDP Every step None Simple LAN/HPC
DiLoCo Every H steps None Medium WAN
Async DiLoCo Every H steps Complete! Complex Slow WAN

✏️ Choose the Right Algorithm

Match the scenario to the best algorithm:

  1. 8 GPUs in same datacenter, NVLink connected
  2. 50 nodes across US and Europe, 100ms latency
  3. 100 spot instances, variable connectivity
Answers:
1. DDP - Low latency, might as well sync every step
2. DiLoCo - Reduce communication, H=100-500
3. Async DiLoCo - Hide latency, handle churn gracefully

The name "DiLoCo" stands for Distributed Low-Communication. It was introduced by Douillard et al. in 2024 and is inspired by federated learning techniques!

"DDP = Daily, DiLoCo = Weekly, Async = While You Sleep"

Communication frequency analogy!

Chapter Summary