Chapter 0: Prerequisites
Are you ready for this journey?
What You Should Already Know
This book assumes you're comfortable with:
- Basic programming (Python, or any language really)
- What a neural network is (layers, weights, training)
- What a gradient is (direction to update weights)
Don't know these? That's okay! We'll give you a 5-minute crash course below.
🎓 60-Second Crash Courses
Crash Course #1: Neural Network Training
Training = Adjusting Knobs
Imagine a massive mixing board with millions of knobs (these are weights/parameters). Your goal: adjust knobs until the output sounds perfect.
Forward pass: Play the current settings, hear the output
Loss: How bad does it sound? (lower = better)
Backward pass: Figure out which knobs to turn and how much
Gradient: The "recipe" for knob adjustments
Optimizer step: Actually turn the knobs
# The training loop in 6 lines
for batch in data:
output = model(batch) # Forward pass
loss = compute_loss(output) # How wrong are we?
loss.backward() # Compute gradients
optimizer.step() # Update weights
optimizer.zero_grad() # Reset for next batch
Crash Course #2: Why Distributed Training?
Q: Why can't I just train on one GPU?
A: Modern models have BILLIONS of parameters. GPT-4 reportedly has ~1.8 trillion. One GPU has maybe 80GB memory. Do the math - it doesn't fit!
Q: So I use multiple GPUs. What's the problem?
A: Each GPU computes gradients on different data. You need to COMBINE these gradients so everyone updates the same way. That's collective communication.
Crash Course #3: TCP/IP Basics
TCP = Reliable Mail Service
IP Address: Your house address (where to send stuff)
Port: Which door to knock on (different services use different ports)
TCP: Guaranteed delivery - if packet lost, resend it
Socket: The "phone line" between two computers
✅ Self-Assessment Quiz
Are you ready? Answer these:
1. What does "gradient" mean in ML?
2. Why do we need multiple GPUs for large models?
3. What does TCP guarantee?
Got 2+ right? You're ready! Got less? Re-read the crash courses above.
🔑 Key Concepts You'll See Everywhere
- World Size: Total number of peers/GPUs in training
- Rank: Each peer's unique ID (0 to world_size-1)
- All-Reduce: Combine data from all peers, give result to all
- Tensor: Multi-dimensional array (fancy word for "matrix")
- Epoch: One complete pass through all training data
- Batch: Subset of data processed at once
🧮 Math You'll Need
Don't worry - it's just basic arithmetic!
The Only Formula That Matters
new_weights = old_weights - learning_rate × gradient
That's it. Everything else is just making this happen across multiple computers efficiently.
🛠️ Tools to Have Ready
| Tool | Why | Install |
|---|---|---|
| Python 3.8+ | Code examples | brew install python |
| PyTorch | ML framework | pip install torch |
| Rust (optional) | Build your own PCCL | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh |
Fun Fact: The name "tensor" comes from Latin "tendere" (to stretch). Tensors were originally used in physics to describe stress and strain in materials. Now they're the backbone of AI!
"FROG" = Forward, Reduce, Optimize, Gradient-zero
The 4 steps of distributed training: Forward pass → Reduce gradients → Optimizer step → Gradient zero