Chapter 0: Prerequisites

Are you ready for this journey?

What You Should Already Know

This book assumes you're comfortable with:

Basic programming (Python, or any language really)
What a neural network is (layers, weights, training)
What a gradient is (direction to update weights)

Don't know these? That's okay! We'll give you a 5-minute crash course below.

🎓 60-Second Crash Courses

Crash Course #1: Neural Network Training

Training = Adjusting Knobs

Imagine a massive mixing board with millions of knobs (these are weights/parameters). Your goal: adjust knobs until the output sounds perfect.

Forward pass: Play the current settings, hear the output
Loss: How bad does it sound? (lower = better)
Backward pass: Figure out which knobs to turn and how much
Gradient: The "recipe" for knob adjustments
Optimizer step: Actually turn the knobs

# The training loop in 6 lines
for batch in data:
    output = model(batch)           # Forward pass
    loss = compute_loss(output)     # How wrong are we?
    loss.backward()                 # Compute gradients
    optimizer.step()                # Update weights
    optimizer.zero_grad()           # Reset for next batch

Crash Course #2: Why Distributed Training?

Q: Why can't I just train on one GPU?

A: Modern models have BILLIONS of parameters. GPT-4 reportedly has ~1.8 trillion. One GPU has maybe 80GB memory. Do the math - it doesn't fit!

Q: So I use multiple GPUs. What's the problem?

A: Each GPU computes gradients on different data. You need to COMBINE these gradients so everyone updates the same way. That's collective communication.

Crash Course #3: TCP/IP Basics

TCP = Reliable Mail Service

IP Address: Your house address (where to send stuff)
Port: Which door to knock on (different services use different ports)
TCP: Guaranteed delivery - if packet lost, resend it
Socket: The "phone line" between two computers

Computer A Computer B ┌─────────────┐ ┌─────────────┐ │ Application │ │ Application │ │ ↓ │ │ ↑ │ │ Socket │ ←── TCP/IP ──────► │ Socket │ │ 192.168.1.1 │ Connection │ 192.168.1.2 │ │ :8080 │ │ :8080 │ └─────────────┘ └─────────────┘

✅ Self-Assessment Quiz

Are you ready? Answer these:

1. What does "gradient" mean in ML?

A) The color scheme of a UI

B) Direction & magnitude to update weights

C) A type of neural network layer

2. Why do we need multiple GPUs for large models?

A) One GPU is too slow

B) Model doesn't fit in one GPU's memory

C) Both A and B

3. What does TCP guarantee?

A) Fast delivery

B) Reliable delivery (no lost packets)

C) Encrypted delivery

Answers: 1-B, 2-C, 3-B
Got 2+ right? You're ready! Got less? Re-read the crash courses above.

🔑 Key Concepts You'll See Everywhere

World Size: Total number of peers/GPUs in training
Rank: Each peer's unique ID (0 to world_size-1)
All-Reduce: Combine data from all peers, give result to all
Tensor: Multi-dimensional array (fancy word for "matrix")
Epoch: One complete pass through all training data
Batch: Subset of data processed at once

🧮 Math You'll Need

Don't worry - it's just basic arithmetic!

The Only Formula That Matters

new_weights = old_weights - learning_rate × gradient

That's it. Everything else is just making this happen across multiple computers efficiently.

🛠️ Tools to Have Ready

Tool	Why	Install
Python 3.8+	Code examples	`brew install python`
PyTorch	ML framework	`pip install torch`
Rust (optional)	Build your own PCCL	`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \| sh`

Fun Fact: The name "tensor" comes from Latin "tendere" (to stretch). Tensors were originally used in physics to describe stress and strain in materials. Now they're the backbone of AI!

"FROG" = Forward, Reduce, Optimize, Gradient-zero

The 4 steps of distributed training: Forward pass → Reduce gradients → Optimizer step → Gradient zero