Phase V — Distributed Training

Weeks 21–25 · ~25 hrs

Goal: Train models across multiple GPUs. By the end, you will have implemented DDP from first principles, understand ZeRO’s three stages of optimizer/gradient/parameter sharding, know when to use tensor parallelism vs. pipeline parallelism, and be able to design a parallelism strategy for a given model and hardware configuration.

Week 21 primary: PyTorch DDP tutorial + implement allreduce from scratch

Week 22 primary: Rajbhandari et al., ZeRO paper §1–3 + DeepSpeed ZeRO stage 1/2/3

Week 23 primary: Shoeybi et al., Megatron-LM paper §3 (tensor parallelism)

Weeks 24–25 primary: Megatron-LM GitHub pipeline parallelism docs + practical multi-GPU recipe


Week 21 — DDP: Data Parallelism from First Principles

Concepts to understand:

Coding tasks:

Milestone

DDP overhead at 2 GPUs connected via PCIe: communication adds approximately 20–40% to step time for a small model (communication dominates because model gradient size is large relative to compute time). At 2 GPUs connected via NVLink: communication overhead drops to 5–10%. The key metric: “communication efficiency” = compute_time / (compute_time + communication_time); high efficiency means computation and communication can be overlapped effectively. If your 2-GPU run is slower than 1-GPU, the model is too small — at small scales, communication overhead exceeds the parallelism benefit; DDP pays off when compute time >> communication time.


Week 22 — ZeRO: Sharding Optimizer States, Gradients, and Parameters

Primary resource: Rajbhandari et al., ZeRO paper — §1–4; focus on Table 1 (memory reduction per stage)

Concepts to understand:

Coding tasks:

Milestone

For a 100M-parameter model at 2 GPUs: DDP uses ~1.6GB per GPU for model state; ZeRO-2 uses ~0.9GB (nearly half, because gradients and optimizer states are sharded); FSDP uses ~0.8GB (additionally parameters are sharded). The throughput penalty for FSDP vs. DDP is typically 5–15% due to all-gather communication overhead during the forward pass. FSDP pays off when model state memory is the binding constraint — for a 10B-parameter model, DDP requires ~160GB per GPU (infeasible), while FSDP with 8 GPUs requires ~20GB per GPU (feasible on A100s).


Week 23 — Tensor Parallelism

Primary resource: Shoeybi et al., Megatron-LM paper — §3 (tensor model parallelism)

Concepts to understand:

Coding tasks:

Milestone

Tensor parallelism N=2 on a 12-layer, d_model=512 model: each GPU holds half the weight matrices; the model state memory per GPU is reduced by approximately 2×. Communication overhead for one forward pass: two all-gather or allreduce operations per layer = 24 communications total; each communication moves batch × seq_len × d_model × 2 bytes of data. For batch=8, seq_len=512, d_model=512: each communication is ~4MB; 24 communications = ~96MB moved per forward pass. At PCIe bandwidth (~32 GB/s), this takes ~3ms; if the forward pass itself takes 50ms, the overhead is 6% — acceptable. At larger d_model, the communication cost scales quadratically while compute scales cubically, so tensor parallelism becomes more efficient at larger scales.


Week 24 — Pipeline Parallelism

Concepts to understand:

Coding tasks:

Milestone

At m=1 (no micro-batching): one GPU is active at a time, utilization is 50% for a 2-stage pipeline. At m=4: utilization is 4/(4+1) = 80%. At m=8: 8/9 ≈ 89%. The 20% loss at m=4 (the bubble) translates directly to a 20% throughput loss vs. a single GPU at the same total batch size. This is why pipeline parallelism is used only when necessary (when layers cannot fit on a single GPU) and why the 1F1B schedule is preferred over GPipe in practice — it achieves similar efficiency with much lower activation memory.


Week 25 — Combining Parallelism Strategies

Concepts to understand:

Coding tasks:

Milestone

Ideal scaling efficiency is 100%; realistic efficiency at 4 GPUs over PCIe is typically 70–85% due to communication overhead. If efficiency is below 60%, check: (1) are communication and computation overlapped (DDP bucketing should handle this automatically)? (2) is the model large enough that communication overhead is small relative to compute? A model that takes 10ms to compute one step and 8ms to communicate gradients has only 55% efficiency — at this point, either the model needs to be larger (more compute per step) or the communication needs to be compressed (gradient compression, which introduces bias but reduces bandwidth).