Phase II — Modern LM Training Engineering

Weeks 5–10 · ~30 hrs

Goal: Train language models efficiently on a single GPU. By the end, you will understand every knob that affects training speed and stability, know how to profile a training run and identify its bottleneck, and be able to implement the Muon optimizer — the most effective optimizer used in top Parameter Golf submissions.

Weeks 5–6 primary: PyTorch AMP tutorial + torch.utils.checkpoint docs + torch.profiler tutorial

Weeks 7–8 primary: HuggingFace datasets library + WebDataset

Weeks 9–10 primary: Loshchilov & Hutter AdamW paper + KellerJordan modded-nanoGPT (Muon optimizer reference implementation)

Week 5 — Mixed Precision: BF16, FP16, and AMP

Concepts to understand:

FP32 vs FP16 vs BF16: FP32 has 8 exponent bits + 23 mantissa bits; FP16 has 5 exponent + 10 mantissa (dynamic range limited — overflow/underflow common); BF16 has 8 exponent + 7 mantissa (same range as FP32, fewer mantissa bits — ideal for DL)
Why BF16 is preferred over FP16 for training: BF16 cannot overflow because it has the same exponent range as FP32; FP16 overflows on activations/gradients above ~65504, requiring loss scaling; on A100+ hardware, BF16 matmuls are as fast as FP16
Loss scaling (FP16 only): multiply the loss by a large constant (scale=2^15) before .backward(); divide gradients by the same constant before optimizer.step(); this prevents gradient underflow in FP16 storage. GradScaler automates this
torch.autocast: a context manager that casts eligible operations to a lower precision; matmuls, convolutions, and attention are cast; reductions, normalization layers, and loss computations stay in FP32; does not affect the model’s stored weights (still FP32 master copy)
Master weights in FP32: even in BF16 training, optimizer states (momentum, variance in Adam) are stored in FP32; only the forward/backward computation uses BF16; this is essential for stability
torch.no_grad() vs torch.inference_mode(): both disable gradient tracking; inference_mode is faster (skips version-counter bookkeeping) but tensors created inside cannot be used in autograd later — safe for pure generation/eval loops, a bug if you re-enter training; use torch.no_grad() when you need the output tensor for a training-time loss comparison, inference_mode for all pure inference paths

Coding tasks:

Add torch.autocast(device_type='cuda', dtype=torch.bfloat16) to your nanoGPT training loop; measure step time before and after; measure GPU memory before and after
Verify correctness: run 1000 steps with and without AMP; val loss should differ by less than 0.02
Add GradScaler for comparison; verify it is unnecessary with BF16 (no overflow occurs); confirm it is necessary with FP16 by observing gradient NaN without it
inference_mode trap drill: Inside torch.inference_mode(), run a forward pass and save the output tensor. Exit the context. Try output.sum().backward(). Expected: RuntimeError: Inference tensors cannot be saved for backward. This is the exact bug that strikes when reusing a generation function inside a training-time distillation pipeline. The fix: use torch.no_grad() instead for any output that flows into a subsequent training loss.

Milestone

Expected speedup from BF16 AMP on an A100: 1.5–2.5× step time reduction. Memory reduction: approximately 40% (activations stored in BF16 rather than FP32; weights unchanged in master copy). If you see less than 1.2× speedup, check that you wrapped the entire forward pass (including attention) in the autocast context — if the matmuls are not being cast, the autocast is a no-op. Verify by inserting print(q.dtype) inside the attention forward to confirm Q/K/V tensors are BF16.

Week 6 — Memory Management: Gradient Checkpointing and Profiling

Concepts to understand:

The memory breakdown of a training step: activations (largest for long sequences), gradients (same size as parameters), optimizer states (2× parameters for Adam), parameters (1×); for a 100M-parameter model with seq_len=1024 and batch_size=8, activations alone require ~4–8 GB
Gradient checkpointing (torch.utils.checkpoint.checkpoint): instead of storing all intermediate activations for the backward pass, recompute them on-the-fly during the backward; trades memory for compute (typically 30–40% more compute, but 8–10× memory reduction for very deep models)
How to apply checkpointing: wrap individual transformer blocks: output = checkpoint(block, x) instead of output = block(x); requires the wrapped function to be re-entrant (no global state mutations)
torch.profiler: wraps a training loop and records kernel-level timing; output includes a timeline of CPU and GPU operations, a table of the top-N slowest kernels, and memory usage over time
Compute-bound vs. memory-bandwidth-bound: a matmul is compute-bound (limited by FLOP throughput); an elementwise operation (e.g., GELU, softmax) is memory-bandwidth-bound (limited by how fast data can be read/written); FlashAttention (Phase IV) is the standard solution for the attention operation’s memory bottleneck
Profiler schedule: naive with profile(): for i in range(10): step() captures allocator warmup and lazy-init noise that masks real bottlenecks; the correct pattern is schedule(wait=1, warmup=1, active=3, repeat=2) — skip the first step, warm up the profiler for one step, then record three active steps; record_function('label') inserts named regions into the trace for human-readable breakdowns
CUDA caching allocator: torch.cuda.memory_allocated() is what your model actually uses; torch.cuda.memory_reserved() is what PyTorch has claimed from the OS (always ≥ allocated); torch.cuda.empty_cache() returns the free pool to the OS but does NOT reduce nvidia-smi usage if another process is also holding GPU memory; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True eliminates most OOM-from-fragmentation errors without changing peak usage

Coding tasks:

Add gradient checkpointing to every transformer block in your nanoGPT; measure memory before/after; verify the loss curve is unchanged
Profile 10 training steps with torch.profiler using the schedule pattern: schedule(wait=1, warmup=1, active=3), with on_trace_ready=tensorboard_trace_handler('./log'); wrap forward, loss, backward, and optimizer.step() in named record_function regions; open the Chrome trace and confirm backward takes ~2× forward
Find the operator that dominates GPU time: is it the attention computation, the MLP matmuls, or the embedding lookup?
CUDA memory drill: After one training step, print memory_allocated() and memory_reserved(); call empty_cache() and print again. Then in a loop allocate and free random tensors from 100MB to 500MB for 100 iterations, then attempt to allocate 800MB. Expected: OutOfMemoryError even though memory_reserved - memory_allocated > 800MB — this is fragmentation. Re-run with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Expected: allocation succeeds.

Milestone

In a standard nanoGPT training step at d_model=768, n_layers=12, seq_len=1024, batch_size=8: the two MLP linear layers (each d_model × 4d_model) should dominate GPU time — they account for roughly 60–70% of the compute. The attention computation is often 20–30%. The embedding lookup is negligible in time (it is a memory operation, very fast). If attention dominates at long sequence lengths (>2048), this is the motivation for FlashAttention in Phase IV.

Week 7 — Data Pipeline Engineering

Concepts to understand:

The data loading bottleneck: if step_time with num_workers=0 ≫ step_time with num_workers=4, the CPU is the bottleneck; the GPU is idle waiting for data
DataLoader tuning: num_workers (parallel data loading processes), pin_memory=True (pages CPU memory to be non-swappable, speeds host-to-device transfer), prefetch_factor (number of batches to prefetch per worker)
Memory-mapped datasets: np.memmap for pre-tokenized binary files; mmap mode does not load the file into RAM — the OS pages in only the accessed bytes; essential for datasets larger than RAM
WebDataset: a streaming dataset format based on .tar files; enables training on data stored in object storage (S3, GCS) without downloading locally; reads are sequential (no random access), which is often faster than random reads from disk
Dataset deduplication: training on duplicate text inflates effective epoch count and hurts generalization; MinHash + LSH is the standard near-dedup algorithm for large text corpora (used in RefinedWeb, Dolma, FineWeb)
Data mixing: training on a weighted mixture of sources (web text, code, math, books); the mixing ratios are a hyperparameter with large effects on downstream capability — Llama-3 spent significant effort on this

Coding tasks:

Profile your data loader: time get_batch() in isolation vs. inside the training loop; verify the GPU is not waiting (data loading time < forward+backward time)
Implement a streaming WebDataset loader for a text corpus; verify you can train on data without loading it entirely into memory
Implement a multi-source data mixer: combine two token datasets with configurable weights; verify the sampling ratio matches the specified weights over a 10,000-step run

Milestone

A well-configured data pipeline on a modern GPU should have near-zero data loading time relative to GPU compute time — the prefetched batch should always be ready before the GPU finishes the previous step. The test: log data_time_ms and gpu_time_ms separately in your training loop. If data_time_ms > 0.1 × gpu_time_ms, your pipeline is the bottleneck. The usual fix: increase num_workers, add pin_memory=True, or switch to pre-tokenized memory-mapped files.

Week 8 — Debugging at Scale

Concepts to understand:

Loss NaN tracking: NaN propagates through all subsequent operations; detect with torch.isnan(loss).any() and torch.isnan(grads).any() before calling optimizer.step(); the cause is almost always a gradient explosion — detect with grad_norm = nn.utils.clip_grad_norm_(model.parameters(), float('inf')) printed before clipping
The gradient norm as a leading indicator: grad_norm spikes 1–2 steps before a loss spike; monitoring it in wandb is the earliest warning of training instability
Loss spike recovery: when a spike occurs, roll back to the last checkpoint; reduce LR by 2× and resume — a spike indicates the LR is near the edge of stability for the current loss landscape
Silent data corruption: a training loop that runs without error but produces degraded models; check for: off-by-one in the data loader (X and Y not shifted by exactly 1), incorrect attention masking (non-causal masking leaks future information), label smoothing applied to padding tokens
The causal mask verification test: generate from a model trained with a suspected masking bug; the output should be incoherent if future tokens were visible during training (the model learns to “look ahead” and produces text with unusually high confidence on early tokens)
Checkpoint diff: compare model.state_dict() between two checkpoints 100 steps apart; parameters that have not changed at all indicate dead neurons or a layer that is not receiving gradient signal
register_forward_hook and register_full_backward_hook: the standard way to instrument a live model without modifying its code; a forward hook receives (module, input, output) and can record or modify the output; a backward hook receives (module, grad_input, grad_output); use register_full_backward_hook (not the deprecated register_backward_hook) — the “full” version provides correct gradient tuples for modules with multiple inputs

Coding tasks:

Add NaN detection to your training loop; deliberately introduce a NaN by setting one weight to inf; verify detection fires before optimizer.step()
Deliberately introduce an off-by-one error in the data loader (Y = X instead of Y = X[:, 1:]); observe how the loss curve changes; diagnose from the curve alone before reading the code
Hooks drill — activation distribution: Register a forward_hook on every nn.LayerNorm in nanoGPT that records the std of the output tensor. After 100 training steps, print std per layer. Expected: all values near 1.0 (LayerNorm normalizes). If any std > 5, you registered the hook on the wrong module — capturing a post-residual output rather than the LayerNorm output itself.
Hooks drill — gradient flow check: Use param.register_hook(lambda g: ...) on every named parameter to record g.abs().mean() in a dict keyed by parameter name. Run one backward pass. Expected: every parameter’s hook fires once with a non-zero value. If any hook does not fire, that parameter is detached from the loss graph — a common bug after misplaced .detach() in custom blocks. This replaces the need to manually recurse model.parameters() and is cleaner than the naive checkpoint-diff approach for catching dead layers in real time.

Milestone

The off-by-one target bug (Y = X instead of Y = X[:, 1:]) produces a model that is predicting the current token from itself — a trivially easy task. Loss will drop very quickly to near zero on training data and stay near zero on validation data. This is the only case where near-zero training AND validation loss is suspicious — it signals the model is memorizing an identity mapping, not learning language statistics. Any other near-zero training / near-zero validation pattern is just successful learning.

Week 9 — Modern Optimizers: AdamW and Muon

Primary resources: - Loshchilov & Hutter, AdamW paper (~30 min read — focus on §3) - KellerJordan, modded-nanoGPT — read train_gpt2.py end to end; the Muon optimizer implementation is ~50 lines

Concepts to understand:

AdamW vs. Adam: standard Adam applies weight decay as L2 loss += λ||θ||², which is equivalent to θ ← θ - η(g + λθ) — but Adam normalizes gradients by the second moment estimate, so the effective weight decay is λ / √(v̂ + ε), not λ; AdamW applies weight decay directly to weights after the adaptive step: θ ← (1 - ηλ)θ - η·m̂/√(v̂+ε), decoupling it from the adaptive scaling
Why AdamW generalizes better: the decoupled weight decay is more predictable and acts as a proper L2 regularizer independent of the gradient magnitude; models trained with AdamW have lower weight norms for the same WD coefficient vs. Adam
Muon optimizer (Modular Dual): applies Newton-Schulz iteration to approximate the matrix square root of the gradient second moment; effectively steepest descent in spectral norm rather than L2 norm; works best for weight matrices (not embeddings or biases)
Newton-Schulz iteration: the update X ← (3X - X³) / 2 converges to the matrix sign of the input in ~5 iterations; Muon uses this to orthogonalize the gradient direction; result: gradient updates are nearly orthogonal matrices, which have good conditioning properties for deep networks
Muon in practice: apply Muon to all weight tensors in linear/attention layers; apply AdamW to embeddings, biases, and LayerNorm parameters; this hybrid is the configuration used in modded-nanoGPT
Parameter groups for mixed optimizers: construct an optimizer with multiple parameter groups — optim.AdamW([{"params": emb_params, "lr": 3e-3, "weight_decay": 0.0}, {"params": weight_params, "lr": 3e-4, "weight_decay": 0.1}]); each group can have its own LR, weight decay, and momentum; selective freezing: param.requires_grad_(False) before constructing the optimizer prevents that parameter from receiving updates
Partitioning by tensor dimensionality: 2D tensors (weight matrices) → Muon; 1D tensors (biases, LayerNorm scale/shift) and 0D/embedding → AdamW; this is the exact partition used in modded-nanoGPT and can be automated with [p for p in model.parameters() if p.dim() == 2]

Coding tasks:

Implement AdamW from scratch (without torch.optim.AdamW); verify it matches the PyTorch implementation on a small model by comparing parameter values after 100 steps
Add the Muon optimizer to your nanoGPT training loop following the modded-nanoGPT reference; train on Shakespeare; compare final val loss and convergence speed vs. AdamW
Profile the Muon optimizer step vs. AdamW: measure the additional time from the Newton-Schulz iterations; typical overhead is 5–15% of step time
Parameter group and freeze drill: Freeze all parameters except LoRA-like rank-4 B, A matrices you add to the first transformer block. Build an AdamW with two groups: frozen base at lr=0, trainable at lr=3e-4. Run 100 steps. Assert: all(p.grad is None for p in base_params) and all(p.grad is not None for p in lora_params). Expected: loss decreases; base weights are identical to their initial values. If base params have gradients, you passed them to the optimizer before calling .requires_grad_(False) — some optimizers allocate gradient state for all params on construction.

Milestone

Expected results: Muon should reach the same final validation loss as AdamW in fewer steps (~20–30% fewer on small language modeling tasks). The reason: Muon’s orthogonalized updates are better conditioned than Adam’s, especially in the early training phase where the gradient covariance structure is poorly estimated. If you see no improvement, check that you are applying Muon only to weight matrices and not to embeddings — Muon performs poorly on embedding tables because the rows are not fully connected to all outputs (sparse gradient structure).

Week 10 — Learning Rate Schedules and Stability

Concepts to understand:

Warmup: train at near-zero LR for the first T_warm steps (typically 1–5% of total steps), then ramp to the target LR; prevents early instability because at step 0, Adam’s second moment estimate v̂ is near zero, making m̂/√(v̂+ε) arbitrarily large without warmup
Cosine annealing with warmup: the standard for LLM pretraining; LR rises linearly for T_warm steps, then follows η(t) = η_min + (η_max - η_min)/2 × (1 + cos(π(t - T_warm)/(T_max - T_warm)))
Warmdown (cooldown): reduce LR to near-zero in the final 10–20% of training; empirically improves final loss by allowing the model to “settle” into a flat region of the loss landscape; used in modded-nanoGPT
LR as a function of model size: the optimal LR scales roughly as η ∝ 1/√d_model (empirical); a 256-dim model trains well at lr=3e-3 while a 1024-dim model needs lr≈1e-3; this is one reason hyperparameter transfer across scales is non-trivial
Gradient clipping threshold: clip to max_norm=1.0 for most LM training; set it to float('inf') during the first experiment to measure the natural gradient norm — then set the clip threshold to 2× the typical gradient norm

Coding tasks:

Implement the full warmup + cosine + warmdown schedule in a get_lr(step) function; plot the schedule for T_warm=100, T_max=5000, T_down=500
Run an ablation: compare (a) constant LR, (b) cosine without warmup, (c) cosine + warmup, (d) cosine + warmup + warmdown; plot all four val loss curves on the same axes

Milestone

Expected ranking of schedules by final val loss (best to worst): (d) > (c) > (b) > (a). The warmdown improvement is often 0.02–0.05 in val loss for a small LM trained for 5000 steps — surprisingly large for a modification to only the final 10% of training. The intuition: cosine decay without warmdown leaves the model in a moderate-LR regime at the end of training, where it is still making relatively large steps that prevent full convergence. Warmdown drives the LR to near-zero, forcing convergence to a lower-loss solution within the current loss basin.