Goal: Train language models efficiently on a single GPU. By the end, you will understand every knob that affects training speed and stability, know how to profile a training run and identify its bottleneck, and be able to implement the Muon optimizer — the most effective optimizer used in top Parameter Golf submissions.
Expected speedup from BF16 AMP on an A100: 1.5–2.5× step time reduction. Memory reduction: approximately 40% (activations stored in BF16 rather than FP32; weights unchanged in master copy). If you see less than 1.2× speedup, check that you wrapped the entire forward pass (including attention) in the autocast context — if the matmuls are not being cast, the autocast is a no-op. Verify by inserting print(q.dtype) inside the attention forward to confirm Q/K/V tensors are BF16.
Week 6 — Memory Management: Gradient Checkpointing and Profiling
Concepts to understand:
Coding tasks:
Milestone
In a standard nanoGPT training step at d_model=768, n_layers=12, seq_len=1024, batch_size=8: the two MLP linear layers (each d_model × 4d_model) should dominate GPU time — they account for roughly 60–70% of the compute. The attention computation is often 20–30%. The embedding lookup is negligible in time (it is a memory operation, very fast). If attention dominates at long sequence lengths (>2048), this is the motivation for FlashAttention in Phase IV.
Week 7 — Data Pipeline Engineering
Concepts to understand:
Coding tasks:
Milestone
A well-configured data pipeline on a modern GPU should have near-zero data loading time relative to GPU compute time — the prefetched batch should always be ready before the GPU finishes the previous step. The test: log data_time_ms and gpu_time_ms separately in your training loop. If data_time_ms > 0.1 × gpu_time_ms, your pipeline is the bottleneck. The usual fix: increase num_workers, add pin_memory=True, or switch to pre-tokenized memory-mapped files.
Week 8 — Debugging at Scale
Concepts to understand:
Coding tasks:
Milestone
The off-by-one target bug (Y = X instead of Y = X[:, 1:]) produces a model that is predicting the current token from itself — a trivially easy task. Loss will drop very quickly to near zero on training data and stay near zero on validation data. This is the only case where near-zero training AND validation loss is suspicious — it signals the model is memorizing an identity mapping, not learning language statistics. Any other near-zero training / near-zero validation pattern is just successful learning.
Week 9 — Modern Optimizers: AdamW and Muon
Primary resources:
- Loshchilov & Hutter, AdamW paper (~30 min read — focus on §3)
- KellerJordan, modded-nanoGPT — read train_gpt2.py end to end; the Muon optimizer implementation is ~50 lines
Concepts to understand:
Coding tasks:
Milestone
Expected results: Muon should reach the same final validation loss as AdamW in fewer steps (~20–30% fewer on small language modeling tasks). The reason: Muon’s orthogonalized updates are better conditioned than Adam’s, especially in the early training phase where the gradient covariance structure is poorly estimated. If you see no improvement, check that you are applying Muon only to weight matrices and not to embeddings — Muon performs poorly on embedding tables because the rows are not fully connected to all outputs (sparse gradient structure).
Week 10 — Learning Rate Schedules and Stability
Concepts to understand:
Coding tasks:
Milestone
Expected ranking of schedules by final val loss (best to worst): (d) > (c) > (b) > (a). The warmdown improvement is often 0.02–0.05 in val loss for a small LM trained for 5000 steps — surprisingly large for a modification to only the final 10% of training. The intuition: cosine decay without warmdown leaves the model in a moderate-LR regime at the end of training, where it is still making relatively large steps that prevent full convergence. Warmdown drives the LR to near-zero, forcing convergence to a lower-loss solution within the current loss basin.