Phase I — Engineering Foundation
Weeks 1–4 · ~20 hrs
Goal: Bridge the gap between knowing the theory and being able to run it. By the end, you will have a working LM training pipeline with proper checkpointing, a tokenizer you understand from the inside out, and experiment tracking that makes every subsequent phase’s work reproducible. Theory is assumed — this phase is about building engineering muscle memory.
Week 1 primary: Karpathy, nanoGPT video (1h56m) + full model.py / train.py read
Weeks 2–3 primary: Sennrich et al., BPE paper + OpenAI tiktoken source + Karpathy, Let’s build the GPT Tokenizer (2h13m)
Week 4 primary: wandb quickstart + Hugging Face accelerate config basics + learning rate finder implementation
Week 1 — nanoGPT: Your Production Baseline
Primary resource: Karpathy — nanoGPT video (1h56m) + nanoGPT source
Watch the video, then read model.py and train.py end to end. Every line. Annotate anything you would not have written yourself. The goal is not to understand the architecture (you already do) but to understand the engineering decisions: why is the training loop structured this way, where does checkpointing happen, how is the eval loop different from the training loop.
Concepts to understand:
Coding tasks:
After adding gradient accumulation: run with batch_size=4, grad_accum=4 and with batch_size=16, grad_accum=1. Loss curves should be nearly identical (they are computing the same gradient in expectation). If they diverge significantly, check that you are dividing the loss by grad_accum_steps before each backward call — otherwise you are scaling the gradient by grad_accum_steps and the effective learning rate is too high.
Week 2 — Tokenization Engineering
Primary resource: Karpathy, Let’s build the GPT Tokenizer (2h13m) + minbpe source
Tokenization is invisible in most tutorials but is one of the largest levers in Parameter Golf: a 4096-token vocabulary uses 4× fewer embedding parameters than a 16384-token vocabulary for the same d_model. The top Parameter Golf entries explicitly tune vocabulary size.
Concepts to understand:
Coding tasks:
Expected observations: at vocab=256 (pure byte-level), average ~1.0 tokens/character; at vocab=4096, ~0.35 tokens/character; at vocab=50257, ~0.25 tokens/character. Smaller vocab → longer sequences → the model needs more context capacity to achieve the same perplexity. The Parameter Golf trade-off is exact: reducing vocab from 50257 to 4096 saves (50257 - 4096) × d_model embedding parameters, but requires ~1.4× more context length to achieve the same loss. Whether it is worth it depends on how context length affects your architecture’s parameter count.
Week 3 — The Full Training Stack
Primary resource: Karpathy — llm.c — read the Python training script and the C implementation’s main loop; don’t worry about the CUDA kernels yet
This week is about understanding every engineering component of a complete training run: data loading, tokenization, batching, forward/backward, optimizer step, logging, and evaluation. The goal is to be able to write the entire stack from memory.
Concepts to understand:
Coding tasks:
A training step at batch_size=32, seq_len=256, d_model=256, n_layers=4 on an A100 should take approximately 15–30ms. If your step time is 200ms+, the bottleneck is almost certainly the data loader (CPU-bound tokenization or disk I/O) rather than the GPU. The fix: pre-tokenize to a binary file and use np.memmap. After the fix, the GPU should be the bottleneck, and you should see MFU increase from <5% to >30%.
Week 4 — Experiment Infrastructure
Primary resources: wandb quickstart + hydra config tutorial (or use simple argparse + YAML)
Every experiment from here on should be fully reproducible. This week is about building the scaffolding that makes that possible.
Concepts to understand:
Coding tasks:
After adding wandb logging: open the run page during training. You should see val_loss decreasing, grad_norm fluctuating around a roughly constant level (spikes indicate LR is near the edge of stability), and mfu stable after the first few steps. If mfu is high in the first 10 steps and then drops, you have a data loader issue that is hidden by the GPU warmup. The five-seed reproducibility check should show very small variance between runs — if val loss varies by more than 0.1 between seeds, your initialization or data ordering is non-deterministic despite your seed settings.
Phase I Consolidation
Engineering checklist — you should be able to do all of these from memory:
You should now be able to reproduce nanoGPT’s Shakespeare training run from memory — not by reading the source, but by writing a training loop that produces the same loss curve. If you can do this, Phase I is complete. The test: open a blank editor, write the full training loop without reference, run it, and get val loss below 1.6 on the character-level Shakespeare dataset in under 10 minutes of training.