Deep Learning Engineering: Language Models
30 weeks ยท ~5 hrs/wk ยท ~150 hrs total Profile: strong theory (transformers, optimization, scaling laws), weak engineering; goal = expert LM practitioner + Parameter Golf competitor Focus: engineering over theory โ every week produces running code on real hardware
Overview
| Phase | Weeks | Theme | File |
|---|---|---|---|
| I | 1โ4 | Engineering Foundation | Phase I |
| II | 5โ10 | Modern LM Training | Phase II |
| III | 11โ15 | Quantization & Compression | Phase III |
| IV | 16โ20 | Inference Systems | Phase IV |
| V | 21โ25 | Distributed Training | Phase V |
| VI | 26โ30 | Novel Architectures & Golf | Phase VI |
Theory is assumed. Attention math, scaling laws, Adam, loss landscapes โ these are not re-taught. The starting point is: you understand the concepts, but you cannot yet write a production-quality training loop, profile a GPU run, or implement GPTQ. That changes over 30 weeks.
Parameter Golf techniques used by top-50 submissions (the engineering skills this curriculum is designed to unlock):
| Technique | Phase where covered |
|---|---|
| GPTQ (post-training quantization) | III |
| FP8 / INT6 quantization-aware training | III |
| Muon optimizer | II |
| Test-Time Training (TTT) architectures | VI |
| Depth recurrence / shared-weight transformers | VI |
| KV cache reduction (GQA/MQA) | IV |
| FlashAttention | IV |
| Vocabulary optimization + bigram hashing | I |
| Brotli/LZMA weight compression | III |
| Distributed training for data-parallel experiments | V |
Dependency Map
flowchart TD
subgraph P1["Phase I: Engineering Foundation (Wks 1โ4)"]
nano["nanoGPT full stack
training loop, checkpointing"]
tok["Tokenization engineering
BPE, vocab optimization"]
exp["Experiment infrastructure
wandb, configs, LR finder"]
end
subgraph P2["Phase II: LM Training Engineering (Wks 5โ10)"]
amp["Mixed precision
BF16, AMP, loss scaling"]
ckpt["Memory management
grad checkpointing, profiling"]
pipe["Data pipelines
streaming, WebDataset"]
muon["Modern optimizers
Muon, AdamW, schedules"]
end
subgraph P3["Phase III: Quantization (Wks 11โ15)"]
qfund["Quantization fundamentals
INT8, calibration"]
gptq["GPTQ
second-order PTQ"]
qat["QAT
fake quantization"]
fp8["FP8 training
NF4, low-rank SVD"]
end
subgraph P4["Phase IV: Inference Systems (Wks 16โ20)"]
kvcache["KV cache
GQA, MQA, memory"]
flash["FlashAttention
tiling, IO analysis"]
spec["Speculative decoding"]
batch["Continuous batching
PagedAttention, vLLM"]
end
subgraph P5["Phase V: Distributed Training (Wks 21โ25)"]
ddp["DDP
allreduce, buckets"]
fsdp["FSDP / ZeRO
optimizer sharding"]
tp["Tensor parallelism
Megatron-style"]
pp["Pipeline parallelism
microbatching"]
end
subgraph P6["Phase VI: Novel Architectures (Wks 26โ30)"]
ssm["Mamba / SSMs
selective scan"]
ttt["TTT layers
test-time training"]
rec["Depth recurrence
shared weights"]
golf["Parameter Golf campaign
systematic experiments"]
end
nano --> tok
tok --> exp
exp --> amp
amp --> ckpt
ckpt --> pipe
pipe --> muon
muon --> qfund
qfund --> gptq
gptq --> qat
qat --> fp8
fp8 --> kvcache
kvcache --> flash
flash --> spec
spec --> batch
batch --> ddp
ddp --> fsdp
fsdp --> tp
tp --> pp
pp --> ssm
ssm --> ttt
ttt --> rec
rec --> golf
References
| Resource | Role |
|---|---|
| Karpathy, nanoGPT (code) | Phase I primary: production LM baseline |
| Karpathy, nanoGPT video (1h56m) | Phase I walkthrough |
| Karpathy, llm.c (code) | Phase II: understanding efficiency from first principles |
| Sennrich et al., BPE paper | Phase I: tokenization |
| Loshchilov & Hutter, AdamW paper | Phase II: optimizer correctness |
| Kostrikov, Muon optimizer | Phase II: top Parameter Golf optimizer |
| PyTorch, torch.profiler docs | Phase II: profiling |
| PyTorch, AMP tutorial | Phase II: mixed precision |
| Frantar et al., GPTQ paper | Phase III primary |
| Dettmers et al., LLM.int8() paper | Phase III: INT8 quantization |
| Dao et al., FlashAttention paper | Phase IV primary |
| Leviathan et al., Speculative decoding paper | Phase IV |
| Kwon et al., vLLM / PagedAttention paper | Phase IV |
| Rajbhandari et al., ZeRO paper | Phase V primary |
| Shoeybi et al., Megatron-LM paper | Phase V: tensor parallelism |
| Gu & Dao, Mamba paper | Phase VI |
| Sun et al., TTT paper | Phase VI: top Parameter Golf architecture |
| Dehghani et al., Universal Transformers | Phase VI: depth recurrence |
| Modded-nanoGPT, KellerJordan/modded-nanogpt | Phase VI: Parameter Golf reference implementation |