Deep Learning Engineering: Language Models

30 weeks ยท ~5 hrs/wk ยท ~150 hrs total Profile: strong theory (transformers, optimization, scaling laws), weak engineering; goal = expert LM practitioner + Parameter Golf competitor Focus: engineering over theory โ€” every week produces running code on real hardware


Overview

Phase Weeks Theme File
I 1โ€“4 Engineering Foundation Phase I
II 5โ€“10 Modern LM Training Phase II
III 11โ€“15 Quantization & Compression Phase III
IV 16โ€“20 Inference Systems Phase IV
V 21โ€“25 Distributed Training Phase V
VI 26โ€“30 Novel Architectures & Golf Phase VI

Theory is assumed. Attention math, scaling laws, Adam, loss landscapes โ€” these are not re-taught. The starting point is: you understand the concepts, but you cannot yet write a production-quality training loop, profile a GPU run, or implement GPTQ. That changes over 30 weeks.

Parameter Golf techniques used by top-50 submissions (the engineering skills this curriculum is designed to unlock):

Technique Phase where covered
GPTQ (post-training quantization) III
FP8 / INT6 quantization-aware training III
Muon optimizer II
Test-Time Training (TTT) architectures VI
Depth recurrence / shared-weight transformers VI
KV cache reduction (GQA/MQA) IV
FlashAttention IV
Vocabulary optimization + bigram hashing I
Brotli/LZMA weight compression III
Distributed training for data-parallel experiments V

Dependency Map

flowchart TD
    subgraph P1["Phase I: Engineering Foundation (Wks 1โ€“4)"]
        nano["nanoGPT full stack
training loop, checkpointing"] tok["Tokenization engineering
BPE, vocab optimization"] exp["Experiment infrastructure
wandb, configs, LR finder"] end subgraph P2["Phase II: LM Training Engineering (Wks 5โ€“10)"] amp["Mixed precision
BF16, AMP, loss scaling"] ckpt["Memory management
grad checkpointing, profiling"] pipe["Data pipelines
streaming, WebDataset"] muon["Modern optimizers
Muon, AdamW, schedules"] end subgraph P3["Phase III: Quantization (Wks 11โ€“15)"] qfund["Quantization fundamentals
INT8, calibration"] gptq["GPTQ
second-order PTQ"] qat["QAT
fake quantization"] fp8["FP8 training
NF4, low-rank SVD"] end subgraph P4["Phase IV: Inference Systems (Wks 16โ€“20)"] kvcache["KV cache
GQA, MQA, memory"] flash["FlashAttention
tiling, IO analysis"] spec["Speculative decoding"] batch["Continuous batching
PagedAttention, vLLM"] end subgraph P5["Phase V: Distributed Training (Wks 21โ€“25)"] ddp["DDP
allreduce, buckets"] fsdp["FSDP / ZeRO
optimizer sharding"] tp["Tensor parallelism
Megatron-style"] pp["Pipeline parallelism
microbatching"] end subgraph P6["Phase VI: Novel Architectures (Wks 26โ€“30)"] ssm["Mamba / SSMs
selective scan"] ttt["TTT layers
test-time training"] rec["Depth recurrence
shared weights"] golf["Parameter Golf campaign
systematic experiments"] end nano --> tok tok --> exp exp --> amp amp --> ckpt ckpt --> pipe pipe --> muon muon --> qfund qfund --> gptq gptq --> qat qat --> fp8 fp8 --> kvcache kvcache --> flash flash --> spec spec --> batch batch --> ddp ddp --> fsdp fsdp --> tp tp --> pp pp --> ssm ssm --> ttt ttt --> rec rec --> golf

References

Resource Role
Karpathy, nanoGPT (code) Phase I primary: production LM baseline
Karpathy, nanoGPT video (1h56m) Phase I walkthrough
Karpathy, llm.c (code) Phase II: understanding efficiency from first principles
Sennrich et al., BPE paper Phase I: tokenization
Loshchilov & Hutter, AdamW paper Phase II: optimizer correctness
Kostrikov, Muon optimizer Phase II: top Parameter Golf optimizer
PyTorch, torch.profiler docs Phase II: profiling
PyTorch, AMP tutorial Phase II: mixed precision
Frantar et al., GPTQ paper Phase III primary
Dettmers et al., LLM.int8() paper Phase III: INT8 quantization
Dao et al., FlashAttention paper Phase IV primary
Leviathan et al., Speculative decoding paper Phase IV
Kwon et al., vLLM / PagedAttention paper Phase IV
Rajbhandari et al., ZeRO paper Phase V primary
Shoeybi et al., Megatron-LM paper Phase V: tensor parallelism
Gu & Dao, Mamba paper Phase VI
Sun et al., TTT paper Phase VI: top Parameter Golf architecture
Dehghani et al., Universal Transformers Phase VI: depth recurrence
Modded-nanoGPT, KellerJordan/modded-nanogpt Phase VI: Parameter Golf reference implementation