Deep Learning Engineering: Language Models

30 weeks · ~5 hrs/wk · ~150 hrs total Profile: strong theory (transformers, optimization, scaling laws), weak engineering; goal = expert LM practitioner + Parameter Golf competitor Focus: engineering over theory — every week produces running code on real hardware

Overview

Phase	Weeks	Theme	File
I	1–4	Engineering Foundation	Phase I
II	5–10	Modern LM Training	Phase II
III	11–15	Quantization & Compression	Phase III
IV	16–20	Inference Systems	Phase IV
V	21–25	Distributed Training	Phase V
VI	26–30	Novel Architectures & Golf	Phase VI

Theory is assumed. Attention math, scaling laws, Adam, loss landscapes — these are not re-taught. The starting point is: you understand the concepts, but you cannot yet write a production-quality training loop, profile a GPU run, or implement GPTQ. That changes over 30 weeks.

Parameter Golf techniques used by top-50 submissions (the engineering skills this curriculum is designed to unlock):

Technique	Phase where covered
GPTQ (post-training quantization)	III
FP8 / INT6 quantization-aware training	III
Muon optimizer	II
Test-Time Training (TTT) architectures	VI
Depth recurrence / shared-weight transformers	VI
KV cache reduction (GQA/MQA)	IV
FlashAttention	IV
Vocabulary optimization + bigram hashing	I
Brotli/LZMA weight compression	III
Distributed training for data-parallel experiments	V

Dependency Map

flowchart TD
    subgraph P1["Phase I: Engineering Foundation (Wks 1–4)"]
        nano["nanoGPT full stack
training loop, checkpointing"]
        tok["Tokenization engineering
BPE, vocab optimization"]
        exp["Experiment infrastructure
wandb, configs, LR finder"]
    end

    subgraph P2["Phase II: LM Training Engineering (Wks 5–10)"]
        amp["Mixed precision
BF16, AMP, loss scaling"]
        ckpt["Memory management
grad checkpointing, profiling"]
        pipe["Data pipelines
streaming, WebDataset"]
        muon["Modern optimizers
Muon, AdamW, schedules"]
    end

    subgraph P3["Phase III: Quantization (Wks 11–15)"]
        qfund["Quantization fundamentals
INT8, calibration"]
        gptq["GPTQ
second-order PTQ"]
        qat["QAT
fake quantization"]
        fp8["FP8 training
NF4, low-rank SVD"]
    end

    subgraph P4["Phase IV: Inference Systems (Wks 16–20)"]
        kvcache["KV cache
GQA, MQA, memory"]
        flash["FlashAttention
tiling, IO analysis"]
        spec["Speculative decoding"]
        batch["Continuous batching
PagedAttention, vLLM"]
    end

    subgraph P5["Phase V: Distributed Training (Wks 21–25)"]
        ddp["DDP
allreduce, buckets"]
        fsdp["FSDP / ZeRO
optimizer sharding"]
        tp["Tensor parallelism
Megatron-style"]
        pp["Pipeline parallelism
microbatching"]
    end

    subgraph P6["Phase VI: Novel Architectures (Wks 26–30)"]
        ssm["Mamba / SSMs
selective scan"]
        ttt["TTT layers
test-time training"]
        rec["Depth recurrence
shared weights"]
        golf["Parameter Golf campaign
systematic experiments"]
    end

    nano --> tok
    tok --> exp
    exp --> amp
    amp --> ckpt
    ckpt --> pipe
    pipe --> muon
    muon --> qfund
    qfund --> gptq
    gptq --> qat
    qat --> fp8
    fp8 --> kvcache
    kvcache --> flash
    flash --> spec
    spec --> batch
    batch --> ddp
    ddp --> fsdp
    fsdp --> tp
    tp --> pp
    pp --> ssm
    ssm --> ttt
    ttt --> rec
    rec --> golf

References

Resource	Role
Karpathy, nanoGPT (code)	Phase I primary: production LM baseline
Karpathy, nanoGPT video (1h56m)	Phase I walkthrough
Karpathy, llm.c (code)	Phase II: understanding efficiency from first principles
Sennrich et al., BPE paper	Phase I: tokenization
Loshchilov & Hutter, AdamW paper	Phase II: optimizer correctness
Kostrikov, Muon optimizer	Phase II: top Parameter Golf optimizer
PyTorch, torch.profiler docs	Phase II: profiling
PyTorch, AMP tutorial	Phase II: mixed precision
Frantar et al., GPTQ paper	Phase III primary
Dettmers et al., LLM.int8() paper	Phase III: INT8 quantization
Dao et al., FlashAttention paper	Phase IV primary
Leviathan et al., Speculative decoding paper	Phase IV
Kwon et al., vLLM / PagedAttention paper	Phase IV
Rajbhandari et al., ZeRO paper	Phase V primary
Shoeybi et al., Megatron-LM paper	Phase V: tensor parallelism
Gu & Dao, Mamba paper	Phase VI
Sun et al., TTT paper	Phase VI: top Parameter Golf architecture
Dehghani et al., Universal Transformers	Phase VI: depth recurrence
Modded-nanoGPT, KellerJordan/modded-nanogpt	Phase VI: Parameter Golf reference implementation