ML Systems Curriculum: LLMs & Multimodal Models
52 weeks · ~10 hrs/wk · ~520 hrs total Profile: Python/PyTorch practitioner, GPU beginner, full-stack goal (inference + training + kernels)
Year Overview
| Part | Weeks | Theme |
|---|---|---|
| I | 1–12 | Foundations |
| II | 13–20 | Hardware Mastery |
| III | 21–29 | Training at Scale |
| IV | 30–37 | Advanced Inference |
| V | 38–46 | Emerging Architectures |
| VI | 47–52 | Compilers & Infrastructure |
Dependency Map
flowchart TD
subgraph P1["① Foundations · Wks 1–12"]
roofline["Roofline & GPU Architecture"]
txArith["Transformer Systems Arithmetic"]
effAttn["Efficient Attention (Flash, GQA, MLA)"]
inference["Quantization & Serving"]
dist["Distributed Training (DP / TP / PP / ZeRO)"]
kernels["Triton Kernels & torch.compile"]
end
subgraph P2["② Hardware Mastery · Wks 13–20"]
advCuda["Advanced CUDA (Warps, Tensor Cores, cp.async)"]
cutlass["CUTLASS / CuTe"]
hopper["Hopper ISA (TMA, WGMMA, Ping-Pong)"]
profiling["Nsight Profiling (Compute + Systems)"]
interconnects["Interconnects & NCCL Basics"]
end
subgraph P3["③ Training at Scale · Wks 21–29"]
seqPar["Sequence & Context Parallelism"]
moeTraining["MoE Training (Expert Parallelism)"]
dataPipe["Data Pipelines & Pretraining Infra"]
peft["LoRA / QLoRA / PEFT"]
rlhf["RLHF Systems (PPO, DPO, GRPO)"]
end
subgraph P4["④ Advanced Inference · Wks 30–37"]
longCtx["Long-Context (YaRN, KV Eviction)"]
advQuant["FP8 & Extreme Quantization"]
advSpec["Advanced Spec Decoding (EAGLE, Medusa)"]
disagg["Disaggregated Inference (DistServe)"]
prodServing["Production Serving & SLO Scheduling"]
end
subgraph P5["⑤ Emerging Architectures · Wks 36–46"]
mamba["Mamba / SSMs & Parallel Scan"]
moeInf["MoE Inference & Expert Offload"]
linAttn["Linear Attention & Hybrid Models"]
diffusion["Diffusion Systems (DiT, Flow Matching)"]
end
subgraph P6["⑥ Compilers & Infra · Wks 43–52"]
mlir["MLIR & Compiler IR Theory"]
tvm["TVM / XLA / Autotuning"]
adSys["Automatic Differentiation Systems"]
memMgmt["CUDA Memory Management"]
ncclDeep["NCCL Deep Dive & Collectives"]
end
%% Part I internal
roofline --> txArith
roofline --> kernels
txArith --> effAttn
effAttn --> inference
effAttn --> dist
%% P1 → P2
roofline --> advCuda
kernels --> cutlass
advCuda --> cutlass
cutlass --> hopper
hopper --> profiling
dist --> interconnects
%% P1 → P3
dist --> seqPar
dist --> moeTraining
txArith --> dataPipe
inference --> peft
peft --> rlhf
seqPar --> rlhf
%% P1 → P4
effAttn --> longCtx
inference --> advQuant
inference --> advSpec
inference --> disagg
inference --> prodServing
%% P2 → P4
hopper --> advQuant
%% P2 → P5
hopper --> mamba
kernels --> mamba
%% P1 → P5
effAttn --> linAttn
txArith --> diffusion
%% P3 → P5
moeTraining --> moeInf
%% P4 → P5
disagg --> moeInf
%% P2 → P6
hopper --> mlir
kernels --> mlir
mlir --> tvm
advCuda --> memMgmt
interconnects --> ncclDeep
kernels --> adSys
Part I — Foundations
Goal: Build the hardware intuition, transformer arithmetic, attention theory, inference fundamentals, distributed training basics, and kernel writing skills that underpin everything else.
Week 1 — The Roofline Model
Concepts to understand:
Reading:
Hands-on:
Milestone: Given
matmul(A, B)with A=(4096,4096), B=(4096,4096) in fp16 on an A100 (312 TFLOP/s, 2 TB/s), compute arithmetic intensity and determine its roofline regime.
Week 2 — CUDA Mental Model & Memory Hierarchy
Concepts to understand:
Reading:
Hands-on:
Milestone: Explain why a naively written matrix transpose kernel is memory-bandwidth-bound despite O(N²) reads and writes — and describe the shared memory tiling fix.
Week 3 — Transformer Systems Arithmetic
Concepts to understand:
Reading:
Hands-on:
Milestone: Why does doubling sequence length quadratically increase attention FLOPs but only linearly increase KV cache memory? Write out the derivation.
Week 4 — Efficient Attention
Concepts to understand:
Reading:
Hands-on:
Milestone: For a 70B model, 64 heads, d_head=128, seqlen=8192, batch=1 in fp16 — compute the memory footprint of the full attention matrix under standard MHA. Then state FlashAttention’s peak SMEM usage and explain why.
Week 5 — Quantization & Speculative Decoding
Concepts to understand:
Reading:
Hands-on:
Week 6 — Serving Systems
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 5–6): For a 7B model on one A100 (80GB), walk through: (1) VRAM available for KV cache after loading int4 weights, (2) concurrent sequences that fit, (3) why continuous batching improves utilization over static batching.
Week 7 — Data Parallelism, Tensor Parallelism, Pipeline Parallelism
Concepts to understand:
Reading:
Hands-on:
Week 8 — ZeRO, FSDP, Mixed Precision, Gradient Checkpointing
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 7–8): For a 13B model on 8 GPUs with ZeRO-3: calculate per-GPU memory for parameters, gradients, and optimizer states. Estimate the all-gather + reduce-scatter communication volume per step vs. DDP baseline.
Week 9 — Triton Foundations
Concepts to understand:
Reading:
Hands-on:
Week 10 — torch.compile & Inductor
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 9–10): Write a fused layernorm kernel in Triton that computes mean, variance, and normalization in a single pass. Compare throughput and bandwidth to
torch.nn.LayerNorm. Explain why a single-pass implementation is faster.
Week 11 — Vision-Language Model Serving
Concepts to understand:
Reading:
Hands-on:
Milestone: A 336×336 image produces 576 image tokens. For LLaVA-1.5-7B serving 16 concurrent users (one image + 64-token question each), compute effective prefill token count and compare KV cache memory to a text-only baseline.
Week 12 — Part I Capstone
Choose one track:
Deliverable checklist:
Part II — Hardware Mastery
Goal: Go from Triton user to someone who can reason about hardware ISA-level behavior, write CUTLASS kernels, profile with Nsight Compute, and understand the full interconnect stack.
Week 13 — Advanced CUDA: Warp Primitives & Tensor Cores
Concepts to understand:
Reading:
Hands-on:
Week 14 — CUDA Concurrency: Streams, Graphs, Async Copy
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 13–14): Implement a fused LayerNorm with warp-shuffle reduction and double-buffered
cp.asyncpipelining. Profile in Nsight Compute and identify: (a) achieved bandwidth as % of theoretical, (b) dominant stall reason.
Week 15 — CUTLASS & CuTe
Concepts to understand:
Reading:
Hands-on:
Week 16 — Hopper Architecture: TMA & WGMMA
Concepts to understand:
Reading:
Hands-on:
Week 17 — FlashAttention-3 as Hopper Case Study
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 15–17): Write a fused FP8 attention kernel in CUTLASS 3.x with TMA + WGMMA + Ping-Pong scheduling. Profile vs. a Triton FlashAttention-2 implementation. Explain the performance delta using roofline analysis.
Week 18 — Profiling Methodology: Nsight Compute & Nsight Systems
Concepts to understand:
Reading:
Hands-on:
Week 19 — Alternative Accelerators
Concepts to understand:
Reading:
Hands-on:
Week 20 — Interconnects, NCCL, & Part II Capstone
Concepts to understand:
Reading:
Hands-on:
Milestone (Part II): End-to-end kernel engineering capstone. Choose one: (a) fused FP8 attention with TMA + WGMMA, (b) INT8/FP8 GEMM with per-block dequantization epilogue vs. cuBLAS, or (c) roofline-guided optimization of an underperforming open-source kernel with 3 distinct improvements verified in Nsight Compute.
Part III — Training at Scale
Goal: Extend beyond the ZeRO/Megatron basics to cover sequence parallelism, MoE training, data infrastructure, fault tolerance, efficient finetuning, and RLHF systems.
Week 21 — Sequence Parallelism & Context Parallelism
Concepts to understand:
Reading:
Hands-on:
Week 22 — MoE Training: Architecture & Routing
Concepts to understand:
Reading:
Hands-on:
Week 23 — MoE Training: Systems & Communication
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 22–23): For a 47B MoE model (Mixtral-style, 8 experts, top-2), compute: (a) active parameters per token, (b) VRAM per GPU with EP=8, (c) all-to-all communication volume per forward pass. Compare to a 13B dense model with equivalent per-token FLOPs.
Week 24 — Data Pipelines for Pretraining
Concepts to understand:
Reading:
Hands-on:
Week 25 — Fault Tolerance & Large-Scale Reliability
Concepts to understand:
Reading:
Hands-on:
Week 26 — Efficient Finetuning: LoRA, QLoRA, PEFT
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 24–26): Build a complete finetuning pipeline: streaming data loading → QLoRA training with async checkpointing → PEFT weight merge → evaluation. Document the memory and throughput profile at each stage.
Week 27 — RLHF Systems: SFT, Reward Models & PPO
Concepts to understand:
Reading:
Hands-on:
Week 28 — RLHF Algorithm Variants: DPO, GRPO & Beyond
Concepts to understand:
Reading:
Hands-on:
Milestone (Part III): Full RLHF pipeline: streaming data loading → QLoRA SFT → DPO training with async checkpointing → evaluation. Document GPU memory, throughput, and reward statistics at each stage.
Part IV — Advanced Inference
Goal: Go deep on long-context serving, next-generation quantization, model compression, advanced speculative decoding, disaggregated inference, and production-grade serving infrastructure.
Week 29 — Long-Context Inference: Position Encodings
Concepts to understand:
Reading:
Hands-on:
Week 30 — Long-Context Inference: KV Cache Management
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 29–30): For a LLaMA-3-70B model on 4×A100 with tensor parallelism, compute: (a) baseline KV cache size at seqlen=32K, batch=8, (b) memory savings under 50% H2O eviction, (c) latency impact of chunked prefill at different chunk sizes.
Week 31 — Advanced Quantization: FP8, GGUF, and Extreme Low-Bit
Concepts to understand:
Reading:
Hands-on:
Week 32 — Model Compression: Pruning & Distillation
Concepts to understand:
Reading:
Hands-on:
Week 33 — Advanced Speculative Decoding
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 31–33): Given a 70B model with 50% Wanda pruning + AQLM 2-bit quantization + EAGLE-2 speculative decoding: estimate theoretical tokens/sec vs. the uncompressed baseline. Identify which technique gives the highest throughput-per-quality-point tradeoff.
Week 34 — Disaggregated & Distributed Inference
Concepts to understand:
Reading:
Hands-on:
Week 35 — Production Serving Infrastructure
Concepts to understand:
Reading:
Hands-on:
Milestone (Part IV): End-to-end serving system: FP8-quantized 70B model on disaggregated prefill/decode infrastructure, with SLO-aware scheduling, autoscaling, and cost monitoring. Document latency, throughput, GPU utilization, and $/token achieved.
Part V — Emerging Architectures
Goal: Understand Mamba/SSMs, MoE inference, linear attention variants, diffusion systems, and advanced multimodal architectures — always through the systems lens of compute, memory, and parallelism.
Week 36 — State Space Models: Mamba
Concepts to understand:
Reading:
Hands-on:
Week 37 — Mamba-2, Triton Scan Kernels & Hybrid Architectures
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 36–37): Build a Mamba-2 inference server. Measure: (a) per-token VRAM at seqlen 1K vs. 64K vs. 256K (should be constant), (b) throughput vs. an equivalent-parameter transformer at each seqlen, (c) the crossover point where Mamba becomes faster.
Week 38 — MoE Architecture & Inference Serving
Concepts to understand:
Reading:
Hands-on:
Week 39 — Linear Attention & Hybrid Architectures
Concepts to understand:
Reading:
Hands-on:
Week 40 — Diffusion Model Systems: Samplers & Architectures
Concepts to understand:
Reading:
Hands-on:
Week 41 — Diffusion Serving & Distributed Execution
Concepts to understand:
Reading:
Hands-on:
Milestone (Weeks 40–41): For a DiT-XL/2 model serving at 256 requests/sec, design an architecture: (a) step count vs. quality tradeoff using DPM-Solver, (b) batching strategy with variable CFG, (c) multi-GPU with DistriFusion — estimate total GPU count needed to hit 100ms TTFT P99.
Week 42 — Advanced Multimodal: Video & Audio
Concepts to understand:
Reading:
Hands-on:
Milestone (Part V): Implement a Mamba-2 parallel scan kernel and benchmark it end-to-end: compare throughput, memory, and quality against FlashAttention-2 at seqlens from 2K to 256K. Write a one-page analysis of which architecture you’d choose for which use case.
Part VI — Compilers & Infrastructure
Goal: Understand the full compiler stack from MLIR down to PTX, the internals of automatic differentiation, memory allocation, and production MLOps infrastructure.
Week 43 — Compiler IR Theory & MLIR
Concepts to understand:
Reading:
Hands-on:
Week 44 — TVM, XLA & Autotuning
Concepts to understand:
Reading:
Hands-on:
Week 45 — Automatic Differentiation Systems
Concepts to understand:
Reading:
Hands-on:
Week 46 — Memory Management & Allocators
Concepts to understand:
Reading:
Hands-on:
Week 47 — NCCL Deep Dive
Concepts to understand:
Reading:
Hands-on:
Week 48 — Deployment & Production Infrastructure
Concepts to understand:
Reading:
Hands-on:
Week 49 — Advanced Profiling & Distributed Debugging
Concepts to understand:
Reading:
Hands-on:
Week 50 — Integrating the Compiler Stack
Concepts to understand:
Reading:
Hands-on:
Week 51 — Reading Week & Integration
A structured review week with no new material. Revisit the notes, exercises, and milestone answers from the hardest weeks.
Suggested review targets:
Week 52 — Year-End Capstone
Goal: Ship something production-quality that integrates at least 3 domains from the curriculum. The deliverable should be something you’d be comfortable presenting as a portfolio piece.
Choose one track:
Deliverable checklist:
Reference Lists
Canonical Papers by Week
| Paper | Topic | Week |
|---|---|---|
| Attention Is All You Need | Transformer | prerequisite |
| Scaling Laws for Neural Language Models | Scaling | 3 |
| FlashAttention | Efficient attention | 4 |
| FlashAttention-2 | Efficient attention | 4 |
| GQA | KV compression | 4 |
| DeepSeek-V2 §3 MLA | KV compression | 4 |
| LLM.int8() | Quantization | 5 |
| GPTQ | Quantization | 5 |
| AWQ | Quantization | 5 |
| Speculative Decoding | Inference | 5 |
| PagedAttention / vLLM | Serving | 6 |
| Orca | Serving | 6 |
| SGLang | Serving | 6 |
| Megatron-LM | Distributed | 7 |
| Megatron 3D | Distributed | 7 |
| ZeRO | Distributed | 8 |
| Mixed Precision Training | Training | 8 |
| Activation Recomputation | Training | 8 |
| Triton | Kernels | 9 |
| PyTorch 2 | Compilation | 10 |
| CLIP | Multimodal | 11 |
| LLaVA | Multimodal | 11 |
| FlashAttention-3 | Hardware | 17 |
| Ring Attention | Seq parallelism | 21 |
| DeepSpeed Ulysses | Seq parallelism | 21 |
| Switch Transformers | MoE | 22 |
| GShard | MoE | 22 |
| MegaBlocks | MoE | 23 |
| Chinchilla | Scaling | 24 |
| LoRA | Finetuning | 26 |
| QLoRA | Finetuning | 26 |
| DPO | RLHF | 28 |
| DeepSeekMath (GRPO) | RLHF | 28 |
| YaRN | Long context | 29 |
| SnapKV | KV eviction | 30 |
| QuIP# | Quantization | 31 |
| AQLM | Quantization | 31 |
| Wanda | Pruning | 32 |
| EAGLE | Spec decode | 33 |
| EAGLE-2 | Spec decode | 33 |
| DistServe | Disaggregated | 34 |
| Mamba | SSMs | 36 |
| Mamba-2 / SSD | SSMs | 37 |
| GLA | Linear attn | 39 |
| RWKV | Linear attn | 39 |
| DiT | Diffusion | 40 |
| DPM-Solver | Diffusion | 40 |
| Flow Matching | Diffusion | 40 |
| GSPMD | Compilers | 44 |
| Demystifying NCCL | Networking | 47 |
Key Blogs & References
| Resource | What It’s Good For | Link |
|---|---|---|
| Making Deep Learning Go Brrrr | Compute/memory bottleneck taxonomy | https://horace.io/brrr_intro.html |
| Transformer Math 101 (EleutherAI) | FLOP/memory arithmetic reference | https://blog.eleuther.ai/transformer-math/ |
| JAX Scaling Book | Roofline, sharding, distributed training | https://jax-ml.github.io/scaling-book/ |
| Stas Bekman’s ML Engineering | Practical distributed training cookbook | https://github.com/stas00/ml-engineering |
| GPU MODE Lectures | Advanced CUDA, CUTLASS, Triton | https://github.com/gpu-mode/lectures |
| Lilian Weng’s blog | Broad ML coverage, well-cited | https://lilianweng.github.io |
| Sebastian Raschka’s newsletter | LLM research summaries | https://magazine.sebastianraschka.com |
| CutlassAcademy | CUTLASS 3.x tutorials | https://github.com/MekkCyber/CutlassAcademy |
| flash-linear-attention | Triton kernels for SSMs/linear attn | https://github.com/fla-org/flash-linear-attention |
| LLM Inference Optimization Papers | Curated inference paper list | https://github.com/chenhongyu2048/LLM-inference-optimization-paper |
Last updated: 2026-03-15. Revisit pacing at Part boundaries.