ML Systems Curriculum: LLMs & Multimodal Models
52 weeks · ~10 hrs/wk · ~520 hrs total Profile: Python/PyTorch practitioner, GPU beginner, full-stack goal (inference + training + kernels)
Year Overview
| Part | Weeks | Theme |
|---|---|---|
| I | 1–12 | Foundations |
| II | 13–20 | Hardware Mastery |
| III | 21–29 | Training at Scale |
| IV | 30–37 | Advanced Inference |
| V | 38–46 | Emerging Architectures |
| VI | 47–52 | Compilers & Infrastructure |
Dependency Map
flowchart TD
subgraph P1["① Foundations · Wks 1–12"]
roofline["Roofline & GPU Architecture"]
txArith["Transformer Systems Arithmetic"]
effAttn["Efficient Attention (Flash, GQA, MLA)"]
inference["Quantization & Serving"]
dist["Distributed Training (DP / TP / PP / ZeRO)"]
kernels["Triton Kernels & torch.compile"]
end
subgraph P2["② Hardware Mastery · Wks 13–20"]
advCuda["Advanced CUDA (Warps, Tensor Cores, cp.async)"]
cutlass["CUTLASS / CuTe"]
hopper["Hopper ISA (TMA, WGMMA, Ping-Pong)"]
profiling["Nsight Profiling (Compute + Systems)"]
interconnects["Interconnects & NCCL Basics"]
end
subgraph P3["③ Training at Scale · Wks 21–29"]
seqPar["Sequence & Context Parallelism"]
moeTraining["MoE Training (Expert Parallelism)"]
dataPipe["Data Pipelines & Pretraining Infra"]
peft["LoRA / QLoRA / PEFT"]
rlhf["RLHF Systems (PPO, DPO, GRPO)"]
end
subgraph P4["④ Advanced Inference · Wks 30–37"]
longCtx["Long-Context (YaRN, KV Eviction)"]
advQuant["FP8 & Extreme Quantization"]
advSpec["Advanced Spec Decoding (EAGLE, Medusa)"]
disagg["Disaggregated Inference (DistServe)"]
prodServing["Production Serving & SLO Scheduling"]
end
subgraph P5["⑤ Emerging Architectures · Wks 36–46"]
mamba["Mamba / SSMs & Parallel Scan"]
moeInf["MoE Inference & Expert Offload"]
linAttn["Linear Attention & Hybrid Models"]
diffusion["Diffusion Systems (DiT, Flow Matching)"]
end
subgraph P6["⑥ Compilers & Infra · Wks 43–52"]
mlir["MLIR & Compiler IR Theory"]
tvm["TVM / XLA / Autotuning"]
adSys["Automatic Differentiation Systems"]
memMgmt["CUDA Memory Management"]
ncclDeep["NCCL Deep Dive & Collectives"]
end
%% Part I internal
roofline --> txArith
roofline --> kernels
txArith --> effAttn
effAttn --> inference
effAttn --> dist
%% P1 → P2
roofline --> advCuda
kernels --> cutlass
advCuda --> cutlass
cutlass --> hopper
hopper --> profiling
dist --> interconnects
%% P1 → P3
dist --> seqPar
dist --> moeTraining
txArith --> dataPipe
inference --> peft
peft --> rlhf
seqPar --> rlhf
%% P1 → P4
effAttn --> longCtx
inference --> advQuant
inference --> advSpec
inference --> disagg
inference --> prodServing
%% P2 → P4
hopper --> advQuant
%% P2 → P5
hopper --> mamba
kernels --> mamba
%% P1 → P5
effAttn --> linAttn
txArith --> diffusion
%% P3 → P5
moeTraining --> moeInf
%% P4 → P5
disagg --> moeInf
%% P2 → P6
hopper --> mlir
kernels --> mlir
mlir --> tvm
advCuda --> memMgmt
interconnects --> ncclDeep
kernels --> adSys
Part I — Foundations
Goal: Build the hardware intuition, transformer arithmetic, attention theory, inference fundamentals, distributed training basics, and kernel writing skills that underpin everything else.
Week 1 — The Roofline Model
Concepts to understand: - [ ] Arithmetic intensity: what it is and how to compute it for an op - [ ] The roofline model: compute ceiling vs memory bandwidth ceiling - [ ] Ridge point: the arithmetic intensity that separates memory-bound from compute-bound - [ ] HBM vs SRAM: capacity, bandwidth, and latency tradeoffs - [ ] GPU SM, warp, and thread hierarchy at a high level - [ ] How to read a PyTorch profiler trace
Reading: - [ ] Making Deep Learning Go Brrrr — Horace He (1.5 hrs) - [ ] All About Rooflines — JAX Scaling Book (2 hrs) - [ ] Transformer Math 101 — EleutherAI (1.5 hrs) - [ ] NVIDIA Hopper Architecture Whitepaper §1–3 (1.5 hrs) - [ ] PyTorch Profiler Tutorial (1 hr)
Hands-on:
- [ ] Run PyTorch profiler on a forward pass of GPT2LMHeadModel. Identify the top 3 ops by CUDA time and classify each as compute- or memory-bound. (1.5 hrs)
Milestone: Given
matmul(A, B)with A=(4096,4096), B=(4096,4096) in fp16 on an A100 (312 TFLOP/s, 2 TB/s), compute arithmetic intensity and determine its roofline regime.
Week 2 — CUDA Mental Model & Memory Hierarchy
Concepts to understand: - [ ] Thread/block/grid hierarchy and how it maps to GPU hardware - [ ] Shared memory (SMEM): capacity per SM, latency vs global memory - [ ] Global memory coalescing: why access pattern matters for bandwidth - [ ] Occupancy: how register and SMEM usage limits active warps per SM - [ ] Bank conflicts in shared memory - [ ] L1/L2 cache behavior
Reading: - [ ] Programming Massively Parallel Processors Ch. 1–5 — Hwu & Kirk (4 hrs) - [ ] ECE408/CS483 Lecture Videos Weeks 1–3 — UIUC via NVIDIA (3 hrs) - [ ] An Even Easier Introduction to CUDA — NVIDIA Blog (1 hr)
Hands-on: - [ ] Write a naive CUDA vector-add kernel (in C or via Numba). Measure achieved bandwidth vs theoretical peak. (2 hrs)
Milestone: Explain why a naively written matrix transpose kernel is memory-bandwidth-bound despite O(N²) reads and writes — and describe the shared memory tiling fix.
Week 3 — Transformer Systems Arithmetic
Concepts to understand:
- [ ] Parameter count formula for a transformer (embedding, attn projections, MLP, LM head)
- [ ] FLOPs ≈ 6ND for training, ≈ 2ND for inference — where this comes from
- [ ] KV cache memory: 2 · n_layers · n_heads · d_head · seqlen · batch · dtype_bytes
- [ ] Model FLOPs utilization (MFU): definition and how to measure it
- [ ] Scaling laws: compute-optimal training (Chinchilla)
- [ ] Why seqlen causes quadratic FLOPs but linear KV cache growth
Reading: - [ ] Transformer Math 101 — EleutherAI (re-read carefully, 1.5 hrs) - [ ] Language Models are Few-Shot Learners §2 — GPT-3 (1 hr) - [ ] Scaling Laws for Neural Language Models — Kaplan et al. (2 hrs) - [ ] JAX Scaling Book Ch. 2–3 (2 hrs)
Hands-on: - [ ] For a 7B-parameter LLaMA-2 in bf16, derive: (a) weight memory, (b) KV cache at batch=32 seqlen=2048, (c) FLOPs per forward pass. (3 hrs)
Milestone: Why does doubling sequence length quadratically increase attention FLOPs but only linearly increase KV cache memory? Write out the derivation.
Week 4 — Efficient Attention
Concepts to understand: - [ ] IO complexity of standard attention: why materializing (seqlen × seqlen) is the bottleneck - [ ] FlashAttention tiling: splitting Q/K/V into blocks to avoid writing the full attention matrix - [ ] Online softmax: computing softmax incrementally across tiles - [ ] GQA/MQA: sharing K/V heads across query heads and the memory savings - [ ] MLA: low-rank KV compression in DeepSeek-V2 and the 64× cache reduction - [ ] FlashAttention-2 improvements over v1
Reading: - [ ] FlashAttention — Dao et al. 2022 (2 hrs) - [ ] FlashAttention-2 — Dao 2023 (1.5 hrs) - [ ] GQA: Training Generalized Multi-Query Transformer Models (1.5 hrs) - [ ] DeepSeek-V2 §3 — MLA (1 hr) - [ ] FlashAttention-3 blog (1 hr)
Hands-on:
- [ ] Implement FlashAttention forward tiling logic in pure NumPy. Verify against torch.nn.functional.scaled_dot_product_attention. (3 hrs)
Milestone: For a 70B model, 64 heads, d_head=128, seqlen=8192, batch=1 in fp16 — compute the memory footprint of the full attention matrix under standard MHA. Then state FlashAttention’s peak SMEM usage and explain why.
Week 5 — Quantization & Speculative Decoding
Concepts to understand: - [ ] Post-training quantization (PTQ) vs quantization-aware training (QAT) - [ ] Why outlier activations break naive int8 quantization (LLM.int8() mixed-precision) - [ ] Weight-only quantization (GPTQ, AWQ) vs weight+activation quantization - [ ] AWQ: activation-aware scaling to protect salient weights - [ ] Speculative decoding: draft model generates K tokens, target verifies in parallel - [ ] Token acceptance rate and when speculative decoding wins/loses
Reading: - [ ] LLM.int8() — Dettmers et al. (1.5 hrs) - [ ] GPTQ (1.5 hrs) - [ ] AWQ: Activation-aware Weight Quantization (1 hr) - [ ] Fast Inference via Speculative Decoding — Leviathan et al. (1.5 hrs) - [ ] Hugging Face quantization guide (1 hr)
Hands-on: - [ ] Load LLaMA-3-8B in fp16, int8 (bitsandbytes), and GPTQ 4-bit. Measure decode latency, throughput, and perplexity for each. (3 hrs)
Week 6 — Serving Systems
Concepts to understand: - [ ] Prefill vs decode phases: why prefill is compute-bound and decode is memory-bandwidth-bound - [ ] TTFT (time-to-first-token) vs TPOT (time-per-output-token) - [ ] Continuous batching (iteration-level scheduling) - [ ] PagedAttention: virtual memory for KV cache, block table indirection - [ ] KV cache fragmentation and how paging limits it to <4% - [ ] SGLang vs vLLM tradeoffs
Reading: - [ ] Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al. (2 hrs) - [ ] Orca: A Distributed Serving System for Transformer-Based LLMs (1.5 hrs) - [ ] Continuous Batching — 23× LLM Throughput — Anyscale (0.5 hrs) - [ ] vLLM docs — quickstart + architecture (1 hr) - [ ] SGLang (1.5 hrs)
Hands-on: - [ ] Serve LLaMA-3-8B with vLLM. Sweep batch size 1→64. Record throughput and TTFT. At what batch size does the system transition from memory-bound to compute-bound? (3 hrs)
Milestone (Weeks 5–6): For a 7B model on one A100 (80GB), walk through: (1) VRAM available for KV cache after loading int4 weights, (2) concurrent sequences that fit, (3) why continuous batching improves utilization over static batching.
Week 7 — Data Parallelism, Tensor Parallelism, Pipeline Parallelism
Concepts to understand: - [ ] Data parallelism (DDP): gradient all-reduce, overlap with backward pass - [ ] Tensor parallelism: column-parallel and row-parallel linear layers - [ ] Pipeline parallelism: stage assignment, pipeline bubbles, 1F1B schedule - [ ] Micro-batching to fill the pipeline - [ ] 3D parallelism: how TP + PP + DP compose - [ ] Communication primitives: all-reduce, all-gather, reduce-scatter
Reading: - [ ] Megatron-LM — Shoeybi et al. (1.5 hrs) - [ ] Efficient Large-Scale Language Model Training on GPU Clusters — Narayanan et al. (2.5 hrs) - [ ] Everything About Distributed Training and Efficient Finetuning (2 hrs) - [ ] JAX Scaling Book Ch. 4–5 (2 hrs)
Hands-on:
- [ ] Using torch.distributed, run DDP on 2 GPUs. Profile all-reduce communication time vs. total step time at varying batch sizes. (2 hrs)
Week 8 — ZeRO, FSDP, Mixed Precision, Gradient Checkpointing
Concepts to understand: - [ ] ZeRO stages 1/2/3: what each stage shards - [ ] ZeRO-3 / FSDP communication: all-gather before forward, reduce-scatter after backward - [ ] Mixed precision: fp16/bf16 forward + fp32 master weights, loss scaling - [ ] Why bf16 is preferred over fp16 for training - [ ] Gradient checkpointing: recomputing activations vs storing them
Reading: - [ ] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (2 hrs) - [ ] PyTorch FSDP2 blog (1 hr) - [ ] Mixed Precision Training (1.5 hrs) - [ ] NVIDIA Mixed Precision Training Guide (1 hr) - [ ] Reducing Activation Recomputation in Large Transformer Models (1 hr)
Hands-on: - [ ] Train a 1B-parameter toy transformer on 2 GPUs. Record peak memory under: (a) DDP fp32, (b) DDP bf16, (c) FSDP ZeRO-3 bf16. (3 hrs)
Milestone (Weeks 7–8): For a 13B model on 8 GPUs with ZeRO-3: calculate per-GPU memory for parameters, gradients, and optimizer states. Estimate the all-gather + reduce-scatter communication volume per step vs. DDP baseline.
Week 9 — Triton Foundations
Concepts to understand:
- [ ] What Triton is: a Python DSL compiling to PTX, sitting between CUDA C and PyTorch ops
- [ ] tl.program_id, tl.load, tl.store: the core primitives
- [ ] Block tiling in Triton: how BLOCK_SIZE maps to SMEM usage and occupancy
- [ ] Masking: handling tensor edges when shape isn’t a multiple of BLOCK_SIZE
- [ ] tl.dot: Triton’s matmul primitive
- [ ] Autotuning: @triton.autotune decorator and config search
- [ ] Fusion: why fusing elementwise ops into a matmul saves memory bandwidth
Reading: - [ ] Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (1.5 hrs) - [ ] Official Triton tutorials: vector add → fused softmax → matmul (4 hrs) - [ ] Unleashing the Power of Triton (1 hr)
Hands-on:
- [ ] Implement fused SiLU (x * sigmoid(x)) as a Triton kernel. Benchmark against PyTorch built-in and two-op naive version. Report bandwidth utilization. (3.5 hrs)
Week 10 — torch.compile & Inductor
Concepts to understand:
- [ ] torch.compile pipeline: Dynamo → Inductor → Triton kernels
- [ ] Graph capture modes: reduce-overhead vs max-autotune
- [ ] Fusion in Inductor: which op patterns get fused automatically
- [ ] Graph breaks: what causes them and how to diagnose with TORCH_COMPILE_DEBUG=1
- [ ] When torch.compile is and isn’t worth it (dynamic shapes, small models)
- [ ] How FlashAttention-3 uses Triton + Hopper TMA/WGMMA for ~75% peak FLOP/s
Reading: - [ ] PyTorch 2 (1.5 hrs) - [ ] torch.compile tutorial (1.5 hrs) - [ ] FlashAttention-3 blog (1.5 hrs) - [ ] Triton blocked matmul tutorial (2 hrs)
Hands-on:
- [ ] Apply torch.compile to a full transformer forward pass. Use TORCH_COMPILE_DEBUG=1 to inspect emitted Triton kernels. Identify which ops fused and which caused graph breaks. (3.5 hrs)
Milestone (Weeks 9–10): Write a fused layernorm kernel in Triton that computes mean, variance, and normalization in a single pass. Compare throughput and bandwidth to
torch.nn.LayerNorm. Explain why a single-pass implementation is faster.
Week 11 — Vision-Language Model Serving
Concepts to understand: - [ ] VLM anatomy: vision encoder (ViT) + projector + LLM backbone - [ ] How image tokens are injected as “virtual” prefill tokens - [ ] Resolution scaling: how image size determines token count - [ ] Two serving regimes: encoder-heavy vs. LLM-decode-heavy - [ ] Cross-attention injection (Flamingo) vs. token merge/concatenation (LLaVA) - [ ] KV cache implications: image tokens inflate effective context length
Reading: - [ ] CLIP (1.5 hrs) - [ ] LLaVA: Visual Instruction Tuning (1.5 hrs) - [ ] LLaVA-1.5 (1 hr) - [ ] Flamingo §3 (1 hr) - [ ] vLLM multimodal docs (1 hr)
Hands-on: - [ ] Serve LLaVA-1.5-7B with vLLM. Measure: (a) vision encoder latency vs. LLM prefill latency, (b) how doubling image resolution changes latency and KV cache size. (4 hrs)
Milestone: A 336×336 image produces 576 image tokens. For LLaVA-1.5-7B serving 16 concurrent users (one image + 64-token question each), compute effective prefill token count and compare KV cache memory to a text-only baseline.
Week 12 — Part I Capstone
Choose one track: - [ ] Track A — LLM serving optimization: Serve a 7B model on one GPU. Apply quantization, continuous batching, and speculative decoding sequentially. Document measured impact of each. - [ ] Track B — Custom attention kernel: Implement simplified FlashAttention forward in Triton. Benchmark vs. PyTorch SDPA across seqlen 512→8192. Produce a roofline plot. - [ ] Track C — Distributed training audit: Train a 1B model on 4 GPUs. Compare DDP, FSDP ZeRO-2, FSDP ZeRO-3. Identify the dominant communication bottleneck.
Deliverable checklist: - [ ] Baseline profile (PyTorch profiler or Nsight output) - [ ] Each optimization applied with before/after measurements - [ ] Written explanation of why each change helped (roofline reasoning) - [ ] One thing you’d do next with more time or hardware
Part II — Hardware Mastery
Goal: Go from Triton user to someone who can reason about hardware ISA-level behavior, write CUTLASS kernels, profile with Nsight Compute, and understand the full interconnect stack.
Week 13 — Advanced CUDA: Warp Primitives & Tensor Cores
Concepts to understand:
- [ ] Warp shuffle functions: __shfl_sync, __shfl_down_sync, __shfl_xor_sync — register-to-register exchange within a warp without SMEM
- [ ] Warp vote functions: __ballot_sync, __all_sync, __any_sync
- [ ] Warp-level reduce patterns using shuffles
- [ ] Tensor Cores via WMMA API: nvcuda::wmma — fragment types, load_matrix_sync, mma_sync, store_matrix_sync
- [ ] Supported WMMA shapes (16×16×16, 8×32×16) and data types (FP16, BF16, TF32, INT8, FP8)
- [ ] PTX mma.sync for finer control below the WMMA abstraction
Reading: - [ ] NVIDIA Tensor Core Programming — Lei Mao’s Log Book (3 hrs) - [ ] Programming Tensor Cores in CUDA 9 (1.5 hrs) - [ ] Warp Shuffle and Warp Vote Instructions — CSE 599I Slides (1.5 hrs) - [ ] GPU MODE Lectures 14–20 (3 hrs)
Hands-on: - [ ] Implement a fused LayerNorm kernel using warp shuffles for the reduction. Benchmark bandwidth against the naive two-pass version. (3 hrs)
Week 14 — CUDA Concurrency: Streams, Graphs, Async Copy
Concepts to understand:
- [ ] cp.async (Ampere+): asynchronous SMEM copy from global memory; commit/wait groups; double-buffering pattern to overlap copy and compute
- [ ] CUDA streams: stream semantics, default vs. non-default streams, cudaEventRecord/cudaEventSynchronize
- [ ] Overlapping H2D copy, kernel, and D2H copy across streams
- [ ] CUDA Graphs: graph capture, instantiation, launch; constant-time ~2.5µs launch overhead; when graphs pay off (many small kernels, inference serving)
- [ ] Cooperative Groups: thread-block groups, multi-block cooperative kernels, grid-wide synchronization
Reading: - [ ] Controlling Data Movement on Ampere (cp.async) (2 hrs) - [ ] Getting Started with CUDA Graphs (1.5 hrs) - [ ] Cooperative Groups (2 hrs) - [ ] CUDA Streams and Synchronization (1 hr)
Hands-on:
- [ ] Add cp.async double-buffering to your LayerNorm kernel from Week 13. Measure the memory bandwidth improvement vs. the synchronous version. (3.5 hrs)
Milestone (Weeks 13–14): Implement a fused LayerNorm with warp-shuffle reduction and double-buffered
cp.asyncpipelining. Profile in Nsight Compute and identify: (a) achieved bandwidth as % of theoretical, (b) dominant stall reason.
Week 15 — CUTLASS & CuTe
Concepts to understand:
- [ ] What CUTLASS is: composable CUDA templates for GEMM and convolution; CUTLASS 2.x vs. 3.x (CuTe-based)
- [ ] CuTe layout algebra: Layout = Shape × Stride; composable tiling; how CuTe eliminates hand-coded index arithmetic
- [ ] CUTLASS 3.x GEMM pipeline: CollectiveMma, CollectiveEpilogue, GemmUniversal
- [ ] Warp specialization: producer warps (TMA-issued loads) vs. consumer warpgroups (WGMMA); PipelineAsync barriers; Ping-Pong scheduling
- [ ] CUTLASS vs. Triton: when to use each; most production kernels in vLLM/FlashAttention-3 are CUTLASS-based
Reading: - [ ] CUTLASS: Fast Linear Algebra in CUDA C++ (2 hrs) - [ ] CUTLASS: Principled Abstractions (CuTe overview) (2 hrs) - [ ] Deep Dive on CUTLASS Ping-Pong GEMM (3 hrs) - [ ] CutlassAcademy — curated tutorials (3 hrs)
Hands-on:
- [ ] Write a CUTLASS 3.x GEMM using GemmUniversal with a custom epilogue (fused bias + ReLU). Profile against cuBLAS and a hand-rolled Triton kernel. (3 hrs)
Week 16 — Hopper Architecture: TMA & WGMMA
Concepts to understand:
- [ ] Tensor Memory Accelerator (TMA): hardware DMA engine that accepts 5-D tensor descriptors; single-thread issue in producer warp; eliminates per-thread address calculation
- [ ] wgmma.mma_async: asynchronous warpgroup-level (128-thread) MMA; operand B must be in SMEM; requires wgmma.fence / wgmma.commit_group
- [ ] Thread-block clusters: groups of up to 16 CTAs on adjacent SMs sharing Distributed Shared Memory (DSMEM)
- [ ] Persistent kernel patterns: one CTA per SM, producer warpgroup feeds TMA, consumer warpgroup executes WGMMA
- [ ] Ping-pong warp specialization: alternating consumer warpgroups on back-to-back tiles to hide softmax latency
Reading: - [ ] Benchmarking and Dissecting the NVIDIA Hopper GPU Architecture (4 hrs) - [ ] Dissecting Hopper via Microbenchmarking (3 hrs) - [ ] CUTLASS Tutorial: WGMMA on Hopper — Colfax Research (3 hrs)
Hands-on:
- [ ] Implement a double-buffered GEMM on H100 using TMA + WGMMA (without CUTLASS). Read the generated PTX/SASS in Nsight Compute and verify cp.async.bulk and wgmma.mma_async appear in the hot loop. (3 hrs)
Week 17 — FlashAttention-3 as Hopper Case Study
Concepts to understand: - [ ] How FA3 combines: TMA-pipelined Q/K/V loads, ping-pong WGMMA between consumer warpgroups, and FP8 block quantization - [ ] Why FA3 achieves ~740 TFLOPS/s (~75% SOL) on H100 while FA2 achieves ~35% - [ ] Inside NVIDIA GPUs: anatomy of a high-performance matmul kernel - [ ] Blackwell (B200) preview: UMMA, FP4 Tensor Cores, 5th-gen NVLink (1.8 TB/s/GPU)
Reading: - [ ] FlashAttention-3 (4 hrs) - [ ] Anatomy of High-Performance Matmul Kernels — Aleksa Gordić (4 hrs) - [ ] Developing CUDA Kernels for Hopper — Colfax PDF (3 hrs)
Hands-on: - [ ] Read the FlashAttention-3 Triton reference implementation. Annotate each section: which Hopper feature is it using, and what is the expected performance impact? (3 hrs)
Milestone (Weeks 15–17): Write a fused FP8 attention kernel in CUTLASS 3.x with TMA + WGMMA + Ping-Pong scheduling. Profile vs. a Triton FlashAttention-2 implementation. Explain the performance delta using roofline analysis.
Week 18 — Profiling Methodology: Nsight Compute & Nsight Systems
Concepts to understand:
- [ ] Nsight Compute workflow: capturing a kernel profile with ncu --set full
- [ ] Speed-of-Light (SOL): reading SM throughput vs. memory throughput as % of theoretical peak
- [ ] Memory Workload Analysis: L1/L2 hit rates, global load efficiency, bank conflicts
- [ ] Warp State Statistics: stall reasons (long scoreboard, memory dependency, no instruction, MIO throttle)
- [ ] Scheduler Statistics: issued IPC vs. theoretical IPC; interpreting occupancy
- [ ] Nsight Systems timeline: CPU-GPU synchronization bubbles, kernel gaps, multi-stream overlap
- [ ] Bottleneck taxonomy: memory-bound (bandwidth wall), compute-bound (FLOP wall), latency-bound (small tiles), launch-overhead-bound
Reading: - [ ] Nsight Compute Profiling Guide (5 hrs) - [ ] Accelerating HPC with Nsight Compute Roofline Analysis (2 hrs) - [ ] Nsight Systems User Guide (3 hrs)
Hands-on: - [ ] Take your WGMMA GEMM from Week 16. Do a full Nsight Compute profile: find the dominant stall reason, interpret the roofline position, implement one optimization, re-profile to verify improvement. (3 hrs)
Week 19 — Alternative Accelerators
Concepts to understand:
- [ ] TPU architecture: systolic array, HBM bandwidth, inter-chip interconnect (ICI) mesh
- [ ] XLA compilation: HLO IR, fusion passes, layout assignment, SPMD partitioning
- [ ] PyTorch/XLA: lazy tensor execution, mark_step(), SPMD mesh partitioning
- [ ] AMD ROCm/HIP: near-identical to CUDA; wavefront size = 64 on CDNA; hipify for porting
- [ ] CDNA (MI300X): 192 GB unified HBM3, no separate VRAM boundary
- [ ] Groq LPU, Cerebras WSE-3, Gaudi: architectural tradeoffs at a conceptual level
Reading: - [ ] TPU Deep Dive — Henry Hmko (2 hrs) - [ ] PyTorch/XLA Overview (2 hrs) - [ ] HIP Programming Model — AMD ROCm Docs (2 hrs) - [ ] AI Accelerators Beyond GPUs (1.5 hrs)
Hands-on:
- [ ] Port a Triton matmul kernel to HIP using hipify. Run on ROCm (cloud MI300X instance or ROCm Docker). Document any wavefront-size or memory-layout differences. (2.5 hrs)
Week 20 — Interconnects, NCCL, & Part II Capstone
Concepts to understand: - [ ] NVLink generations: NVLink 4.0 (H100, 900 GB/s bidirectional), NVSwitch fabric (all-to-all within DGX) - [ ] InfiniBand: NDR/XDR generations, fat-tree and dragonfly topologies, IBTA architecture - [ ] RDMA basics: memory registration, QP model, one-sided vs. two-sided operations - [ ] GPUDirect RDMA: NIC reads/writes GPU HBM directly over PCIe — eliminates one bounce copy - [ ] NCCL ring-AllReduce algorithm and bandwidth-latency tradeoff - [ ] Tree-AllReduce for small messages; hierarchical collectives (intra-node NVLink + inter-node IB) - [ ] NVSHMEM: device-initiated one-sided PUT/GET from within CUDA kernels
Reading: - [ ] Scaling Deep Learning Training with NCCL (2 hrs) - [ ] Demystifying NCCL (3 hrs) - [ ] InfiniBand vs. RoCE — Juniper White Paper (2 hrs) - [ ] Inside Multi-Node Training — Together.ai (1.5 hrs)
Hands-on:
- [ ] Write a custom AllReduce using NCCL primitives (ReduceScatter + AllGather). Benchmark against dist.all_reduce at varying message sizes. Plot effective bus bandwidth vs. message size. (3 hrs)
Milestone (Part II): End-to-end kernel engineering capstone. Choose one: (a) fused FP8 attention with TMA + WGMMA, (b) INT8/FP8 GEMM with per-block dequantization epilogue vs. cuBLAS, or (c) roofline-guided optimization of an underperforming open-source kernel with 3 distinct improvements verified in Nsight Compute.
Part III — Training at Scale
Goal: Extend beyond the ZeRO/Megatron basics to cover sequence parallelism, MoE training, data infrastructure, fault tolerance, efficient finetuning, and RLHF systems.
Week 21 — Sequence Parallelism & Context Parallelism
Concepts to understand: - [ ] Why standard tensor parallelism doesn’t help with O(N²) attention memory at long contexts - [ ] Megatron-LM sequence parallelism: sharding LayerNorm/Dropout along the sequence dimension, using reduce-scatter/all-gather pairs instead of all-reduce - [ ] Ring Attention: blockwise attention with KV blocks passed ring-wise; communication overlaps with local attention computation - [ ] DeepSpeed Ulysses: all-to-all before QKV projection so each device attends to full sequence but a subset of heads; communication stays constant as sequence length and device count scale proportionally - [ ] Combining sequence parallelism with TP+PP (4D parallelism) - [ ] Context parallelism in Megatron-Core and TorchTitan
Reading: - [ ] Ring Attention with Blockwise Transformers (3 hrs) - [ ] DeepSpeed Ulysses (3 hrs) - [ ] TorchTitan — docs + architecture (3 hrs)
Hands-on:
- [ ] Implement Ring Attention for a toy transformer using torch.distributed. Measure effective sequence length per GPU vs. standard DP at equal memory budget. (3 hrs)
Week 22 — MoE Training: Architecture & Routing
Concepts to understand: - [ ] Sparse vs. dense compute: MoE activates K of N experts per token — decouples parameter count from FLOPs - [ ] Top-K routing, gating network, load balancing auxiliary loss - [ ] Switch Transformer: top-1 routing, capacity factor, token dropping under overflow - [ ] GShard: expert parallelism via XLA annotations across 2048 TPUs - [ ] Expert Choice routing: experts select tokens (inverted routing) - [ ] Token dropping vs. capacity buffer padding tradeoffs - [ ] MoE compute efficiency: 4× more compute-efficient than dense at low training budgets
Reading: - [ ] Switch Transformers (4 hrs) - [ ] GShard (3 hrs) - [ ] Mixtral of Experts (2 hrs) - [ ] Cameron Wolfe: MoE LLMs (2 hrs)
Hands-on: - [ ] Implement a sparse MoE FFN layer in PyTorch with top-2 routing and load balancing loss. Train on a toy task and measure expert utilization over time. (3 hrs)
Week 23 — MoE Training: Systems & Communication
Concepts to understand: - [ ] Expert parallelism: sharding experts across devices, all-to-all communication for token dispatch and gather - [ ] Why all-to-all is latency-sensitive at small batch sizes — the key serving challenge - [ ] DeepSeek-MoE: fine-grained expert decomposition (mN sub-experts, mK active); shared expert mechanism - [ ] Megablocks: block-sparse matrix formulation eliminating token dropping; 40% faster than Tutel - [ ] Interaction between expert parallelism and TP/PP - [ ] Expert collapse: diagnosing and preventing degenerate routing
Reading: - [ ] MegaBlocks: Efficient Sparse Training with MoE (3 hrs) - [ ] DeepSeekMoE (3 hrs) - [ ] Survey on MoE Inference Optimization — training sections (3 hrs)
Hands-on: - [ ] Replace naïve token-padded expert dispatch in your MoE from Week 22 with a block-sparse implementation. Measure GPU utilization improvement. (3 hrs)
Milestone (Weeks 22–23): For a 47B MoE model (Mixtral-style, 8 experts, top-2), compute: (a) active parameters per token, (b) VRAM per GPU with EP=8, (c) all-to-all communication volume per forward pass. Compare to a 13B dense model with equivalent per-token FLOPs.
Week 24 — Data Pipelines for Pretraining
Concepts to understand: - [ ] The data loading bottleneck: CPU throughput vs. GPU compute; profiling with DataLoader workers and async prefetch - [ ] WebDataset: tar-based sharded format, sequential streaming without random access - [ ] MosaicML StreamingDataset (MDS): deterministic ordering regardless of GPU count, mid-epoch resumption, multi-cloud support - [ ] NVIDIA DALI: GPU-accelerated preprocessing pipeline; eliminates CPU bottleneck for image/video/audio - [ ] Tokenization at scale: pre-tokenizing and caching, memory-mapped numpy arrays for zero-copy loading - [ ] Dataset mixing and weighting: sampling proportions, upsampling high-quality data - [ ] Data quality filtering: MinHash LSH deduplication, perplexity filtering, rule-based heuristics, toxic content filtering
Reading: - [ ] MosaicML StreamingDataset — docs (3 hrs) - [ ] NVIDIA DALI Documentation (3 hrs) - [ ] Training Compute-Optimal Large Language Models (Chinchilla) (3 hrs) - [ ] LLaMA — data pipeline section (2 hrs)
Hands-on: - [ ] Build a streaming data pipeline using MosaicML StreamingDataset. Profile data loading throughput (samples/sec) and measure how many DataLoader workers are needed to saturate a single GPU. (3 hrs)
Week 25 — Fault Tolerance & Large-Scale Reliability
Concepts to understand:
- [ ] The scale problem: at 1000 GPUs, MTBF is hours not days
- [ ] Full checkpointing vs. sharded checkpointing (PyTorch DCP): save/load cost comparison
- [ ] Async checkpointing: save to host CPU memory in background while training continues
- [ ] NCCL error handling: NCCL_ASYNC_ERROR_HANDLING, timeout detection, error propagation to Python
- [ ] Elastic training: torch.distributed.elastic (torchrun), job preemption and resumption
- [ ] SLURM integration: --signal flag, SIGUSR1 for preemption-aware checkpointing
- [ ] Monitoring: per-GPU throughput tracking, NaN/inf gradient detection, loss anomaly detection
Reading:
- [ ] BLOOM Training Chronicle (3 hrs)
- [ ] OPT-175B Training Logbook (3 hrs)
- [ ] TorchElastic / torchrun documentation (2 hrs)
- [ ] PyTorch Distributed Checkpoint (DCP) documentation (2 hrs)
Hands-on: - [ ] Implement async sharded checkpointing for a distributed training job. Simulate a mid-training failure. Measure recovery time vs. full checkpoint. (3 hrs)
Week 26 — Efficient Finetuning: LoRA, QLoRA, PEFT
Concepts to understand:
- [ ] Full finetuning memory breakdown: 16 bytes/param total (weights + gradients + Adam states)
- [ ] LoRA: injecting trainable W = BA into each attention projection; rank-r reduces trainable params by up to 10,000×; weights merge at inference — no latency cost
- [ ] QLoRA: NF4 quantization of frozen base + double quantization + paged optimizers; enables 65B finetuning on a single 48GB GPU
- [ ] PEFT library internals: get_peft_model(), merge_and_unload(), distributed training with FSDP + PEFT
- [ ] IA3: learns three scaling vectors rather than low-rank matrices — even fewer trainable params than LoRA
Reading: - [ ] LoRA: Low-Rank Adaptation of Large Language Models (3 hrs) - [ ] QLoRA: Efficient Finetuning of Quantized LLMs (3 hrs) - [ ] Hugging Face PEFT Library (3 hrs) - [ ] TRL smol-course (4 hrs)
Hands-on: - [ ] Finetune LLaMA-3-8B using: (a) full finetuning with FSDP, (b) LoRA rank-8, (c) QLoRA 4-bit. Record peak memory, training throughput, and eval accuracy for each. (3 hrs)
Milestone (Weeks 24–26): Build a complete finetuning pipeline: streaming data loading → QLoRA training with async checkpointing → PEFT weight merge → evaluation. Document the memory and throughput profile at each stage.
Week 27 — RLHF Systems: SFT, Reward Models & PPO
Concepts to understand:
- [ ] Why RLHF changes the training topology: four model instances coexist (Actor, Critic, Reward Model, Reference Model)
- [ ] Reference model and KL penalty: frozen initial policy produces per-token log-probs used in KL term r_θ - λ * KL(π || π_ref)
- [ ] The rollout bottleneck: generating on-policy samples is ~80% of wall-clock time in PPO
- [ ] OpenRLHF architecture: Ray orchestrates model groups; vLLM handles generation; DeepSpeed ZeRO-3 handles training
- [ ] SFT infrastructure: sequence packing, efficient attention masking for packed sequences
- [ ] Reward model training: Bradley-Terry preference model, ranking loss, process reward models (PRMs)
Reading: - [ ] Illustrating RLHF — Hugging Face (1.5 hrs) - [ ] OpenRLHF — docs + architecture (5 hrs) - [ ] TRL Documentation (4 hrs)
Hands-on: - [ ] Run a full PPO training loop on a 7B model using OpenRLHF. Profile wall-clock time split between rollout generation and model updates. Measure how vLLM integration affects total throughput. (3 hrs)
Week 28 — RLHF Algorithm Variants: DPO, GRPO & Beyond
Concepts to understand:
- [ ] DPO: reformulates RLHF as a classification loss; no reward model or RL loop at training time; implicit reward is log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x))
- [ ] GRPO: eliminates critic network; normalizes rewards within a group of sampled outputs; reduces memory by removing the value network
- [ ] RLOO: leave-one-out baseline within a group of K samples; simpler than GRPO
- [ ] KTO: aligns using binary feedback (thumbs up/down); avoids need for paired data
- [ ] ORPO: combines SFT + preference loss in a single stage; no reference model needed
- [ ] Online vs. offline: DPO/KTO are offline; PPO/GRPO/RLOO are online (generate on-policy rollouts every step)
Reading: - [ ] Direct Preference Optimization — Rafailov et al. (3 hrs) - [ ] DeepSeekMath — GRPO derivation (3 hrs) - [ ] Putting RL back in RLHF — Hugging Face (1.5 hrs)
Hands-on: - [ ] Train LLaMA-3-8B with DPO on a preference dataset. Compare reward margin and eval performance against the PPO run from Week 27. Measure training memory and time per step. (3 hrs)
Milestone (Part III): Full RLHF pipeline: streaming data loading → QLoRA SFT → DPO training with async checkpointing → evaluation. Document GPU memory, throughput, and reward statistics at each stage.
Part IV — Advanced Inference
Goal: Go deep on long-context serving, next-generation quantization, model compression, advanced speculative decoding, disaggregated inference, and production-grade serving infrastructure.
Week 29 — Long-Context Inference: Position Encodings
Concepts to understand:
- [ ] RoPE mechanics: encoding relative position as a rotation in the complex plane via e^(imθ)
- [ ] Why naive RoPE extrapolation fails beyond training sequence length
- [ ] Position interpolation (PI): linearly scaling position indices to fit within training range
- [ ] NTK-aware interpolation: applying PI non-uniformly across frequency components
- [ ] YaRN: ramp function + temperature correction for improved long-context performance
- [ ] LongRoPE: frequency-domain analysis of dimension-wise scaling
Reading: - [ ] YaRN: Efficient Context Window Extension of Large Language Models (3 hrs) - [ ] EleutherAI blog: Extending the RoPE (1.5 hrs) - [ ] Ring Attention for inference — inference section (1.5 hrs)
Hands-on: - [ ] Extend LLaMA-3-8B to 32K context using YaRN. Measure perplexity at 4K, 8K, 16K, 32K and compare to a model without YaRN. (3 hrs)
Week 30 — Long-Context Inference: KV Cache Management
Concepts to understand: - [ ] Chunked prefill: splitting long-prompt prefill into chunks to reduce head-of-line blocking - [ ] Prefix caching: radix-tree KV reuse across requests sharing common prefixes - [ ] H2O (Heavy-Hitter Oracle): evicting KV entries based on attention mass - [ ] SnapKV: query-guided one-shot per-layer token selection before generation - [ ] StreamingLLM: attention sinks + recency window for infinite-length generation with fixed VRAM - [ ] SAGE-KV: one-shot token/head-level top-k eviction
Reading: - [ ] SnapKV (2 hrs) - [ ] vLLM blog: Anatomy of a High-Throughput LLM Inference System (2 hrs) - [ ] StreamingLLM (2 hrs)
Hands-on: - [ ] Implement H2O KV eviction on top of a vLLM-served model. Measure perplexity degradation vs. memory savings at 50% cache budget. (3 hrs)
Milestone (Weeks 29–30): For a LLaMA-3-70B model on 4×A100 with tensor parallelism, compute: (a) baseline KV cache size at seqlen=32K, batch=8, (b) memory savings under 50% H2O eviction, (c) latency impact of chunked prefill at different chunk sizes.
Week 31 — Advanced Quantization: FP8, GGUF, and Extreme Low-Bit
Concepts to understand:
- [ ] FP8 floating-point formats: E4M3 (range-optimized) vs. E5M2 (precision-optimized), scaling and saturation behavior
- [ ] FP8 inference on Hopper/Ada: NVIDIA TransformerEngine fp8_autocast API
- [ ] SmoothQuant: migrating quantization difficulty from activations to weights via per-channel scaling; joint W8A8
- [ ] GGUF format: K-quant families (Q2_K through Q6_K), block-scale encoding, CPU/GPU hybrid execution
- [ ] QuIP#: Hadamard incoherence processing + E8 lattice codebooks for 2-bit weights
- [ ] AQLM: additive quantization with learned codebooks; training cost vs. inference speed
- [ ] Mixed-precision quantization: per-layer bit selection via sensitivity analysis
- [ ] Calibration dataset design: representativeness, domain shift effects
Reading: - [ ] NVIDIA TransformerEngine FP8 Primer (2 hrs) - [ ] QuIP# (3 hrs) - [ ] AQLM (3 hrs) - [ ] Which Quantization Should I Use? (systematic comparison) (2 hrs)
Hands-on: - [ ] Benchmark LLaMA-3-8B in fp16, int8 (SmoothQuant), int4 (GPTQ), FP8 (TE), and GGUF Q4_K_M. Record throughput (tokens/sec), TTFT, and perplexity. (3 hrs)
Week 32 — Model Compression: Pruning & Distillation
Concepts to understand: - [ ] Unstructured pruning: magnitude pruning, SparseGPT (Hessian-based layer-wise), Wanda (weight × activation magnitude criterion) - [ ] Structured pruning: head pruning, layer pruning, width pruning; hardware-efficiency tradeoffs - [ ] 2:4 structured sparsity: what NVIDIA sparse tensor cores require - [ ] Knowledge distillation at scale: sequence-level vs. token-level losses, forward KL vs. reverse KL (MiniLLM) - [ ] NVIDIA Minitron: structured pruning + short distillation fine-tune as production recipe - [ ] Combined pipelines: prune → distill → quantize stacking
Reading: - [ ] Wanda: A Simple and Effective Pruning Method (2 hrs) - [ ] MiniLLM: Knowledge Distillation of Large Language Models (2.5 hrs) - [ ] ACM Efficient Compressing and Tuning Methods for LLMs — survey (4 hrs)
Hands-on: - [ ] Apply Wanda unstructured pruning to LLaMA-3-8B at 50% sparsity. Measure throughput change (hint: unstructured sparsity doesn’t help on dense hardware) then apply 2:4 structural pattern and remeasure. (3 hrs)
Week 33 — Advanced Speculative Decoding
Concepts to understand: - [ ] Review: draft-model speculative decoding, acceptance rate, expected speedup derivation - [ ] Medusa: multiple decoding heads on frozen backbone; Medusa-1 (frozen LM) vs. Medusa-2 (joint fine-tune); predicts positions +1…+5 - [ ] Tree-structured candidate verification: constructing a candidate tree, batching the verify forward pass, token acceptance mask - [ ] EAGLE-1: draft at the feature level (not token level) using a shallow auto-regressive head on frozen LM embeddings - [ ] EAGLE-2: context-aware dynamic draft tree; confidence-score-based acceptance rate; 20–40% faster than EAGLE-1 - [ ] Self-speculative decoding (layer skipping): same model with skipped layers as draft — no auxiliary model needed
Reading: - [ ] Medusa (2.5 hrs) - [ ] EAGLE (3 hrs) - [ ] EAGLE-2 (2 hrs) - [ ] vLLM blog: Speculative Decoding up to 2.8× (1 hr)
Hands-on: - [ ] Enable Medusa and EAGLE-2 in vLLM for LLaMA-3-8B. Measure speedup at greedy and temperature=1 decoding. Compare acceptance rates. (3 hrs)
Milestone (Weeks 31–33): Given a 70B model with 50% Wanda pruning + AQLM 2-bit quantization + EAGLE-2 speculative decoding: estimate theoretical tokens/sec vs. the uncompressed baseline. Identify which technique gives the highest throughput-per-quality-point tradeoff.
Week 34 — Disaggregated & Distributed Inference
Concepts to understand:
- [ ] Why prefill and decode have fundamentally different compute/memory profiles (roofline)
- [ ] Splitwise: phase splitting onto heterogeneous hardware (H100 prefill / A100 decode), KV migration protocol
- [ ] DistServe: goodput-optimized disaggregation, independent parallelism strategies per phase
- [ ] Tensor parallelism for inference: column/row partition, all-reduce cost
- [ ] Multi-GPU vLLM: --tensor-parallel-size, worker topology, NVLink vs. PCIe bandwidth sensitivity
- [ ] 2025 landscape: disaggregation as the default (NVIDIA Dynamo, SGLang, LMCache)
Reading: - [ ] DistServe (3 hrs) - [ ] Hao AI Lab: Disaggregated Inference 18 Months Later (1 hr) - [ ] NVIDIA Dynamo announcement (1 hr)
Hands-on: - [ ] Deploy LLaMA-3-70B with tensor parallelism across 2 GPUs using vLLM. Measure TTFT and throughput vs. single-GPU with 4-bit quantization. Identify which strategy gives better throughput/VRAM tradeoff. (3 hrs)
Week 35 — Production Serving Infrastructure
Concepts to understand: - [ ] SLO taxonomy: TTFT, TBT (time-between-tokens), P50/P99 targets - [ ] SLO-aware scheduling: global queue management, preemption, priority classes - [ ] Load balancing across replicas: session affinity for prefix cache hits, least-outstanding-requests - [ ] Autoscaling: request-rate-based vs. queue-depth-based triggers, custom Prometheus metrics, K8s HPA - [ ] Triton Inference Server: model repository, backend API, dynamic batching, ensemble models - [ ] TensorRT-LLM: plugin system, inflight batching, paged KV cache, quantized kernel dispatch - [ ] Cost modeling: $/token, GPU utilization targets, spot instance fault tolerance
Reading: - [ ] SLO-Aware Scheduling for LLM Inferences (2.5 hrs) - [ ] AWS: Multi-node TensorRT-LLM + Triton on EKS (3 hrs) - [ ] A Survey on Inference Engines for LLMs (4 hrs)
Hands-on: - [ ] Deploy LLaMA-3-8B with TensorRT-LLM + Triton Inference Server. Set TTFT P99 < 500ms as an SLO. Measure max throughput (tokens/sec) while holding the SLO. (4 hrs)
Milestone (Part IV): End-to-end serving system: FP8-quantized 70B model on disaggregated prefill/decode infrastructure, with SLO-aware scheduling, autoscaling, and cost monitoring. Document latency, throughput, GPU utilization, and $/token achieved.
Part V — Emerging Architectures
Goal: Understand Mamba/SSMs, MoE inference, linear attention variants, diffusion systems, and advanced multimodal architectures — always through the systems lens of compute, memory, and parallelism.
Week 36 — State Space Models: Mamba
Concepts to understand: - [ ] Selective SSM recurrence: input-dependent (A, B, C) parameters break time-invariance — preventing FFT shortcut - [ ] Parallel associative scan: the core primitive — associativity of state-update operator, tree reduction, work vs. depth analysis - [ ] Mamba’s kernel fusion: why the naïve sequential scan is memory-bandwidth-bound; how kernel fusion eliminates HBM materialization (analogous to FlashAttention tiling) - [ ] Inference memory profile: O(1) KV-cache equivalent — fixed-size recurrent state regardless of sequence length - [ ] Mamba-2 / SSD (Structured State Space Duality): reformulating selective SSM as semiseparable matrix multiplication; enables tensor-core utilization; 2–8× speedup over Mamba-1
Reading: - [ ] Mamba: Linear-Time Sequence Modeling with Selective State Spaces (4 hrs) - [ ] Tri Dao: SSD Blog series (Parts I–III) (3 hrs) - [ ] Princeton PLI: Mamba-2 Algorithms and Systems (2 hrs)
Hands-on: - [ ] Implement the Mamba selective scan in pure PyTorch. Verify output matches the reference implementation. Profile memory usage at seqlen 4K, 16K, 64K — compare to FlashAttention-2. (3 hrs)
Week 37 — Mamba-2, Triton Scan Kernels & Hybrid Architectures
Concepts to understand: - [ ] Mamba-2/SSD: block-decomposable structure enabling sequence parallelism across devices - [ ] Implementing efficient parallel scan in Triton: tile sizes, recomputation vs. checkpointing of states, variable-length batches - [ ] Hybrid architectures (Jamba, Zamba): the 1:7 attention-to-Mamba ratio; interleaved MoE layers; systems implications for serving - [ ] flash-linear-attention library: unified Triton kernels for GLA/Mamba/RWKV
Reading: - [ ] Mamba-2 / SSD (5 hrs) - [ ] Mamba: The Hard Way — Sasha Rush (annotated Triton) (6 hrs) - [ ] Jamba: Hybrid Transformer-Mamba Language Model (2 hrs)
Hands-on: - [ ] Implement a simplified parallel selective scan in Triton. Benchmark against the sequential Mamba reference at seqlen 1K→64K. Measure FLOP/s and memory bandwidth utilization. (4 hrs)
Milestone (Weeks 36–37): Build a Mamba-2 inference server. Measure: (a) per-token VRAM at seqlen 1K vs. 64K vs. 256K (should be constant), (b) throughput vs. an equivalent-parameter transformer at each seqlen, (c) the crossover point where Mamba becomes faster.
Week 38 — MoE Architecture & Inference Serving
Concepts to understand: - [ ] Sparse MoE inference: expert weight storage, all-to-all routing, GPU utilization at small batch sizes - [ ] Expert offloading: streaming expert weights from CPU DRAM; expert prediction/caching to reduce transfer cost - [ ] Serving Mixtral 8×7B and DeepSeek-MoE: how vLLM/SGLang handle expert-parallel routing at batching time - [ ] MoE vs. dense at inference: same FLOPs per token, 4× more parameters — memory bandwidth bottleneck is worse
Reading: - [ ] Efficient Large Scale Language Modeling with MoE — empirical efficiency study (2 hrs) - [ ] Survey on MoE Inference Optimization — inference sections (3 hrs) - [ ] DeepSeekMoE (3 hrs)
Hands-on: - [ ] Serve Mixtral-8×7B with vLLM. Profile GPU memory and expert routing distribution. Measure how batch size affects expert utilization and throughput. (3 hrs)
Week 39 — Linear Attention & Hybrid Architectures
Concepts to understand:
- [ ] Kernel-trick reformulation of softmax attention: replacing exp(qᵀk) with φ(q)ᵀφ(k) — the root of O(1)-memory inference
- [ ] Why pure linear attention underperforms: missing normalizer term, forgetting over long sequences
- [ ] GLA (Gated Linear Attention): chunked-form training kernel in Triton; faster than FlashAttention-2 at 1K seqlen
- [ ] RWKV dual-mode: parallel training (WKV operator along sequence) vs. pure RNN inference (O(1) per-token, fixed VRAM)
- [ ] RetNet: decay matrix as structured (diagonal + low-rank) operator enabling efficient chunkwise computation
- [ ] flash-linear-attention: how subquadratic kernels share the same tiling and recomputation strategy as FlashAttention
Reading: - [ ] GLA: Gated Linear Attention Transformers with Hardware-Efficient Training (3 hrs) - [ ] RWKV: Reinventing RNNs for the Transformer Era (3 hrs) - [ ] flash-linear-attention — code + docs (3 hrs)
Hands-on: - [ ] Benchmark GLA vs. FlashAttention-2 vs. Mamba at seqlen 512, 1K, 2K, 4K, 8K. Plot throughput (tokens/sec) and memory per token. Identify the regime where each is fastest. (3 hrs)
Week 40 — Diffusion Model Systems: Samplers & Architectures
Concepts to understand: - [ ] DDPM reverse process: N forward passes per image — O(N × model_FLOPs) inference cost - [ ] DDIM and DPM-Solver: deterministic ODE-based samplers reducing steps from ~1000 to 10–20 without retraining - [ ] Flow matching: velocity field over straight ODE trajectories — fewer NFEs at inference; numerically better-conditioned - [ ] Consistency models: self-consistency along ODE trajectory → single-step inference - [ ] DiT (Diffusion Transformer) vs. U-Net: patching the latent space with a ViT backbone; uniform FLOP distribution per layer; simpler tensor parallelism
Reading: - [ ] DiT: Scalable Diffusion Models with Transformers (3 hrs) - [ ] DPM-Solver (3 hrs) - [ ] Flow Matching for Generative Modeling (3 hrs) - [ ] Efficient Diffusion Models Survey (4 hrs)
Hands-on: - [ ] Profile DiT-XL/2 inference. Measure latency at 50 steps (DDPM) vs. 10 steps (DPM-Solver) vs. 1 step (consistency). Plot quality (FID) vs. latency tradeoff. (3 hrs)
Week 41 — Diffusion Serving & Distributed Execution
Concepts to understand: - [ ] Batching strategies for diffusion: requests are embarrassingly parallel within a batch; variable CFG guidance scale complicates batching - [ ] DistriFusion: asynchronous parallel denoising over devices — exploiting temporal redundancy across steps - [ ] PipeFusion: pipeline parallelism across transformer layers for DiT - [ ] Long-context sparse attention patterns: BigBird’s random + local + global, Longformer sliding-window + global - [ ] ALiBi: per-head linear bias on attention logits proportional to distance — no positional embedding parameters; generalizes to unseen lengths
Reading: - [ ] BigBird: Transformers for Longer Sequences (3 hrs) - [ ] Hyena Hierarchy (3 hrs) - [ ] Efficient Attention Mechanisms for LLMs Survey (4 hrs)
Hands-on: - [ ] Implement DistriFusion for DiT-XL/2 inference across 2 GPUs. Compare latency to single-GPU at the same step count. Measure communication overhead. (3 hrs)
Milestone (Weeks 40–41): For a DiT-XL/2 model serving at 256 requests/sec, design an architecture: (a) step count vs. quality tradeoff using DPM-Solver, (b) batching strategy with variable CFG, (c) multi-GPU with DistriFusion — estimate total GPU count needed to hit 100ms TTFT P99.
Week 42 — Advanced Multimodal: Video & Audio
Concepts to understand:
- [ ] ViT FLOP profile: patch count n = (H × W) / p² grows quadratically with resolution — direct analogy to seqlen scaling
- [ ] Image tokenization tradeoffs: continuous patch embeddings vs. discrete VQ-VAE tokens; 32-token tokenization (NeurIPS 2024)
- [ ] Video model systems: naive 3D attention is O((T·H·W)²); space-time factored attention; sliding-window temporal attention; memory management for long video sequences
- [ ] Frame batching strategies: independent frames (cheap, no temporal coherence) vs. tubelet embeddings (3D patch tokens)
- [ ] Audio models (Whisper, EnCodec): spectrogram-to-patch tokenization, streaming inference, causal attention latency constraints
- [ ] Cross-modal fusion: Q-Former / cross-attention layers (BLIP-2, Flamingo) vs. simple projection (LLaVA)
Reading: - [ ] An Image is Worth 16×16 Words: ViT (2 hrs) - [ ] An Image is Worth 32 Tokens for Reconstruction and Generation (2 hrs) - [ ] Image and Video Tokenization (ICLR 2025) (2 hrs) - [ ] Vision Transformers on the Edge: Model Compression Survey (3 hrs)
Hands-on: - [ ] Serve a video-language model (e.g., LLaVA-Video or VideoLLaMA). Profile: (a) video tokenization latency vs. LLM prefill vs. decode, (b) how video length and resolution affect KV cache size and TTFT. (4 hrs)
Milestone (Part V): Implement a Mamba-2 parallel scan kernel and benchmark it end-to-end: compare throughput, memory, and quality against FlashAttention-2 at seqlens from 2K to 256K. Write a one-page analysis of which architecture you’d choose for which use case.
Part VI — Compilers & Infrastructure
Goal: Understand the full compiler stack from MLIR down to PTX, the internals of automatic differentiation, memory allocation, and production MLOps infrastructure.
Week 43 — Compiler IR Theory & MLIR
Concepts to understand:
- [ ] SSA (Static Single Assignment): basic blocks, CFGs, dominance frontiers, liveness analysis, φ-functions
- [ ] MLIR architecture: op, attribute, type, region, block; built-in dialects (func, arith, affine, linalg, memref, scf)
- [ ] Dialect conversion framework and the Transform dialect for compiler-controlled transformations
- [ ] Lowering chains: MLIR → LLVM dialect → LLVM IR → PTX
- [ ] torch.compile internals: Dynamo bytecode interception (PEP 523), FX graph, guard mechanism, graph breaks
Reading: - [ ] SSA-Based Compiler Design (free PDF) — selective chapters (5 hrs) - [ ] MLIR Toy Tutorial Chapters 1–6 (8 hrs) - [ ] depyf: decompiles torch.compile bytecode (3 hrs) - [ ] PyTorch Dynamo Deep-Dive (4 hrs)
Hands-on:
- [ ] Complete all 6 chapters of the MLIR Toy tutorial. Then use depyf to inspect the Dynamo FX graph of a transformer forward pass. Identify 3 graph breaks and explain what causes them. (3 hrs)
Week 44 — TVM, XLA & Autotuning
Concepts to understand: - [ ] TVM / TensorIR: first-class schedulable IR; MetaSchedule stochastic search space (block tiling, loop reordering, vectorization); how tuning records are stored - [ ] Ansor: program search space for high-performance tensor programs; cost-model-driven auto-scheduling - [ ] XLA HLO instruction set: algebraic simplification, fusion, layout assignment, buffer assignment - [ ] XLA SPMD partitioner: sharding annotations, per-op partitioning semantics, automatic collective insertion - [ ] GSPMD: generalizing SPMD to arbitrary parallelism strategies
Reading: - [ ] Machine Learning Compilation (MLC) course — TensorIR + MetaSchedule chapters (8 hrs) - [ ] OpenXLA GPU Architecture Overview (3 hrs) - [ ] GSPMD (3 hrs) - [ ] Ansor (2 hrs)
Hands-on: - [ ] Autotune a matrix multiplication using TVM MetaSchedule. Compare achieved FLOP/s to: (a) a naive implementation, (b) cuBLAS, (c) your hand-written Triton kernel. (3 hrs)
Week 45 — Automatic Differentiation Systems
Concepts to understand:
- [ ] JVP (forward-mode AD) as Jacobian–vector product; VJP (reverse-mode) as vector–Jacobian product
- [ ] Why reverse mode dominates for ML: O(m) VJPs vs. O(n) JVPs for n-input, m-output functions
- [ ] JAX transformation model: jax.jvp, jax.vjp, jax.grad; Jaxpr as the internal lambda calculus IR
- [ ] vmap as a batch-dimension lifting transformation; how jit + vmap + grad compose
- [ ] Custom derivatives in JAX: custom_jvp and custom_vjp for non-differentiable ops and numerical stability fixes
- [ ] PyTorch autograd internals: dynamic computation graph, Function.forward/backward, AccumulateGrad nodes, C++ engine thread pool
- [ ] AOTAutograd: ahead-of-time joint forward+backward graph capture; why graph breaks hurt backward compilation
Reading: - [ ] JAX Autodiff Cookbook (4 hrs) - [ ] JAX JVP/VJP documentation (2 hrs) - [ ] JAX Custom derivative rules (2 hrs) - [ ] How Computational Graphs are Constructed in PyTorch (2 hrs)
Hands-on:
- [ ] Implement a custom vjp in JAX for numerically stable log-softmax. Verify the gradient matches jax.grad on stable inputs but doesn’t NaN on extreme inputs. Then implement the same in PyTorch with torch.autograd.Function. (3 hrs)
Week 46 — Memory Management & Allocators
Concepts to understand:
- [ ] PyTorch CUDA caching allocator: block splitting and reuse, per-stream caching, rounding policy, when cudaMalloc/cudaFree are actually called
- [ ] Memory fragmentation: external fragmentation from variable-size tensors across streams; max_split_size_mb; PYTORCH_CUDA_ALLOC_CONF knobs
- [ ] cudaMallocAsync backend: CUDA 11.4+ virtual memory pools, stream-ordered allocation semantics; when it beats the native allocator
- [ ] Memory snapshot tooling: torch.cuda.memory._record_memory_history() + _dump_snapshot() + pytorch.org/memory_viz
- [ ] Gradient checkpointing + activation CPU offload: separate CUDA streams for overlap; FSDP + offload combinations
Reading:
- [ ] A Guide to PyTorch’s CUDA Caching Allocator — Zach DeVito (2.5 hrs)
- [ ] Understanding GPU Memory 1: Visualizing All Allocations over Time (1.5 hrs)
- [ ] PyTorch CUDA semantics — PYTORCH_CUDA_ALLOC_CONF reference (1 hr)
- [ ] torchtune: Memory Optimization Overview (2 hrs)
Hands-on: - [ ] Capture a memory snapshot for a distributed training run with FSDP ZeRO-3. Use pytorch.org/memory_viz to identify the peak allocation event. Reduce peak memory by 20% through allocator tuning and partial activation offload. (3 hrs)
Week 47 — NCCL Deep Dive
Concepts to understand:
- [ ] NCCL ring-AllReduce: bandwidth-optimal for large messages; latency formula O(2(n-1)/n × α + 2(n-1)/n × β × M)
- [ ] Double binary tree: logarithmic latency with full bandwidth; when to prefer tree over ring
- [ ] NCCL protocols: Simple (bandwidth-optimal), LL (latency-optimal 8-byte writes), LL128 (balanced); dynamic selection based on message size
- [ ] NCCL tuning: NCCL_ALGO, NCCL_PROTO, channel count, thread count; benchmarking with nccl-tests
- [ ] Compute-communication overlap: async collectives on separate CUDA streams; NCCL 2.28 copy-engine collectives
- [ ] SHARP: in-network reduction on InfiniBand switches — eliminating the final merge step
Reading: - [ ] Understanding NCCL Tuning (2 hrs) - [ ] Fast Multi-GPU Collectives with NCCL (1.5 hrs) - [ ] NCCL 2.28 Copy Engine Collectives (1.5 hrs) - [ ] Demystifying NCCL (3 hrs)
Hands-on:
- [ ] Use nccl-tests to benchmark AllReduce across 2 and 8 GPUs at message sizes 1KB to 1GB. Plot bus bandwidth vs. message size. Identify the crossover between latency-dominated and bandwidth-dominated regimes. (3 hrs)
Week 48 — Deployment & Production Infrastructure
Concepts to understand:
- [ ] Experiment tracking: MLflow (runs, artifacts, model registry) vs. W&B (sweeps, artifact versioning, comparison); when to use which
- [ ] ONNX export pipeline: torch.onnx.export dynamo-based path; opset versioning; ONNX Runtime graph optimization levels (basic, extended, all); execution provider selection (CUDA EP, TensorRT EP)
- [ ] Kubernetes for ML: resource requests/limits for GPU pods (nvidia.com/gpu), Kubeflow Pipelines, KubeRay, autoscaling GPU node pools
- [ ] Cost optimization: spot/preemptible instances, bin-packing, inference-time quantization, request batching
Reading: - [ ] ONNX Runtime — Graph Optimizations (2 hrs) - [ ] ONNX Runtime — Execution Providers (1.5 hrs) - [ ] Ray for ML Infrastructure (4 hrs) - [ ] Full Stack Deep Learning Lecture 6: MLOps (2 hrs)
Hands-on: - [ ] Export LLaMA-3-8B to ONNX. Deploy with ONNX Runtime using the TensorRT EP. Compare latency to vLLM serving. Identify which graph optimizations ONNX Runtime applies automatically. (3 hrs)
Week 49 — Advanced Profiling & Distributed Debugging
Concepts to understand:
- [ ] Nsight Compute kernel analysis: SOL section, Memory Workload Analysis chart, Warp State Statistics stall taxonomy, Scheduler Statistics, Source/SASS views for line-level attribution
- [ ] Nsight Systems: NVTX range annotations, correlating host-side Python/C++ with device execution, multi-node traces for straggler identification
- [ ] Distributed training debugging: NCCL_DEBUG=INFO, rank asymmetry (one rank hangs while others wait), common NCCL error classes (mismatched tensor shapes/dtypes, communicator reuse bugs)
- [ ] PyTorch memory profiler: profile_memory=True, memory snapshot workflow, memory_stats counters
Reading: - [ ] Using Nsight Compute to Inspect Your Kernels (2.5 hrs) - [ ] Debugging NCCL Errors in Distributed Training (1.5 hrs) - [ ] Debugging PyTorch Memory Use with Snapshots — Zach DeVito (1.5 hrs) - [ ] PyTorch Mosaic Memory Profiling Tutorial (1.5 hrs)
Hands-on:
- [ ] Run a multi-GPU training job where one rank is intentionally slower (add a sleep). Use Nsight Systems + NCCL_DEBUG=INFO to identify the straggler. Then fix the bottleneck and verify the speedup. (3 hrs)
Week 50 — Integrating the Compiler Stack
Concepts to understand: - [ ] Full stack: Python → Dynamo FX graph → AOTAutograd joint graph → Inductor loop IR → Triton / CUTLASS → PTX → SASS - [ ] How each layer in the stack creates or destroys optimization opportunities - [ ] TorchInductor design: define-by-run IR, symbolic shapes, persistent reduction fusions - [ ] When to bypass each layer: hand-written Triton for custom ops, CUTLASS for peak GEMM performance, XLA for TPU/multi-host SPMD
Reading: - [ ] TorchInductor design doc (2 hrs) - [ ] PyTorch 2 paper — re-read for architecture (2 hrs) - [ ] GSPMD paper — re-read for the SPMD big picture (2 hrs)
Hands-on:
- [ ] Write a custom PyTorch operator that falls through the entire stack: implement the forward in Triton, register a custom torch.autograd.Function with a custom_vjp, and verify torch.compile can capture and fuse it. (4 hrs)
Week 51 — Reading Week & Integration
A structured review week with no new material. Revisit the notes, exercises, and milestone answers from the hardest weeks.
Suggested review targets: - [ ] Re-read FlashAttention-3, Mamba-2, and DistServe papers with fresh eyes - [ ] Re-derive: ZeRO-3 communication volume, arithmetic intensity for every major op type, speculative decoding speedup formula - [ ] Review your Nsight Compute profiles from Weeks 18 and 49 — do you understand every row now? - [ ] Read Taming the Titans: A Survey of Efficient LLM Inference Serving as a capstone survey (4 hrs) - [ ] Read A Survey on Efficient Inference for Large Language Models (5 hrs)
Week 52 — Year-End Capstone
Goal: Ship something production-quality that integrates at least 3 domains from the curriculum. The deliverable should be something you’d be comfortable presenting as a portfolio piece.
Choose one track:
Deliverable checklist: - [ ] Architecture diagram showing which curriculum concepts were applied and why - [ ] Baseline measurement (before optimizations) - [ ] Each optimization applied with before/after numbers - [ ] Roofline analysis at the bottleneck point - [ ] What you’d do next if you had more hardware or time
Reference Lists
Canonical Papers by Week
| Paper | Topic | Week |
|---|---|---|
| Attention Is All You Need | Transformer | prerequisite |
| Scaling Laws for Neural Language Models | Scaling | 3 |
| FlashAttention | Efficient attention | 4 |
| FlashAttention-2 | Efficient attention | 4 |
| GQA | KV compression | 4 |
| DeepSeek-V2 §3 MLA | KV compression | 4 |
| LLM.int8() | Quantization | 5 |
| GPTQ | Quantization | 5 |
| AWQ | Quantization | 5 |
| Speculative Decoding | Inference | 5 |
| PagedAttention / vLLM | Serving | 6 |
| Orca | Serving | 6 |
| SGLang | Serving | 6 |
| Megatron-LM | Distributed | 7 |
| Megatron 3D | Distributed | 7 |
| ZeRO | Distributed | 8 |
| Mixed Precision Training | Training | 8 |
| Activation Recomputation | Training | 8 |
| Triton | Kernels | 9 |
| PyTorch 2 | Compilation | 10 |
| CLIP | Multimodal | 11 |
| LLaVA | Multimodal | 11 |
| FlashAttention-3 | Hardware | 17 |
| Ring Attention | Seq parallelism | 21 |
| DeepSpeed Ulysses | Seq parallelism | 21 |
| Switch Transformers | MoE | 22 |
| GShard | MoE | 22 |
| MegaBlocks | MoE | 23 |
| Chinchilla | Scaling | 24 |
| LoRA | Finetuning | 26 |
| QLoRA | Finetuning | 26 |
| DPO | RLHF | 28 |
| DeepSeekMath (GRPO) | RLHF | 28 |
| YaRN | Long context | 29 |
| SnapKV | KV eviction | 30 |
| QuIP# | Quantization | 31 |
| AQLM | Quantization | 31 |
| Wanda | Pruning | 32 |
| EAGLE | Spec decode | 33 |
| EAGLE-2 | Spec decode | 33 |
| DistServe | Disaggregated | 34 |
| Mamba | SSMs | 36 |
| Mamba-2 / SSD | SSMs | 37 |
| GLA | Linear attn | 39 |
| RWKV | Linear attn | 39 |
| DiT | Diffusion | 40 |
| DPM-Solver | Diffusion | 40 |
| Flow Matching | Diffusion | 40 |
| GSPMD | Compilers | 44 |
| Demystifying NCCL | Networking | 47 |
Key Blogs & References
| Resource | What It’s Good For | Link |
|---|---|---|
| Making Deep Learning Go Brrrr | Compute/memory bottleneck taxonomy | https://horace.io/brrr_intro.html |
| Transformer Math 101 (EleutherAI) | FLOP/memory arithmetic reference | https://blog.eleuther.ai/transformer-math/ |
| JAX Scaling Book | Roofline, sharding, distributed training | https://jax-ml.github.io/scaling-book/ |
| Stas Bekman’s ML Engineering | Practical distributed training cookbook | https://github.com/stas00/ml-engineering |
| GPU MODE Lectures | Advanced CUDA, CUTLASS, Triton | https://github.com/gpu-mode/lectures |
| Lilian Weng’s blog | Broad ML coverage, well-cited | https://lilianweng.github.io |
| Sebastian Raschka’s newsletter | LLM research summaries | https://magazine.sebastianraschka.com |
| CutlassAcademy | CUTLASS 3.x tutorials | https://github.com/MekkCyber/CutlassAcademy |
| flash-linear-attention | Triton kernels for SSMs/linear attn | https://github.com/fla-org/flash-linear-attention |
| LLM Inference Optimization Papers | Curated inference paper list | https://github.com/chenhongyu2048/LLM-inference-optimization-paper |
Last updated: 2026-03-15. Revisit pacing at Part boundaries.