ML Systems Curriculum: LLMs & Multimodal Models

52 weeks · ~10 hrs/wk · ~520 hrs total Profile: Python/PyTorch practitioner, GPU beginner, full-stack goal (inference + training + kernels)


Year Overview

Part Weeks Theme
I 1–12 Foundations
II 13–20 Hardware Mastery
III 21–29 Training at Scale
IV 30–37 Advanced Inference
V 38–46 Emerging Architectures
VI 47–52 Compilers & Infrastructure

Dependency Map

flowchart TD
    subgraph P1["① Foundations · Wks 1–12"]
        roofline["Roofline & GPU Architecture"]
        txArith["Transformer Systems Arithmetic"]
        effAttn["Efficient Attention (Flash, GQA, MLA)"]
        inference["Quantization & Serving"]
        dist["Distributed Training (DP / TP / PP / ZeRO)"]
        kernels["Triton Kernels & torch.compile"]
    end

    subgraph P2["② Hardware Mastery · Wks 13–20"]
        advCuda["Advanced CUDA (Warps, Tensor Cores, cp.async)"]
        cutlass["CUTLASS / CuTe"]
        hopper["Hopper ISA (TMA, WGMMA, Ping-Pong)"]
        profiling["Nsight Profiling (Compute + Systems)"]
        interconnects["Interconnects & NCCL Basics"]
    end

    subgraph P3["③ Training at Scale · Wks 21–29"]
        seqPar["Sequence & Context Parallelism"]
        moeTraining["MoE Training (Expert Parallelism)"]
        dataPipe["Data Pipelines & Pretraining Infra"]
        peft["LoRA / QLoRA / PEFT"]
        rlhf["RLHF Systems (PPO, DPO, GRPO)"]
    end

    subgraph P4["④ Advanced Inference · Wks 30–37"]
        longCtx["Long-Context (YaRN, KV Eviction)"]
        advQuant["FP8 & Extreme Quantization"]
        advSpec["Advanced Spec Decoding (EAGLE, Medusa)"]
        disagg["Disaggregated Inference (DistServe)"]
        prodServing["Production Serving & SLO Scheduling"]
    end

    subgraph P5["⑤ Emerging Architectures · Wks 36–46"]
        mamba["Mamba / SSMs & Parallel Scan"]
        moeInf["MoE Inference & Expert Offload"]
        linAttn["Linear Attention & Hybrid Models"]
        diffusion["Diffusion Systems (DiT, Flow Matching)"]
    end

    subgraph P6["⑥ Compilers & Infra · Wks 43–52"]
        mlir["MLIR & Compiler IR Theory"]
        tvm["TVM / XLA / Autotuning"]
        adSys["Automatic Differentiation Systems"]
        memMgmt["CUDA Memory Management"]
        ncclDeep["NCCL Deep Dive & Collectives"]
    end

    %% Part I internal
    roofline --> txArith
    roofline --> kernels
    txArith --> effAttn
    effAttn --> inference
    effAttn --> dist

    %% P1 → P2
    roofline --> advCuda
    kernels --> cutlass
    advCuda --> cutlass
    cutlass --> hopper
    hopper --> profiling
    dist --> interconnects

    %% P1 → P3
    dist --> seqPar
    dist --> moeTraining
    txArith --> dataPipe
    inference --> peft
    peft --> rlhf
    seqPar --> rlhf

    %% P1 → P4
    effAttn --> longCtx
    inference --> advQuant
    inference --> advSpec
    inference --> disagg
    inference --> prodServing

    %% P2 → P4
    hopper --> advQuant

    %% P2 → P5
    hopper --> mamba
    kernels --> mamba

    %% P1 → P5
    effAttn --> linAttn
    txArith --> diffusion

    %% P3 → P5
    moeTraining --> moeInf

    %% P4 → P5
    disagg --> moeInf

    %% P2 → P6
    hopper --> mlir
    kernels --> mlir
    mlir --> tvm
    advCuda --> memMgmt
    interconnects --> ncclDeep
    kernels --> adSys

Part I — Foundations

Goal: Build the hardware intuition, transformer arithmetic, attention theory, inference fundamentals, distributed training basics, and kernel writing skills that underpin everything else.

Week 1 — The Roofline Model

Concepts to understand: - [ ] Arithmetic intensity: what it is and how to compute it for an op - [ ] The roofline model: compute ceiling vs memory bandwidth ceiling - [ ] Ridge point: the arithmetic intensity that separates memory-bound from compute-bound - [ ] HBM vs SRAM: capacity, bandwidth, and latency tradeoffs - [ ] GPU SM, warp, and thread hierarchy at a high level - [ ] How to read a PyTorch profiler trace

Reading: - [ ] Making Deep Learning Go Brrrr — Horace He (1.5 hrs) - [ ] All About Rooflines — JAX Scaling Book (2 hrs) - [ ] Transformer Math 101 — EleutherAI (1.5 hrs) - [ ] NVIDIA Hopper Architecture Whitepaper §1–3 (1.5 hrs) - [ ] PyTorch Profiler Tutorial (1 hr)

Hands-on: - [ ] Run PyTorch profiler on a forward pass of GPT2LMHeadModel. Identify the top 3 ops by CUDA time and classify each as compute- or memory-bound. (1.5 hrs)

Milestone: Given matmul(A, B) with A=(4096,4096), B=(4096,4096) in fp16 on an A100 (312 TFLOP/s, 2 TB/s), compute arithmetic intensity and determine its roofline regime.


Week 2 — CUDA Mental Model & Memory Hierarchy

Concepts to understand: - [ ] Thread/block/grid hierarchy and how it maps to GPU hardware - [ ] Shared memory (SMEM): capacity per SM, latency vs global memory - [ ] Global memory coalescing: why access pattern matters for bandwidth - [ ] Occupancy: how register and SMEM usage limits active warps per SM - [ ] Bank conflicts in shared memory - [ ] L1/L2 cache behavior

Reading: - [ ] Programming Massively Parallel Processors Ch. 1–5 — Hwu & Kirk (4 hrs) - [ ] ECE408/CS483 Lecture Videos Weeks 1–3 — UIUC via NVIDIA (3 hrs) - [ ] An Even Easier Introduction to CUDA — NVIDIA Blog (1 hr)

Hands-on: - [ ] Write a naive CUDA vector-add kernel (in C or via Numba). Measure achieved bandwidth vs theoretical peak. (2 hrs)

Milestone: Explain why a naively written matrix transpose kernel is memory-bandwidth-bound despite O(N²) reads and writes — and describe the shared memory tiling fix.


Week 3 — Transformer Systems Arithmetic

Concepts to understand: - [ ] Parameter count formula for a transformer (embedding, attn projections, MLP, LM head) - [ ] FLOPs ≈ 6ND for training, ≈ 2ND for inference — where this comes from - [ ] KV cache memory: 2 · n_layers · n_heads · d_head · seqlen · batch · dtype_bytes - [ ] Model FLOPs utilization (MFU): definition and how to measure it - [ ] Scaling laws: compute-optimal training (Chinchilla) - [ ] Why seqlen causes quadratic FLOPs but linear KV cache growth

Reading: - [ ] Transformer Math 101 — EleutherAI (re-read carefully, 1.5 hrs) - [ ] Language Models are Few-Shot Learners §2 — GPT-3 (1 hr) - [ ] Scaling Laws for Neural Language Models — Kaplan et al. (2 hrs) - [ ] JAX Scaling Book Ch. 2–3 (2 hrs)

Hands-on: - [ ] For a 7B-parameter LLaMA-2 in bf16, derive: (a) weight memory, (b) KV cache at batch=32 seqlen=2048, (c) FLOPs per forward pass. (3 hrs)

Milestone: Why does doubling sequence length quadratically increase attention FLOPs but only linearly increase KV cache memory? Write out the derivation.


Week 4 — Efficient Attention

Concepts to understand: - [ ] IO complexity of standard attention: why materializing (seqlen × seqlen) is the bottleneck - [ ] FlashAttention tiling: splitting Q/K/V into blocks to avoid writing the full attention matrix - [ ] Online softmax: computing softmax incrementally across tiles - [ ] GQA/MQA: sharing K/V heads across query heads and the memory savings - [ ] MLA: low-rank KV compression in DeepSeek-V2 and the 64× cache reduction - [ ] FlashAttention-2 improvements over v1

Reading: - [ ] FlashAttention — Dao et al. 2022 (2 hrs) - [ ] FlashAttention-2 — Dao 2023 (1.5 hrs) - [ ] GQA: Training Generalized Multi-Query Transformer Models (1.5 hrs) - [ ] DeepSeek-V2 §3 — MLA (1 hr) - [ ] FlashAttention-3 blog (1 hr)

Hands-on: - [ ] Implement FlashAttention forward tiling logic in pure NumPy. Verify against torch.nn.functional.scaled_dot_product_attention. (3 hrs)

Milestone: For a 70B model, 64 heads, d_head=128, seqlen=8192, batch=1 in fp16 — compute the memory footprint of the full attention matrix under standard MHA. Then state FlashAttention’s peak SMEM usage and explain why.


Week 5 — Quantization & Speculative Decoding

Concepts to understand: - [ ] Post-training quantization (PTQ) vs quantization-aware training (QAT) - [ ] Why outlier activations break naive int8 quantization (LLM.int8() mixed-precision) - [ ] Weight-only quantization (GPTQ, AWQ) vs weight+activation quantization - [ ] AWQ: activation-aware scaling to protect salient weights - [ ] Speculative decoding: draft model generates K tokens, target verifies in parallel - [ ] Token acceptance rate and when speculative decoding wins/loses

Reading: - [ ] LLM.int8() — Dettmers et al. (1.5 hrs) - [ ] GPTQ (1.5 hrs) - [ ] AWQ: Activation-aware Weight Quantization (1 hr) - [ ] Fast Inference via Speculative Decoding — Leviathan et al. (1.5 hrs) - [ ] Hugging Face quantization guide (1 hr)

Hands-on: - [ ] Load LLaMA-3-8B in fp16, int8 (bitsandbytes), and GPTQ 4-bit. Measure decode latency, throughput, and perplexity for each. (3 hrs)


Week 6 — Serving Systems

Concepts to understand: - [ ] Prefill vs decode phases: why prefill is compute-bound and decode is memory-bandwidth-bound - [ ] TTFT (time-to-first-token) vs TPOT (time-per-output-token) - [ ] Continuous batching (iteration-level scheduling) - [ ] PagedAttention: virtual memory for KV cache, block table indirection - [ ] KV cache fragmentation and how paging limits it to <4% - [ ] SGLang vs vLLM tradeoffs

Reading: - [ ] Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al. (2 hrs) - [ ] Orca: A Distributed Serving System for Transformer-Based LLMs (1.5 hrs) - [ ] Continuous Batching — 23× LLM Throughput — Anyscale (0.5 hrs) - [ ] vLLM docs — quickstart + architecture (1 hr) - [ ] SGLang (1.5 hrs)

Hands-on: - [ ] Serve LLaMA-3-8B with vLLM. Sweep batch size 1→64. Record throughput and TTFT. At what batch size does the system transition from memory-bound to compute-bound? (3 hrs)

Milestone (Weeks 5–6): For a 7B model on one A100 (80GB), walk through: (1) VRAM available for KV cache after loading int4 weights, (2) concurrent sequences that fit, (3) why continuous batching improves utilization over static batching.


Week 7 — Data Parallelism, Tensor Parallelism, Pipeline Parallelism

Concepts to understand: - [ ] Data parallelism (DDP): gradient all-reduce, overlap with backward pass - [ ] Tensor parallelism: column-parallel and row-parallel linear layers - [ ] Pipeline parallelism: stage assignment, pipeline bubbles, 1F1B schedule - [ ] Micro-batching to fill the pipeline - [ ] 3D parallelism: how TP + PP + DP compose - [ ] Communication primitives: all-reduce, all-gather, reduce-scatter

Reading: - [ ] Megatron-LM — Shoeybi et al. (1.5 hrs) - [ ] Efficient Large-Scale Language Model Training on GPU Clusters — Narayanan et al. (2.5 hrs) - [ ] Everything About Distributed Training and Efficient Finetuning (2 hrs) - [ ] JAX Scaling Book Ch. 4–5 (2 hrs)

Hands-on: - [ ] Using torch.distributed, run DDP on 2 GPUs. Profile all-reduce communication time vs. total step time at varying batch sizes. (2 hrs)


Week 8 — ZeRO, FSDP, Mixed Precision, Gradient Checkpointing

Concepts to understand: - [ ] ZeRO stages 1/2/3: what each stage shards - [ ] ZeRO-3 / FSDP communication: all-gather before forward, reduce-scatter after backward - [ ] Mixed precision: fp16/bf16 forward + fp32 master weights, loss scaling - [ ] Why bf16 is preferred over fp16 for training - [ ] Gradient checkpointing: recomputing activations vs storing them

Reading: - [ ] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (2 hrs) - [ ] PyTorch FSDP2 blog (1 hr) - [ ] Mixed Precision Training (1.5 hrs) - [ ] NVIDIA Mixed Precision Training Guide (1 hr) - [ ] Reducing Activation Recomputation in Large Transformer Models (1 hr)

Hands-on: - [ ] Train a 1B-parameter toy transformer on 2 GPUs. Record peak memory under: (a) DDP fp32, (b) DDP bf16, (c) FSDP ZeRO-3 bf16. (3 hrs)

Milestone (Weeks 7–8): For a 13B model on 8 GPUs with ZeRO-3: calculate per-GPU memory for parameters, gradients, and optimizer states. Estimate the all-gather + reduce-scatter communication volume per step vs. DDP baseline.


Week 9 — Triton Foundations

Concepts to understand: - [ ] What Triton is: a Python DSL compiling to PTX, sitting between CUDA C and PyTorch ops - [ ] tl.program_id, tl.load, tl.store: the core primitives - [ ] Block tiling in Triton: how BLOCK_SIZE maps to SMEM usage and occupancy - [ ] Masking: handling tensor edges when shape isn’t a multiple of BLOCK_SIZE - [ ] tl.dot: Triton’s matmul primitive - [ ] Autotuning: @triton.autotune decorator and config search - [ ] Fusion: why fusing elementwise ops into a matmul saves memory bandwidth

Reading: - [ ] Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (1.5 hrs) - [ ] Official Triton tutorials: vector add → fused softmax → matmul (4 hrs) - [ ] Unleashing the Power of Triton (1 hr)

Hands-on: - [ ] Implement fused SiLU (x * sigmoid(x)) as a Triton kernel. Benchmark against PyTorch built-in and two-op naive version. Report bandwidth utilization. (3.5 hrs)


Week 10 — torch.compile & Inductor

Concepts to understand: - [ ] torch.compile pipeline: Dynamo → Inductor → Triton kernels - [ ] Graph capture modes: reduce-overhead vs max-autotune - [ ] Fusion in Inductor: which op patterns get fused automatically - [ ] Graph breaks: what causes them and how to diagnose with TORCH_COMPILE_DEBUG=1 - [ ] When torch.compile is and isn’t worth it (dynamic shapes, small models) - [ ] How FlashAttention-3 uses Triton + Hopper TMA/WGMMA for ~75% peak FLOP/s

Reading: - [ ] PyTorch 2 (1.5 hrs) - [ ] torch.compile tutorial (1.5 hrs) - [ ] FlashAttention-3 blog (1.5 hrs) - [ ] Triton blocked matmul tutorial (2 hrs)

Hands-on: - [ ] Apply torch.compile to a full transformer forward pass. Use TORCH_COMPILE_DEBUG=1 to inspect emitted Triton kernels. Identify which ops fused and which caused graph breaks. (3.5 hrs)

Milestone (Weeks 9–10): Write a fused layernorm kernel in Triton that computes mean, variance, and normalization in a single pass. Compare throughput and bandwidth to torch.nn.LayerNorm. Explain why a single-pass implementation is faster.


Week 11 — Vision-Language Model Serving

Concepts to understand: - [ ] VLM anatomy: vision encoder (ViT) + projector + LLM backbone - [ ] How image tokens are injected as “virtual” prefill tokens - [ ] Resolution scaling: how image size determines token count - [ ] Two serving regimes: encoder-heavy vs. LLM-decode-heavy - [ ] Cross-attention injection (Flamingo) vs. token merge/concatenation (LLaVA) - [ ] KV cache implications: image tokens inflate effective context length

Reading: - [ ] CLIP (1.5 hrs) - [ ] LLaVA: Visual Instruction Tuning (1.5 hrs) - [ ] LLaVA-1.5 (1 hr) - [ ] Flamingo §3 (1 hr) - [ ] vLLM multimodal docs (1 hr)

Hands-on: - [ ] Serve LLaVA-1.5-7B with vLLM. Measure: (a) vision encoder latency vs. LLM prefill latency, (b) how doubling image resolution changes latency and KV cache size. (4 hrs)

Milestone: A 336×336 image produces 576 image tokens. For LLaVA-1.5-7B serving 16 concurrent users (one image + 64-token question each), compute effective prefill token count and compare KV cache memory to a text-only baseline.


Week 12 — Part I Capstone

Choose one track: - [ ] Track A — LLM serving optimization: Serve a 7B model on one GPU. Apply quantization, continuous batching, and speculative decoding sequentially. Document measured impact of each. - [ ] Track B — Custom attention kernel: Implement simplified FlashAttention forward in Triton. Benchmark vs. PyTorch SDPA across seqlen 512→8192. Produce a roofline plot. - [ ] Track C — Distributed training audit: Train a 1B model on 4 GPUs. Compare DDP, FSDP ZeRO-2, FSDP ZeRO-3. Identify the dominant communication bottleneck.

Deliverable checklist: - [ ] Baseline profile (PyTorch profiler or Nsight output) - [ ] Each optimization applied with before/after measurements - [ ] Written explanation of why each change helped (roofline reasoning) - [ ] One thing you’d do next with more time or hardware


Part II — Hardware Mastery

Goal: Go from Triton user to someone who can reason about hardware ISA-level behavior, write CUTLASS kernels, profile with Nsight Compute, and understand the full interconnect stack.

Week 13 — Advanced CUDA: Warp Primitives & Tensor Cores

Concepts to understand: - [ ] Warp shuffle functions: __shfl_sync, __shfl_down_sync, __shfl_xor_sync — register-to-register exchange within a warp without SMEM - [ ] Warp vote functions: __ballot_sync, __all_sync, __any_sync - [ ] Warp-level reduce patterns using shuffles - [ ] Tensor Cores via WMMA API: nvcuda::wmma — fragment types, load_matrix_sync, mma_sync, store_matrix_sync - [ ] Supported WMMA shapes (16×16×16, 8×32×16) and data types (FP16, BF16, TF32, INT8, FP8) - [ ] PTX mma.sync for finer control below the WMMA abstraction

Reading: - [ ] NVIDIA Tensor Core Programming — Lei Mao’s Log Book (3 hrs) - [ ] Programming Tensor Cores in CUDA 9 (1.5 hrs) - [ ] Warp Shuffle and Warp Vote Instructions — CSE 599I Slides (1.5 hrs) - [ ] GPU MODE Lectures 14–20 (3 hrs)

Hands-on: - [ ] Implement a fused LayerNorm kernel using warp shuffles for the reduction. Benchmark bandwidth against the naive two-pass version. (3 hrs)


Week 14 — CUDA Concurrency: Streams, Graphs, Async Copy

Concepts to understand: - [ ] cp.async (Ampere+): asynchronous SMEM copy from global memory; commit/wait groups; double-buffering pattern to overlap copy and compute - [ ] CUDA streams: stream semantics, default vs. non-default streams, cudaEventRecord/cudaEventSynchronize - [ ] Overlapping H2D copy, kernel, and D2H copy across streams - [ ] CUDA Graphs: graph capture, instantiation, launch; constant-time ~2.5µs launch overhead; when graphs pay off (many small kernels, inference serving) - [ ] Cooperative Groups: thread-block groups, multi-block cooperative kernels, grid-wide synchronization

Reading: - [ ] Controlling Data Movement on Ampere (cp.async) (2 hrs) - [ ] Getting Started with CUDA Graphs (1.5 hrs) - [ ] Cooperative Groups (2 hrs) - [ ] CUDA Streams and Synchronization (1 hr)

Hands-on: - [ ] Add cp.async double-buffering to your LayerNorm kernel from Week 13. Measure the memory bandwidth improvement vs. the synchronous version. (3.5 hrs)

Milestone (Weeks 13–14): Implement a fused LayerNorm with warp-shuffle reduction and double-buffered cp.async pipelining. Profile in Nsight Compute and identify: (a) achieved bandwidth as % of theoretical, (b) dominant stall reason.


Week 15 — CUTLASS & CuTe

Concepts to understand: - [ ] What CUTLASS is: composable CUDA templates for GEMM and convolution; CUTLASS 2.x vs. 3.x (CuTe-based) - [ ] CuTe layout algebra: Layout = Shape × Stride; composable tiling; how CuTe eliminates hand-coded index arithmetic - [ ] CUTLASS 3.x GEMM pipeline: CollectiveMma, CollectiveEpilogue, GemmUniversal - [ ] Warp specialization: producer warps (TMA-issued loads) vs. consumer warpgroups (WGMMA); PipelineAsync barriers; Ping-Pong scheduling - [ ] CUTLASS vs. Triton: when to use each; most production kernels in vLLM/FlashAttention-3 are CUTLASS-based

Reading: - [ ] CUTLASS: Fast Linear Algebra in CUDA C++ (2 hrs) - [ ] CUTLASS: Principled Abstractions (CuTe overview) (2 hrs) - [ ] Deep Dive on CUTLASS Ping-Pong GEMM (3 hrs) - [ ] CutlassAcademy — curated tutorials (3 hrs)

Hands-on: - [ ] Write a CUTLASS 3.x GEMM using GemmUniversal with a custom epilogue (fused bias + ReLU). Profile against cuBLAS and a hand-rolled Triton kernel. (3 hrs)


Week 16 — Hopper Architecture: TMA & WGMMA

Concepts to understand: - [ ] Tensor Memory Accelerator (TMA): hardware DMA engine that accepts 5-D tensor descriptors; single-thread issue in producer warp; eliminates per-thread address calculation - [ ] wgmma.mma_async: asynchronous warpgroup-level (128-thread) MMA; operand B must be in SMEM; requires wgmma.fence / wgmma.commit_group - [ ] Thread-block clusters: groups of up to 16 CTAs on adjacent SMs sharing Distributed Shared Memory (DSMEM) - [ ] Persistent kernel patterns: one CTA per SM, producer warpgroup feeds TMA, consumer warpgroup executes WGMMA - [ ] Ping-pong warp specialization: alternating consumer warpgroups on back-to-back tiles to hide softmax latency

Reading: - [ ] Benchmarking and Dissecting the NVIDIA Hopper GPU Architecture (4 hrs) - [ ] Dissecting Hopper via Microbenchmarking (3 hrs) - [ ] CUTLASS Tutorial: WGMMA on Hopper — Colfax Research (3 hrs)

Hands-on: - [ ] Implement a double-buffered GEMM on H100 using TMA + WGMMA (without CUTLASS). Read the generated PTX/SASS in Nsight Compute and verify cp.async.bulk and wgmma.mma_async appear in the hot loop. (3 hrs)


Week 17 — FlashAttention-3 as Hopper Case Study

Concepts to understand: - [ ] How FA3 combines: TMA-pipelined Q/K/V loads, ping-pong WGMMA between consumer warpgroups, and FP8 block quantization - [ ] Why FA3 achieves ~740 TFLOPS/s (~75% SOL) on H100 while FA2 achieves ~35% - [ ] Inside NVIDIA GPUs: anatomy of a high-performance matmul kernel - [ ] Blackwell (B200) preview: UMMA, FP4 Tensor Cores, 5th-gen NVLink (1.8 TB/s/GPU)

Reading: - [ ] FlashAttention-3 (4 hrs) - [ ] Anatomy of High-Performance Matmul Kernels — Aleksa Gordić (4 hrs) - [ ] Developing CUDA Kernels for Hopper — Colfax PDF (3 hrs)

Hands-on: - [ ] Read the FlashAttention-3 Triton reference implementation. Annotate each section: which Hopper feature is it using, and what is the expected performance impact? (3 hrs)

Milestone (Weeks 15–17): Write a fused FP8 attention kernel in CUTLASS 3.x with TMA + WGMMA + Ping-Pong scheduling. Profile vs. a Triton FlashAttention-2 implementation. Explain the performance delta using roofline analysis.


Week 18 — Profiling Methodology: Nsight Compute & Nsight Systems

Concepts to understand: - [ ] Nsight Compute workflow: capturing a kernel profile with ncu --set full - [ ] Speed-of-Light (SOL): reading SM throughput vs. memory throughput as % of theoretical peak - [ ] Memory Workload Analysis: L1/L2 hit rates, global load efficiency, bank conflicts - [ ] Warp State Statistics: stall reasons (long scoreboard, memory dependency, no instruction, MIO throttle) - [ ] Scheduler Statistics: issued IPC vs. theoretical IPC; interpreting occupancy - [ ] Nsight Systems timeline: CPU-GPU synchronization bubbles, kernel gaps, multi-stream overlap - [ ] Bottleneck taxonomy: memory-bound (bandwidth wall), compute-bound (FLOP wall), latency-bound (small tiles), launch-overhead-bound

Reading: - [ ] Nsight Compute Profiling Guide (5 hrs) - [ ] Accelerating HPC with Nsight Compute Roofline Analysis (2 hrs) - [ ] Nsight Systems User Guide (3 hrs)

Hands-on: - [ ] Take your WGMMA GEMM from Week 16. Do a full Nsight Compute profile: find the dominant stall reason, interpret the roofline position, implement one optimization, re-profile to verify improvement. (3 hrs)


Week 19 — Alternative Accelerators

Concepts to understand: - [ ] TPU architecture: systolic array, HBM bandwidth, inter-chip interconnect (ICI) mesh - [ ] XLA compilation: HLO IR, fusion passes, layout assignment, SPMD partitioning - [ ] PyTorch/XLA: lazy tensor execution, mark_step(), SPMD mesh partitioning - [ ] AMD ROCm/HIP: near-identical to CUDA; wavefront size = 64 on CDNA; hipify for porting - [ ] CDNA (MI300X): 192 GB unified HBM3, no separate VRAM boundary - [ ] Groq LPU, Cerebras WSE-3, Gaudi: architectural tradeoffs at a conceptual level

Reading: - [ ] TPU Deep Dive — Henry Hmko (2 hrs) - [ ] PyTorch/XLA Overview (2 hrs) - [ ] HIP Programming Model — AMD ROCm Docs (2 hrs) - [ ] AI Accelerators Beyond GPUs (1.5 hrs)

Hands-on: - [ ] Port a Triton matmul kernel to HIP using hipify. Run on ROCm (cloud MI300X instance or ROCm Docker). Document any wavefront-size or memory-layout differences. (2.5 hrs)


Week 20 — Interconnects, NCCL, & Part II Capstone

Concepts to understand: - [ ] NVLink generations: NVLink 4.0 (H100, 900 GB/s bidirectional), NVSwitch fabric (all-to-all within DGX) - [ ] InfiniBand: NDR/XDR generations, fat-tree and dragonfly topologies, IBTA architecture - [ ] RDMA basics: memory registration, QP model, one-sided vs. two-sided operations - [ ] GPUDirect RDMA: NIC reads/writes GPU HBM directly over PCIe — eliminates one bounce copy - [ ] NCCL ring-AllReduce algorithm and bandwidth-latency tradeoff - [ ] Tree-AllReduce for small messages; hierarchical collectives (intra-node NVLink + inter-node IB) - [ ] NVSHMEM: device-initiated one-sided PUT/GET from within CUDA kernels

Reading: - [ ] Scaling Deep Learning Training with NCCL (2 hrs) - [ ] Demystifying NCCL (3 hrs) - [ ] InfiniBand vs. RoCE — Juniper White Paper (2 hrs) - [ ] Inside Multi-Node Training — Together.ai (1.5 hrs)

Hands-on: - [ ] Write a custom AllReduce using NCCL primitives (ReduceScatter + AllGather). Benchmark against dist.all_reduce at varying message sizes. Plot effective bus bandwidth vs. message size. (3 hrs)

Milestone (Part II): End-to-end kernel engineering capstone. Choose one: (a) fused FP8 attention with TMA + WGMMA, (b) INT8/FP8 GEMM with per-block dequantization epilogue vs. cuBLAS, or (c) roofline-guided optimization of an underperforming open-source kernel with 3 distinct improvements verified in Nsight Compute.


Part III — Training at Scale

Goal: Extend beyond the ZeRO/Megatron basics to cover sequence parallelism, MoE training, data infrastructure, fault tolerance, efficient finetuning, and RLHF systems.

Week 21 — Sequence Parallelism & Context Parallelism

Concepts to understand: - [ ] Why standard tensor parallelism doesn’t help with O(N²) attention memory at long contexts - [ ] Megatron-LM sequence parallelism: sharding LayerNorm/Dropout along the sequence dimension, using reduce-scatter/all-gather pairs instead of all-reduce - [ ] Ring Attention: blockwise attention with KV blocks passed ring-wise; communication overlaps with local attention computation - [ ] DeepSpeed Ulysses: all-to-all before QKV projection so each device attends to full sequence but a subset of heads; communication stays constant as sequence length and device count scale proportionally - [ ] Combining sequence parallelism with TP+PP (4D parallelism) - [ ] Context parallelism in Megatron-Core and TorchTitan

Reading: - [ ] Ring Attention with Blockwise Transformers (3 hrs) - [ ] DeepSpeed Ulysses (3 hrs) - [ ] TorchTitan — docs + architecture (3 hrs)

Hands-on: - [ ] Implement Ring Attention for a toy transformer using torch.distributed. Measure effective sequence length per GPU vs. standard DP at equal memory budget. (3 hrs)


Week 22 — MoE Training: Architecture & Routing

Concepts to understand: - [ ] Sparse vs. dense compute: MoE activates K of N experts per token — decouples parameter count from FLOPs - [ ] Top-K routing, gating network, load balancing auxiliary loss - [ ] Switch Transformer: top-1 routing, capacity factor, token dropping under overflow - [ ] GShard: expert parallelism via XLA annotations across 2048 TPUs - [ ] Expert Choice routing: experts select tokens (inverted routing) - [ ] Token dropping vs. capacity buffer padding tradeoffs - [ ] MoE compute efficiency: 4× more compute-efficient than dense at low training budgets

Reading: - [ ] Switch Transformers (4 hrs) - [ ] GShard (3 hrs) - [ ] Mixtral of Experts (2 hrs) - [ ] Cameron Wolfe: MoE LLMs (2 hrs)

Hands-on: - [ ] Implement a sparse MoE FFN layer in PyTorch with top-2 routing and load balancing loss. Train on a toy task and measure expert utilization over time. (3 hrs)


Week 23 — MoE Training: Systems & Communication

Concepts to understand: - [ ] Expert parallelism: sharding experts across devices, all-to-all communication for token dispatch and gather - [ ] Why all-to-all is latency-sensitive at small batch sizes — the key serving challenge - [ ] DeepSeek-MoE: fine-grained expert decomposition (mN sub-experts, mK active); shared expert mechanism - [ ] Megablocks: block-sparse matrix formulation eliminating token dropping; 40% faster than Tutel - [ ] Interaction between expert parallelism and TP/PP - [ ] Expert collapse: diagnosing and preventing degenerate routing

Reading: - [ ] MegaBlocks: Efficient Sparse Training with MoE (3 hrs) - [ ] DeepSeekMoE (3 hrs) - [ ] Survey on MoE Inference Optimization — training sections (3 hrs)

Hands-on: - [ ] Replace naïve token-padded expert dispatch in your MoE from Week 22 with a block-sparse implementation. Measure GPU utilization improvement. (3 hrs)

Milestone (Weeks 22–23): For a 47B MoE model (Mixtral-style, 8 experts, top-2), compute: (a) active parameters per token, (b) VRAM per GPU with EP=8, (c) all-to-all communication volume per forward pass. Compare to a 13B dense model with equivalent per-token FLOPs.


Week 24 — Data Pipelines for Pretraining

Concepts to understand: - [ ] The data loading bottleneck: CPU throughput vs. GPU compute; profiling with DataLoader workers and async prefetch - [ ] WebDataset: tar-based sharded format, sequential streaming without random access - [ ] MosaicML StreamingDataset (MDS): deterministic ordering regardless of GPU count, mid-epoch resumption, multi-cloud support - [ ] NVIDIA DALI: GPU-accelerated preprocessing pipeline; eliminates CPU bottleneck for image/video/audio - [ ] Tokenization at scale: pre-tokenizing and caching, memory-mapped numpy arrays for zero-copy loading - [ ] Dataset mixing and weighting: sampling proportions, upsampling high-quality data - [ ] Data quality filtering: MinHash LSH deduplication, perplexity filtering, rule-based heuristics, toxic content filtering

Reading: - [ ] MosaicML StreamingDataset — docs (3 hrs) - [ ] NVIDIA DALI Documentation (3 hrs) - [ ] Training Compute-Optimal Large Language Models (Chinchilla) (3 hrs) - [ ] LLaMA — data pipeline section (2 hrs)

Hands-on: - [ ] Build a streaming data pipeline using MosaicML StreamingDataset. Profile data loading throughput (samples/sec) and measure how many DataLoader workers are needed to saturate a single GPU. (3 hrs)


Week 25 — Fault Tolerance & Large-Scale Reliability

Concepts to understand: - [ ] The scale problem: at 1000 GPUs, MTBF is hours not days - [ ] Full checkpointing vs. sharded checkpointing (PyTorch DCP): save/load cost comparison - [ ] Async checkpointing: save to host CPU memory in background while training continues - [ ] NCCL error handling: NCCL_ASYNC_ERROR_HANDLING, timeout detection, error propagation to Python - [ ] Elastic training: torch.distributed.elastic (torchrun), job preemption and resumption - [ ] SLURM integration: --signal flag, SIGUSR1 for preemption-aware checkpointing - [ ] Monitoring: per-GPU throughput tracking, NaN/inf gradient detection, loss anomaly detection

Reading: - [ ] BLOOM Training Chronicle (3 hrs) - [ ] OPT-175B Training Logbook (3 hrs) - [ ] TorchElastic / torchrun documentation (2 hrs) - [ ] PyTorch Distributed Checkpoint (DCP) documentation (2 hrs)

Hands-on: - [ ] Implement async sharded checkpointing for a distributed training job. Simulate a mid-training failure. Measure recovery time vs. full checkpoint. (3 hrs)


Week 26 — Efficient Finetuning: LoRA, QLoRA, PEFT

Concepts to understand: - [ ] Full finetuning memory breakdown: 16 bytes/param total (weights + gradients + Adam states) - [ ] LoRA: injecting trainable W = BA into each attention projection; rank-r reduces trainable params by up to 10,000×; weights merge at inference — no latency cost - [ ] QLoRA: NF4 quantization of frozen base + double quantization + paged optimizers; enables 65B finetuning on a single 48GB GPU - [ ] PEFT library internals: get_peft_model(), merge_and_unload(), distributed training with FSDP + PEFT - [ ] IA3: learns three scaling vectors rather than low-rank matrices — even fewer trainable params than LoRA

Reading: - [ ] LoRA: Low-Rank Adaptation of Large Language Models (3 hrs) - [ ] QLoRA: Efficient Finetuning of Quantized LLMs (3 hrs) - [ ] Hugging Face PEFT Library (3 hrs) - [ ] TRL smol-course (4 hrs)

Hands-on: - [ ] Finetune LLaMA-3-8B using: (a) full finetuning with FSDP, (b) LoRA rank-8, (c) QLoRA 4-bit. Record peak memory, training throughput, and eval accuracy for each. (3 hrs)

Milestone (Weeks 24–26): Build a complete finetuning pipeline: streaming data loading → QLoRA training with async checkpointing → PEFT weight merge → evaluation. Document the memory and throughput profile at each stage.


Week 27 — RLHF Systems: SFT, Reward Models & PPO

Concepts to understand: - [ ] Why RLHF changes the training topology: four model instances coexist (Actor, Critic, Reward Model, Reference Model) - [ ] Reference model and KL penalty: frozen initial policy produces per-token log-probs used in KL term r_θ - λ * KL(π || π_ref) - [ ] The rollout bottleneck: generating on-policy samples is ~80% of wall-clock time in PPO - [ ] OpenRLHF architecture: Ray orchestrates model groups; vLLM handles generation; DeepSpeed ZeRO-3 handles training - [ ] SFT infrastructure: sequence packing, efficient attention masking for packed sequences - [ ] Reward model training: Bradley-Terry preference model, ranking loss, process reward models (PRMs)

Reading: - [ ] Illustrating RLHF — Hugging Face (1.5 hrs) - [ ] OpenRLHF — docs + architecture (5 hrs) - [ ] TRL Documentation (4 hrs)

Hands-on: - [ ] Run a full PPO training loop on a 7B model using OpenRLHF. Profile wall-clock time split between rollout generation and model updates. Measure how vLLM integration affects total throughput. (3 hrs)


Week 28 — RLHF Algorithm Variants: DPO, GRPO & Beyond

Concepts to understand: - [ ] DPO: reformulates RLHF as a classification loss; no reward model or RL loop at training time; implicit reward is log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x)) - [ ] GRPO: eliminates critic network; normalizes rewards within a group of sampled outputs; reduces memory by removing the value network - [ ] RLOO: leave-one-out baseline within a group of K samples; simpler than GRPO - [ ] KTO: aligns using binary feedback (thumbs up/down); avoids need for paired data - [ ] ORPO: combines SFT + preference loss in a single stage; no reference model needed - [ ] Online vs. offline: DPO/KTO are offline; PPO/GRPO/RLOO are online (generate on-policy rollouts every step)

Reading: - [ ] Direct Preference Optimization — Rafailov et al. (3 hrs) - [ ] DeepSeekMath — GRPO derivation (3 hrs) - [ ] Putting RL back in RLHF — Hugging Face (1.5 hrs)

Hands-on: - [ ] Train LLaMA-3-8B with DPO on a preference dataset. Compare reward margin and eval performance against the PPO run from Week 27. Measure training memory and time per step. (3 hrs)

Milestone (Part III): Full RLHF pipeline: streaming data loading → QLoRA SFT → DPO training with async checkpointing → evaluation. Document GPU memory, throughput, and reward statistics at each stage.


Part IV — Advanced Inference

Goal: Go deep on long-context serving, next-generation quantization, model compression, advanced speculative decoding, disaggregated inference, and production-grade serving infrastructure.

Week 29 — Long-Context Inference: Position Encodings

Concepts to understand: - [ ] RoPE mechanics: encoding relative position as a rotation in the complex plane via e^(imθ) - [ ] Why naive RoPE extrapolation fails beyond training sequence length - [ ] Position interpolation (PI): linearly scaling position indices to fit within training range - [ ] NTK-aware interpolation: applying PI non-uniformly across frequency components - [ ] YaRN: ramp function + temperature correction for improved long-context performance - [ ] LongRoPE: frequency-domain analysis of dimension-wise scaling

Reading: - [ ] YaRN: Efficient Context Window Extension of Large Language Models (3 hrs) - [ ] EleutherAI blog: Extending the RoPE (1.5 hrs) - [ ] Ring Attention for inference — inference section (1.5 hrs)

Hands-on: - [ ] Extend LLaMA-3-8B to 32K context using YaRN. Measure perplexity at 4K, 8K, 16K, 32K and compare to a model without YaRN. (3 hrs)


Week 30 — Long-Context Inference: KV Cache Management

Concepts to understand: - [ ] Chunked prefill: splitting long-prompt prefill into chunks to reduce head-of-line blocking - [ ] Prefix caching: radix-tree KV reuse across requests sharing common prefixes - [ ] H2O (Heavy-Hitter Oracle): evicting KV entries based on attention mass - [ ] SnapKV: query-guided one-shot per-layer token selection before generation - [ ] StreamingLLM: attention sinks + recency window for infinite-length generation with fixed VRAM - [ ] SAGE-KV: one-shot token/head-level top-k eviction

Reading: - [ ] SnapKV (2 hrs) - [ ] vLLM blog: Anatomy of a High-Throughput LLM Inference System (2 hrs) - [ ] StreamingLLM (2 hrs)

Hands-on: - [ ] Implement H2O KV eviction on top of a vLLM-served model. Measure perplexity degradation vs. memory savings at 50% cache budget. (3 hrs)

Milestone (Weeks 29–30): For a LLaMA-3-70B model on 4×A100 with tensor parallelism, compute: (a) baseline KV cache size at seqlen=32K, batch=8, (b) memory savings under 50% H2O eviction, (c) latency impact of chunked prefill at different chunk sizes.


Week 31 — Advanced Quantization: FP8, GGUF, and Extreme Low-Bit

Concepts to understand: - [ ] FP8 floating-point formats: E4M3 (range-optimized) vs. E5M2 (precision-optimized), scaling and saturation behavior - [ ] FP8 inference on Hopper/Ada: NVIDIA TransformerEngine fp8_autocast API - [ ] SmoothQuant: migrating quantization difficulty from activations to weights via per-channel scaling; joint W8A8 - [ ] GGUF format: K-quant families (Q2_K through Q6_K), block-scale encoding, CPU/GPU hybrid execution - [ ] QuIP#: Hadamard incoherence processing + E8 lattice codebooks for 2-bit weights - [ ] AQLM: additive quantization with learned codebooks; training cost vs. inference speed - [ ] Mixed-precision quantization: per-layer bit selection via sensitivity analysis - [ ] Calibration dataset design: representativeness, domain shift effects

Reading: - [ ] NVIDIA TransformerEngine FP8 Primer (2 hrs) - [ ] QuIP# (3 hrs) - [ ] AQLM (3 hrs) - [ ] Which Quantization Should I Use? (systematic comparison) (2 hrs)

Hands-on: - [ ] Benchmark LLaMA-3-8B in fp16, int8 (SmoothQuant), int4 (GPTQ), FP8 (TE), and GGUF Q4_K_M. Record throughput (tokens/sec), TTFT, and perplexity. (3 hrs)


Week 32 — Model Compression: Pruning & Distillation

Concepts to understand: - [ ] Unstructured pruning: magnitude pruning, SparseGPT (Hessian-based layer-wise), Wanda (weight × activation magnitude criterion) - [ ] Structured pruning: head pruning, layer pruning, width pruning; hardware-efficiency tradeoffs - [ ] 2:4 structured sparsity: what NVIDIA sparse tensor cores require - [ ] Knowledge distillation at scale: sequence-level vs. token-level losses, forward KL vs. reverse KL (MiniLLM) - [ ] NVIDIA Minitron: structured pruning + short distillation fine-tune as production recipe - [ ] Combined pipelines: prune → distill → quantize stacking

Reading: - [ ] Wanda: A Simple and Effective Pruning Method (2 hrs) - [ ] MiniLLM: Knowledge Distillation of Large Language Models (2.5 hrs) - [ ] ACM Efficient Compressing and Tuning Methods for LLMs — survey (4 hrs)

Hands-on: - [ ] Apply Wanda unstructured pruning to LLaMA-3-8B at 50% sparsity. Measure throughput change (hint: unstructured sparsity doesn’t help on dense hardware) then apply 2:4 structural pattern and remeasure. (3 hrs)


Week 33 — Advanced Speculative Decoding

Concepts to understand: - [ ] Review: draft-model speculative decoding, acceptance rate, expected speedup derivation - [ ] Medusa: multiple decoding heads on frozen backbone; Medusa-1 (frozen LM) vs. Medusa-2 (joint fine-tune); predicts positions +1…+5 - [ ] Tree-structured candidate verification: constructing a candidate tree, batching the verify forward pass, token acceptance mask - [ ] EAGLE-1: draft at the feature level (not token level) using a shallow auto-regressive head on frozen LM embeddings - [ ] EAGLE-2: context-aware dynamic draft tree; confidence-score-based acceptance rate; 20–40% faster than EAGLE-1 - [ ] Self-speculative decoding (layer skipping): same model with skipped layers as draft — no auxiliary model needed

Reading: - [ ] Medusa (2.5 hrs) - [ ] EAGLE (3 hrs) - [ ] EAGLE-2 (2 hrs) - [ ] vLLM blog: Speculative Decoding up to 2.8× (1 hr)

Hands-on: - [ ] Enable Medusa and EAGLE-2 in vLLM for LLaMA-3-8B. Measure speedup at greedy and temperature=1 decoding. Compare acceptance rates. (3 hrs)

Milestone (Weeks 31–33): Given a 70B model with 50% Wanda pruning + AQLM 2-bit quantization + EAGLE-2 speculative decoding: estimate theoretical tokens/sec vs. the uncompressed baseline. Identify which technique gives the highest throughput-per-quality-point tradeoff.


Week 34 — Disaggregated & Distributed Inference

Concepts to understand: - [ ] Why prefill and decode have fundamentally different compute/memory profiles (roofline) - [ ] Splitwise: phase splitting onto heterogeneous hardware (H100 prefill / A100 decode), KV migration protocol - [ ] DistServe: goodput-optimized disaggregation, independent parallelism strategies per phase - [ ] Tensor parallelism for inference: column/row partition, all-reduce cost - [ ] Multi-GPU vLLM: --tensor-parallel-size, worker topology, NVLink vs. PCIe bandwidth sensitivity - [ ] 2025 landscape: disaggregation as the default (NVIDIA Dynamo, SGLang, LMCache)

Reading: - [ ] DistServe (3 hrs) - [ ] Hao AI Lab: Disaggregated Inference 18 Months Later (1 hr) - [ ] NVIDIA Dynamo announcement (1 hr)

Hands-on: - [ ] Deploy LLaMA-3-70B with tensor parallelism across 2 GPUs using vLLM. Measure TTFT and throughput vs. single-GPU with 4-bit quantization. Identify which strategy gives better throughput/VRAM tradeoff. (3 hrs)


Week 35 — Production Serving Infrastructure

Concepts to understand: - [ ] SLO taxonomy: TTFT, TBT (time-between-tokens), P50/P99 targets - [ ] SLO-aware scheduling: global queue management, preemption, priority classes - [ ] Load balancing across replicas: session affinity for prefix cache hits, least-outstanding-requests - [ ] Autoscaling: request-rate-based vs. queue-depth-based triggers, custom Prometheus metrics, K8s HPA - [ ] Triton Inference Server: model repository, backend API, dynamic batching, ensemble models - [ ] TensorRT-LLM: plugin system, inflight batching, paged KV cache, quantized kernel dispatch - [ ] Cost modeling: $/token, GPU utilization targets, spot instance fault tolerance

Reading: - [ ] SLO-Aware Scheduling for LLM Inferences (2.5 hrs) - [ ] AWS: Multi-node TensorRT-LLM + Triton on EKS (3 hrs) - [ ] A Survey on Inference Engines for LLMs (4 hrs)

Hands-on: - [ ] Deploy LLaMA-3-8B with TensorRT-LLM + Triton Inference Server. Set TTFT P99 < 500ms as an SLO. Measure max throughput (tokens/sec) while holding the SLO. (4 hrs)

Milestone (Part IV): End-to-end serving system: FP8-quantized 70B model on disaggregated prefill/decode infrastructure, with SLO-aware scheduling, autoscaling, and cost monitoring. Document latency, throughput, GPU utilization, and $/token achieved.


Part V — Emerging Architectures

Goal: Understand Mamba/SSMs, MoE inference, linear attention variants, diffusion systems, and advanced multimodal architectures — always through the systems lens of compute, memory, and parallelism.

Week 36 — State Space Models: Mamba

Concepts to understand: - [ ] Selective SSM recurrence: input-dependent (A, B, C) parameters break time-invariance — preventing FFT shortcut - [ ] Parallel associative scan: the core primitive — associativity of state-update operator, tree reduction, work vs. depth analysis - [ ] Mamba’s kernel fusion: why the naïve sequential scan is memory-bandwidth-bound; how kernel fusion eliminates HBM materialization (analogous to FlashAttention tiling) - [ ] Inference memory profile: O(1) KV-cache equivalent — fixed-size recurrent state regardless of sequence length - [ ] Mamba-2 / SSD (Structured State Space Duality): reformulating selective SSM as semiseparable matrix multiplication; enables tensor-core utilization; 2–8× speedup over Mamba-1

Reading: - [ ] Mamba: Linear-Time Sequence Modeling with Selective State Spaces (4 hrs) - [ ] Tri Dao: SSD Blog series (Parts I–III) (3 hrs) - [ ] Princeton PLI: Mamba-2 Algorithms and Systems (2 hrs)

Hands-on: - [ ] Implement the Mamba selective scan in pure PyTorch. Verify output matches the reference implementation. Profile memory usage at seqlen 4K, 16K, 64K — compare to FlashAttention-2. (3 hrs)


Week 37 — Mamba-2, Triton Scan Kernels & Hybrid Architectures

Concepts to understand: - [ ] Mamba-2/SSD: block-decomposable structure enabling sequence parallelism across devices - [ ] Implementing efficient parallel scan in Triton: tile sizes, recomputation vs. checkpointing of states, variable-length batches - [ ] Hybrid architectures (Jamba, Zamba): the 1:7 attention-to-Mamba ratio; interleaved MoE layers; systems implications for serving - [ ] flash-linear-attention library: unified Triton kernels for GLA/Mamba/RWKV

Reading: - [ ] Mamba-2 / SSD (5 hrs) - [ ] Mamba: The Hard Way — Sasha Rush (annotated Triton) (6 hrs) - [ ] Jamba: Hybrid Transformer-Mamba Language Model (2 hrs)

Hands-on: - [ ] Implement a simplified parallel selective scan in Triton. Benchmark against the sequential Mamba reference at seqlen 1K→64K. Measure FLOP/s and memory bandwidth utilization. (4 hrs)

Milestone (Weeks 36–37): Build a Mamba-2 inference server. Measure: (a) per-token VRAM at seqlen 1K vs. 64K vs. 256K (should be constant), (b) throughput vs. an equivalent-parameter transformer at each seqlen, (c) the crossover point where Mamba becomes faster.


Week 38 — MoE Architecture & Inference Serving

Concepts to understand: - [ ] Sparse MoE inference: expert weight storage, all-to-all routing, GPU utilization at small batch sizes - [ ] Expert offloading: streaming expert weights from CPU DRAM; expert prediction/caching to reduce transfer cost - [ ] Serving Mixtral 8×7B and DeepSeek-MoE: how vLLM/SGLang handle expert-parallel routing at batching time - [ ] MoE vs. dense at inference: same FLOPs per token, 4× more parameters — memory bandwidth bottleneck is worse

Reading: - [ ] Efficient Large Scale Language Modeling with MoE — empirical efficiency study (2 hrs) - [ ] Survey on MoE Inference Optimization — inference sections (3 hrs) - [ ] DeepSeekMoE (3 hrs)

Hands-on: - [ ] Serve Mixtral-8×7B with vLLM. Profile GPU memory and expert routing distribution. Measure how batch size affects expert utilization and throughput. (3 hrs)


Week 39 — Linear Attention & Hybrid Architectures

Concepts to understand: - [ ] Kernel-trick reformulation of softmax attention: replacing exp(qᵀk) with φ(q)ᵀφ(k) — the root of O(1)-memory inference - [ ] Why pure linear attention underperforms: missing normalizer term, forgetting over long sequences - [ ] GLA (Gated Linear Attention): chunked-form training kernel in Triton; faster than FlashAttention-2 at 1K seqlen - [ ] RWKV dual-mode: parallel training (WKV operator along sequence) vs. pure RNN inference (O(1) per-token, fixed VRAM) - [ ] RetNet: decay matrix as structured (diagonal + low-rank) operator enabling efficient chunkwise computation - [ ] flash-linear-attention: how subquadratic kernels share the same tiling and recomputation strategy as FlashAttention

Reading: - [ ] GLA: Gated Linear Attention Transformers with Hardware-Efficient Training (3 hrs) - [ ] RWKV: Reinventing RNNs for the Transformer Era (3 hrs) - [ ] flash-linear-attention — code + docs (3 hrs)

Hands-on: - [ ] Benchmark GLA vs. FlashAttention-2 vs. Mamba at seqlen 512, 1K, 2K, 4K, 8K. Plot throughput (tokens/sec) and memory per token. Identify the regime where each is fastest. (3 hrs)


Week 40 — Diffusion Model Systems: Samplers & Architectures

Concepts to understand: - [ ] DDPM reverse process: N forward passes per image — O(N × model_FLOPs) inference cost - [ ] DDIM and DPM-Solver: deterministic ODE-based samplers reducing steps from ~1000 to 10–20 without retraining - [ ] Flow matching: velocity field over straight ODE trajectories — fewer NFEs at inference; numerically better-conditioned - [ ] Consistency models: self-consistency along ODE trajectory → single-step inference - [ ] DiT (Diffusion Transformer) vs. U-Net: patching the latent space with a ViT backbone; uniform FLOP distribution per layer; simpler tensor parallelism

Reading: - [ ] DiT: Scalable Diffusion Models with Transformers (3 hrs) - [ ] DPM-Solver (3 hrs) - [ ] Flow Matching for Generative Modeling (3 hrs) - [ ] Efficient Diffusion Models Survey (4 hrs)

Hands-on: - [ ] Profile DiT-XL/2 inference. Measure latency at 50 steps (DDPM) vs. 10 steps (DPM-Solver) vs. 1 step (consistency). Plot quality (FID) vs. latency tradeoff. (3 hrs)


Week 41 — Diffusion Serving & Distributed Execution

Concepts to understand: - [ ] Batching strategies for diffusion: requests are embarrassingly parallel within a batch; variable CFG guidance scale complicates batching - [ ] DistriFusion: asynchronous parallel denoising over devices — exploiting temporal redundancy across steps - [ ] PipeFusion: pipeline parallelism across transformer layers for DiT - [ ] Long-context sparse attention patterns: BigBird’s random + local + global, Longformer sliding-window + global - [ ] ALiBi: per-head linear bias on attention logits proportional to distance — no positional embedding parameters; generalizes to unseen lengths

Reading: - [ ] BigBird: Transformers for Longer Sequences (3 hrs) - [ ] Hyena Hierarchy (3 hrs) - [ ] Efficient Attention Mechanisms for LLMs Survey (4 hrs)

Hands-on: - [ ] Implement DistriFusion for DiT-XL/2 inference across 2 GPUs. Compare latency to single-GPU at the same step count. Measure communication overhead. (3 hrs)

Milestone (Weeks 40–41): For a DiT-XL/2 model serving at 256 requests/sec, design an architecture: (a) step count vs. quality tradeoff using DPM-Solver, (b) batching strategy with variable CFG, (c) multi-GPU with DistriFusion — estimate total GPU count needed to hit 100ms TTFT P99.


Week 42 — Advanced Multimodal: Video & Audio

Concepts to understand: - [ ] ViT FLOP profile: patch count n = (H × W) / p² grows quadratically with resolution — direct analogy to seqlen scaling - [ ] Image tokenization tradeoffs: continuous patch embeddings vs. discrete VQ-VAE tokens; 32-token tokenization (NeurIPS 2024) - [ ] Video model systems: naive 3D attention is O((T·H·W)²); space-time factored attention; sliding-window temporal attention; memory management for long video sequences - [ ] Frame batching strategies: independent frames (cheap, no temporal coherence) vs. tubelet embeddings (3D patch tokens) - [ ] Audio models (Whisper, EnCodec): spectrogram-to-patch tokenization, streaming inference, causal attention latency constraints - [ ] Cross-modal fusion: Q-Former / cross-attention layers (BLIP-2, Flamingo) vs. simple projection (LLaVA)

Reading: - [ ] An Image is Worth 16×16 Words: ViT (2 hrs) - [ ] An Image is Worth 32 Tokens for Reconstruction and Generation (2 hrs) - [ ] Image and Video Tokenization (ICLR 2025) (2 hrs) - [ ] Vision Transformers on the Edge: Model Compression Survey (3 hrs)

Hands-on: - [ ] Serve a video-language model (e.g., LLaVA-Video or VideoLLaMA). Profile: (a) video tokenization latency vs. LLM prefill vs. decode, (b) how video length and resolution affect KV cache size and TTFT. (4 hrs)

Milestone (Part V): Implement a Mamba-2 parallel scan kernel and benchmark it end-to-end: compare throughput, memory, and quality against FlashAttention-2 at seqlens from 2K to 256K. Write a one-page analysis of which architecture you’d choose for which use case.


Part VI — Compilers & Infrastructure

Goal: Understand the full compiler stack from MLIR down to PTX, the internals of automatic differentiation, memory allocation, and production MLOps infrastructure.

Week 43 — Compiler IR Theory & MLIR

Concepts to understand: - [ ] SSA (Static Single Assignment): basic blocks, CFGs, dominance frontiers, liveness analysis, φ-functions - [ ] MLIR architecture: op, attribute, type, region, block; built-in dialects (func, arith, affine, linalg, memref, scf) - [ ] Dialect conversion framework and the Transform dialect for compiler-controlled transformations - [ ] Lowering chains: MLIR → LLVM dialect → LLVM IR → PTX - [ ] torch.compile internals: Dynamo bytecode interception (PEP 523), FX graph, guard mechanism, graph breaks

Reading: - [ ] SSA-Based Compiler Design (free PDF) — selective chapters (5 hrs) - [ ] MLIR Toy Tutorial Chapters 1–6 (8 hrs) - [ ] depyf: decompiles torch.compile bytecode (3 hrs) - [ ] PyTorch Dynamo Deep-Dive (4 hrs)

Hands-on: - [ ] Complete all 6 chapters of the MLIR Toy tutorial. Then use depyf to inspect the Dynamo FX graph of a transformer forward pass. Identify 3 graph breaks and explain what causes them. (3 hrs)


Week 44 — TVM, XLA & Autotuning

Concepts to understand: - [ ] TVM / TensorIR: first-class schedulable IR; MetaSchedule stochastic search space (block tiling, loop reordering, vectorization); how tuning records are stored - [ ] Ansor: program search space for high-performance tensor programs; cost-model-driven auto-scheduling - [ ] XLA HLO instruction set: algebraic simplification, fusion, layout assignment, buffer assignment - [ ] XLA SPMD partitioner: sharding annotations, per-op partitioning semantics, automatic collective insertion - [ ] GSPMD: generalizing SPMD to arbitrary parallelism strategies

Reading: - [ ] Machine Learning Compilation (MLC) course — TensorIR + MetaSchedule chapters (8 hrs) - [ ] OpenXLA GPU Architecture Overview (3 hrs) - [ ] GSPMD (3 hrs) - [ ] Ansor (2 hrs)

Hands-on: - [ ] Autotune a matrix multiplication using TVM MetaSchedule. Compare achieved FLOP/s to: (a) a naive implementation, (b) cuBLAS, (c) your hand-written Triton kernel. (3 hrs)


Week 45 — Automatic Differentiation Systems

Concepts to understand: - [ ] JVP (forward-mode AD) as Jacobian–vector product; VJP (reverse-mode) as vector–Jacobian product - [ ] Why reverse mode dominates for ML: O(m) VJPs vs. O(n) JVPs for n-input, m-output functions - [ ] JAX transformation model: jax.jvp, jax.vjp, jax.grad; Jaxpr as the internal lambda calculus IR - [ ] vmap as a batch-dimension lifting transformation; how jit + vmap + grad compose - [ ] Custom derivatives in JAX: custom_jvp and custom_vjp for non-differentiable ops and numerical stability fixes - [ ] PyTorch autograd internals: dynamic computation graph, Function.forward/backward, AccumulateGrad nodes, C++ engine thread pool - [ ] AOTAutograd: ahead-of-time joint forward+backward graph capture; why graph breaks hurt backward compilation

Reading: - [ ] JAX Autodiff Cookbook (4 hrs) - [ ] JAX JVP/VJP documentation (2 hrs) - [ ] JAX Custom derivative rules (2 hrs) - [ ] How Computational Graphs are Constructed in PyTorch (2 hrs)

Hands-on: - [ ] Implement a custom vjp in JAX for numerically stable log-softmax. Verify the gradient matches jax.grad on stable inputs but doesn’t NaN on extreme inputs. Then implement the same in PyTorch with torch.autograd.Function. (3 hrs)


Week 46 — Memory Management & Allocators

Concepts to understand: - [ ] PyTorch CUDA caching allocator: block splitting and reuse, per-stream caching, rounding policy, when cudaMalloc/cudaFree are actually called - [ ] Memory fragmentation: external fragmentation from variable-size tensors across streams; max_split_size_mb; PYTORCH_CUDA_ALLOC_CONF knobs - [ ] cudaMallocAsync backend: CUDA 11.4+ virtual memory pools, stream-ordered allocation semantics; when it beats the native allocator - [ ] Memory snapshot tooling: torch.cuda.memory._record_memory_history() + _dump_snapshot() + pytorch.org/memory_viz - [ ] Gradient checkpointing + activation CPU offload: separate CUDA streams for overlap; FSDP + offload combinations

Reading: - [ ] A Guide to PyTorch’s CUDA Caching Allocator — Zach DeVito (2.5 hrs) - [ ] Understanding GPU Memory 1: Visualizing All Allocations over Time (1.5 hrs) - [ ] PyTorch CUDA semantics — PYTORCH_CUDA_ALLOC_CONF reference (1 hr) - [ ] torchtune: Memory Optimization Overview (2 hrs)

Hands-on: - [ ] Capture a memory snapshot for a distributed training run with FSDP ZeRO-3. Use pytorch.org/memory_viz to identify the peak allocation event. Reduce peak memory by 20% through allocator tuning and partial activation offload. (3 hrs)


Week 47 — NCCL Deep Dive

Concepts to understand: - [ ] NCCL ring-AllReduce: bandwidth-optimal for large messages; latency formula O(2(n-1)/n × α + 2(n-1)/n × β × M) - [ ] Double binary tree: logarithmic latency with full bandwidth; when to prefer tree over ring - [ ] NCCL protocols: Simple (bandwidth-optimal), LL (latency-optimal 8-byte writes), LL128 (balanced); dynamic selection based on message size - [ ] NCCL tuning: NCCL_ALGO, NCCL_PROTO, channel count, thread count; benchmarking with nccl-tests - [ ] Compute-communication overlap: async collectives on separate CUDA streams; NCCL 2.28 copy-engine collectives - [ ] SHARP: in-network reduction on InfiniBand switches — eliminating the final merge step

Reading: - [ ] Understanding NCCL Tuning (2 hrs) - [ ] Fast Multi-GPU Collectives with NCCL (1.5 hrs) - [ ] NCCL 2.28 Copy Engine Collectives (1.5 hrs) - [ ] Demystifying NCCL (3 hrs)

Hands-on: - [ ] Use nccl-tests to benchmark AllReduce across 2 and 8 GPUs at message sizes 1KB to 1GB. Plot bus bandwidth vs. message size. Identify the crossover between latency-dominated and bandwidth-dominated regimes. (3 hrs)


Week 48 — Deployment & Production Infrastructure

Concepts to understand: - [ ] Experiment tracking: MLflow (runs, artifacts, model registry) vs. W&B (sweeps, artifact versioning, comparison); when to use which - [ ] ONNX export pipeline: torch.onnx.export dynamo-based path; opset versioning; ONNX Runtime graph optimization levels (basic, extended, all); execution provider selection (CUDA EP, TensorRT EP) - [ ] Kubernetes for ML: resource requests/limits for GPU pods (nvidia.com/gpu), Kubeflow Pipelines, KubeRay, autoscaling GPU node pools - [ ] Cost optimization: spot/preemptible instances, bin-packing, inference-time quantization, request batching

Reading: - [ ] ONNX Runtime — Graph Optimizations (2 hrs) - [ ] ONNX Runtime — Execution Providers (1.5 hrs) - [ ] Ray for ML Infrastructure (4 hrs) - [ ] Full Stack Deep Learning Lecture 6: MLOps (2 hrs)

Hands-on: - [ ] Export LLaMA-3-8B to ONNX. Deploy with ONNX Runtime using the TensorRT EP. Compare latency to vLLM serving. Identify which graph optimizations ONNX Runtime applies automatically. (3 hrs)


Week 49 — Advanced Profiling & Distributed Debugging

Concepts to understand: - [ ] Nsight Compute kernel analysis: SOL section, Memory Workload Analysis chart, Warp State Statistics stall taxonomy, Scheduler Statistics, Source/SASS views for line-level attribution - [ ] Nsight Systems: NVTX range annotations, correlating host-side Python/C++ with device execution, multi-node traces for straggler identification - [ ] Distributed training debugging: NCCL_DEBUG=INFO, rank asymmetry (one rank hangs while others wait), common NCCL error classes (mismatched tensor shapes/dtypes, communicator reuse bugs) - [ ] PyTorch memory profiler: profile_memory=True, memory snapshot workflow, memory_stats counters

Reading: - [ ] Using Nsight Compute to Inspect Your Kernels (2.5 hrs) - [ ] Debugging NCCL Errors in Distributed Training (1.5 hrs) - [ ] Debugging PyTorch Memory Use with Snapshots — Zach DeVito (1.5 hrs) - [ ] PyTorch Mosaic Memory Profiling Tutorial (1.5 hrs)

Hands-on: - [ ] Run a multi-GPU training job where one rank is intentionally slower (add a sleep). Use Nsight Systems + NCCL_DEBUG=INFO to identify the straggler. Then fix the bottleneck and verify the speedup. (3 hrs)


Week 50 — Integrating the Compiler Stack

Concepts to understand: - [ ] Full stack: Python → Dynamo FX graph → AOTAutograd joint graph → Inductor loop IR → Triton / CUTLASS → PTX → SASS - [ ] How each layer in the stack creates or destroys optimization opportunities - [ ] TorchInductor design: define-by-run IR, symbolic shapes, persistent reduction fusions - [ ] When to bypass each layer: hand-written Triton for custom ops, CUTLASS for peak GEMM performance, XLA for TPU/multi-host SPMD

Reading: - [ ] TorchInductor design doc (2 hrs) - [ ] PyTorch 2 paper — re-read for architecture (2 hrs) - [ ] GSPMD paper — re-read for the SPMD big picture (2 hrs)

Hands-on: - [ ] Write a custom PyTorch operator that falls through the entire stack: implement the forward in Triton, register a custom torch.autograd.Function with a custom_vjp, and verify torch.compile can capture and fuse it. (4 hrs)


Week 51 — Reading Week & Integration

A structured review week with no new material. Revisit the notes, exercises, and milestone answers from the hardest weeks.

Suggested review targets: - [ ] Re-read FlashAttention-3, Mamba-2, and DistServe papers with fresh eyes - [ ] Re-derive: ZeRO-3 communication volume, arithmetic intensity for every major op type, speculative decoding speedup formula - [ ] Review your Nsight Compute profiles from Weeks 18 and 49 — do you understand every row now? - [ ] Read Taming the Titans: A Survey of Efficient LLM Inference Serving as a capstone survey (4 hrs) - [ ] Read A Survey on Efficient Inference for Large Language Models (5 hrs)


Week 52 — Year-End Capstone

Goal: Ship something production-quality that integrates at least 3 domains from the curriculum. The deliverable should be something you’d be comfortable presenting as a portfolio piece.

Choose one track:

Deliverable checklist: - [ ] Architecture diagram showing which curriculum concepts were applied and why - [ ] Baseline measurement (before optimizations) - [ ] Each optimization applied with before/after numbers - [ ] Roofline analysis at the bottleneck point - [ ] What you’d do next if you had more hardware or time


Reference Lists

Canonical Papers by Week

Paper Topic Week
Attention Is All You Need Transformer prerequisite
Scaling Laws for Neural Language Models Scaling 3
FlashAttention Efficient attention 4
FlashAttention-2 Efficient attention 4
GQA KV compression 4
DeepSeek-V2 §3 MLA KV compression 4
LLM.int8() Quantization 5
GPTQ Quantization 5
AWQ Quantization 5
Speculative Decoding Inference 5
PagedAttention / vLLM Serving 6
Orca Serving 6
SGLang Serving 6
Megatron-LM Distributed 7
Megatron 3D Distributed 7
ZeRO Distributed 8
Mixed Precision Training Training 8
Activation Recomputation Training 8
Triton Kernels 9
PyTorch 2 Compilation 10
CLIP Multimodal 11
LLaVA Multimodal 11
FlashAttention-3 Hardware 17
Ring Attention Seq parallelism 21
DeepSpeed Ulysses Seq parallelism 21
Switch Transformers MoE 22
GShard MoE 22
MegaBlocks MoE 23
Chinchilla Scaling 24
LoRA Finetuning 26
QLoRA Finetuning 26
DPO RLHF 28
DeepSeekMath (GRPO) RLHF 28
YaRN Long context 29
SnapKV KV eviction 30
QuIP# Quantization 31
AQLM Quantization 31
Wanda Pruning 32
EAGLE Spec decode 33
EAGLE-2 Spec decode 33
DistServe Disaggregated 34
Mamba SSMs 36
Mamba-2 / SSD SSMs 37
GLA Linear attn 39
RWKV Linear attn 39
DiT Diffusion 40
DPM-Solver Diffusion 40
Flow Matching Diffusion 40
GSPMD Compilers 44
Demystifying NCCL Networking 47

Key Blogs & References

Resource What It’s Good For Link
Making Deep Learning Go Brrrr Compute/memory bottleneck taxonomy https://horace.io/brrr_intro.html
Transformer Math 101 (EleutherAI) FLOP/memory arithmetic reference https://blog.eleuther.ai/transformer-math/
JAX Scaling Book Roofline, sharding, distributed training https://jax-ml.github.io/scaling-book/
Stas Bekman’s ML Engineering Practical distributed training cookbook https://github.com/stas00/ml-engineering
GPU MODE Lectures Advanced CUDA, CUTLASS, Triton https://github.com/gpu-mode/lectures
Lilian Weng’s blog Broad ML coverage, well-cited https://lilianweng.github.io
Sebastian Raschka’s newsletter LLM research summaries https://magazine.sebastianraschka.com
CutlassAcademy CUTLASS 3.x tutorials https://github.com/MekkCyber/CutlassAcademy
flash-linear-attention Triton kernels for SSMs/linear attn https://github.com/fla-org/flash-linear-attention
LLM Inference Optimization Papers Curated inference paper list https://github.com/chenhongyu2048/LLM-inference-optimization-paper

Last updated: 2026-03-15. Revisit pacing at Part boundaries.