ML Systems Curriculum: LLMs & Multimodal Models

52 weeks · ~10 hrs/wk · ~520 hrs total Profile: Python/PyTorch practitioner, GPU beginner, full-stack goal (inference + training + kernels)


Year Overview

Part Weeks Theme
I 1–12 Foundations
II 13–20 Hardware Mastery
III 21–29 Training at Scale
IV 30–37 Advanced Inference
V 38–46 Emerging Architectures
VI 47–52 Compilers & Infrastructure

Dependency Map

flowchart TD
    subgraph P1["① Foundations · Wks 1–12"]
        roofline["Roofline & GPU Architecture"]
        txArith["Transformer Systems Arithmetic"]
        effAttn["Efficient Attention (Flash, GQA, MLA)"]
        inference["Quantization & Serving"]
        dist["Distributed Training (DP / TP / PP / ZeRO)"]
        kernels["Triton Kernels & torch.compile"]
    end

    subgraph P2["② Hardware Mastery · Wks 13–20"]
        advCuda["Advanced CUDA (Warps, Tensor Cores, cp.async)"]
        cutlass["CUTLASS / CuTe"]
        hopper["Hopper ISA (TMA, WGMMA, Ping-Pong)"]
        profiling["Nsight Profiling (Compute + Systems)"]
        interconnects["Interconnects & NCCL Basics"]
    end

    subgraph P3["③ Training at Scale · Wks 21–29"]
        seqPar["Sequence & Context Parallelism"]
        moeTraining["MoE Training (Expert Parallelism)"]
        dataPipe["Data Pipelines & Pretraining Infra"]
        peft["LoRA / QLoRA / PEFT"]
        rlhf["RLHF Systems (PPO, DPO, GRPO)"]
    end

    subgraph P4["④ Advanced Inference · Wks 30–37"]
        longCtx["Long-Context (YaRN, KV Eviction)"]
        advQuant["FP8 & Extreme Quantization"]
        advSpec["Advanced Spec Decoding (EAGLE, Medusa)"]
        disagg["Disaggregated Inference (DistServe)"]
        prodServing["Production Serving & SLO Scheduling"]
    end

    subgraph P5["⑤ Emerging Architectures · Wks 36–46"]
        mamba["Mamba / SSMs & Parallel Scan"]
        moeInf["MoE Inference & Expert Offload"]
        linAttn["Linear Attention & Hybrid Models"]
        diffusion["Diffusion Systems (DiT, Flow Matching)"]
    end

    subgraph P6["⑥ Compilers & Infra · Wks 43–52"]
        mlir["MLIR & Compiler IR Theory"]
        tvm["TVM / XLA / Autotuning"]
        adSys["Automatic Differentiation Systems"]
        memMgmt["CUDA Memory Management"]
        ncclDeep["NCCL Deep Dive & Collectives"]
    end

    %% Part I internal
    roofline --> txArith
    roofline --> kernels
    txArith --> effAttn
    effAttn --> inference
    effAttn --> dist

    %% P1 → P2
    roofline --> advCuda
    kernels --> cutlass
    advCuda --> cutlass
    cutlass --> hopper
    hopper --> profiling
    dist --> interconnects

    %% P1 → P3
    dist --> seqPar
    dist --> moeTraining
    txArith --> dataPipe
    inference --> peft
    peft --> rlhf
    seqPar --> rlhf

    %% P1 → P4
    effAttn --> longCtx
    inference --> advQuant
    inference --> advSpec
    inference --> disagg
    inference --> prodServing

    %% P2 → P4
    hopper --> advQuant

    %% P2 → P5
    hopper --> mamba
    kernels --> mamba

    %% P1 → P5
    effAttn --> linAttn
    txArith --> diffusion

    %% P3 → P5
    moeTraining --> moeInf

    %% P4 → P5
    disagg --> moeInf

    %% P2 → P6
    hopper --> mlir
    kernels --> mlir
    mlir --> tvm
    advCuda --> memMgmt
    interconnects --> ncclDeep
    kernels --> adSys

Part I — Foundations

Goal: Build the hardware intuition, transformer arithmetic, attention theory, inference fundamentals, distributed training basics, and kernel writing skills that underpin everything else.

Week 1 — The Roofline Model

Concepts to understand:

Reading:

Hands-on:

Milestone: Given matmul(A, B) with A=(4096,4096), B=(4096,4096) in fp16 on an A100 (312 TFLOP/s, 2 TB/s), compute arithmetic intensity and determine its roofline regime.


Week 2 — CUDA Mental Model & Memory Hierarchy

Concepts to understand:

Reading:

Hands-on:

Milestone: Explain why a naively written matrix transpose kernel is memory-bandwidth-bound despite O(N²) reads and writes — and describe the shared memory tiling fix.


Week 3 — Transformer Systems Arithmetic

Concepts to understand:

Reading:

Hands-on:

Milestone: Why does doubling sequence length quadratically increase attention FLOPs but only linearly increase KV cache memory? Write out the derivation.


Week 4 — Efficient Attention

Concepts to understand:

Reading:

Hands-on:

Milestone: For a 70B model, 64 heads, d_head=128, seqlen=8192, batch=1 in fp16 — compute the memory footprint of the full attention matrix under standard MHA. Then state FlashAttention’s peak SMEM usage and explain why.


Week 5 — Quantization & Speculative Decoding

Concepts to understand:

Reading:

Hands-on:


Week 6 — Serving Systems

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 5–6): For a 7B model on one A100 (80GB), walk through: (1) VRAM available for KV cache after loading int4 weights, (2) concurrent sequences that fit, (3) why continuous batching improves utilization over static batching.


Week 7 — Data Parallelism, Tensor Parallelism, Pipeline Parallelism

Concepts to understand:

Reading:

Hands-on:


Week 8 — ZeRO, FSDP, Mixed Precision, Gradient Checkpointing

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 7–8): For a 13B model on 8 GPUs with ZeRO-3: calculate per-GPU memory for parameters, gradients, and optimizer states. Estimate the all-gather + reduce-scatter communication volume per step vs. DDP baseline.


Week 9 — Triton Foundations

Concepts to understand:

Reading:

Hands-on:


Week 10 — torch.compile & Inductor

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 9–10): Write a fused layernorm kernel in Triton that computes mean, variance, and normalization in a single pass. Compare throughput and bandwidth to torch.nn.LayerNorm. Explain why a single-pass implementation is faster.


Week 11 — Vision-Language Model Serving

Concepts to understand:

Reading:

Hands-on:

Milestone: A 336×336 image produces 576 image tokens. For LLaVA-1.5-7B serving 16 concurrent users (one image + 64-token question each), compute effective prefill token count and compare KV cache memory to a text-only baseline.


Week 12 — Part I Capstone

Choose one track:

Deliverable checklist:


Part II — Hardware Mastery

Goal: Go from Triton user to someone who can reason about hardware ISA-level behavior, write CUTLASS kernels, profile with Nsight Compute, and understand the full interconnect stack.

Week 13 — Advanced CUDA: Warp Primitives & Tensor Cores

Concepts to understand:

Reading:

Hands-on:


Week 14 — CUDA Concurrency: Streams, Graphs, Async Copy

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 13–14): Implement a fused LayerNorm with warp-shuffle reduction and double-buffered cp.async pipelining. Profile in Nsight Compute and identify: (a) achieved bandwidth as % of theoretical, (b) dominant stall reason.


Week 15 — CUTLASS & CuTe

Concepts to understand:

Reading:

Hands-on:


Week 16 — Hopper Architecture: TMA & WGMMA

Concepts to understand:

Reading:

Hands-on:


Week 17 — FlashAttention-3 as Hopper Case Study

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 15–17): Write a fused FP8 attention kernel in CUTLASS 3.x with TMA + WGMMA + Ping-Pong scheduling. Profile vs. a Triton FlashAttention-2 implementation. Explain the performance delta using roofline analysis.


Week 18 — Profiling Methodology: Nsight Compute & Nsight Systems

Concepts to understand:

Reading:

Hands-on:


Week 19 — Alternative Accelerators

Concepts to understand:

Reading:

Hands-on:


Week 20 — Interconnects, NCCL, & Part II Capstone

Concepts to understand:

Reading:

Hands-on:

Milestone (Part II): End-to-end kernel engineering capstone. Choose one: (a) fused FP8 attention with TMA + WGMMA, (b) INT8/FP8 GEMM with per-block dequantization epilogue vs. cuBLAS, or (c) roofline-guided optimization of an underperforming open-source kernel with 3 distinct improvements verified in Nsight Compute.


Part III — Training at Scale

Goal: Extend beyond the ZeRO/Megatron basics to cover sequence parallelism, MoE training, data infrastructure, fault tolerance, efficient finetuning, and RLHF systems.

Week 21 — Sequence Parallelism & Context Parallelism

Concepts to understand:

Reading:

Hands-on:


Week 22 — MoE Training: Architecture & Routing

Concepts to understand:

Reading:

Hands-on:


Week 23 — MoE Training: Systems & Communication

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 22–23): For a 47B MoE model (Mixtral-style, 8 experts, top-2), compute: (a) active parameters per token, (b) VRAM per GPU with EP=8, (c) all-to-all communication volume per forward pass. Compare to a 13B dense model with equivalent per-token FLOPs.


Week 24 — Data Pipelines for Pretraining

Concepts to understand:

Reading:

Hands-on:


Week 25 — Fault Tolerance & Large-Scale Reliability

Concepts to understand:

Reading:

Hands-on:


Week 26 — Efficient Finetuning: LoRA, QLoRA, PEFT

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 24–26): Build a complete finetuning pipeline: streaming data loading → QLoRA training with async checkpointing → PEFT weight merge → evaluation. Document the memory and throughput profile at each stage.


Week 27 — RLHF Systems: SFT, Reward Models & PPO

Concepts to understand:

Reading:

Hands-on:


Week 28 — RLHF Algorithm Variants: DPO, GRPO & Beyond

Concepts to understand:

Reading:

Hands-on:

Milestone (Part III): Full RLHF pipeline: streaming data loading → QLoRA SFT → DPO training with async checkpointing → evaluation. Document GPU memory, throughput, and reward statistics at each stage.


Part IV — Advanced Inference

Goal: Go deep on long-context serving, next-generation quantization, model compression, advanced speculative decoding, disaggregated inference, and production-grade serving infrastructure.

Week 29 — Long-Context Inference: Position Encodings

Concepts to understand:

Reading:

Hands-on:


Week 30 — Long-Context Inference: KV Cache Management

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 29–30): For a LLaMA-3-70B model on 4×A100 with tensor parallelism, compute: (a) baseline KV cache size at seqlen=32K, batch=8, (b) memory savings under 50% H2O eviction, (c) latency impact of chunked prefill at different chunk sizes.


Week 31 — Advanced Quantization: FP8, GGUF, and Extreme Low-Bit

Concepts to understand:

Reading:

Hands-on:


Week 32 — Model Compression: Pruning & Distillation

Concepts to understand:

Reading:

Hands-on:


Week 33 — Advanced Speculative Decoding

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 31–33): Given a 70B model with 50% Wanda pruning + AQLM 2-bit quantization + EAGLE-2 speculative decoding: estimate theoretical tokens/sec vs. the uncompressed baseline. Identify which technique gives the highest throughput-per-quality-point tradeoff.


Week 34 — Disaggregated & Distributed Inference

Concepts to understand:

Reading:

Hands-on:


Week 35 — Production Serving Infrastructure

Concepts to understand:

Reading:

Hands-on:

Milestone (Part IV): End-to-end serving system: FP8-quantized 70B model on disaggregated prefill/decode infrastructure, with SLO-aware scheduling, autoscaling, and cost monitoring. Document latency, throughput, GPU utilization, and $/token achieved.


Part V — Emerging Architectures

Goal: Understand Mamba/SSMs, MoE inference, linear attention variants, diffusion systems, and advanced multimodal architectures — always through the systems lens of compute, memory, and parallelism.

Week 36 — State Space Models: Mamba

Concepts to understand:

Reading:

Hands-on:


Week 37 — Mamba-2, Triton Scan Kernels & Hybrid Architectures

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 36–37): Build a Mamba-2 inference server. Measure: (a) per-token VRAM at seqlen 1K vs. 64K vs. 256K (should be constant), (b) throughput vs. an equivalent-parameter transformer at each seqlen, (c) the crossover point where Mamba becomes faster.


Week 38 — MoE Architecture & Inference Serving

Concepts to understand:

Reading:

Hands-on:


Week 39 — Linear Attention & Hybrid Architectures

Concepts to understand:

Reading:

Hands-on:


Week 40 — Diffusion Model Systems: Samplers & Architectures

Concepts to understand:

Reading:

Hands-on:


Week 41 — Diffusion Serving & Distributed Execution

Concepts to understand:

Reading:

Hands-on:

Milestone (Weeks 40–41): For a DiT-XL/2 model serving at 256 requests/sec, design an architecture: (a) step count vs. quality tradeoff using DPM-Solver, (b) batching strategy with variable CFG, (c) multi-GPU with DistriFusion — estimate total GPU count needed to hit 100ms TTFT P99.


Week 42 — Advanced Multimodal: Video & Audio

Concepts to understand:

Reading:

Hands-on:

Milestone (Part V): Implement a Mamba-2 parallel scan kernel and benchmark it end-to-end: compare throughput, memory, and quality against FlashAttention-2 at seqlens from 2K to 256K. Write a one-page analysis of which architecture you’d choose for which use case.


Part VI — Compilers & Infrastructure

Goal: Understand the full compiler stack from MLIR down to PTX, the internals of automatic differentiation, memory allocation, and production MLOps infrastructure.

Week 43 — Compiler IR Theory & MLIR

Concepts to understand:

Reading:

Hands-on:


Week 44 — TVM, XLA & Autotuning

Concepts to understand:

Reading:

Hands-on:


Week 45 — Automatic Differentiation Systems

Concepts to understand:

Reading:

Hands-on:


Week 46 — Memory Management & Allocators

Concepts to understand:

Reading:

Hands-on:


Week 47 — NCCL Deep Dive

Concepts to understand:

Reading:

Hands-on:


Week 48 — Deployment & Production Infrastructure

Concepts to understand:

Reading:

Hands-on:


Week 49 — Advanced Profiling & Distributed Debugging

Concepts to understand:

Reading:

Hands-on:


Week 50 — Integrating the Compiler Stack

Concepts to understand:

Reading:

Hands-on:


Week 51 — Reading Week & Integration

A structured review week with no new material. Revisit the notes, exercises, and milestone answers from the hardest weeks.

Suggested review targets:


Week 52 — Year-End Capstone

Goal: Ship something production-quality that integrates at least 3 domains from the curriculum. The deliverable should be something you’d be comfortable presenting as a portfolio piece.

Choose one track:

Deliverable checklist:


Reference Lists

Canonical Papers by Week

Paper Topic Week
Attention Is All You Need Transformer prerequisite
Scaling Laws for Neural Language Models Scaling 3
FlashAttention Efficient attention 4
FlashAttention-2 Efficient attention 4
GQA KV compression 4
DeepSeek-V2 §3 MLA KV compression 4
LLM.int8() Quantization 5
GPTQ Quantization 5
AWQ Quantization 5
Speculative Decoding Inference 5
PagedAttention / vLLM Serving 6
Orca Serving 6
SGLang Serving 6
Megatron-LM Distributed 7
Megatron 3D Distributed 7
ZeRO Distributed 8
Mixed Precision Training Training 8
Activation Recomputation Training 8
Triton Kernels 9
PyTorch 2 Compilation 10
CLIP Multimodal 11
LLaVA Multimodal 11
FlashAttention-3 Hardware 17
Ring Attention Seq parallelism 21
DeepSpeed Ulysses Seq parallelism 21
Switch Transformers MoE 22
GShard MoE 22
MegaBlocks MoE 23
Chinchilla Scaling 24
LoRA Finetuning 26
QLoRA Finetuning 26
DPO RLHF 28
DeepSeekMath (GRPO) RLHF 28
YaRN Long context 29
SnapKV KV eviction 30
QuIP# Quantization 31
AQLM Quantization 31
Wanda Pruning 32
EAGLE Spec decode 33
EAGLE-2 Spec decode 33
DistServe Disaggregated 34
Mamba SSMs 36
Mamba-2 / SSD SSMs 37
GLA Linear attn 39
RWKV Linear attn 39
DiT Diffusion 40
DPM-Solver Diffusion 40
Flow Matching Diffusion 40
GSPMD Compilers 44
Demystifying NCCL Networking 47

Key Blogs & References

Resource What It’s Good For Link
Making Deep Learning Go Brrrr Compute/memory bottleneck taxonomy https://horace.io/brrr_intro.html
Transformer Math 101 (EleutherAI) FLOP/memory arithmetic reference https://blog.eleuther.ai/transformer-math/
JAX Scaling Book Roofline, sharding, distributed training https://jax-ml.github.io/scaling-book/
Stas Bekman’s ML Engineering Practical distributed training cookbook https://github.com/stas00/ml-engineering
GPU MODE Lectures Advanced CUDA, CUTLASS, Triton https://github.com/gpu-mode/lectures
Lilian Weng’s blog Broad ML coverage, well-cited https://lilianweng.github.io
Sebastian Raschka’s newsletter LLM research summaries https://magazine.sebastianraschka.com
CutlassAcademy CUTLASS 3.x tutorials https://github.com/MekkCyber/CutlassAcademy
flash-linear-attention Triton kernels for SSMs/linear attn https://github.com/fla-org/flash-linear-attention
LLM Inference Optimization Papers Curated inference paper list https://github.com/chenhongyu2048/LLM-inference-optimization-paper

Last updated: 2026-03-15. Revisit pacing at Part boundaries.