Phase IV — Inference Systems

Weeks 16–20 · ~25 hrs

Goal: Understand how LMs are served efficiently at scale. This is the engineering that determines whether a deployed model is fast enough to be usable, and it is increasingly important for Parameter Golf submissions that require low-latency generation. By the end, you will understand the KV cache from first principles, be able to explain FlashAttention’s tiling algorithm, and know how speculative decoding and continuous batching work.

Week 16 primary: Implement KV caching from scratch; read GQA paper (Ainslie et al., 2023)

Week 17 primary: Dao et al., FlashAttention paper — §1–3 (algorithm + memory analysis)

Week 18 primary: Leviathan et al., Speculative decoding paper

Weeks 19–20 primary: Kwon et al., vLLM / PagedAttention paper + quantized inference with llama.cpp/GGUF


Week 16 — The KV Cache

Concepts to understand:

Coding tasks:

Milestone

Expected speedup from KV cache: at seq_len=256, caching provides ~40× speedup per token; at seq_len=64, ~15×. The quadratic vs. linear scaling is stark: without caching, generating the 256th token requires attending over 255 previous tokens (the same full forward pass as prefill); with caching, it requires only one attention operation between the new token and the 255 cached K/V pairs. If your cached and uncached outputs are not identical, the most likely bug is that you are applying the causal mask incorrectly during cached decoding — the causal mask shape changes from [T, T] during prefill to [1, T_cached + 1] during cached generation.


Week 17 — FlashAttention: IO-Optimal Attention

Primary resource: Dao et al., FlashAttention — §1 (motivation), §2 (GPU memory hierarchy), §3 (algorithm); skip §4+ for now

Concepts to understand:

Coding tasks:

Milestone

At seq_len=4096, batch_size=8, d_model=512, n_heads=8: standard attention requires materializing an [8, 8, 4096, 4096] attention score tensor = 8 × 8 × 4096² × 4 bytes ≈ 4.3GB of HBM — exceeding most consumer GPU memory. FlashAttention uses <100MB for the same configuration because it never materializes the full N×N matrix. For seq_len=1024, the difference is less dramatic (70MB vs. 7MB), but the speedup is still 2–4× because FlashAttention makes fewer HBM reads/writes. If torch.nn.functional.scaled_dot_product_attention is not faster than manual attention on your hardware, check CUDA and PyTorch versions — FlashAttention requires PyTorch 2.0+ and CUDA 11.6+.


Week 18 — Speculative Decoding

Primary resource: Leviathan et al., Speculative decoding — §1–3 (algorithm + acceptance analysis)

Concepts to understand:

Coding tasks:

Milestone

Expected acceptance rate at T=1.0 with a draft model trained to minimize KL divergence from the target: ~70–80% per token. With random initialization of the draft model (worst case): ~1/vocab_size ≈ 0 acceptance. The acceptance rate is the dominant factor in speculative decoding speedup — a 70% acceptance rate with K=4 gives ~3.6 tokens per target model call; a 50% acceptance rate gives only ~2.1. Measure acceptance rate first before measuring wall-clock speedup — if acceptance is low, the draft model needs to be improved or retrained, not the decoding algorithm.


Week 19 — Continuous Batching and PagedAttention

Primary resource: Kwon et al., vLLM / PagedAttention — §1–4

Concepts to understand:

Coding tasks:

Milestone

Static batching with padding at 8 concurrent requests of varying lengths (e.g., half finish in 50 tokens, half in 200 tokens): the short requests are padded to 200 tokens and waste 75% of their computation. Continuous batching in the same scenario: short requests leave the batch at step 50, freeing slots for new requests; GPU utilization stays high throughout. Expected throughput improvement: 2–3× at 8 concurrent requests of highly variable length. At uniform request lengths, continuous batching provides negligible benefit — all requests finish at the same time anyway.


Week 20 — Quantized Inference with GGUF and llama.cpp

Concepts to understand:

Coding tasks:

Milestone

GGUF Q4_K_M and AutoGPTQ INT4 may produce different perplexity despite the same nominal bit depth. GGUF K-quants use mixed precision (higher bits on the first/last layers and on specific attention layers that are more quantization-sensitive); AutoGPTQ uses uniform 4-bit across all layers by default. The per-layer sensitivity insight — that some layers can tolerate aggressive quantization while others cannot — is the most important practical heuristic in model compression. Measuring per-layer quantization error and allocating more bits to sensitive layers is the approach used in the highest-scoring Parameter Golf quantization entries.