Phase III — Quantization & Compression
Weeks 11–15 · ~25 hrs
Goal: Understand and implement the quantization techniques that dominate the Parameter Golf leaderboard. GPTQ, FP8, and INT6 QAT are responsible for the majority of top-50 entries. By the end, you will be able to apply post-training quantization (PTQ) and quantization-aware training (QAT) to any transformer, and understand exactly what the accuracy-vs-compression trade-off looks like in practice.
Week 11 primary: Dettmers et al., LLM.int8() paper + quantization fundamentals from scratch
Weeks 12–13 primary: Frantar et al., GPTQ paper + AutoGPTQ library
Week 14 primary: PyTorch FX quantization tutorial (QAT section)
Week 15 primary: Dettmers et al., QLoRA paper + SVD-based low-rank compression
Week 11 — Quantization Fundamentals
Concepts to understand:
Coding tasks:
Expected results: INT8 post-training quantization (PTQ) of weights only (keeping activations in FP16) on a small LM should reduce model file size by ~3.5× (from 4 bytes/param to ~1.15 bytes/param including per-channel scales). Perplexity degradation should be less than 1% on a well-calibrated INT8 quantization. If perplexity degrades by more than 5%, check for outlier channels — a single channel with weights 100× larger than the median will cause per-tensor quantization to allocate nearly all its range to that channel, wasting 99% of the representational capacity.
Week 12 — GPTQ: Post-Training Quantization with Second-Order Information
Primary resource: Frantar et al., GPTQ paper — read §1–3 carefully (the algorithm); §4 (results) for calibration
Concepts to understand:
Coding tasks:
Expected results for a 100M-parameter model: 4-bit GPTQ should achieve ~4× size reduction from FP16 (from ~200MB to ~50MB) with less than 2% perplexity degradation when calibrated on 128 sequences from the training distribution. Naive 4-bit rounding (round-to-nearest without compensation) will show 20–50% perplexity degradation on the same model — the improvement from GPTQ’s second-order compensation is the entire story. For models below 100M parameters, GPTQ’s advantage is smaller because the models are already well-conditioned; it becomes critical at 1B+ parameters.
Week 13 — Quantization-Aware Training (QAT)
Concepts to understand:
Coding tasks:
QAT trained to INT8 should match FP32 training loss within 0.5% and often within 0.1%. The key difference from PTQ: the model has learned to “work around” quantization noise during training, whereas PTQ applies quantization to a model that was never exposed to it. If QAT val loss is worse than FP32 by more than 1%, check: (1) are you applying fake quantization to all weight matrices including the embedding and output projection? (2) is the STE implemented correctly — gradients should not be zero? (3) is your quantization scale per-channel or per-tensor? Per-channel QAT is significantly more stable.
Week 14 — INT6 and Extreme Quantization
Concepts to understand:
torchao does under the hood: torch.fx
When you call torchao.quantize_(model, int4_weight_only()), it uses torch.fx to trace the model’s computation graph, identify every nn.Linear call-site, and replace it with a quantized variant — without you writing any module surgery. torch.fx.symbolic_trace(model) produces a GraphModule whose .graph attribute is a list of Node objects with node.op ∈ {'call_module', 'call_function', 'get_attr', 'placeholder', 'output'}. If torchao fails to quantize a layer, it is usually because symbolic tracing hit a data-dependent branch — inspect with torch.fx.symbolic_trace(model) and look for TraceError. Fix by providing concrete_args or using torch.fx.wrap on the offending function.
Coding tasks:
Expected Pareto curve for a 50M-parameter LM: FP32 baseline at 4 bits/param (by definition); INT8 PTQ at 2 bits/param equivalent with <1% perplexity loss; INT4 GPTQ at 1 bit/param equivalent with ~2–5% perplexity loss; INT3 GPTQ with group-128 at ~0.75 bits/param with 5–15% perplexity loss. NF4 should consistently outperform INT4 by 1–2% perplexity at the same bit budget because neural network weights are approximately normally distributed — the NF4 grid is specifically designed for this distribution.
Week 15 — Low-Rank Compression and the Full Compression Pipeline
Concepts to understand:
Coding tasks:
The SVD singular value spectrum of a well-trained weight matrix has a characteristic “elbow” shape: a few large singular values followed by a long tail of small ones. If the spectrum is flat (all singular values similar in magnitude), the matrix is full-rank and SVD compression will not work well — you will need to accept large perplexity degradation for any compression. Matrices that benefit most from SVD: the output projection of the MLP (often low effective rank after training), and the Q/K projection matrices in attention (the attention mechanism often learns a low-dimensional subspace). Matrices that benefit least: the value projection and the MLP input projection (tend to be higher-rank).