Goal: Understand and implement the non-standard architectures that appear in the top Parameter Golf submissions. Mamba / SSMs, Test-Time Training (TTT) layers, depth recurrence, and the Muon optimizer’s relationship to orthogonal gradient updates are the key ideas. By the end of Week 30, you will have run a systematic Parameter Golf campaign with a documented experiment log.
Week 27 primary: Sun et al., TTT paper + TTT-linear implementation
Week 28 primary: Dehghani et al., Universal Transformers + depth recurrence variants in Parameter Golf submissions
Week 29 primary: KellerJordan, modded-nanoGPT — deep read of the full training pipeline; the Muon optimizer + architecture choices are the top-scoring open-source entry
Week 30: Parameter Golf campaign — no new primary resources; systematic experimentation
Week 26 — State Space Models: Mamba
Primary resources:
- Gu & Dao, Mamba paper — §1 (motivation), §3 (selective SSM mechanism), §4 (hardware-efficient algorithm); skip §2 (prior SSM background unless you want deep theory)
- mamba-minimal — ~200 lines of clean PyTorch; read end to end
Concepts to understand:
Coding tasks:
Milestone
At seq_len=256 and seq_len=512: transformer and Mamba should have similar generation speed (the quadratic factor is small at short contexts). At seq_len=4096: transformer generation is ~16× slower than at seq_len=1024 (quadratic scaling); Mamba is ~4× slower (linear scaling). The memory difference is more dramatic: at seq_len=4096 with batch=8, the transformer attention matrix requires ~2GB; Mamba’s state requires only d_state × d_model × 2 bytes × batch ≈ 16 × 256 × 2 × 8 = 65KB — four orders of magnitude smaller. For Parameter Golf tasks that evaluate on long sequences, Mamba-based architectures are parameter-efficient in a way transformers fundamentally cannot be.
Week 27 — Test-Time Training (TTT) Layers
Primary resource: Sun et al., TTT paper — §1–4; focus on the TTT-Linear and TTT-MLP variants
Concepts to understand:
Coding tasks:
Milestone
TTT-Linear should match or slightly beat transformer perplexity at the same parameter count on short-to-medium contexts (up to ~512 tokens), because the dynamic hidden state provides more effective context compression than the fixed-size state of a standard RNN while using fewer parameters than full attention. On longer contexts, TTT should scale better than transformer (linear in sequence length) but may not match Mamba’s efficiency. Training step time for TTT is typically 1.5–2.5× slower than transformer due to the inner gradient computation — this is a real cost and should be included when comparing parameter efficiency on a compute budget.
Week 28 — Depth Recurrence and Shared-Weight Transformers
Expected result: a fully shared-weight model with L=8 applications of a single block achieves significantly worse perplexity than an 8-layer model with independent weights at the same parameter count — but only if the shared block is the same size. The fair comparison is: shared-weight model with 1 large block run 8 times vs. 8 small blocks with independent weights at the same total parameter count. In this comparison, the shared-weight model often wins at shorter contexts because the large single block has more capacity per-pass. The parameter Golf insight: shared weights allow you to get “depth for free” — a model counted as having 1 block’s parameters but achieving 8-block expressivity through iteration.
Week 29 — The modded-nanoGPT Reference Implementation
Primary resource: KellerJordan, modded-nanoGPT — read every file; this is the most engineering-dense reference implementation for Parameter Golf techniques
Concepts to understand:
Coding tasks:
Milestone
The modded-nanoGPT architecture choices compound: each individual modification provides a small improvement (typically 0.01–0.05 perplexity), but together they produce a model that is 5–10% more efficient per parameter than standard nanoGPT. The ablation study will show that Muon > QK norm > ReLU² > logit soft-cap in terms of individual contribution. If the Muon improvement is not reproducible (variance across seeds is larger than the signal), your learning rate for Muon is likely wrong — Muon requires a different LR than AdamW (typically lr_muon ≈ 0.1 × lr_adamw because Muon’s updates are orthonormalized and thus larger in spectral norm).
Week 30 — Parameter Golf Campaign
No new primary resources — systematic experimentation using all techniques from the curriculum.
Pre-campaign analysis:
Experiment log (fill in during experimentation):
#
Technique
Params before
Params after
BPB before
BPB after
Δ params
Δ BPB
Notes
0
Baseline
—
—
1
2
3
4
5
Suggested experimentation sequence (one change at a time):
Weight tying: tie wte and lm_head; free if vocabulary is large
Vocab optimization: train a custom BPE tokenizer at 2048–4096 tokens; measure the embedding parameter savings vs. the perplexity cost from shorter context compression
Muon optimizer: replace AdamW with Muon + AdamW hybrid; train a fresh model
Architecture: replace transformer blocks with shared-weight Mamba or TTT-Linear blocks
GPTQ quantization: apply 4-bit or 6-bit GPTQ to the trained model; measure bits-per-parameter after compression
Brotli compression: apply Brotli to the quantized weight byte stream; measure final effective parameter count
Compatibility sanity checks (run before submitting):
Capstone requirement:
After completing the experiment log, write a one-paragraph analysis of the task’s structure: which component dominated the parameter count, which technique provided the best accuracy-per-parameter trade-off for this specific task, and whether the result matched your prediction before running the experiment. The paragraph should contain at least one specific number (e.g., “weight tying saved 23% of parameters at zero cost in bits-per-byte because the vocabulary size was 16K”).
Milestone
A strong Parameter Golf run will combine at least three techniques from this curriculum. The highest-scoring open-source entry (modded-nanoGPT + GPTQ) achieves its score by: (1) using modded-nanoGPT’s architecture for maximum parameter efficiency during training, (2) applying aggressive GPTQ quantization (INT4 with group-size 32) for maximum compression during submission, (3) using Brotli to compress the quantized weight stream. Each step independently reduces the effective parameter count; the composition is multiplicative. A model that is 2× more efficient architecturally and 4× more efficient via quantization achieves 8× the parameter efficiency of the baseline.