Phase VI — Novel Architectures & Parameter Golf

Weeks 26–30 · ~25 hrs

Goal: Understand and implement the non-standard architectures that appear in the top Parameter Golf submissions. Mamba / SSMs, Test-Time Training (TTT) layers, depth recurrence, and the Muon optimizer’s relationship to orthogonal gradient updates are the key ideas. By the end of Week 30, you will have run a systematic Parameter Golf campaign with a documented experiment log.

Week 26 primary: Gu & Dao, Mamba paper §1–3 + mamba-minimal

Week 27 primary: Sun et al., TTT paper + TTT-linear implementation

Week 28 primary: Dehghani et al., Universal Transformers + depth recurrence variants in Parameter Golf submissions

Week 29 primary: KellerJordan, modded-nanoGPT — deep read of the full training pipeline; the Muon optimizer + architecture choices are the top-scoring open-source entry

Week 30: Parameter Golf campaign — no new primary resources; systematic experimentation


Week 26 — State Space Models: Mamba

Primary resources: - Gu & Dao, Mamba paper — §1 (motivation), §3 (selective SSM mechanism), §4 (hardware-efficient algorithm); skip §2 (prior SSM background unless you want deep theory) - mamba-minimal — ~200 lines of clean PyTorch; read end to end

Concepts to understand:

Coding tasks:

Milestone

At seq_len=256 and seq_len=512: transformer and Mamba should have similar generation speed (the quadratic factor is small at short contexts). At seq_len=4096: transformer generation is ~16× slower than at seq_len=1024 (quadratic scaling); Mamba is ~4× slower (linear scaling). The memory difference is more dramatic: at seq_len=4096 with batch=8, the transformer attention matrix requires ~2GB; Mamba’s state requires only d_state × d_model × 2 bytes × batch ≈ 16 × 256 × 2 × 8 = 65KB — four orders of magnitude smaller. For Parameter Golf tasks that evaluate on long sequences, Mamba-based architectures are parameter-efficient in a way transformers fundamentally cannot be.


Week 27 — Test-Time Training (TTT) Layers

Primary resource: Sun et al., TTT paper — §1–4; focus on the TTT-Linear and TTT-MLP variants

Concepts to understand:

Coding tasks:

Milestone

TTT-Linear should match or slightly beat transformer perplexity at the same parameter count on short-to-medium contexts (up to ~512 tokens), because the dynamic hidden state provides more effective context compression than the fixed-size state of a standard RNN while using fewer parameters than full attention. On longer contexts, TTT should scale better than transformer (linear in sequence length) but may not match Mamba’s efficiency. Training step time for TTT is typically 1.5–2.5× slower than transformer due to the inner gradient computation — this is a real cost and should be included when comparing parameter efficiency on a compute budget.


Week 28 — Depth Recurrence and Shared-Weight Transformers

Primary resource: Dehghani et al., Universal Transformers — §1–3

Concepts to understand:

Coding tasks:

Milestone

Expected result: a fully shared-weight model with L=8 applications of a single block achieves significantly worse perplexity than an 8-layer model with independent weights at the same parameter count — but only if the shared block is the same size. The fair comparison is: shared-weight model with 1 large block run 8 times vs. 8 small blocks with independent weights at the same total parameter count. In this comparison, the shared-weight model often wins at shorter contexts because the large single block has more capacity per-pass. The parameter Golf insight: shared weights allow you to get “depth for free” — a model counted as having 1 block’s parameters but achieving 8-block expressivity through iteration.


Week 29 — The modded-nanoGPT Reference Implementation

Primary resource: KellerJordan, modded-nanoGPT — read every file; this is the most engineering-dense reference implementation for Parameter Golf techniques

Concepts to understand:

Coding tasks:

Milestone

The modded-nanoGPT architecture choices compound: each individual modification provides a small improvement (typically 0.01–0.05 perplexity), but together they produce a model that is 5–10% more efficient per parameter than standard nanoGPT. The ablation study will show that Muon > QK norm > ReLU² > logit soft-cap in terms of individual contribution. If the Muon improvement is not reproducible (variance across seeds is larger than the signal), your learning rate for Muon is likely wrong — Muon requires a different LR than AdamW (typically lr_muon ≈ 0.1 × lr_adamw because Muon’s updates are orthonormalized and thus larger in spectral norm).


Week 30 — Parameter Golf Campaign

No new primary resources — systematic experimentation using all techniques from the curriculum.

Pre-campaign analysis:

Experiment log (fill in during experimentation):

# Technique Params before Params after BPB before BPB after Δ params Δ BPB Notes
0 Baseline
1
2
3
4
5

Suggested experimentation sequence (one change at a time):

  1. Weight tying: tie wte and lm_head; free if vocabulary is large
  2. Vocab optimization: train a custom BPE tokenizer at 2048–4096 tokens; measure the embedding parameter savings vs. the perplexity cost from shorter context compression
  3. Muon optimizer: replace AdamW with Muon + AdamW hybrid; train a fresh model
  4. Architecture: replace transformer blocks with shared-weight Mamba or TTT-Linear blocks
  5. GPTQ quantization: apply 4-bit or 6-bit GPTQ to the trained model; measure bits-per-parameter after compression
  6. Brotli compression: apply Brotli to the quantized weight byte stream; measure final effective parameter count

Compatibility sanity checks (run before submitting):

Capstone requirement:

After completing the experiment log, write a one-paragraph analysis of the task’s structure: which component dominated the parameter count, which technique provided the best accuracy-per-parameter trade-off for this specific task, and whether the result matched your prediction before running the experiment. The paragraph should contain at least one specific number (e.g., “weight tying saved 23% of parameters at zero cost in bits-per-byte because the vocabulary size was 16K”).

Milestone

A strong Parameter Golf run will combine at least three techniques from this curriculum. The highest-scoring open-source entry (modded-nanoGPT + GPTQ) achieves its score by: (1) using modded-nanoGPT’s architecture for maximum parameter efficiency during training, (2) applying aggressive GPTQ quantization (INT4 with group-size 32) for maximum compression during submission, (3) using Brotli to compress the quantized weight stream. Each step independently reduces the effective parameter count; the composition is multiplicative. A model that is 2× more efficient architecturally and 4× more efficient via quantization achieves 8× the parameter efficiency of the baseline.