Deep Learning Engineering: Overview

This file is the index for the concepts/deep-learning-engineering/ folder. It maps planned and written notes on practical techniques for training efficiency, inference acceleration, and parameter efficiency in modern deep learning — with emphasis on large language models and Parameter Golf-style optimization.

Notes in This Folder

Written

File	Topic
`weight-tying.md`	✅ Parameter sharing between input embedding and output projection
`gradient-checkpointing.md`	✅ Recompute activations on the backward pass to reduce peak memory
`normalization-free-transformers.md`	✅ DyT and Derf as pointwise LayerNorm replacements; four-property theory
`rotary-embeddings.md`	✅ RoPE — rotating Q/K vectors to encode relative position; NTK-aware scaling and YaRN context extension
`training-loops.md`	✅ Engineering concerns for ideal training loops — step order, scheduling, mixed precision, gradient clipping, tricks

Planned

File	Topic
`mixed-precision.md`	FP16/BF16 training, loss scaling, and master weight copies
`muon.md`	Orthogonalized gradient descent for 2D weights; the top Parameter Golf optimizer
`mup-parametrization.md`	Maximal Update Parametrization — transfer optimal hyperparameters from small to large models
`normalization.md`	RMSNorm vs. LayerNorm; Pre-Norm vs. Post-Norm; training stability at depth
`glu-variants.md`	SwiGLU and gated FFN variants — multiplicative gating for better perplexity
`mamba-ssm.md`	Selective state space models — \(O(N)\) training via parallel scan, \(O(1)\) inference via recurrence
`linear-attention.md`	Gated linear attention — RWKV, RetNet, GLA; \(O(1)\) inference with data-dependent decay
`ttt-layers.md`	Test-time training layers — inner model weights as dynamic hidden state
`depth-recurrence.md`	Shared-weight transformers — depth for free via weight reuse across layers
`sparse-moe.md`	Sparse Mixture of Experts — decouple parameter count from per-token compute via routing
`mixture-of-depths.md`	Token routing to skip transformer blocks — reduce FLOPs with matched perplexity
`vocab-optimization.md`	BPE engineering, bigram hashing, and small-vocabulary strategies
`knowledge-distillation/knowledge-distillation.md`	✅ Teacher-student distillation — soft targets, dark knowledge, feature-based methods
`sparsity-pruning/`	✅ Subfolder — classical (OBD/OBS/IMP), compression pipelines (Deep Compression/EIE), structured pruning, sparse training (LTH/RigL), LLM pruning (SparseGPT/Wanda)
`spectral-bias/spectral-bias-fourier-kernels.md`	✅ Subfolder — spectral bias via NTK eigenspectrum, Fourier feature maps (Bochner/Rahimi–Recht), cross-attention as adaptive frequency routing (Feng et al. 2025)

Subtopic Map

🏋️ Training Efficiency

Subtopic	Key Idea	Primary Source
Weight tying	Share \(W_{\text{emb}} = W_{\text{out}}^\top\); halves vocab-layer parameters	Press & Wolf (2017)
Gradient checkpointing	Discard activations during forward; recompute on demand during backward	Chen et al. (2016)
Mixed precision	Compute in FP16/BF16; keep FP32 master weights for stable updates	Micikevicius et al. (2018)
Muon optimizer	Orthogonalize gradients via Newton-Schulz before applying; beats AdamW per parameter	Jordan et al. (2024)
µP	Reparametrize init and LR scaling so optimal HPs transfer from small proxy to full model	Yang et al. (2022)

🧱 Architecture Components

Subtopic	Key Idea	Primary Source
RMSNorm + Pre-Norm	Drop mean-centering; normalize before each sublayer — ubiquitous in modern LLMs	Zhang & Sennrich (2019); LLaMA (2023)
Normalization-free Transformers	Replace LayerNorm with DyT (tanh) or Derf (erf); four necessary properties; matches/beats LN	Zhu et al. (2025); Chen et al. (2025)
SwiGLU / GLU variants	Gated FFN: \(\text{FFN}(x) = (\sigma(xV) \odot xW_1) W_2\) — consistent perplexity improvement	Shazeer (2020)
Rotary embeddings (RoPE)	Encode relative position via Q/K rotation; extends to longer contexts via YaRN	Su et al. (2021); Peng et al. (2023)

🏗️ Sequence Models

Subtopic	Key Idea	Primary Source
Mamba / selective SSMs	Input-dependent \(A, B, C\); parallel scan in training, recurrent in inference	Gu & Dao (2023)
Linear attention / GLA	Data-dependent decay gates; \(O(1)\) inference memory, \(O(N)\) training compute	Peng et al. / Yang et al. (2023–24)

⚙️ Dynamic Computation

Subtopic	Key Idea	Primary Source
Sparse MoE	Route each token to top-\(k\) of \(N\) expert FFNs; total params \(\gg\) active params	Jiang et al. / Dai et al. (2024)
Mixture of Depths	Top-\(k\) token routing to skip entire blocks; up to 50% FLOP reduction	Raposo et al. (2024)
TTT layers	Hidden state = inner model weights; updated by gradient steps at test time	Sun et al. (2024)
Depth recurrence	Run one shared transformer block \(L\) times; \(L\)-layer expressivity at 1-layer cost	Dehghani et al. (2019)

📖 Vocabulary and Transfer

Subtopic	Key Idea	Primary Source
Vocabulary optimization	Tune BPE vocab size; bigram hashing for ultra-small vocabularies	Sennrich et al. (2016)
Knowledge distillation	Train student on teacher’s output distribution via reverse KL; on-policy variants	Gu et al. (2023)

🌊 Signal Representation and Spectral Methods (subfolder: `spectral-bias/`)

Subtopic	Key Idea	Primary Source
Spectral bias	MLPs fit low-frequency modes first; NTK eigenvalues decay as \(l^{-(d+1)}\)	Rahaman et al. (2019)
Fourier feature maps	Lift inputs via \(\gamma(v)=[\cos(Bv),\sin(Bv)]\) to equalize NTK eigenspectrum	Tancik et al. (2020); Rahimi & Recht (2007)
Cross-attention frequency routing	Treat Fourier feature bank as key–value tokens; attention reweights frequency contributions adaptively	Feng et al. (2025)
NeRF positional encoding	Log-spaced deterministic Fourier features for 3D coordinate MLPs	Mildenhall et al. (2020)

✂️ Model Compression (subfolder: `sparsity-pruning/`, `knowledge-distillation/`)

Subtopic	Key Idea	Primary Source
OBD / OBS	Second-order Hessian saliency for weight deletion; full inverse-Hessian weight correction	LeCun et al. 1990; Hassibi & Stork 1993
Deep Compression	Prune + k-means quantize + Huffman encode; 35–49× total compression	Han et al. 2016
EIE	Custom VLSI for compressed-sparse FC inference; 189× CPU speedup	Han et al. 2016
Structured pruning	Filter/channel/head pruning; hardware-compatible sparsity without sparse kernels	Li et al. 2016; Liu et al. 2017
Lottery Ticket Hypothesis	Sparse winning-ticket subnetworks found via IMP + weight rewinding	Frankle & Carlin 2019
RigL	Dynamic sparse training — gradient-guided topology updates at fixed FLOP budget	Evci et al. 2020
SparseGPT	Layer-wise OBS at LLM scale via Cholesky inverse; 50% sparsity with no retraining	Frantar & Alistarh 2023
Wanda	Weight × activation-norm saliency; matches SparseGPT with no weight update	Sun et al. 2023
Knowledge distillation	Soft targets encode inter-class structure (“dark knowledge”); student trained on teacher logits	Hinton et al. 2015

Dependency Graph

flowchart TD
    WT["Weight Tying
weight-tying.md"]
    GC["Gradient Checkpointing
gradient-checkpointing.md"]
    MP["Mixed Precision
mixed-precision.md"]
    MU["Muon
muon.md"]
    UP["muP
mup-parametrization.md"]
    NM["RMSNorm + Pre-Norm
normalization.md"]
    GL["SwiGLU / GLU
glu-variants.md"]
    RO["RoPE
rotary-embeddings.md"]
    MB["Mamba / SSMs
mamba-ssm.md"]
    LA["Linear Attention
linear-attention.md"]
    TTT["TTT Layers
ttt-layers.md"]
    DR["Depth Recurrence
depth-recurrence.md"]
    MoE["Sparse MoE
sparse-moe.md"]
    MoD["Mixture of Depths
mixture-of-depths.md"]
    VO["Vocab Optimization
vocab-optimization.md"]
    KD["Knowledge Distillation
knowledge-distillation/knowledge-distillation.md"]
    SP["Sparsity and Pruning
sparsity-pruning/overview.md"]
    SB["Spectral Bias and Fourier Kernels
spectral-bias/spectral-bias-fourier-kernels.md"]

    MB --> LA
    MB --> TTT
    TTT --> DR
    MoE --> MoD
    KD -.->|"competing paradigm"| SP
    RO -.->|"NTK frequency analysis"| SB

Most notes are self-contained. Notable chains: mamba-ssm motivates both linear-attention (shared recurrent framing) and ttt-layers (generalizing fixed state to a learned model); ttt-layers pairs naturally with depth-recurrence for Parameter Golf. sparse-moe is prerequisite context for mixture-of-depths.

Master References

Reference	Authors	Year	What It Covers	Link
Using the Output Embedding to Improve Language Models	Press & Wolf	2017	Weight tying — empirical analysis and motivation	arXiv:1608.05859
Tying Word Vectors and Word Classifiers	Inan et al.	2017	Independent concurrent weight tying; KL-divergence justification	arXiv:1611.01462
ALBERT	Lan et al.	2020	Factored embeddings; cross-layer weight sharing	arXiv:1909.11942
Training Deep Nets with Sublinear Memory Cost	Chen et al.	2016	Gradient checkpointing — \(O(\sqrt{n})\) activation memory	arXiv:1604.06174
Reducing Activation Recomputation in Large Transformer Models	Korthikanti et al.	2022	Selective recomputation — 5x memory reduction	arXiv:2205.05198
Mixed Precision Training	Micikevicius et al.	2018	FP16 training with loss scaling and FP32 master weights	arXiv:1710.03740
A Study of BFLOAT16 for Deep Learning Training	Kalamkar et al.	2019	BF16 as FP32 drop-in — same exponent range, no loss scaling needed	arXiv:1905.12322
Modded-nanoGPT / Muon	Jordan et al.	2024	Muon optimizer — Newton-Schulz orthogonalization for 2D weight gradients	GitHub
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot HP Transfer	Yang et al.	2022	µP — reparametrization enabling HP transfer across model scales	arXiv:2203.03466
Root Mean Square Layer Normalization	Zhang & Sennrich	2019	RMSNorm — drops mean-centering; faster and equally stable as LayerNorm	arXiv:1910.07467
Transformers without Normalization	Zhu, Chen, He, LeCun, Liu	2025	DyT = γ⊙tanh(αx)+β as drop-in LayerNorm replacement; 8.2% LLaMA-7B training speedup	arXiv:2503.10622
Stronger Normalization-Free Transformers	Chen, Lu, Zhu, Sun, Liu	2025	Four-property framework for pointwise norm replacements; Derf = γ·erf(αx+s)+β outperforms LN and DyT	arXiv:2512.10938
GLU Variants Improve Transformer	Shazeer	2020	SwiGLU and other gated FFN variants — consistent perplexity improvement	arXiv:2002.05202
RoFormer: Enhanced Transformer with Rotary Position Embedding	Su et al.	2021	RoPE — relative position via rotation; decays with distance naturally	arXiv:2104.09864
YaRN: Efficient Context Window Extension	Peng et al.	2023	NTK-aware RoPE frequency interpolation for context length extension	arXiv:2309.00071
Mamba: Linear-Time Sequence Modeling with Selective State Spaces	Gu & Dao	2023	Selective SSM — input-dependent transitions; parallel scan algorithm	arXiv:2312.00752
RWKV: Reinventing RNNs for the Transformer Era	Peng et al.	2023	Linear attention via time-decay recurrence; \(O(1)\) inference memory	arXiv:2305.13048
Gated Linear Attention Transformers with Hardware-Efficient Training	Yang et al.	2024	GLA — data-dependent decay gates; hardware-efficient via chunk-wise scan	arXiv:2312.06635
Learning to (Learn at Test Time)	Sun et al.	2024	TTT layers — inner model weights as dynamic hidden state	arXiv:2407.04620
Universal Transformers	Dehghani et al.	2019	Depth recurrence — one shared transformer block run \(L\) times	arXiv:1807.03819
Mixtral of Experts	Jiang et al.	2024	Sparse MoE at scale — 8 experts, top-2 routing, matches dense 13B with 7B active params	arXiv:2401.04088
DeepSeekMoE	Dai et al.	2024	Fine-grained expert segmentation and shared expert isolation	arXiv:2401.06066
Mixture-of-Depths	Raposo et al.	2024	Token routing to skip entire transformer blocks; 50% FLOP reduction	arXiv:2404.02258
Neural Machine Translation of Rare Words with Subword Units	Sennrich et al.	2016	BPE tokenization — foundational algorithm for vocabulary construction	arXiv:1508.07909
MiniLLM: Knowledge Distillation of Large Language Models	Gu et al.	2023	On-policy distillation via reverse KL — avoids mode-covering pathologies	arXiv:2306.08543
On the Spectral Bias of Neural Networks	Rahaman et al.	2019	Empirical and Fourier-analytic characterization of spectral bias; \(O(\\|\xi\\|^{-(d+1)})\) convergence rate	arXiv:1806.08734
Fourier Features Let Networks Learn High Frequency Functions	Tancik et al.	2020	Lifting inputs via \(\gamma(v)=[\cos(Bv),\sin(Bv)]\) flattens NTK eigenspectrum; NTK convergence analysis	arXiv:2006.10739
Random Features for Large-Scale Kernel Machines	Rahimi & Recht	2007	Monte Carlo approximation of shift-invariant kernels via Bochner’s theorem	NIPS 2007
Overcoming Spectral Bias via Cross-Attention	Feng et al.	2025	Cross-attention over multiscale Fourier token banks for adaptive frequency routing; AFE and dual-network PDE architecture	arXiv:2512.18586