Deep Learning Engineering: Overview
This file is the index for the concepts/deep-learning-engineering/ folder. It maps planned and written notes on practical techniques for training efficiency, inference acceleration, and parameter efficiency in modern deep learning β with emphasis on large language models and Parameter Golf-style optimization.
Notes in This Folder
Written
| File | Topic |
|---|---|
weight-tying.md |
β Parameter sharing between input embedding and output projection |
gradient-checkpointing.md |
β Recompute activations on the backward pass to reduce peak memory |
normalization-free-transformers.md |
β DyT and Derf as pointwise LayerNorm replacements; four-property theory |
rotary-embeddings.md |
β RoPE β rotating Q/K vectors to encode relative position; NTK-aware scaling and YaRN context extension |
training-loops.md |
β Engineering concerns for ideal training loops β step order, scheduling, mixed precision, gradient clipping, tricks |
Planned
| File | Topic |
|---|---|
mixed-precision.md |
FP16/BF16 training, loss scaling, and master weight copies |
muon.md |
Orthogonalized gradient descent for 2D weights; the top Parameter Golf optimizer |
mup-parametrization.md |
Maximal Update Parametrization β transfer optimal hyperparameters from small to large models |
normalization.md |
RMSNorm vs.Β LayerNorm; Pre-Norm vs.Β Post-Norm; training stability at depth |
glu-variants.md |
SwiGLU and gated FFN variants β multiplicative gating for better perplexity |
mamba-ssm.md |
Selective state space models β \(O(N)\) training via parallel scan, \(O(1)\) inference via recurrence |
linear-attention.md |
Gated linear attention β RWKV, RetNet, GLA; \(O(1)\) inference with data-dependent decay |
ttt-layers.md |
Test-time training layers β inner model weights as dynamic hidden state |
depth-recurrence.md |
Shared-weight transformers β depth for free via weight reuse across layers |
sparse-moe.md |
Sparse Mixture of Experts β decouple parameter count from per-token compute via routing |
mixture-of-depths.md |
Token routing to skip transformer blocks β reduce FLOPs with matched perplexity |
vocab-optimization.md |
BPE engineering, bigram hashing, and small-vocabulary strategies |
knowledge-distillation.md |
Sequence-level and on-policy distillation β training small models from teacher distributions |
Subtopic Map
ποΈ Training Efficiency
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Weight tying | Share \(W_{\text{emb}} = W_{\text{out}}^\top\); halves vocab-layer parameters | Press & Wolf (2017) |
| Gradient checkpointing | Discard activations during forward; recompute on demand during backward | Chen et al. (2016) |
| Mixed precision | Compute in FP16/BF16; keep FP32 master weights for stable updates | Micikevicius et al. (2018) |
| Muon optimizer | Orthogonalize gradients via Newton-Schulz before applying; beats AdamW per parameter | Jordan et al. (2024) |
| Β΅P | Reparametrize init and LR scaling so optimal HPs transfer from small proxy to full model | Yang et al. (2022) |
π§± Architecture Components
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| RMSNorm + Pre-Norm | Drop mean-centering; normalize before each sublayer β ubiquitous in modern LLMs | Zhang & Sennrich (2019); LLaMA (2023) |
| Normalization-free Transformers | Replace LayerNorm with DyT (tanh) or Derf (erf); four necessary properties; matches/beats LN | Zhu et al. (2025); Chen et al. (2025) |
| SwiGLU / GLU variants | Gated FFN: \(\text{FFN}(x) = (\sigma(xV) \odot xW_1) W_2\) β consistent perplexity improvement | Shazeer (2020) |
| Rotary embeddings (RoPE) | Encode relative position via Q/K rotation; extends to longer contexts via YaRN | Su et al. (2021); Peng et al. (2023) |
ποΈ Sequence Models
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Mamba / selective SSMs | Input-dependent \(A, B, C\); parallel scan in training, recurrent in inference | Gu & Dao (2023) |
| Linear attention / GLA | Data-dependent decay gates; \(O(1)\) inference memory, \(O(N)\) training compute | Peng et al. / Yang et al. (2023β24) |
βοΈ Dynamic Computation
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Sparse MoE | Route each token to top-\(k\) of \(N\) expert FFNs; total params \(\gg\) active params | Jiang et al. / Dai et al. (2024) |
| Mixture of Depths | Top-\(k\) token routing to skip entire blocks; up to 50% FLOP reduction | Raposo et al. (2024) |
| TTT layers | Hidden state = inner model weights; updated by gradient steps at test time | Sun et al. (2024) |
| Depth recurrence | Run one shared transformer block \(L\) times; \(L\)-layer expressivity at 1-layer cost | Dehghani et al. (2019) |
π Vocabulary and Transfer
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Vocabulary optimization | Tune BPE vocab size; bigram hashing for ultra-small vocabularies | Sennrich et al. (2016) |
| Knowledge distillation | Train student on teacherβs output distribution via reverse KL; on-policy variants | Gu et al. (2023) |
Dependency Graph
flowchart TD
WT["Weight Tying
weight-tying.md"]
GC["Gradient Checkpointing
gradient-checkpointing.md"]
MP["Mixed Precision
mixed-precision.md"]
MU["Muon
muon.md"]
UP["muP
mup-parametrization.md"]
NM["RMSNorm + Pre-Norm
normalization.md"]
GL["SwiGLU / GLU
glu-variants.md"]
RO["RoPE
rotary-embeddings.md"]
MB["Mamba / SSMs
mamba-ssm.md"]
LA["Linear Attention
linear-attention.md"]
TTT["TTT Layers
ttt-layers.md"]
DR["Depth Recurrence
depth-recurrence.md"]
MoE["Sparse MoE
sparse-moe.md"]
MoD["Mixture of Depths
mixture-of-depths.md"]
VO["Vocab Optimization
vocab-optimization.md"]
KD["Knowledge Distillation
knowledge-distillation.md"]
MB --> LA
MB --> TTT
TTT --> DR
MoE --> MoD
Most notes are self-contained. Notable chains: mamba-ssm motivates both linear-attention (shared recurrent framing) and ttt-layers (generalizing fixed state to a learned model); ttt-layers pairs naturally with depth-recurrence for Parameter Golf. sparse-moe is prerequisite context for mixture-of-depths.
Master References
| Reference | Authors | Year | What It Covers | Link |
|---|---|---|---|---|
| Using the Output Embedding to Improve Language Models | Press & Wolf | 2017 | Weight tying β empirical analysis and motivation | arXiv:1608.05859 |
| Tying Word Vectors and Word Classifiers | Inan et al. | 2017 | Independent concurrent weight tying; KL-divergence justification | arXiv:1611.01462 |
| ALBERT | Lan et al. | 2020 | Factored embeddings; cross-layer weight sharing | arXiv:1909.11942 |
| Training Deep Nets with Sublinear Memory Cost | Chen et al. | 2016 | Gradient checkpointing β \(O(\sqrt{n})\) activation memory | arXiv:1604.06174 |
| Reducing Activation Recomputation in Large Transformer Models | Korthikanti et al. | 2022 | Selective recomputation β 5x memory reduction | arXiv:2205.05198 |
| Mixed Precision Training | Micikevicius et al. | 2018 | FP16 training with loss scaling and FP32 master weights | arXiv:1710.03740 |
| A Study of BFLOAT16 for Deep Learning Training | Kalamkar et al. | 2019 | BF16 as FP32 drop-in β same exponent range, no loss scaling needed | arXiv:1905.12322 |
| Modded-nanoGPT / Muon | Jordan et al. | 2024 | Muon optimizer β Newton-Schulz orthogonalization for 2D weight gradients | GitHub |
| Tensor Programs V: Tuning Large Neural Networks via Zero-Shot HP Transfer | Yang et al. | 2022 | Β΅P β reparametrization enabling HP transfer across model scales | arXiv:2203.03466 |
| Root Mean Square Layer Normalization | Zhang & Sennrich | 2019 | RMSNorm β drops mean-centering; faster and equally stable as LayerNorm | arXiv:1910.07467 |
| Transformers without Normalization | Zhu, Chen, He, LeCun, Liu | 2025 | DyT = Ξ³βtanh(Ξ±x)+Ξ² as drop-in LayerNorm replacement; 8.2% LLaMA-7B training speedup | arXiv:2503.10622 |
| Stronger Normalization-Free Transformers | Chen, Lu, Zhu, Sun, Liu | 2025 | Four-property framework for pointwise norm replacements; Derf = Ξ³Β·erf(Ξ±x+s)+Ξ² outperforms LN and DyT | arXiv:2512.10938 |
| GLU Variants Improve Transformer | Shazeer | 2020 | SwiGLU and other gated FFN variants β consistent perplexity improvement | arXiv:2002.05202 |
| RoFormer: Enhanced Transformer with Rotary Position Embedding | Su et al. | 2021 | RoPE β relative position via rotation; decays with distance naturally | arXiv:2104.09864 |
| YaRN: Efficient Context Window Extension | Peng et al. | 2023 | NTK-aware RoPE frequency interpolation for context length extension | arXiv:2309.00071 |
| Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Gu & Dao | 2023 | Selective SSM β input-dependent transitions; parallel scan algorithm | arXiv:2312.00752 |
| RWKV: Reinventing RNNs for the Transformer Era | Peng et al. | 2023 | Linear attention via time-decay recurrence; \(O(1)\) inference memory | arXiv:2305.13048 |
| Gated Linear Attention Transformers with Hardware-Efficient Training | Yang et al. | 2024 | GLA β data-dependent decay gates; hardware-efficient via chunk-wise scan | arXiv:2312.06635 |
| Learning to (Learn at Test Time) | Sun et al. | 2024 | TTT layers β inner model weights as dynamic hidden state | arXiv:2407.04620 |
| Universal Transformers | Dehghani et al. | 2019 | Depth recurrence β one shared transformer block run \(L\) times | arXiv:1807.03819 |
| Mixtral of Experts | Jiang et al. | 2024 | Sparse MoE at scale β 8 experts, top-2 routing, matches dense 13B with 7B active params | arXiv:2401.04088 |
| DeepSeekMoE | Dai et al. | 2024 | Fine-grained expert segmentation and shared expert isolation | arXiv:2401.06066 |
| Mixture-of-Depths | Raposo et al. | 2024 | Token routing to skip entire transformer blocks; 50% FLOP reduction | arXiv:2404.02258 |
| Neural Machine Translation of Rare Words with Subword Units | Sennrich et al. | 2016 | BPE tokenization β foundational algorithm for vocabulary construction | arXiv:1508.07909 |
| MiniLLM: Knowledge Distillation of Large Language Models | Gu et al. | 2023 | On-policy distillation via reverse KL β avoids mode-covering pathologies | arXiv:2306.08543 |