Sparsity and Pruning in Deep Learning: Overview
This file is the index for the concepts/deep-learning-engineering/sparsity-pruning/ folder. It lists planned and written subtopic notes, organizes them by theme, and collects the canonical references for the field.
Notes in This Folder
| File | Status | Topic |
|---|---|---|
classical-pruning.md |
✅ Written | OBD, OBS, magnitude pruning — second-order saliency theory and iterative magnitude pruning |
compression-pipelines.md |
✅ Written | Deep Compression (prune + quantize + Huffman) and EIE hardware accelerator |
structured-pruning.md |
✅ Written | Filter/channel pruning, BN-scaling pruning, attention head pruning |
sparse-training.md |
✅ Written | Lottery Ticket Hypothesis, SNIP, SET, SNFS, RigL — sparse training from scratch |
llm-pruning.md |
✅ Written | Movement Pruning, SparseGPT, Wanda — LLM-scale compression |
Subtopic Map
Classical Theory: Hessian-Based Saliency
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Optimal Brain Damage | Diagonal-Hessian saliency \(s_i = H_{ii} w_i^2 / 2\); prune low-\(s\) weights | LeCun et al. 1990 |
| Optimal Brain Surgeon | Full inverse-Hessian; exact weight compensation \(\delta w = -\frac{w_q}{[H^{-1}]_{qq}} H^{-1} e_q\) | Hassibi & Stork 1993 |
| Iterative Magnitude Pruning | Train → threshold → retrain; zeroth-order proxy; competitive at scale | Han et al. 2015 |
Hardware-Aware Compression Pipelines
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Deep Compression | Prune (9–13×) → k-means quantization (5 bits) → Huffman coding; 35–49× total | Han et al. 2016 |
| EIE Accelerator | Custom VLSI for compressed sparse FC layers; skips zero weights & activations | Han et al. 2016 |
| SCNN | Exploits both weight and activation sparsity in a tiled dataflow microarchitecture | Parashar et al. 2017 |
Structured Pruning
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Filter pruning (ℓ₁) | Rank filters by ℓ₁ norm; remove whole filters for cuDNN-compatible sparsity | Li et al. 2016 |
| BN scaling (Network Slimming) | ℓ₁ sparsity on BN γ; prune channels with near-zero γ after training | Liu et al. 2017 |
| Attention head pruning | Most heads are redundant; L0-gate pruning identifies specialized heads | Michel et al. 2019; Voita et al. 2019 |
| Rethinking pruning | Fine-tuning pruned weights ≈ random init of pruned architecture for structured methods | Liu et al. 2019 |
Sparse Training
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Lottery Ticket Hypothesis | Dense nets contain sparse “winning tickets” that train well from original init | Frankle & Carlin 2019 |
| SNIP | Connection sensitivity saliency at init; prune before training begins | Lee et al. 2019 |
| SET | Sparse Erdős–Rényi topology evolved during training; no dense model needed | Mocanu et al. 2018 |
| SNFS / Sparse Momentum | Gradient-magnitude momentum drives topology reallocation; 5× faster training | Dettmers & Zettlemoyer 2019 |
| RigL | Instantaneous gradient magnitudes update connectivity periodically; fixed FLOP budget | Evci et al. 2020 |
LLM-Scale Pruning
| Subtopic | Key Idea | Primary Source |
|---|---|---|
| Movement Pruning | Fine-tuning saliency = weight × gradient movement; task-adaptive | Sanh et al. 2020 |
| SparseGPT | Layerwise OBS with approximate inverse-Hessian via Cholesky; 50% sparsity at 175B | Frantar & Alistarh 2023 |
| Wanda | Saliency = weight magnitude × activation ℓ₂ norm; no weight update needed | Sun et al. 2023 |
Dependency Graph
flowchart TD
A["Second-Order Methods
concepts/optimization-theory/second-order-methods.md"]
B["Classical Pruning
classical-pruning.md"]
C["Compression Pipelines
compression-pipelines.md"]
D["Structured Pruning
structured-pruning.md"]
E["Sparse Training
sparse-training.md"]
F["LLM Pruning
llm-pruning.md"]
G["Knowledge Distillation
concepts/deep-learning-engineering/knowledge-distillation/knowledge-distillation.md"]
A --> B
B --> C
B --> D
B --> E
B --> F
C --> F
E --> F
G -.->|"competing paradigm"| F
Master References
| Reference | Authors | Year | Sub-theme | Key Contribution | Link |
|---|---|---|---|---|---|
| Optimal Brain Damage | LeCun, Denker, Solla | 1990 | Classical | Diagonal-Hessian saliency scores; pruning as constrained loss minimization | NeurIPS 1989 |
| Optimal Brain Surgeon | Hassibi, Stork | 1993 | Classical | Full inverse-Hessian; exact closed-form weight compensation after pruning | NeurIPS 1992 |
| Learning Weights and Connections | Han, Pool, Tran, Dally | 2015 | Magnitude | IMP pipeline; 9× AlexNet, 13× VGG-16 compression, no accuracy loss | NeurIPS 2015 |
| Deep Compression | Han, Mao, Dally | 2016 | Pipeline | Prune + quantize + Huffman; 35–49× total compression; ICLR 2016 Best Paper | ICLR 2016 |
| EIE | Han et al. | 2016 | Hardware | Custom VLSI for compressed sparse FC; 189× CPU speedup | ISCA 2016 |
| SCNN | Parashar et al. | 2017 | Hardware | Dual weight+activation sparsity dataflow; ISCA 2017 | ISCA 2017 |
| Pruning Filters | Li et al. | 2016 | Structured | ℓ₁-norm filter pruning; hardware-compatible structured sparsity | ICLR 2017 |
| Network Slimming | Liu et al. | 2017 | Structured | BN γ sparsity regularization; channel pruning via near-zero γ | ICCV 2017 |
| Sixteen Heads | Michel, Levy, Neubig | 2019 | Structured | Most attention heads are redundant; head importance scoring | NeurIPS 2019 |
| Analyzing Self-Attention | Voita et al. | 2019 | Structured | L0-gate head pruning; specialized vs. redundant heads | ACL 2019 |
| Lottery Ticket Hypothesis | Frankle, Carlin | 2019 | Sparse training | Winning ticket subnetworks; IMP + weight rewinding; ICLR 2019 Best Paper | ICLR 2019 |
| Linear Mode Connectivity | Frankle et al. | 2020 | Sparse training | LTH stability requires rewinding to early training checkpoint, not step 0 | ICML 2020 |
| SNIP | Lee, Ajanthan, Torr | 2019 | Sparse training | Connection sensitivity at init; prune before any training | ICLR 2019 |
| SET | Mocanu et al. | 2018 | Sparse training | Sparse Erdős–Rényi topology evolved online; no dense model | Nature Comms 2018 |
| SNFS | Dettmers, Zettlemoyer | 2019 | Sparse training | Gradient-momentum-based topology reallocation; 5× faster | NeurIPS 2019 |
| RigL | Evci et al. | 2020 | Sparse training | Periodic gradient-magnitude topology updates; fixed FLOP training | ICML 2020 |
| Movement Pruning | Sanh, Wolf, Rush | 2020 | LLM | Fine-tuning-adaptive saliency; weight × gradient movement | NeurIPS 2020 |
| SparseGPT | Frantar, Alistarh | 2023 | LLM | Layerwise OBS at 175B scale; approximate inverse-Hessian via Cholesky | ICML 2023 |
| Wanda | Sun et al. | 2023 | LLM | Weight × activation-norm saliency; no weight update; matches SparseGPT | arXiv 2023 |
| State of Sparsity | Gale, Elsen, Hooker | 2019 | Survey | Magnitude pruning matches complex methods; sparse archs can’t train from scratch | arXiv 2019 |
| Pruning Survey | Blalock et al. | 2020 | Survey | 81-paper meta-survey; community lacks reproducible benchmarks; ShrinkBench | MLSys 2020 |
| Rethinking Pruning | Liu et al. | 2019 | Survey | For structured pruning, architecture > inherited weights; train from random init | ICLR 2019 |