ποΈ Compression Pipelines: Deep Compression and EIE
Table of Contents
- 1. Motivation: From Pruning to Deployment
- 2. The Deep Compression Pipeline
- 3. EIE: Efficient Inference Engine
- 4. NVIDIA Ampere 2:4 Structured Sparsity
- 5. References
1. π‘ Motivation: From Pruning to Deployment
Iterative magnitude pruning (Classical Pruning) produces a network where 80β90% of weights are exactly zero. But storing or computing with this sparse network naively recovers none of the theoretical efficiency:
- A sparse weight matrix stored as a dense 32-bit float array wastes memory: 90% of the entries are zeros consuming full storage.
- A matrix-vector product with a dense representation executes 10Γ more multiply-accumulate operations than necessary.
- Modern GPU cores have no native support for irregular unstructured sparsity β a sparse matrix-vector product on a GPU can be slower than its dense counterpart due to irregular memory access patterns.
Deep Compression (Han, Mao, Dally 2016) addresses the storage problem via a three-stage pipeline that reduces AlexNet from 240 MB to 6.9 MB. EIE addresses the compute problem via custom VLSI that directly operates on the compressed sparse representation.
2. π The Deep Compression Pipeline
The pipeline applies three compression stages in sequence:
flowchart LR
A["Dense model
32-bit floats
240 MB (AlexNet)"]
B["Sparse model
32-bit floats
~27 MB (9x)"]
C["Sparse + quantized
4-bit codes
~6.9 MB (35x)"]
D["Sparse + quantized
+ Huffman
6.9 MB (35x)"]
A -->|"Pruning
(IMP)"| B
B -->|"k-means
quantization"| C
C -->|"Huffman
coding"| D
2.1 Stage 1 β Pruning
Apply iterative magnitude pruning (see Classical Pruning Β§6) to remove 80β90% of weights. The output is a sparse weight matrix with a binary mask:
\[W_\text{sparse} = W \odot M, \quad M_{ij} \in \{0, 1\}\]
The mask \(M\) does not need to be stored explicitly β it is implicit in the sparse representation used in subsequent stages.
2.2 Stage 2 β Weight Quantization
Each layerβs non-zero weights are clustered into \(k = 2^b\) groups using \(k\)-means clustering (typically \(b = 4\) or \(5\) bits). Instead of storing each weight as a 32-bit float, we store:
- A codebook of \(k\) centroid values \(\{c_1, \ldots, c_k\}\) (32-bit floats, \(k \cdot 32\) bits total).
- A code index for each non-zero weight: a \(b\)-bit integer pointing to its centroid.
Formal setup. Let \(W_{nz} = \{w_i : w_i \neq 0\}\) be the non-zero weights of a layer. We solve:
\[\min_{c_1,\ldots,c_k,\, z} \sum_{i: w_i \neq 0} \bigl(w_i - c_{z_i}\bigr)^2\]
where \(z_i \in \{1,\ldots,k\}\) is the cluster assignment for weight \(w_i\). This is standard \(k\)-means; LLoydβs algorithm converges in \(O(k \cdot |W_{nz}| \cdot T)\) for \(T\) iterations.
Gradient update for codebook (fine-tuning). After assigning codes, the network is fine-tuned. Gradients are accumulated per cluster:
\[\frac{\partial L}{\partial c_j} = \sum_{i:\, z_i = j} \frac{\partial L}{\partial w_i}\]
All weights sharing a centroid receive the same gradient update β the centroid moves, and all weights in its cluster move together.
Weight distributions after training are approximately Gaussian (for FC layers) or bimodal. Uniform quantization wastes precision on the sparse tails. K-means places more cluster centers in the high-density region near zero, matching the data distribution. Empirically, k-means quantization loses \(\approx 0.5\%\) top-1 accuracy vs.Β full precision at 4 bits; uniform quantization loses \(\approx 2\%\).
This exercise analyzes the storage cost of the quantization codebook.
Prerequisites: 2.2 Stage 2 β Weight Quantization
A fully-connected layer has \(N_{nz} = 4{,}096 \times 4{,}096 \times 0.1 = 1{,}677{,}722\) non-zero weights (90% pruned). We quantize to \(b = 4\) bits (\(k = 16\) clusters).
How many bits does the codebook consume? How many bits do the code indices consume?
What is the total storage cost in MB? Compare to storing the non-zero weights as 32-bit floats.
At what value of \(N_{nz}\) does the codebook overhead become negligible (say, \(< 1\%\) of total storage)?
Key insight: The codebook is fixed-size (\(k \cdot 32\) bits) regardless of \(N_{nz}\); its overhead is \(O(1/N_{nz})\) relative.
(a) Codebook: \(16 \times 32 = 512\) bits \(= 64\) bytes. Code indices: \(N_{nz} \times 4 = 1{,}677{,}722 \times 4 \approx 6.72 \times 10^6\) bits \(\approx 0.84\) MB.
(b) Total: \(64\) bytes \(+ 0.84\) MB \(\approx 0.84\) MB. Full 32-bit storage: \(1.68 \times 10^6 \times 4 \approx 6.7\) MB. Compression ratio from quantization alone: \(\approx 8\times\).
(c) Codebook overhead \(< 1\%\): \(512 / (N_{nz} \cdot 4) < 0.01 \implies N_{nz} > 512/0.04 = 12{,}800\). Any non-trivial layer satisfies this; the codebook is always negligible for realistic layer sizes.
2.3 Stage 3 β Huffman Coding
Huffman coding assigns variable-length binary codes to symbols based on their frequency: frequent symbols get shorter codes. After quantization, the code indices \(\{z_i\}\) form a discrete distribution over \(\{1,\ldots,k\}\).
Shannon entropy lower bound. The minimum average code length achievable is the entropy:
\[H(\mathbf{z}) = -\sum_{j=1}^k p_j \log_2 p_j \quad \text{bits per symbol}\]
where \(p_j = |\{i : z_i = j\}| / N_{nz}\). Huffman coding achieves average length within \(1\) bit of \(H(\mathbf{z})\).
Storage after Huffman. If the average Huffman code length is \(\bar{\ell}\) bits (typically \(3\)β\(3.5\) bits for 4-bit quantization of a Gaussian weight distribution), total non-zero weight storage is \(N_{nz} \cdot \bar{\ell}\) bits β an additional \(4/\bar{\ell} \approx 1.14\)β\(1.33\times\) compression on top of the 4-bit codes.
The index difference trick. Non-zero weights are stored in sparse format as (value, column-index) pairs. The column indices are stored as differences between consecutive non-zero positions rather than absolute values. Differences are small (average gap = \(1/(1-\text{sparsity})\) columns), so they use few bits. These differences are also Huffman-coded.
This exercise bounds the Huffman compression factor given the weight distribution.
Prerequisites: 2.3 Stage 3 β Huffman Coding
A quantized layer has 4-bit codes (\(k = 16\)). The cluster frequencies follow a Laplace distribution: \(p_j \propto \exp(-|j - 8|/\lambda)\) for \(j = 1, \ldots, 16\) with \(\lambda = 2\).
Normalize \(p_j\) and compute the entropy \(H(\mathbf{z})\).
What is the maximum compression factor (ratio of 4 bits to \(H(\mathbf{z})\)) achievable by Huffman coding?
Key insight: Huffman compression is most effective when the distribution is highly skewed; for a near-uniform distribution, it provides little gain.
(a) Unnormalized: \(\tilde{p}_j = e^{-|j-8|/2}\) for \(j=1,\ldots,16\). The distribution is symmetric around \(j=8\). Numerically: \(Z = \sum_{j=1}^{16} e^{-|j-8|/2} \approx 2(e^0 + e^{-0.5} + e^{-1} + \ldots + e^{-3.5}) - e^0\) (counting \(j=8\) once). \(Z \approx 2(1 + 0.607 + 0.368 + 0.223 + 0.135 + 0.082 + 0.050 + 0.030) - 1 \approx 3.99\). Central probability \(p_8 \approx 1/3.99 \approx 0.251\). Entropy is reduced relative to uniform (\(H_{uniform} = 4\) bits): numerically \(H \approx 3.0\)β\(3.2\) bits.
(b) Compression factor \(\approx 4/3.1 \approx 1.3\times\) from Huffman on top of 4-bit quantization. Combined with 4-bit quantization (8Γ over 32-bit), total compression from quantization + Huffman \(\approx 10.4\times\) for this layer.
2.4 End-to-End Compression Ratios
Han et al. report the following for AlexNet:
| Layer | Parameters | Pruning | Quantization | Huffman | Total |
|---|---|---|---|---|---|
| conv1 | 35K | 16% | 8-bit | 1.2Γ | 6.3Γ |
| conv2β5 | 1.5M | 62% | 8-bit | 1.3Γ | 9.6Γ |
| fc6 | 37.7M | 91% | 5-bit | 1.3Γ | 27.3Γ |
| fc7 | 16.8M | 91% | 5-bit | 1.3Γ | 27.2Γ |
| fc8 | 4.1M | 75% | 5-bit | 1.3Γ | 17.9Γ |
| Total | 60.9M | β | β | β | 35Γ |
VGG-16: 49Γ total compression (552 MB β 11.3 MB) with no accuracy loss.
2.5 π» PyTorch: Deep Compression Pipeline
import torch
import torch.nn as nn
import numpy as np
from sklearn.cluster import KMeans
from collections import Counter
import heapq
from dataclasses import dataclass, field
from typing import Optional
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Stage 2: k-means weight quantization
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
@dataclass
class QuantizedLayer:
"""Stores the codebook and per-weight code indices for one layer."""
codebook: torch.Tensor # (k,) float32 centroid values
codes: torch.Tensor # (n_nonzero,) int code indices
indices: torch.Tensor # (n_nonzero,) int positions of non-zero weights
original_shape: tuple # shape of the original weight tensor
n_bits: int # bits per code (log2 k)
def quantize_layer(weight: torch.Tensor, n_bits: int = 4) -> QuantizedLayer:
"""
K-means quantize the non-zero weights of a (pruned) weight tensor.
Args:
weight: weight tensor (may contain zeros from pruning)
n_bits: bits per code; k = 2**n_bits clusters
Returns:
QuantizedLayer with codebook and code indices for non-zero weights
"""
k = 2 ** n_bits
flat = weight.flatten()
nz_mask = flat.ne(0)
nz_vals = flat[nz_mask].float().cpu().numpy().reshape(-1, 1)
# Initialize k-means with linear spacing over the non-zero range
init_centers = np.linspace(nz_vals.min(), nz_vals.max(), k).reshape(-1, 1)
km = KMeans(n_clusters=k, init=init_centers, n_init=1, max_iter=300)
km.fit(nz_vals)
codebook = torch.tensor(km.cluster_centers_.flatten(), dtype=torch.float32)
codes = torch.tensor(km.labels_, dtype=torch.int32)
indices = nz_mask.nonzero(as_tuple=False).squeeze(1)
return QuantizedLayer(
codebook=codebook,
codes=codes,
indices=indices,
original_shape=tuple(weight.shape),
n_bits=n_bits,
)
def dequantize_layer(ql: QuantizedLayer) -> torch.Tensor:
"""Reconstruct a dense weight tensor from a QuantizedLayer."""
flat = torch.zeros(int(np.prod(ql.original_shape)), dtype=torch.float32)
flat[ql.indices] = ql.codebook[ql.codes.long()]
return flat.reshape(ql.original_shape)
def fine_tune_codebook(
ql: QuantizedLayer,
grad_weight: torch.Tensor,
) -> QuantizedLayer:
"""
Accumulate gradients per cluster and update codebook centroids.
Called during the fine-tuning phase after quantization.
grad_weight: the gradient dL/dW for the full weight tensor
"""
flat_grad = grad_weight.flatten()[ql.indices]
new_codebook = ql.codebook.clone()
for j in range(len(ql.codebook)):
mask = ql.codes == j
if mask.any():
new_codebook[j] -= flat_grad[mask].mean()
return QuantizedLayer(
codebook=new_codebook,
codes=ql.codes,
indices=ql.indices,
original_shape=ql.original_shape,
n_bits=ql.n_bits,
)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Stage 3: Huffman coding
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
@dataclass(order=True)
class _HNode:
freq: int
symbol: Optional[int] = field(compare=False, default=None)
left: Optional["_HNode"] = field(compare=False, default=None)
right: Optional["_HNode"] = field(compare=False, default=None)
def build_huffman_codebook(symbols: list[int]) -> dict[int, str]:
"""Build a Huffman codebook from a list of symbol occurrences."""
freq = Counter(symbols)
heap = [_HNode(f, s) for s, f in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
a = heapq.heappop(heap)
b = heapq.heappop(heap)
heapq.heappush(heap, _HNode(a.freq + b.freq, left=a, right=b))
root = heap[0]
codebook: dict[int, str] = {}
def _traverse(node: _HNode, code: str) -> None:
if node.symbol is not None:
codebook[node.symbol] = code or "0"
else:
if node.left:
_traverse(node.left, code + "0")
if node.right:
_traverse(node.right, code + "1")
_traverse(root, "")
return codebook
def huffman_compress(ql: QuantizedLayer) -> tuple[dict[int, str], float]:
"""
Build Huffman codes for the code indices and return:
- huffman_codes: dict mapping code index β bit string
- avg_bits: average bits per non-zero weight after Huffman
"""
symbols = ql.codes.tolist()
hcodes = build_huffman_codebook(symbols)
avg_bits = sum(len(hcodes[s]) for s in symbols) / len(symbols)
return hcodes, avg_bits
def compression_ratio(
ql: QuantizedLayer,
huffman_avg_bits: float,
original_bits: int = 32,
) -> float:
"""
Compute compression ratio relative to dense 32-bit storage.
Accounts for: (1) sparsity, (2) quantization, (3) Huffman.
"""
n_total = int(np.prod(ql.original_shape))
n_nonzero = len(ql.codes)
# Storage: n_nonzero * huffman_avg_bits (codes) + n_nonzero * log2(n_total) (indices)
# Codebook: 2**n_bits * 32 bits (negligible for large layers)
idx_bits = np.log2(n_total)
compressed_bits = n_nonzero * (huffman_avg_bits + idx_bits) + (2 ** ql.n_bits) * 32
original_total_bits = n_total * original_bits
return original_total_bits / compressed_bits3. β‘ EIE: Efficient Inference Engine
Han, Liu, Mao, Pu, Pedram, Horowitz, Dally (2016). βEIE: Efficient Inference Engine on Compressed Deep Neural Network.β ISCA 2016.
3.1 The Sparsity Exploitation Problem
After Deep Compression, a FC layer has: - 90%+ weight sparsity - ReLU activation sparsity: \(\approx 50\%\) of activations are zero (ReLU kills negative pre-activations)
A GPU cannot efficiently exploit unstructured weight sparsity. The irregular memory access pattern of a compressed sparse row (CSR) matrix-vector product causes memory bandwidth bottlenecks β the GPUβs strength is regular, coalesced DRAM accesses. In practice, a 90%-sparse matrix-vector product on a GPU is no faster than its dense counterpart.
EIE is a custom ASIC designed to operate natively on the compressed sparse column (CSC) format and to skip zero activations at the hardware level.
3.2 Compressed Sparse Column Representation
For a weight matrix \(W \in \mathbb{R}^{m \times n}\) with sparsity \(s\), the CSC representation stores:
val: vector of \(N_{nz} = (1-s) \cdot mn\) non-zero weight code indices (4-bit integers after quantization), packed two per byte.col_ptr: vector of \(n+1\) integers;col_ptr[j]is the index invalof the first non-zero entry in column \(j\).row_idx: vector of \(N_{nz}\) integers giving the row of each non-zero entry, stored as relative offsets (4-bit differences) between consecutive non-zeros in the same column.
Storage for AlexNet fc6 (\(4096 \times 9216\), 91% pruned): - Original: \(4096 \times 9216 \times 32 = 1.2\)Gb - CSC + 4-bit codes: \(\approx 44\)MB (27Γ compression from pruning+quantization)
3.3 EIE Architecture
EIE implements the matrix-vector product \(y = Wh\) where \(W\) is stored in CSC and \(h\) is the (sparse) activation vector.
flowchart TD
A["Central Control Unit
Broadcasts non-zero h_j values"]
B["PE 0
Columns 0, P, 2P, ..."]
C["PE 1
Columns 1, P+1, 2P+1, ..."]
D["PE P-1
Columns P-1, 2P-1, ..."]
E["Activation Queue
(non-zero h_j)"]
F["Codebook Lookup
(4-bit β float)"]
G["Accumulator Array
(output y_i)"]
A --> E
E --> B
E --> C
E --> D
B --> F
C --> F
D --> F
F --> G
Key design decisions:
1. Column-based work distribution: Each of \(P\) processing elements (PEs) handles a disjoint subset of columns. Column \(j\) is processed by PE \(j \bmod P\). This distributes non-zero weights evenly (approximately) across PEs.
2. Zero activation skipping: The central control unit broadcasts non-zero \(h_j\) values and their indices. Zero activations are never broadcast β the PEs skip them entirely. This exploits the \(\approx 50\%\) ReLU activation sparsity.
3. In-place codebook lookup: Each PE stores the 16-entry codebook (16 Γ 16-bit = 32 bytes) and decodes 4-bit codes on-the-fly during accumulation.
4. Pointer-driven column scanning: Each PE maintains a pointer into its columnsβ val arrays. When \(h_j \neq 0\), PE \(j \bmod P\) scans the non-zero entries in column \(j\) using the CSC col_ptr and accumulates into its output buffer.
3.4 Performance Analysis
Compute: A dense FC layer costs \(2mn\) multiply-accumulates (MACs). EIE performs \((1-s_w)(1-s_a) \cdot 2mn\) MACs, where \(s_w \approx 0.9\) is weight sparsity and \(s_a \approx 0.5\) is activation sparsity. Speedup from sparsity alone: \(\frac{1}{(1-s_w)(1-s_a)} = \frac{1}{0.1 \times 0.5} = 20\times\).
Memory bandwidth: EIE reads 4-bit codes (not 32-bit floats) and skips zeros. Effective bandwidth reduction vs.Β dense 32-bit: \(\approx 8\times\) (4-bit) \(\times 10\times\) (weight sparsity) \(\times 2\times\) (activation sparsity) \(= 160\times\) fewer DRAM accesses.
Measured results (from paper): - 189Γ speedup over Intel Core i7 CPU (dense) - 13Γ speedup over NVIDIA GeForce GTX Titan X (dense) - 24,000Γ energy efficiency improvement over CPU - 3,400Γ over GPU - EIE chip: 282 GOPS/W, 0.6W power, 40mmΒ² area (45nm CMOS)
A GPU is optimized for regular, high-throughput computation. Its DRAM controller prefetches cache lines assuming spatial locality β but a sparse matrix-vector product accesses memory in an irregular pattern determined by the non-zero structure. EIEβs custom memory controller is designed for exactly this access pattern. The GPU also cannot exploit 4-bit codes natively (pre-Turing); EIEβs 4-bit arithmetic is free at the hardware level. See NVIDIA GPU Hardware for GPU memory hierarchy context.
3.5 π» PyTorch: Sparse Matrix-Vector Product in CSC Format
import torch
import torch.nn as nn
def dense_to_csc(W: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Convert a dense weight matrix W (m x n) to CSC format.
Returns:
val: (nnz,) non-zero values
col_ptr: (n+1,) column start pointers into val
row_idx: (nnz,) row indices of non-zeros
"""
m, n = W.shape
col_ptr = torch.zeros(n + 1, dtype=torch.long)
row_indices = []
values = []
for j in range(n):
nz_rows = W[:, j].nonzero(as_tuple=False).squeeze(1)
col_ptr[j + 1] = col_ptr[j] + len(nz_rows)
row_indices.append(nz_rows)
values.append(W[nz_rows, j])
val = torch.cat(values) if values else torch.tensor([])
row_idx = torch.cat(row_indices).long() if row_indices else torch.tensor([], dtype=torch.long)
return val, col_ptr, row_idx
def csc_matvec(
val: torch.Tensor,
col_ptr: torch.Tensor,
row_idx: torch.Tensor,
h: torch.Tensor,
m: int,
) -> torch.Tensor:
"""
Sparse matrix-vector product y = W h using CSC storage.
Skips zero entries in h (simulating EIE activation sparsity).
Args:
val, col_ptr, row_idx: CSC representation of W
h: (n,) input activation vector
m: number of output rows
Returns:
y: (m,) output vector
"""
y = torch.zeros(m, dtype=val.dtype, device=val.device)
n = len(h)
for j in range(n):
if h[j] == 0.0:
continue # skip zero activations (EIE's key optimization)
start = col_ptr[j].item()
end = col_ptr[j + 1].item()
if start == end:
continue # empty column (pruned)
rows = row_idx[start:end]
y[rows] += val[start:end] * h[j]
return y
# PyTorch sparse tensors provide a vectorized alternative:
def sparse_matvec_pytorch(W_sparse: torch.Tensor, h: torch.Tensor) -> torch.Tensor:
"""
Sparse matrix-vector product using PyTorch's native sparse COO format.
More efficient than the Python loop above for large matrices.
"""
# W_sparse is a torch.sparse_coo_tensor or sparse_csr_tensor
return torch.mv(W_sparse, h)
def to_pytorch_sparse(W: torch.Tensor) -> torch.Tensor:
"""Convert a dense weight tensor to PyTorch sparse CSR format."""
return W.to_sparse_csr()import torch
W_dense = torch.randn(4096, 9216)
# Simulate 90% pruning
mask = torch.rand_like(W_dense) > 0.9
W_pruned = W_dense * mask
W_sparse = W_pruned.to_sparse_csr()
dense_bytes = W_dense.element_size() * W_dense.numel()
# Sparse CSR storage: values + col_indices + crow_indices
sparse_bytes = (
W_sparse.values().nbytes
+ W_sparse.col_indices().nbytes
+ W_sparse.crow_indices().nbytes
)
print(f"Dense: {dense_bytes / 1e6:.1f} MB")
print(f"Sparse CSR: {sparse_bytes / 1e6:.1f} MB")
print(f"Compression: {dense_bytes / sparse_bytes:.1f}x")
# Dense: 150.9 MB, Sparse CSR: ~22 MB, Compression: ~6.8x
# (further 8x from 4-bit quantization β ~50x total)4. π² NVIDIA Ampere 2:4 Structured Sparsity
EIE targets a custom ASIC. For mainstream GPU deployment, NVIDIA introduced 2:4 structured sparsity in the Ampere architecture (A100, 2020): exactly 2 of every 4 consecutive weights in a row must be zero, enforcing 50% sparsity with a regular pattern that the Sparse Tensor Core can exploit natively.
Format. For each group of 4 weights, store 2 non-zero values (16-bit) and a 2-bit index indicating which two of the four positions are non-zero. Storage: \(2 \times 16 + 2 = 34\) bits per 4-weight group vs.Β \(4 \times 16 = 64\) bits dense β exactly \(2\times\) compression.
Hardware. The Ampere Sparse Tensor Core performs the sparse GEMM directly on the 2:4 compressed format, achieving \(2\times\) throughput over dense at the same accuracy. This is the first mainstream GPU to offer hardware sparsity acceleration.
Limitations. 2:4 sparsity forces exactly 50% sparsity with a structured pattern β unstructured magnitude pruning produces arbitrary sparsity that does not map to 2:4. In practice, a post-training or during-training mask selection step must enforce the 2:4 constraint, typically with \(< 1\%\) accuracy degradation.
Parashar et al. (2017) extended the EIE idea to convolutional layers with SCNN, an accelerator that exploits both weight sparsity (from pruning) and activation sparsity (from ReLU) simultaneously via a novel tiled outer-product dataflow. Unlike EIE (which targets FC layers), SCNNβs hardware maps the sparse convolution computation onto a 2D array of multiply-accumulate units, avoiding the input-stationary or weight-stationary limitations of standard CNN accelerators.
5. π References
| Reference Name | Brief Summary | Link |
|---|---|---|
| Han, Mao, Dally (2016). βDeep Compressionβ | Three-stage pipeline (prune + quantize + Huffman); 35β49Γ compression of AlexNet/VGG-16; ICLR 2016 Best Paper | arXiv:1510.00149 |
| Han et al. (2016). βEIE: Efficient Inference Engineβ | Custom VLSI for compressed-sparse FC inference; 189Γ CPU speedup, 24,000Γ CPU energy efficiency | arXiv:1602.01528 |
| Parashar et al. (2017). βSCNNβ | Dual weight+activation sparsity dataflow for CNNs; ISCA 2017 | arXiv:1708.04485 |
| NVIDIA Ampere Architecture (2020) | 2:4 structured sparsity in Sparse Tensor Cores; 2Γ throughput at 50% sparsity | NVIDIA Technical Blog |