- KV Cache Efficiency Techniques and DeepSeek Sparse Attention
- Conditional Computation in Transformers: Mixture-of-Depths and Depth-Streaming Attention
- A History of Attention Mechanisms
- Linear Attention, Chunkwise Training, and Neural Memory
- Standard Softmax Attention Mechanisms
- Structured Attention Networks