Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations
Yuhao Yang, Zhi Ji, Zhaopeng Li, Yi Li, Zhonglin Mo, Yue Ding, Kai Chen, Zijian Zhang, Jie Li, Shuanglong Li, Lin Liu — Baidu Inc. ArXiv 2503.02453, 2025.
| Dimension | Prior State | This Paper | Key Result |
|---|---|---|---|
| Information completeness | RQ-VAE semantic IDs lose fine-grained detail via quantization; final residual discarded | Cascaded sparse→dense generation jointly captures both categorical structure and intra-cluster detail | Removing dense branch: −27.1% R@500 on industrial dataset |
| Retrieval diversity | Beam search over code tree confined to a single representation type | BeamFusion combines beam probability and ANN cosine similarity for diversity–precision control | Removing BeamFusion: −22.1% R@500 |
| Dense representation training | Prior work (LIGER) used fixed pre-trained dense embeddings not adapted to recommendation signal | Dense encoder trained end-to-end with contrastive loss over interaction data | Removing sparse ID branch: −33.6% R@500 |
| Industrial deployment | TIGER-style discrete generation | Coarse-to-fine sparse→dense pipeline | +3.60% conversion, +4.15% ARPU at Baidu (200M DAU, Jan 2025) |
Relations
Builds on: HSTU (Zhai et al., 2024), TIGER (Rajput et al., NeurIPS 2023) (no note yet), LIGER (Yang et al., 2024) (no note yet) Concepts used: Generative Recommender Systems — §4 (Generative Retrieval, RQ-VAE, COBRA)
Table of Contents
- 1. Motivation: The Information Loss Problem
- 2. Sparse and Dense Representations
- 3. Cascaded Sequential Modeling
- 4. Training Objectives
- 5. Coarse-to-Fine Inference: BeamFusion
- 6. Experiments
- 7. References
1. Motivation: The Information Loss Problem 📐
Generative retrieval via RQ-VAE semantic IDs encodes each item \(x\) as a tuple \((c_1, c_2, \ldots, c_K)\) of discrete codeword indices. The encoder compresses the item’s dense embedding \(\mathbf{z} \in \mathbb{R}^d\) into \(K\) successive quantization residuals; the decoder reconstructs \(\hat{\mathbf{z}} = \sum_{k=1}^K \mathbf{e}_{c_k}\). By construction, the final residual \(\mathbf{r}_K = \mathbf{z} - \hat{\mathbf{z}}\) is discarded.
The information loss is irreducible. For finite \(K\) and codebook sizes, \(\|\mathbf{r}_K\| > 0\) almost surely. Two items can share an identical semantic ID yet differ in their true embeddings — a collision that makes them indistinguishable to the retrieval model. Increasing \(K\) or codebook size reduces the residual but never eliminates it, and increases inference cost.
LIGER (Yang et al., 2024) addressed this by appending a separately pre-trained dense embedding alongside the sparse ID at each sequence position. But using a frozen, separately-trained dense encoder introduces a representation mismatch: the dense vector is not adapted to the generative pretraining objective and carries no collaborative signal from interaction sequences.
COBRA’s solution is a cascade: generate the sparse ID first, then condition the generation of the dense vector on the produced sparse ID. This ordering allows the dense vector to refine the coarse categorical prediction — recovering intra-cluster detail that the sparse ID discards — while avoiding the mismatch of pre-trained fixed embeddings.
This question establishes why the ordering of the cascade matters for information-theoretic reasons.
Prerequisites: §1
The COBRA factorization generates the sparse ID first and conditions the dense vector on it. Suppose instead you generate the dense vector \(\mathbf{v}_{t+1}\) first, then the sparse ID \(\text{ID}_{t+1}\) conditioned on \(\mathbf{v}_{t+1}\). Write out the corresponding factorization of \(P(\text{ID}_{t+1}, \mathbf{v}_{t+1} \mid S_{1:t})\) and argue whether this ordering is preferable, equivalent, or worse than COBRA’s.
Key insight: The conditional \(P(\text{ID}_{t+1} \mid \mathbf{v}_{t+1}, S_{1:t})\) is computing a quantization (nearest-neighbor lookup in code space) given an already-predicted dense vector. But the dense vector prediction \(P(\mathbf{v}_{t+1} \mid S_{1:t})\) is an unconstrained regression over \(\mathbb{R}^d\) — it has no guarantee of landing near any real item in the catalog. The downstream ID assignment then nearest-neighbor projects an arbitrary point, which may not correspond to the right cluster.
Sketch: In the sparse→dense order, beam search over the sparse ID acts as a constraint that confines the dense prediction to items within the beam’s cluster. The dense head then answers a much easier question: “given that the next item is in cluster \(c_1, c_2, \ldots\), where exactly within that cluster is it?” In dense→sparse order, the first step solves a harder, unconstrained problem; the second step is a deterministic mapping from dense to discrete that adds no capacity. The cascade loses its refinement structure. Empirically, COBRA’s ablation confirms that the sparse branch contributes +33.6% recall — it provides the structural prior that makes the dense prediction tractable.
2. Sparse and Dense Representations
2.1 Sparse IDs via RQ-VAE
The sparse representation follows the standard RQ-VAE construction (see §4.1 of the concept note for full derivation). Item \(x\) is encoded from its textual attribute description via a Transformer text encoder, then quantized hierarchically into a \(K\)-tuple \((c_1, \ldots, c_K)\).
Each level has a codebook of \(V\) entries; \(K\) is the number of quantization levels. COBRA uses \(K=3\), \(V=32\) for public datasets and \(K=2\), \(V=32\) for the industrial dataset. The codebook is trained with EMA updates and random last-level assignment to achieve $$99% code uniqueness (see codebook collapse mitigation).
2.2 End-to-End Learnable Dense Vectors
Definition (COBRA Dense Encoder). A Transformer encoder processes the flattened item attribute tokens prefixed with a [CLS] token. The dense vector for item \(x\) is the [CLS] hidden state:
\[\mathbf{v}_x = \text{TransformerEncoder}([\texttt{[CLS]}, w_1, w_2, \ldots, w_L])_{[0]}\]
where \(w_1, \ldots, w_L\) are the item’s attribute tokens and the subscript \([0]\) selects the [CLS] position.
Critically, this encoder is trained end-to-end with the contrastive loss of §4.2 — not pre-trained and frozen. The dense vector therefore adapts to the recommendation objective: items that co-occur in user interaction sequences are pulled together in dense space, regardless of their attribute similarity.
3. Cascaded Sequential Modeling 🔑
3.1 Probabilistic Factorization
Definition (COBRA Factorization). Let \(S_{1:t} = ((x_1, \mathbf{v}_1), \ldots, (x_t, \mathbf{v}_t))\) be the user’s interaction history where \(x_i\) has sparse ID \(\text{ID}_i\) and dense vector \(\mathbf{v}_i\). COBRA factorizes the joint predictive distribution as:
\[P(\text{ID}_{t+1}, \mathbf{v}_{t+1} \mid S_{1:t}) = \underbrace{P(\text{ID}_{t+1} \mid S_{1:t})}_{\text{sparse prediction}} \cdot \underbrace{P(\mathbf{v}_{t+1} \mid \text{ID}_{t+1}, S_{1:t})}_{\text{dense refinement}}\]
The factorization is the product rule applied with \(\text{ID}_{t+1}\) as the first variable. This is not an independence assumption — the dense prediction is explicitly conditioned on the sparse ID.
Intuition. The sparse ID answers “which cluster does the next item belong to?” The dense vector, conditioned on the sparse ID, answers “where within that cluster is the next item?” The cascade decomposes a single hard prediction into two simpler sequential steps.
3.2 Input Construction and Forward Pass
Each sequence position \(t\) is represented by concatenating the sparse ID embedding with the dense vector:
\[\mathbf{h}_t = [\mathbf{e}(\text{ID}_t);\ \mathbf{v}_t] \in \mathbb{R}^{2d}\]
where \(\mathbf{e}(\text{ID}_t) = \sum_{k=1}^K \mathbf{e}_{c_k^{(t)}}\) is the sum of codeword embeddings. The full sequence history is thus:
\[S_{1:t} = [\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_t] \in \mathbb{R}^{t \times 2d}\]
Sparse ID forward pass. A causal Transformer decoder processes \(S_{1:t}\) and a SparseHead projects the output at position \(t\) to logits over the codebook:
\[\mathbf{z}_{t+1} = \text{SparseHead}(\text{TransformerDecoder}(S_{1:t}))\]
\(\text{TransformerDecoder}(S_{1:t})\) is a stack of causal self-attention layers (GPT-style, decoder-only) that maps the input sequence \(S_{1:t} \in \mathbb{R}^{t \times 2d}\) to a sequence of hidden states \(\mathbf{O}_{1:t} \in \mathbb{R}^{t \times d_{\text{model}}}\). Position \(i\) attends only to positions \(\leq i\) (causal mask). The last hidden state \(\mathbf{o}_t \in \mathbb{R}^{d_{\text{model}}}\) summarizes the full causal context of the sequence — it is the only output that SparseHead reads. The intermediate hidden states \(\mathbf{o}_1, \ldots, \mathbf{o}_{t-1}\) are by-products of the computation and are not used directly by either head.
SparseHead is a linear projection \(W_s \in \mathbb{R}^{V \times d_{\text{model}}}\) applied to the last decoder hidden state \(\mathbf{o}_t\): \[\mathbf{z}_{t+1} = W_s\, \mathbf{o}_t \in \mathbb{R}^V\] The output \(\mathbf{z}_{t+1}\) is a vector of \(V\) unnormalized logits — one per codebook entry at the current quantization level. Since the RQ-VAE semantic ID is a \(K\)-tuple \((c_1, \ldots, c_K)\), the sparse ID is generated autoregressively over levels: SparseHead is applied at each decode step \(k \in [K]\), with the code token \(c_k\) selected (by argmax during training, or by beam search at inference) and appended to the sequence before the next step. At the \(k\)-th decode step the decoder attends to both the history \(S_{1:t}\) and the previously generated code tokens \(c_1, \ldots, c_{k-1}\).
\(\widehat{\text{ID}}_{t+1} = (\hat{c}_1, \hat{c}_2, \ldots, \hat{c}_K)\) is the predicted sparse ID for the next item — the model’s best guess at which code cluster the next item belongs to. It is obtained by running \(K\) autoregressive decode steps through SparseHead: - During training: greedy argmax at each level (or teacher forcing in some implementations), giving a single predicted \(K\)-tuple. - During inference: beam search over the code tree, giving \(M\) candidate \(K\)-tuples each with an associated log-probability \(\phi^k\) (see §5.1).
Crucially, \(\widehat{\text{ID}}_{t+1}\) is a discrete index — not a continuous vector. The embedding \(\mathbf{e}(\widehat{\text{ID}}_{t+1}) = \sum_{k=1}^K \mathbf{e}_{\hat{c}_k}\) converts it back into a continuous vector so it can be appended to the sequence for the dense pass.
Dense vector forward pass. The predicted sparse ID embedding \(\mathbf{e}(\widehat{\text{ID}}_{t+1})\) is appended to the history, and a DenseHead projects the decoder output at the new final position:
\[\bar{S}_{1:t} = [S_{1:t};\ \mathbf{e}(\widehat{\text{ID}}_{t+1})], \qquad \hat{\mathbf{v}}_{t+1} = \text{DenseHead}(\text{TransformerDecoder}(\bar{S}_{1:t}))\]
DenseHead is a separate linear projection \(W_d \in \mathbb{R}^{d \times d_{\text{model}}}\) applied to the decoder’s last hidden state after the extended sequence \(\bar{S}_{1:t}\): \[\hat{\mathbf{v}}_{t+1} = W_d\, \bar{\mathbf{o}}_{t+1} \in \mathbb{R}^d\] where \(\bar{\mathbf{o}}_{t+1}\) is the final hidden state of \(\text{TransformerDecoder}(\bar{S}_{1:t})\) — i.e., the hidden state at the newly appended position \(\mathbf{e}(\widehat{\text{ID}}_{t+1})\). This position attends causally to all of \(S_{1:t}\) plus the sparse ID embedding, so its hidden state encodes both the full interaction history and the coarse cluster constraint imposed by \(\widehat{\text{ID}}_{t+1}\). The output \(\hat{\mathbf{v}}_{t+1} \in \mathbb{R}^d\) lives in the same space as the pre-computed item dense vectors, enabling ANN retrieval.
Why is a second decoder pass needed? Because \(\bar{\mathbf{o}}_{t+1}\) (the hidden state at the appended position) does not exist in the first forward pass — appending \(\mathbf{e}(\widehat{\text{ID}}_{t+1})\) to the sequence creates a new final position whose hidden state is not computed until the decoder processes \(\bar{S}_{1:t}\). The KV cache of the first pass can be reused for the prefix \(S_{1:t}\); only one additional attention step is needed for the new position.
Note that the dense head sees the predicted sparse ID — not the ground-truth — at training time. The model is therefore trained to produce dense vectors conditioned on the predicted cluster, which matches inference behavior.
4. Training Objectives
4.1 Sparse Cross-Entropy Loss
The sparse ID loss is standard cross-entropy over codebook tokens at each level \(k \in [K]\). For a sequence of length \(T\):
\[\mathcal{L}_{\text{sparse}} = -\sum_{t=1}^{T-1} \log \frac{\exp\!\left(z_{t+1}^{\text{ID}_{t+1}}\right)}{\sum_j \exp\!\left(z_{t+1}^j\right)}\]
where \(z_{t+1}^j\) is the \(j\)-th logit of \(\mathbf{z}_{t+1}\).
4.2 Dense Contrastive Loss
The dense loss aligns the predicted vector \(\hat{\mathbf{v}}_{t+1}\) with the ground-truth item dense vector \(\mathbf{v}_{t+1}\) via an InfoNCE objective with in-batch negatives:
\[\mathcal{L}_{\text{dense}} = -\sum_{t=1}^{T-1} \log \frac{\exp\!\left(\cos(\hat{\mathbf{v}}_{t+1}, \mathbf{v}_{t+1}) / \tau\right)}{\sum_{j \in \mathcal{B}} \exp\!\left(\cos(\hat{\mathbf{v}}_{t+1}, \mathbf{v}_j) / \tau\right)}\]
where \(\mathcal{B}\) is the in-batch item set and \(\tau\) is a temperature hyperparameter. The cosine similarity metric is invariant to embedding scale, which stabilizes training when the encoder and decoder operate at different magnitudes.
Combined objective:
\[\mathcal{L} = \mathcal{L}_{\text{sparse}} + \mathcal{L}_{\text{dense}}\]
The two terms are summed without weighting hyperparameters — the ablation shows both are necessary and they operate on disjoint output heads.
5. Coarse-to-Fine Inference: BeamFusion 💡
5.1 Sparse ID Beam Search
At inference, beam search with width \(M\) generates the top-\(M\) sparse ID candidates from the model’s distribution:
\[\{\widehat{\text{ID}}_{T+1}^k\}_{k=1}^M = \text{BeamSearch}\!\left(\text{TransformerDecoder}(S_{1:T}),\ M\right)\]
Each beam \(k\) carries an associated log-probability \(\phi^k = \log P(\widehat{\text{ID}}_{T+1}^k \mid S_{1:T})\).
5.2 Conditional Dense Generation
For each beam \(k\), the sparse ID embedding is appended to the history and the DenseHead generates a conditional dense vector:
\[\hat{\mathbf{v}}_{T+1}^k = \text{DenseHead}\!\left(\text{TransformerDecoder}\!\left([S_{1:T};\ \mathbf{e}(\widehat{\text{ID}}_{T+1}^k)]\right)\right)\]
An ANN index over pre-computed item dense vectors is then queried for each beam:
\[\mathcal{A}_k = \text{ANN}\!\left(\hat{\mathbf{v}}_{T+1}^k,\ \mathcal{C}(\widehat{\text{ID}}_{T+1}^k),\ N\right)\]
where \(\mathcal{C}(\widehat{\text{ID}}_{T+1}^k)\) restricts the search to items in the beam’s code cluster, and \(N\) is the number of retrieved neighbors per beam.
5.3 BeamFusion Ranking Score
Each candidate item \(a \in \bigcup_{k=1}^M \mathcal{A}_k\) is scored by combining the sparse beam probability and the dense cosine similarity:
\[\Phi(\hat{\mathbf{v}}_{T+1}^k, \widehat{\text{ID}}_{T+1}^k, a) = \underbrace{\text{Softmax}(\tau \cdot \phi^k)}_{\text{sparse beam weight}} \cdot \underbrace{\text{Softmax}(\psi \cdot \cos(\hat{\mathbf{v}}_{T+1}^k, \mathbf{a}))}_{\text{dense similarity weight}}\]
The hyperparameters \(\tau\) and \(\psi\) control the precision–diversity trade-off: higher \(\tau\) concentrates weight on high-probability beams (lower diversity); higher \(\psi\) concentrates weight on the dense nearest neighbors (lower recall from lower-probability beams). Empirically, the optimal balance is \(\tau = 0.9\), \(\psi = 16\).
Final retrieval:
\[\mathcal{R} = \text{Top-}K\!\left(\bigcup_{k=1}^M \mathcal{A}_k,\ \Phi\right)\]
This question connects BeamFusion to probabilistic model combination.
Prerequisites: §5.3
The BeamFusion score \(\Phi\) can be interpreted as a product of two softmax distributions over items. Show that \(\log \Phi \propto \tau \phi^k + \psi \cos(\hat{\mathbf{v}}_{T+1}^k, \mathbf{a}) + \text{const}\), and identify the conditions under which this is equivalent to the log-probability of a product of experts model combining the sparse and dense predictions.
Key insight: \(\text{Softmax}(\tau \cdot \phi^k)_i = \exp(\tau \phi^k_i) / Z_\tau\) and \(\text{Softmax}(\psi \cdot \cos(\cdot))_j = \exp(\psi \cos_j) / Z_\psi\). The product is \(\exp(\tau \phi^k_i + \psi \cos_j) / (Z_\tau Z_\psi)\), which is proportional to \(\exp(\tau \phi + \psi \cos)\) — exactly a log-linear combination (product of experts) where the two experts are the sparse beam probability and the dense cosine similarity.
Sketch: A product of experts model defines \(P_{\text{PoE}}(a) \propto \prod_m P_m(a)^{\alpha_m}\), or equivalently \(\log P_{\text{PoE}}(a) = \sum_m \alpha_m \log P_m(a) + \text{const}\). BeamFusion matches this with \(\alpha_{\text{sparse}} = \tau\) and \(\alpha_{\text{dense}} = \psi\). It is a PoE under the assumption that the softmax-normalized scores are valid probabilities — they are not, strictly, since \(\tau\) and \(\psi\) rescale the log-probabilities before normalization. But the form has the same flavor: it rewards items that both beams agree on, with calibration controlled by the temperature parameters.
6. Experiments 📊
6.1 Public Benchmarks
COBRA is evaluated on three Amazon product review datasets (Beauty, Sports and Outdoors, Toys and Games) against TIGER and dense sequential baselines (SASRec, BERT4Rec, RecFormer, LIGER).
| Dataset | Metric | TIGER | LIGER | COBRA | Δ vs. TIGER |
|---|---|---|---|---|---|
| Beauty | Recall@5 | 0.0454 | 0.0471 | 0.0537 | +18.3% |
| Beauty | NDCG@5 | 0.0321 | 0.0341 | 0.0395 | +23.1% |
| Sports | Recall@5 | 0.0264 | 0.0281 | 0.0305 | +15.5% |
| Sports | NDCG@10 | 0.0225 | 0.0236 | 0.0257 | +14.2% |
| Toys | Recall@10 | 0.0712 | 0.0733 | 0.0781 | +9.7% |
| Toys | NDCG@10 | 0.0432 | 0.0449 | 0.0515 | +19.2% |
COBRA improves over TIGER by 9–23% across all metrics, with larger gains on NDCG (which penalizes lower-ranked correct items more heavily) than Recall — consistent with BeamFusion’s improvement in ranking quality, not just top-\(K\) coverage.
6.2 Industrial Ablation
Evaluated on Baidu’s advertising dataset (200M DAU). The ablation isolates the contribution of each component:
| Variant | R@50 | R@100 | R@500 | R@800 |
|---|---|---|---|---|
| Full COBRA | 0.1180 | 0.1737 | 0.3716 | 0.4466 |
| w/o sparse ID branch | 0.0611 | 0.0964 | 0.2466 | 0.3111 |
| w/o dense branch | 0.0690 | 0.1032 | 0.2709 | 0.3273 |
| w/o BeamFusion | 0.0856 | 0.1254 | 0.2455 | 0.2855 |
Both branches are necessary. Removing the sparse branch hurts more at low recall cutoffs (R@50: −48.2%) because beam search over semantic IDs is the primary structuring mechanism. Removing the dense branch hurts more at high recall cutoffs (R@800: −26.7%) because the dense vector is what distinguishes items within a cluster once the cluster is identified.
Online A/B test (January 2025, 10% traffic): +3.60% conversion rate, +4.15% ARPU.
6.3 Recall–Diversity Trade-off
The BeamFusion parameters \(\tau\) and \(\psi\) trade recall (broad candidate coverage) against diversity (distributing candidates across semantic clusters). Optimal performance at \(\tau = 0.9\), \(\psi = 16\); increasing \(\tau\) beyond 1.0 collapses the retrieval toward the single highest-probability beam, reducing diversity with diminishing recall gains.
7. References
| Reference Name | Brief Summary | Link |
|---|---|---|
| Sparse Meets Dense: COBRA (Yang et al., 2025) | This paper | https://arxiv.org/abs/2503.02453 |
| TIGER: Recommender Systems with Generative Retrieval (Rajput et al., NeurIPS 2023) | RQ-VAE semantic IDs + seq2seq beam search; baseline COBRA improves over | https://arxiv.org/abs/2305.05065 |
| HSTU: Actions Speak Louder than Words (Zhai et al., 2024) | Foundational generative ranking architecture; motivates the industrial context | https://arxiv.org/abs/2402.17152 |
| LIGER (Yang et al., 2024) | Simultaneous sparse + fixed dense generation; COBRA’s direct predecessor and key baseline | https://arxiv.org/abs/2406.06485 |
| Generating Diverse High-Fidelity Images with VQ-VAE-2 (Razavi et al., 2019) | Foundational RQ-VAE architecture | https://arxiv.org/abs/1906.00446 |
| Representation Learning with Contrastive Predictive Coding (Oord et al., 2018) | InfoNCE objective used in the dense contrastive loss | https://arxiv.org/abs/1807.03748 |