Q-Learning and Value-Based Reinforcement Learning
Table of Contents
- 1. The MDP Framework
- 2. Value Functions
- 3. Optimal Value Functions
- 4. Value Iteration
- 5. Q-Learning: Stochastic Approximation of Value Iteration
- 6. The Deadly Triad and Deep Q-Learning
- 7. Scope and Limitations
- References
1. The MDP Framework 🎯
Q-learning operates inside the Markov Decision Process (MDP) formalism. An MDP is a tuple \((\mathcal{S}, \mathcal{A}, P, r, \gamma)\):
- \(\mathcal{S}\): state space
- \(\mathcal{A}\): action space
- \(P(s' \mid s, a)\): transition kernel — probability of reaching \(s'\) from \((s,a)\)
- \(r(s, a)\): bounded reward function
- \(\gamma \in [0, 1)\): discount factor
A policy \(\pi: \mathcal{S} \to \Delta(\mathcal{A})\) maps states to distributions over actions. The agent’s goal is to find the policy \(\pi^*\) that maximizes expected discounted return from every state simultaneously.
The Markov property — \(s_{t+1} \perp s_{0:t-1}, a_{0:t-1} \mid s_t, a_t\) — is the assumption that the current state contains all information relevant to the future. This is what makes value functions well-defined.
2. Value Functions 📐
2.1 State-Value and Action-Value Functions
For a fixed policy \(\pi\), define:
State-value function: \[V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \;\bigg|\; s_0 = s\right]\]
Action-value function (Q-function): \[Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \;\bigg|\; s_0 = s,\, a_0 = a\right]\]
The Q-function conditions on a specific first action \(a\), then follows \(\pi\) thereafter. The two functions are related by:
\[V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\left[Q^\pi(s, a)\right] = \sum_a \pi(a \,|\, s)\,Q^\pi(s, a)\]
The Q-function is the more useful object for control: given \(Q^\pi\), improving the policy is trivial — just act greedily. Given only \(V^\pi\), you need the model \(P\) to compute \(\mathbb{E}[V^\pi(s')]\) and evaluate actions.
2.2 Bellman Consistency Equations
\(Q^\pi\) satisfies a self-consistency (Bellman) equation. Expanding the return by peeling off the first step:
\[Q^\pi(s, a) = \mathbb{E}_{s' \sim P(\cdot|s,a)}\!\left[r(s,a) + \gamma\,V^\pi(s')\right] = \mathbb{E}_{s'}\!\left[r(s,a) + \gamma\sum_{a'}\pi(a'|s')\,Q^\pi(s',a')\right]\]
Define the Bellman operator for policy \(\pi\):
\[(\mathcal{T}^\pi Q)(s,a) = \mathbb{E}_{s'}\!\left[r(s,a) + \gamma\sum_{a'}\pi(a'|s')\,Q(s',a')\right]\]
Then \(Q^\pi\) is the unique fixed point \(\mathcal{T}^\pi Q^\pi = Q^\pi\).
It transforms the problem of computing an infinite sum into solving a fixed-point equation. The fixed-point can be found by iterating \(\mathcal{T}^\pi\) (policy evaluation) or solved in closed form for tabular MDPs: \(Q^\pi = (I - \gamma P^\pi)^{-1} r\) where \(P^\pi_{(s,a),(s',a')} = P(s'|s,a)\pi(a'|s')\).
This problem derives the Bellman consistency equation directly from the definition of \(Q^\pi\).
Prerequisites: §2.1 State-Value and Action-Value Functions
Starting from \(Q^\pi(s,a) = \mathbb{E}_\pi[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s, a_0=a]\), split the sum at \(t=0\) and \(t \geq 1\). Use the Markov property to show the \(t \geq 1\) terms collapse to \(\gamma Q^\pi(s', a')\) for some \((s', a')\), and average over the transition and policy to obtain the Bellman equation.
Key insight: The Markov property means the return from time 1 onward, conditioned on \((s_1, a_1)\), is independent of \((s_0, a_0)\) and has the same distribution as \(Q^\pi(s_1, a_1)\).
Sketch: \(Q^\pi(s,a) = \mathbb{E}[r(s_0,a_0)] + \mathbb{E}[\sum_{t=1}^\infty \gamma^t r_t \mid s_0,a_0]\). Factor out \(\gamma\): \(= r(s,a) + \gamma\,\mathbb{E}_{s_1 \sim P}[\mathbb{E}_{a_1 \sim \pi}[\mathbb{E}[\sum_{t=0}^\infty \gamma^t r_{t+1} \mid s_1,a_1]]]\). By the Markov property the inner expectation is \(Q^\pi(s_1,a_1)\), giving \(r(s,a) + \gamma\,\mathbb{E}_{s',a'}[Q^\pi(s',a')]\).
3. Optimal Value Functions 🔑
3.1 The Bellman Optimality Equations
Define the optimal value functions:
\[V^*(s) = \max_\pi V^\pi(s), \qquad Q^*(s,a) = \max_\pi Q^\pi(s,a)\]
These exist and are unique for discounted MDPs (by compactness of the policy space and linearity of the value functional in \(\pi\) for fixed \(s,a\)). They satisfy the Bellman optimality equations:
\[Q^*(s,a) = \mathbb{E}_{s' \sim P(\cdot|s,a)}\!\left[r(s,a) + \gamma\,\max_{a'}\,Q^*(s',a')\right]\]
\[V^*(s) = \max_a\,\mathbb{E}_{s'}\!\left[r(s,a) + \gamma\,V^*(s')\right]\]
Key structural fact: Any policy \(\pi^*\) satisfying \(\pi^*(s) \in \arg\max_a Q^*(s,a)\) is optimal. Once \(Q^*\) is known, deriving the optimal policy requires no further computation — just a greedy table lookup.
The standard Bellman optimality equation has a hard \(\max\) over actions. The KL-regularized RL objective studied in rejection sampling as RL replaces \(\max_a\) with a soft max (Boltzmann/Gibbs distribution), producing a soft Bellman equation: \[Q^*(s,a) = r(s,a) + \gamma\,\mathbb{E}_{s'}\!\left[\beta\log\sum_{a'}\exp\!\left(\frac{Q^*(s',a')}{\beta}\right)\right]\] The hard case (\(\beta \to 0\)) recovers standard Q-learning.
3.2 The Bellman Operator Is a Contraction
Define the Bellman optimality operator:
\[(\mathcal{T}Q)(s,a) = \mathbb{E}_{s'}\!\left[r(s,a) + \gamma\,\max_{a'}\,Q(s',a')\right]\]
Theorem. \(\mathcal{T}\) is a contraction on \((\ell^\infty(\mathcal{S} \times \mathcal{A}), \|\cdot\|_\infty)\) with factor \(\gamma\):
\[\|\mathcal{T}Q - \mathcal{T}Q'\|_\infty \leq \gamma\,\|Q - Q'\|_\infty\]
Proof. For any \((s,a)\):
\[|(\mathcal{T}Q)(s,a) - (\mathcal{T}Q')(s,a)| = \gamma\,\left|\mathbb{E}_{s'}\!\left[\max_{a'}Q(s',a') - \max_{a'}Q'(s',a')\right]\right|\]
Use the fact that \(|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|\) and linearity of expectation:
\[\leq \gamma\,\mathbb{E}_{s'}\!\left[\max_{a'}\left|Q(s',a') - Q'(s',a')\right|\right] \leq \gamma\,\|Q - Q'\|_\infty\]
Taking the sup over \((s,a)\) gives \(\|\mathcal{T}Q - \mathcal{T}Q'\|_\infty \leq \gamma\|Q-Q'\|_\infty\). \(\square\)
By Banach’s fixed-point theorem: \(Q^*\) is the unique fixed point of \(\mathcal{T}\), and for any initialization \(Q_0\):
\[\|Q_k - Q^*\|_\infty \leq \gamma^k\,\|Q_0 - Q^*\|_\infty\]
convergence is geometric in \(k\) with rate \(\gamma\).
This problem derives the convergence rate directly from the contraction.
Prerequisites: §3.2 The Bellman Operator Is a Contraction
Let \(Q^* = \mathcal{T}Q^*\) and \(Q_k = \mathcal{T}Q_{k-1}\). Using only the contraction property \(\|\mathcal{T}Q - \mathcal{T}Q'\|_\infty \leq \gamma\|Q-Q'\|_\infty\), prove: (a) \(Q^*\) is the unique fixed point; (b) \(\|Q_k - Q^*\|_\infty \leq \gamma^k \|Q_0 - Q^*\|_\infty\); (c) how many iterations of \(\mathcal{T}\) are needed to reduce the initial error by a factor of \(\epsilon\)?
Key insight: Contraction + completeness of \(\ell^\infty\) → Banach fixed-point theorem directly.
Sketch: (a) If \(Q^* = \mathcal{T}Q^*\) and \(Q^{**} = \mathcal{T}Q^{**}\), then \(\|Q^* - Q^{**}\|_\infty = \|\mathcal{T}Q^* - \mathcal{T}Q^{**}\|_\infty \leq \gamma\|Q^*-Q^{**}\|_\infty\). Since \(\gamma < 1\), this forces \(\|Q^*-Q^{**}\|_\infty = 0\). (b) By induction: \(\|Q_k - Q^*\|_\infty = \|\mathcal{T}Q_{k-1} - \mathcal{T}Q^*\|_\infty \leq \gamma\|Q_{k-1}-Q^*\|_\infty \leq \cdots \leq \gamma^k\|Q_0-Q^*\|_\infty\). (c) Need \(\gamma^k \leq \epsilon\), so \(k \geq \log(1/\epsilon)/\log(1/\gamma) = O\!\left(\frac{\log(1/\epsilon)}{1-\gamma}\right)\) — the \(1/(1-\gamma)\) factor reflects that near \(\gamma=1\) (long-horizon problems), many more iterations are needed.
4. Value Iteration 🔄
Value iteration is the direct application of the Bellman operator: start from any \(Q_0\) and iterate \(Q_{k+1} = \mathcal{T}Q_k\). Explicitly, for all \((s,a)\):
\[Q_{k+1}(s,a) = \sum_{s'} P(s'|s,a)\left[r(s,a) + \gamma\,\max_{a'} Q_k(s',a')\right]\]
By the contraction theorem, this converges to \(Q^*\) geometrically. The greedy policy \(\pi_{k+1}(s) = \arg\max_a Q_{k+1}(s,a)\) converges to \(\pi^*\).
Requirement: computing \(\mathcal{T}Q_k\) requires summing over all \(s'\) weighted by \(P(s'|s,a)\) — the transition model. Value iteration is a model-based planning algorithm.
Complexity: each sweep costs \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) — feasible for small tabular MDPs, intractable for large or continuous state spaces.
In a 5×5 grid with 4 actions (N/S/E/W), \(|\mathcal{S}| = 25\) and \(|\mathcal{A}| = 4\). Value iteration converges in \(O(\log(1/\epsilon)/(1-\gamma))\) sweeps. Each sweep visits all 100 state-action pairs and looks up transition probabilities. This is feasible by hand; for Atari with \(|\mathcal{S}| \approx 10^{70}\) pixel states, it is obviously not.
5. Q-Learning: Stochastic Approximation of Value Iteration 💡
Q-learning removes the model requirement by replacing the exact expectation in \(\mathcal{T}\) with a single sampled transition.
5.1 The Update Rule
Given a transition \((s, a, r, s')\) observed from the environment:
\[Q(s,a) \leftarrow Q(s,a) + \alpha_t\,\underbrace{\bigl[r + \gamma\max_{a'} Q(s',a') - Q(s,a)\bigr]}_{\delta_t\;:\;\text{TD error}}\]
The TD error \(\delta_t\) is a noisy estimate of \((\mathcal{T}Q)(s,a) - Q(s,a)\): if \(\delta_t > 0\), \(Q(s,a)\) was too pessimistic about \((s,a)\) given this transition; if \(\delta_t < 0\), too optimistic. The update moves \(Q(s,a)\) in the direction of the one-sample Bellman target \(r + \gamma\max_{a'} Q(s',a')\).
Off-policy. The update uses \(\max_{a'} Q(s',a')\) — the greedy action — regardless of what the behavior policy (the policy used to collect the transition) actually did. This makes Q-learning off-policy: data can come from any distribution over \((s,a)\) pairs (e.g., a replay buffer, a random policy, a human demonstrator) as long as it covers all \((s,a)\).
Contrast with SARSA. SARSA replaces \(\max_{a'} Q(s',a')\) with \(Q(s', a'_\text{behavior})\), where \(a'_\text{behavior} \sim \pi_\text{behavior}(\cdot|s')\) is the action actually taken next. This makes SARSA on-policy — it evaluates the behavior policy rather than the greedy policy.
| Q-Learning | SARSA | |
|---|---|---|
| Target | \(r + \gamma\max_{a'} Q(s',a')\) | \(r + \gamma Q(s', a')\), \(a' \sim \pi_b\) |
| On/off-policy | Off-policy | On-policy |
| Converges to | \(Q^*\) (optimal) | \(Q^{\pi_b}\) (behavior policy) |
| Cliff-walking behavior | Takes risky optimal path | Takes safe suboptimal path |
5.2 Convergence
Theorem (Watkins & Dayan, 1992). Tabular Q-learning with learning rates \(\{\alpha_t\}\) converges to \(Q^*\) almost surely, provided:
- Exploration: every \((s,a)\) pair is visited infinitely often
- Robbins–Monro: \(\sum_t \alpha_t(s,a) = \infty\) and \(\sum_t \alpha_t(s,a)^2 < \infty\) for all \((s,a)\)
- Bounded rewards: \(|r| \leq r_\text{max} < \infty\)
The Robbins–Monro condition (e.g., \(\alpha_t = 1/t\)) ensures the updates are large enough to eventually correct any initialization, but small enough that noise averages out. The proof uses stochastic approximation theory (Borkar, Tsitsiklis): the Q-learning update is an asynchronous stochastic version of value iteration, and the noise (from single-sample transitions) has conditional mean zero given \(Q_t\).
Every \((s,a)\) must be visited infinitely often — this requires the behavior policy to be GLIE (Greedy in the Limit with Infinite Exploration): \(\epsilon\)-greedy with \(\epsilon_t \to 0\) as \(t \to \infty\), but \(\sum_t \epsilon_t = \infty\). Without this, Q-learning converges to \(Q^{\pi_b}\) (the value of the behavior policy), not \(Q^*\).
5.3 SARSA: On-Policy TD
For completeness, SARSA (State-Action-Reward-State-Action) uses the actual next action in the target:
\[Q(s,a) \leftarrow Q(s,a) + \alpha\bigl[r + \gamma Q(s', a') - Q(s,a)\bigr], \quad a' \sim \pi(\cdot|s')\]
SARSA converges to \(Q^{\pi}\) for the current behavior policy \(\pi\). If \(\pi\) is \(\epsilon\)-greedy and \(\epsilon \to 0\), SARSA converges to \(Q^*\) — but more slowly than Q-learning because it evaluates the exploratory policy, not the greedy one.
This problem makes precise the sense in which Q-learning approximates value iteration.
Prerequisites: §4 Value Iteration, §5.1 The Update Rule
Write the Q-learning update as \(Q_{t+1}(s,a) = (1-\alpha_t) Q_t(s,a) + \alpha_t [r + \gamma\max_{a'} Q_t(s',a')]\). (a) Show this is a convex combination of the old estimate and the one-sample Bellman target. (b) Take \(\mathbb{E}[\cdot | Q_t, s, a]\) of the update and show that the expected update direction is \(\mathcal{T}Q_t - Q_t\) (the value iteration step). (c) What does this imply about the noise \(\delta_t - \mathbb{E}[\delta_t | Q_t]\)?
Key insight: Q-learning is exactly value iteration plus zero-mean noise — stochastic approximation in the sense of Robbins–Monro.
Sketch: (a) \(Q_{t+1}(s,a) = (1-\alpha_t)Q_t(s,a) + \alpha_t \cdot \text{target}\): this is an interpolation with weights \((1-\alpha_t, \alpha_t)\) summing to 1, so it’s a convex combination. (b) \(\mathbb{E}[\text{target}|Q_t,s,a] = \mathbb{E}_{s'}[r + \gamma\max_{a'}Q_t(s',a')] = (\mathcal{T}Q_t)(s,a)\). So \(\mathbb{E}[Q_{t+1} - Q_t | Q_t] = \alpha_t [(\mathcal{T}Q_t)(s,a) - Q_t(s,a)]\), which is \(\alpha_t\) times the value iteration update. (c) The noise \(\xi_t = [r + \gamma\max_{a'}Q_t(s',a')] - (\mathcal{T}Q_t)(s,a)\) has \(\mathbb{E}[\xi_t | Q_t] = 0\) — it’s mean-zero, the central condition for stochastic approximation convergence. Robbins–Monro then ensures the noise averages out over time.
6. The Deadly Triad and Deep Q-Learning ⚠️
6.1 The Deadly Triad
Tabular Q-learning converges because \(Q(s,a)\) for each \((s,a)\) is an independent parameter — updating one doesn’t change others. With function approximation \(Q_\phi(s,a)\) (e.g., a neural network), updating the parameters \(\phi\) changes \(Q_\phi\) at all \((s,a)\) simultaneously. Three ingredients combine to cause potential divergence:
1. Function approximation. \(Q_\phi\) is a smooth function of \(\phi\); a gradient step on one \((s,a)\) moves \(Q_\phi\) at nearby states too. The update is no longer a local table write.
2. Bootstrapping. The TD target \(r + \gamma\max_{a'} Q_\phi(s',a')\) depends on \(Q_\phi\) itself. You are fitting \(Q_\phi\) toward a target that moves as \(\phi\) moves — chasing a moving goalposts. The semi-gradient update:
\[\phi \leftarrow \phi + \alpha\,\delta_t\,\nabla_\phi Q_\phi(s,a)\]
does not gradient-descend any fixed loss — \(\delta_t\) depends on \(\phi\), but the gradient is taken only through \(Q_\phi(s,a)\), not through the target. This is not a true gradient step.
3. Off-policy training. Replay buffer samples come from policies different from \(\pi_\text{greedy}\), creating a mismatch between the distribution the function approximator is trained on and the distribution it’s evaluated on.
Any two of these three are benign; all three together can cause divergence (Baird’s counterexample demonstrates this for linear function approximation).
6.2 DQN
DQN (Mnih et al., 2015) stabilizes deep Q-learning with two engineering interventions:
Experience replay. Store transitions \((s,a,r,s')\) in a replay buffer \(\mathcal{B}\). Sample mini-batches uniformly at random for gradient updates. This breaks temporal correlations (which destabilize SGD) and allows data reuse.
Target network. Maintain a target network \(Q_{\phi^-}\) with parameters \(\phi^-\) frozen for \(k\) steps. Use \(Q_{\phi^-}\) for the Bellman target:
\[\mathcal{L}(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{B}}\!\left[\bigl(r + \gamma\max_{a'} Q_{\phi^-}(s',a') - Q_\phi(s,a)\bigr)^2\right]\]
This is a proper least-squares regression problem (for the \(k\) steps the target is frozen). Every \(k\) steps, \(\phi^- \leftarrow \phi\).
The target network slows the bootstrap feedback loop — the goalposts move, but slowly. Neither intervention is theoretically principled; both are empirical stabilizers. DQN still provably can diverge on some MDPs.
6.3 Double DQN: Fixing Overestimation
Q-learning has a systematic overestimation bias. The issue: the target uses \(\max_{a'} Q_\phi(s',a')\), which is the maximum over noisy estimates. For independent noise \(\eta_{a'}\):
\[\mathbb{E}\!\left[\max_{a'} (Q^*(s',a') + \eta_{a'})\right] \geq \max_{a'} Q^*(s',a')\]
(the expectation of the max exceeds the max of the expectations). The bias grows with \(|\mathcal{A}|\) and the noise magnitude.
Double DQN (van Hasselt et al., 2016) decouples action selection from action evaluation:
\[\text{target} = r + \gamma\,Q_{\phi^-}\!\!\left(s',\,\underbrace{\arg\max_{a'} Q_\phi(s',a')}_{\text{online net selects action}}\right)\]
The online network \(Q_\phi\) selects the greedy action; the target network \(Q_{\phi^-}\) evaluates it. Since these are different networks with different noise, the maximum is no longer over correlated estimates — the bias shrinks substantially.
This problem derives the overestimation bound quantitatively.
Prerequisites: §6.3 Double DQN: Fixing Overestimation
Suppose \(Q_\phi(s',a') = Q^*(s',a') + \eta_{a'}\) where \(\eta_{a'} \sim \mathcal{N}(0, \sigma^2)\) i.i.d. for each \(a'\). Let \(|\mathcal{A}| = n\) and denote \(m^* = \max_{a'} Q^*(s',a')\). (a) Show \(\mathbb{E}[\max_{a'} Q_\phi(s',a')] \geq m^*\). (b) For the special case \(Q^*(s',a') = 0\) for all \(a'\), compute \(\mathbb{E}[\max_{a'} \eta_{a'}]\) in terms of \(\sigma\) and \(n\). What does this say about large action spaces?
Key insight: The expected maximum of \(n\) i.i.d. Gaussians grows as \(\Theta(\sigma\sqrt{\log n})\).
Sketch: (a) \(\mathbb{E}[\max_{a'} Q_\phi(s',a')] = \mathbb{E}[\max_{a'}(Q^*(s',a') + \eta_{a'})] \geq \max_{a'} \mathbb{E}[Q^*(s',a') + \eta_{a'}] = \max_{a'} Q^*(s',a') = m^*\) by Jensen (since max is convex). (b) With \(Q^*=0\): \(\mathbb{E}[\max_{a'}\eta_{a'}] = \sigma\,\mathbb{E}[\max_{a'} Z_{a'}]\) where \(Z_{a'} \sim \mathcal{N}(0,1)\). By the Gaussian order-statistic result, \(\mathbb{E}[\max_n Z_i] \approx \sqrt{2\log n}\). So the bias is \(\approx \sigma\sqrt{2\log n}\) — it grows logarithmically in \(|\mathcal{A}|\). For large action spaces (e.g., \(n=1000\)), the overestimation is \(\approx 3.7\sigma\) — substantial enough to cause meaningful policy degradation.
7. Scope and Limitations 🗺️
Q-learning and value-based RL have a specific niche; they are not the universal RL framework.
Works well when: - Action space is discrete and small enough to enumerate (\(\arg\max\) must be computable) - Good exploration is achievable - A replay buffer can be maintained (off-policy is a feature, not a bug)
Fails or struggles when: - Continuous actions: \(\arg\max_a Q_\phi(s,a)\) over continuous \(\mathcal{A}\) requires a nested optimization loop. DDPG and TD3 sidestep this by learning a separate actor network \(\mu_\psi(s) \approx \arg\max_a Q_\phi(s,a)\) — making them actor-critic methods, not pure Q-learning. - High-dimensional state + function approximation + off-policy: the deadly triad. Convergence is not guaranteed. - Reward shaping sensitivity: the greedy policy is sensitive to small errors in \(Q_\phi\). Policy gradient methods, which optimize a smoother objective, can be more robust. - Long horizon / sparse reward: the \(\gamma^T\) discounting means distant rewards contribute negligibly. Reward shaping or hierarchical methods are needed.
The broader RL landscape (for context):
| Paradigm | What it learns | Key strength | Key weakness |
|---|---|---|---|
| Value-based (Q-learning) | \(Q^*(s,a)\) | Off-policy, data efficient | Discrete actions only |
| Policy gradient (REINFORCE, PPO) | \(\pi_\theta\) directly | Continuous actions, stable | On-policy, sample inefficient |
| Actor-critic (SAC, TD3) | Both \(Q_\phi\) and \(\pi_\theta\) | Continuous actions + efficiency | Complex, many hyperparameters |
| Model-based (Dyna, MuZero) | \(P(s'|s,a)\) and/or \(Q^*\) | Sample efficient | Model errors compound |
Value-based methods remain the standard for discrete-action problems (Atari, board games, combinatorial optimization). Policy gradient and actor-critic methods dominate continuous control (robotics, locomotion, LLM post-training).
References
| Reference Name | Brief Summary | Link |
|---|---|---|
| Watkins (1989), Learning from Delayed Rewards | Introduces Q-learning; proves tabular convergence under GLIE and Robbins–Monro conditions | PhD thesis, Cambridge |
| Watkins & Dayan (1992), Q-Learning | Published journal version of the Q-learning convergence proof | Machine Learning 8(3) |
| Mnih et al. (2015), Human-Level Control through Deep Reinforcement Learning | DQN: combines Q-learning with deep nets, experience replay, and target networks; achieves human-level Atari performance | Nature 518 |
| van Hasselt, Guez & Silver (2016), Deep Reinforcement Learning with Double Q-Learning | Identifies overestimation bias in DQN; Double DQN decouples action selection and evaluation to reduce it | AAAI 2016 |
| Sutton & Barto (2018), Reinforcement Learning: An Introduction | Standard textbook; chapters 6–8 cover TD learning, Q-learning, SARSA, and the deadly triad in depth | Online edition |
| Baird (1995), Residual Algorithms | Constructs explicit counterexample (Baird’s counterexample) showing Q-learning with linear function approximation can diverge | ICML 1995 |
| Tsitsiklis & Van Roy (1997), An Analysis of Temporal-Difference Learning | Proves convergence of TD(0) with linear function approximation under on-policy sampling; identifies off-policy instability | IEEE Transactions on Automatic Control |