The Landscape of Reinforcement Learning

Table of Contents


1. The Two True Axes 🗺️

A common framing divides RL into three paradigms — value-based, policy gradient, and actor-critic. This is misleading. Actor-critic is not a third paradigm; it is policy gradient with a value-function baseline. The critic exists purely for variance reduction. Stripping away that conflation, RL algorithm design is governed by two orthogonal axes.

1.1 Axis 1: Model-Based vs. Model-Free

Does the agent learn a model of the environment?

A model here means an explicit representation of the transition dynamics \(P(s' \mid s, a)\) and/or the reward function \(r(s,a)\). With a model, the agent can plan — simulate trajectories mentally before acting.

Model-Free Model-Based
Learns \(Q^*\) or \(\pi\) from data directly \(\hat{P}(s'\|s,a)\), then plans
Sample efficiency Lower Higher
Failure mode Needs many real transitions Model errors compound; Dyna-style agents can overfit to wrong model
Examples Q-learning, PPO, SAC Dyna, MuZero, AlphaZero, Dreamer

Model-based methods can be dramatically more sample-efficient — they extract value from each transition by replaying it through the learned model (Dyna) or by using the model to guide search (MuZero). The cost is that model errors compound: a slightly wrong \(\hat{P}\) generates slightly wrong imagined trajectories, and a policy trained on those can fail badly in the real environment.

AlphaGo / AlphaZero

These are model-based: the environment model is given (the rules of Go), not learned. Monte Carlo Tree Search (MCTS) uses this model to plan. The value network \(V_\theta(s)\) and policy network \(\pi_\theta(s)\) are trained to improve MCTS, but the model itself is exact. MuZero extends this by learning the model from scratch.

1.2 Axis 2: Value-Based vs. Policy-Based

Within model-free methods, how is the policy represented?

Value-based: The policy is implicit. Learn \(Q^*(s,a)\); the policy is \(\pi(s) = \arg\max_a Q^*(s,a)\). No explicit policy parameters — the policy exists only as a derived object. This requires the \(\arg\max\) to be tractable, which restricts to discrete action spaces.

Policy-based: The policy is explicit. Parameterize \(\pi_\theta(a \mid s)\) directly — a distribution over actions — and optimize \(\theta\) to maximize expected return. Works for discrete and continuous actions, but must discard rollouts after each gradient step (on-policy constraint), making it sample-inefficient.

The two approaches target different optimization problems:

\[\text{Value-based: } \min_Q \|Q - \mathcal{T}Q\|^2 \qquad \text{Policy-based: } \max_\theta\,J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\sum_t \gamma^t r_t\right]\]

1.3 Where Actor-Critic Fits

Actor-critic is the label for policy-based methods that also learn a value function. The value function (the “critic”) does not change the optimization target — the agent is still maximizing \(J(\theta)\) — but it provides a baseline that reduces the variance of the policy gradient estimator.

graph TD
    RL["Reinforcement Learning"]
    MF["Model-Free"]
    MB["Model-Based"]
    VB["Value-Based
Q-learning, DQN"] PB["Policy-Based
REINFORCE, PPO, GRPO"] AC["+ Value Baseline
(Actor-Critic)
A3C, SAC, TD3"] RL --> MF RL --> MB MF --> VB MF --> PB PB --> AC

The “actor” is \(\pi_\theta\); the “critic” is \(V_\phi\) or \(Q_\phi\), used to compute advantages \(A(s,a) = Q(s,a) - V(s)\). Actor-critic is a technique, not a paradigm. PPO — the dominant algorithm in LLM post-training — is actor-critic: the language model is the actor, a small value head trained on the same rollouts is the critic.


Exercise 1: Classification

This problem builds fluency with the two-axis taxonomy.

Prerequisites: §1.1, §1.2

Classify each of the following algorithms along both axes (model-based/free and value/policy-based), and justify each choice: (a) SARSA, (b) AlphaZero, (c) REINFORCE with a learned \(V_\phi\) baseline, (d) RAFT for LLMs, (e) Dyna-Q.

Solution to Exercise 1

Key insight: The model-based/free axis is about whether \(P(s'|s,a)\) is learned or used; the value/policy axis is about how the policy is represented.

Sketch: (a) SARSA: model-free (learns \(Q^\pi\) from samples), value-based (policy derived greedily from \(Q\)). (b) AlphaZero: model-based (uses the exact game rules as a model for MCTS), policy-based (the policy network \(\pi_\theta\) is trained explicitly; MCTS improves it). (c) REINFORCE + \(V_\phi\): model-free, policy-based (actor-critic variant — \(V_\phi\) is a baseline, not the policy representation). (d) RAFT: model-free, policy-based — the policy \(\pi_\theta\) is the LLM, trained by SFT on filtered rollouts. No value function, no model. (e) Dyna-Q: model-based (learns \(\hat{P}\) and uses it to generate imagined transitions), value-based (updates \(Q\) on both real and imagined transitions via Q-learning).


2. Why Value Functions Are Fundamental 💡

Value functions appear even in nominally “model-free, policy-based” methods. This is not an accident — it follows from the mathematical structure of optimal sequential decision-making.

2.1 Bellman’s Principle of Optimality

Bellman (1957):

An optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

This principle of optimality is the recursive structure of optimal control. It says: the optimal decision problem decomposes by state. The value function \(V^*(s)\) — the expected return of the optimal policy from state \(s\) — is the quantity that captures this decomposition:

\[V^*(s) = \max_a\,\mathbb{E}_{s'}\!\left[r(s,a) + \gamma\,V^*(s')\right]\]

This is not an algorithm; it is a characterization. Any optimal policy must satisfy it. The Bellman equation is therefore not a design choice — it is a mathematical constraint that any solution to the MDP must obey.

Why this implies value functions are unavoidable. If you want to find \(\pi^*\), you are implicitly looking for the solution to the Bellman equation — whether or not you compute \(V^*\) or \(Q^*\) explicitly. A policy gradient method that finds \(\pi^*\) has, in some sense, implicitly solved for the value function, even if it never represents it.

Historical context

Bellman introduced these equations in the context of dynamic programming (DP) for deterministic optimal control in the 1950s. The RL community independently rediscovered the same structure. TD learning (Sutton, 1988) and Q-learning (Watkins, 1989) are stochastic approximation algorithms for solving the Bellman equation from data — DP without a model.

2.2 Value Functions Inside Policy Gradient

The policy gradient theorem (Sutton et al., 1999) states that for a parameterized policy \(\pi_\theta\) with expected return \(J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_t \gamma^t r_t]\):

\[\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\, a \sim \pi_\theta(\cdot|s)}\!\left[\nabla_\theta \log \pi_\theta(a \mid s)\cdot Q^{\pi_\theta}(s,a)\right]\]

where \(d^{\pi_\theta}(s) = (1-\gamma)\sum_{t=0}^\infty \gamma^t \Pr(s_t = s \mid \pi_\theta)\) is the discounted state visitation measure.

\(Q^{\pi_\theta}\) appears inside the gradient. This is not optional — it is the exact gradient of \(J\). Any policy gradient method must estimate \(Q^{\pi_\theta}(s,a)\) somehow:

  • REINFORCE: estimates \(Q^{\pi_\theta}(s,a) \approx G_t = \sum_{t'=t}^\infty \gamma^{t'-t} r_{t'}\) with the empirical return. Unbiased but high variance.
  • Actor-Critic: estimates \(Q^{\pi_\theta}\) with a learned \(Q_\phi\) or advantage \(A_\phi = Q_\phi - V_\phi\). Lower variance, some bias from \(Q_\phi\) approximation.
  • PPO / GRPO: estimate \(Q^{\pi_\theta}\) via rollout returns and normalize by a group-level baseline — see rejection sampling as RL.

The difference between REINFORCE and actor-critic is not whether to use a value function — it is how accurately to estimate the one that already appears in the gradient.

2.3 Three Practical Reasons

Beyond the theoretical necessity, value functions are indispensable for three engineering reasons:

1. Variance reduction is unavoidable at long horizons. The Monte Carlo return \(G_t = \sum_{t'=t}^T \gamma^{t'-t} r_{t'}\) has variance that grows with \(T\) — noise in any single reward propagates through the entire sum. Subtracting a baseline \(b(s)\) reduces variance without introducing bias (since \(\mathbb{E}_{a \sim \pi}[\nabla_\theta \log\pi_\theta(a|s) \cdot b(s)] = 0\)). The optimal baseline is \(V^{\pi_\theta}(s)\), which minimizes the variance of the gradient estimator. Without it, policy gradient methods become impractical for \(T \gg 1\).

2. Temporal credit assignment. In an episode of length \(T\), reward at step \(T\) must influence the policy gradient at step \(0\). Monte Carlo waits until episode end and then assigns credit backward — expensive and noisy. TD learning does this incrementally via the Bellman equation: \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) propagates the reward signal back one step per update. Accumulated over many steps, this is equivalent to the full return. Credit assignment at scale requires a value function to make this propagation efficient.

3. Off-policy reuse. Value functions can be trained on any transitions, regardless of which policy collected them — the Bellman equation holds for \(Q^*\) regardless of the data distribution. This enables experience replay (reusing old transitions many times), dramatically increasing sample efficiency. Policy gradient methods must use fresh on-policy rollouts because the log-probability ratio \(\pi_\theta(a|s)/\pi_{\theta_\text{old}}(a|s)\) diverges as \(\pi_\theta\) moves — PPO clips this ratio precisely to allow small off-policy excursions, but cannot stray far. Value functions have no such restriction.


Exercise 2: The Optimal Baseline

This problem derives why \(V^{\pi_\theta}(s)\) is the variance-minimizing baseline.

Prerequisites: §2.2 Value Functions Inside Policy Gradient

The policy gradient estimator with baseline \(b(s)\) is \(\hat{g} = \nabla_\theta \log\pi_\theta(a|s) \cdot (Q^{\pi_\theta}(s,a) - b(s))\). (a) Show this estimator is unbiased for any \(b(s)\) that does not depend on \(a\). (b) Compute \(\frac{d}{db}\text{Var}[\hat{g}]\) and set it to zero to find the \(b^*(s)\) that minimizes variance. (c) Show \(b^*(s) \approx V^{\pi_\theta}(s)\) under a simplifying approximation.

Solution to Exercise 2

Key insight: The bias-free condition uses \(\mathbb{E}_a[\nabla_\theta \log\pi_\theta \cdot b] = b\,\mathbb{E}_a[\nabla_\theta\log\pi_\theta] = 0\) (the score function has zero mean). Variance minimization then gives the optimal baseline.

Sketch: (a) \(\mathbb{E}_a[\nabla_\theta\log\pi_\theta(a|s)\cdot b(s)] = b(s)\sum_a \pi_\theta(a|s)\nabla_\theta\log\pi_\theta(a|s) = b(s)\nabla_\theta\sum_a\pi_\theta(a|s) = b(s)\nabla_\theta 1 = 0\). (b) Let \(g_a = \nabla_\theta\log\pi_\theta(a|s)\). \(\text{Var}[\hat{g}] = \mathbb{E}_a[\|g_a\|^2(Q_a - b)^2] - (\mathbb{E}_a[g_a(Q_a-b)])^2\). Differentiating w.r.t. \(b\) and setting to zero: \(b^*(s) = \frac{\mathbb{E}_a[\|g_a\|^2 Q_a]}{\mathbb{E}_a[\|g_a\|^2]}\) — a weighted average of \(Q\) values. (c) If \(\|g_a\|^2 \approx c\) is approximately constant across actions, \(b^*(s) \approx \mathbb{E}_a[Q_a] = V^{\pi_\theta}(s)\). This is the usual actor-critic baseline.


3. The Policy Gradient Theorem 📐

Since the theorem is so central, a brief derivation is warranted.

Goal: compute \(\nabla_\theta J(\theta)\) where \(J(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta}(s)\).

Expanding \(V^{\pi_\theta}(s) = \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s,a)\) and differentiating:

\[\nabla_\theta V^{\pi_\theta}(s) = \sum_a \nabla_\theta\pi_\theta(a|s)\,Q^{\pi_\theta}(s,a) + \sum_a \pi_\theta(a|s)\,\nabla_\theta Q^{\pi_\theta}(s,a)\]

Expanding \(\nabla_\theta Q^{\pi_\theta}\) via the Bellman equation \(Q^{\pi_\theta}(s,a) = r + \gamma\sum_{s'} P(s'|s,a) V^{\pi_\theta}(s')\) and recursively unrolling gives:

\[\nabla_\theta V^{\pi_\theta}(s) = \sum_{s'}\left(\sum_{t=0}^\infty \gamma^t \Pr(s_t=s'|s_0=s,\pi_\theta)\right)\sum_a \nabla_\theta\pi_\theta(a|s')\,Q^{\pi_\theta}(s',a)\]

Using the log-derivative trick \(\nabla_\theta\pi_\theta(a|s) = \pi_\theta(a|s)\nabla_\theta\log\pi_\theta(a|s)\) and averaging over the initial state distribution:

\[\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\!\left[\nabla_\theta\log\pi_\theta(a|s)\cdot Q^{\pi_\theta}(s,a)\right]}\]

What the theorem says about the landscape

\(\nabla_\theta J\) depends on \(Q^{\pi_\theta}\) — the value function of the current policy, not the optimal \(Q^*\). This means: (1) you only need to evaluate the current policy, not solve for \(Q^*\); (2) as \(\pi_\theta\) improves, \(Q^{\pi_\theta}\) must be re-estimated; (3) actor-critic methods alternate between improving \(\pi_\theta\) (actor step) and updating the estimate of \(Q^{\pi_\theta}\) (critic step).


4. The Full Algorithm Map 📊

Algorithm Model-free/based Value/policy Key idea
Q-learning / DQN Model-free Value-based Stochastic approx. of Bellman optimality; see Q-learning note
SARSA Model-free Value-based On-policy TD; evaluates behavior policy
REINFORCE Model-free Policy-based Monte Carlo PG; high variance, no value function
PPO Model-free Policy-based (AC) Clipped IS ratio + value baseline; dominant in LLM RL
GRPO Model-free Policy-based Group-normalized advantage; no separate value network
A3C / A2C Model-free Policy-based (AC) Async parallel actor-critic; bootstrapped critic
SAC Model-free Policy-based (AC) Entropy-regularized; soft Bellman equations
TD3 Model-free Policy-based (AC) Double Q-trick + delayed policy update for continuous actions
RAFT / RAFT++ Model-free Policy-based Rejection sampling + SFT; see rejection sampling note
Behavioral Cloning — (no reward) Policy-based Imitation; see behavioral cloning note
Dyna Model-based Value-based Real + imagined transitions for Q-learning
MuZero / AlphaZero Model-based Policy-based (AC) Learned/given model + MCTS planning

The throughline: every algorithm in this table either (a) directly optimizes a Bellman equation, or (b) uses a value function to improve the gradient estimate for policy optimization, or (c) uses a model to generate data for (a) or (b). Bellman’s principle is the spine of the entire field.


References

Reference Name Brief Summary Link
Bellman (1957), Dynamic Programming Introduces the principle of optimality and Bellman equations; the mathematical foundation of all value-based RL Princeton University Press
Sutton et al. (1999), Policy Gradient Methods for Reinforcement Learning with Function Approximation Proves the policy gradient theorem; establishes that \(Q^{\pi_\theta}\) appears inside the exact gradient of \(J\) NeurIPS 1999
Sutton & Barto (2018), Reinforcement Learning: An Introduction Standard textbook; chapters 3–4 cover MDPs and Bellman equations; chapter 13 covers the policy gradient theorem Online edition
Schulman et al. (2017), Proximal Policy Optimization PPO: the dominant policy-based algorithm; uses clipped IS ratios + value baseline arXiv:1707.06347
Haarnoja et al. (2018), Soft Actor-Critic SAC: entropy-regularized actor-critic for continuous control; uses soft Bellman equations arXiv:1801.01290
Silver et al. (2017), Mastering the Game of Go without Human Knowledge AlphaZero: model-based (known game rules) + policy-based (explicit \(\pi_\theta\)) + MCTS planning Nature 550