Particle MCMC methods - Calvin Woo's blog

Note: This blog post is still a rough draft. Read on with caution.

Often in machine learning we want to understand the dynamics of sequences of data, often ordered temporally. In this post we will talk about inference methods for Bayesian dynamic latent variable models, a type of probabilistic graphical model.

importance sampling

In a nutshell, Bayesian inference of nonlinear state-space models are analytically intractible, hence the need for approximate methods. To set notation, we will deal with latent Markov models, in which sequences of latent states only depend conditionally on the previous state, not the full history of past states before it. We observe a sequence of observations \(y_t\) for \(t=1,...,T\) from the model– this can be something like the total lab test volume in a given time-window, or the number of patients testing positive for an infectious disease (like COVID-19). The main assumption of a latent variable model is that the observations are generated by a set of underlying latent variables \(x_t\), which are themselves time-varying.

In the dynamic latent variable models we’re studying, the dynamics in the model are given by the latent state transitions \(p(x_t|x_{t-1}) = f(x_{t-1}, \theta_\text{trans})\) and observation probabilities \(p(y_t|x_t) = g(x_t, \theta_\text{obs})\). Here the tunable parameters are denoted by \(\theta=(\theta_\text{obs}, \theta_\text{trans})\).

Combined we get a full joint distribution

\[ p(x_{0:T}, y_{1:T}, \theta) = \prod_{i=1}^T p(y_t|x_t, \theta_\text{obs})\prod_{i=1}^T p(x_t|x_{t-1}, \theta_\text{trans}) p(x_0|\theta) p(\theta) \]

The goal here is to perform Bayesian inference over the latent variables; that is, we want to compute \(p(x_{0:T}|y_{1:T})\). But as we said before, this is analytically intractible, so the best we can do is approximate it by sampling. However, sampling from \(p(x_{0:T}|y_{1:T})\) is also hard!

Instead, we take a cue from Monte Carlo methods and sample instead from a proposal distribution \(q(x_{0:T}|y_{1:T})\) with larger tails than the model distribution. Importance sampling corrects for this bias by scaling the samples from \(q\) by an appropriate weight. That is, for a given integrable function \(\zeta\), we compute our desired expectations as

\[ \begin{align} \mathbf{E}_{x\sim p(x_{0:T}|y_{1:T})}[\zeta(x)] &= \mathbf{E}_{x\sim q(x_{0:T}|y_{1:T})}\left[\frac{p(x_{0:T}|y_{1:T})}{q(x_{0:T}|y_{1:T})}\zeta(x)\right] \\ &= \mathbf{E}_{x\sim q(x_{0:T}|y_{1:T})}[\omega(x)\zeta(x)] \end{align} \]

where \(\omega(x) = \frac{p(x_{0:T}|y_{1:T})}{q(x_{0:T}|y_{1:T})}\).

What’s the problem here? As we said before, sampling from \(p(x_{0:T}|y_{1:T})\) is also hard! Often, it is easier to sample from \(p(x_{0:T}, y_{1:T})\), but then we would have to deal with the normalizing factor \(p(y_{1:T})\). How do we deal with this?

The main trick here is to realize that the normalizing factor is itself an expectation, and hence can be computed via Monte Carlo itself. Indeed,

\[ p(y_{1:T}) = \int_x p(x_{0:T}, y_{1:T}) dx_{0:T} = \int_x \omega(x) q(x_{0:T}|y_{1:T}) dx_{0:T} \]

Putting this all together, we get the self-normalized importance sampler, described in (pseudo)code:

def normalize(wgts):
    return wgts / sum(wgts)

def importance(model, proposal, n_samples):
    q_samples = proposal.sample(n_samples)
    weights = (model.log_prob(q_samples) - proposal.log_prob(q_samples)).exp()
    norm_wgts = normalize(weights)
    return norm_wgts, q_samples

particle filter

The difficulty with importance sampling is choosing a good proposal distribution. If our goal is to compute the marginals \(p(x_t|y_{1:t})\), then it turns out that we don’t need to construct a fancy proposal. It turns out we can reuse the transition probabilities in our model to perform importance sampling. This is called the particle filter method.

The method is straightforward: we want to sample from \(p(x_t|y_{1:t})\), so we perform this iteratively, starting with a prior \(p(x_0)\). Now the trick is that our proposal is the transition probability \(p(x_1|x_0)\), which we assume is known. So sampling \(\widetilde{x}^{(i)}_1\sim p(x_1|x_0)\) for \(i=1..N\) gives a batch of particles that we will use to represent our sampled distribution.

def bootstrap(ys, n_particles=100):
    n_obs = len(ys)
    x = np.zeros(shape=(n_obs, n_particles))
    # initialize initial predicted states
    x[0] = x0_init(n_particles)

    # transition to x_1
    x[1] = transition(x[0])

Calculating weights we see that

\[ \widetilde{\omega}_1^{(i)} = \frac{p(\widetilde{x}^{(i)}_1|y_1)}{p(\widetilde{x}^{(i)}_1|x_0)} \propto p(y_1|\widetilde{x}^{(i)}_1) = g(\widetilde{x}^{(i)}_1, \theta_\text{obs}) \]

    # generate y0 and importance weights
    importance_weights = obs_prob(ys[1], x[1])
    importance_weights = normalize(importance_weights)

so normalizing \(\omega_1^{(i)} = \frac{\widetilde{\omega}_1^{(i)}}{\sum_{j=1}^N \widetilde{\omega}_1^{(j)}}\) gives the importance weights to the sampled particles. That’s the first trick to the particle filter. The second trick is noting that as time goes on, our weights will collapse to zero, and eventually only one particle survives. This is untenable, so we instead resample the particles according to their importance weights

\[ a_1^{(i)} \sim \text{Categorical}(\omega^{(j)}_1\text{ for }j=1...N) \]

where \(a_1\) are the indices of the new resampled particles.

    # resample particles
    resampled_idx = random.choice(range(n_particles), 
                                  size=n_particles, p=importance_weights)
    x[1] = x[1, resampled_idx]

Now we repeat this for each new time-step. We sample \(\widetilde{x}^{(i)}_2\sim p(x_2|x^{(i)}_1)\) by propagating each particle, and compute importance weights as normal:

\[ \widetilde{\omega}_2^{(i)} = \frac{p(\widetilde{x}^{(i)}_2|y_2)}{p(\widetilde{x}^{(i)}_2|x^{(i)}_1)} \propto p(y_2|\widetilde{x}^{(i)}_2) = g(\widetilde{x}^{(i)}_2, \theta_\text{obs}) \]

We then normalize and resample.

    for t in range(2, n_obs):
        # transition and compute weights
        x[t] = transition(x[t-1], p=p)
        importance_weights = normalize(obs_prob(ys[t], x[t]))
        # resample particles
        resampled_idx = random.choice(range(n_particles), 
                                      size=n_particles, p=importance_weights)
        x[t] = x[t, resampled_idx]
        
    return x

This is the essence behind a particle filter! We will use this to provide the approximate likelihoods for more advanced particle MCMC methods, such as the particle Metropolis-Hastings algorithm.

metropolis-hastings and mcmc

Given a state space model above, ultimately our goal is to perform inference over latent states and parameters, \(p(x_{0:T}, \theta\mid y_{1:T})\). In the above particle filter method, we developed a way to approximate the marginal \(p(x_{0:T}\mid\theta, y_{1:T})\). So how can we compute the other marginal, the parameter inference \(p(\theta\mid y_{1:T})\)? We can appeal to a standard MCMC method.

Note: Why not importance sampling? Probably because finding an appropriate proposal distribution \(q(\theta\mid y_{1:T})\) that has thicker tails than \(p(\theta\mid y_{1:T})\) is hard to determine and build.

With that out of the way, let’s describe a basic MCMC method– the Metropolis-Hastings algorithm. The central idea of MCMC methods is to construct a Markov chain whose stationary distribution is the posterior distribution we want to sample from. That is, we want to perform a random walk in latent/parameter space such that after some time passes, the places visited by the walk will start to look like samples distributed according to the posterior.

Formally, MCMC seeks to construct a transition kernel \(P_\text{MCMC}(x\to y)\), which describes the Markovian walk from \(x\) to \(y\) in parameter space. Such kernel has a consistency equation, given by Chapman-Kolmogorov:

\[ P^{(2)}_\text{MCMC}(x\to z) = \int_y P_\text{MCMC}(x\to y) P_\text{MCMC}(y\to z) dy \]

where the \((2)\) denotes the 2-step transition kernel.

By definition, a stationary distribution for a Markov kernel is a distribution \(\pi(x)\) over the parameter space that is “fixed” by the kernel– that is,

\[ \pi(z) = \int_y \pi(y) P_\text{MCMC}(y\to z) dy \]

We see from the Chapman-Kolmogorov equation that \(P_\text{MCMC}(x\to -)\) for any \(x\) would be a close candidate, except that the step-size on each side is different. Under certain mathematical conditions (ergodicity), we can take limits, and show that for any initial point \(x\), the stationary distribution can be described as

\[ \pi(z) = \lim_{n\to\infty} P_\text{MCMC}^{(n)}(x\to z) \]

This shows the main point of MCMC– starting from any initial point, a random walk described by the kernel \(P_\text{MCMC}(x\to y)\) eventually gives samples distributed according to the stationary distribution \(\pi(x)\).

However, this is a fairly difficult condition to check, as often constructing \(P_\text{MCMC}(x\to z)\) shows that it’s hard to get a grasp on its limiting properties. An easier condition to check is that of time-reversibility: \(\pi(x) P_\text{MCMC}(x\to y) = \pi(y) P_\text{MCMC}(y\to x)\). A distribution \(\pi\) that satisfies this condition is thus the stationary distribution of the MCMC kernel.

Proof: This is an integral check, given by

\[ \begin{align} \int_x \pi(x) P_\text{MCMC}(x\to y) dx &= \int_x \pi(y) P_\text{MCMC}(y\to x) dx \\ &= \pi(y) \end{align} \]

Now, we describe Metropolis-Hastings. Remember, in this setting our goal is to sample from \(p(\theta\mid y_{1:T})\), so this is our stationary distribution. To define the MH kernel, we first start with a proposal distribution \(q(x\to y)\). If \(q\) satisfies time-reversibility, then we are done, and \(q(x\to y)\) is our desired kernel. Most likely, this is not the case.

The trick of Metropolis-Hastings is to introduce an accept-reject step in the walk to correct for the bias. Let \(\alpha(x\to y)\) be an acceptance probability of moving from \(x\) to \(y\). We set our MH kernel to be \(P_\text{MH}(x\to y) = q(x\to y)\alpha(x\to y)\).

What is \(\alpha(x\to y)\)? We choose this probability so that time-reversibility is satisfied: Under the assumption that \(\alpha(x\to y) < 1\),

\[ \begin{align} \pi(x) P_\text{MH}(x\to y) &= \pi(x) q(x\to y) \alpha(x\to y) \\ &= \pi(y) q(y\to x) \alpha(y\to x) \\ &= \pi(y) q(y\to x) \end{align} \]

where \(\alpha(y\to x) = 1\). This implies ultimately that

\[ \alpha(x\to y) = \min\left\{1, \frac{\pi(y)q(y\to x)}{\pi(x)q(x\to y)}\right\} \]

This fully describes the Metropolis-Hastings algorithm. In (pseudo)code, it looks like:

def mh(likelihood, proposal, x_0, n_samples=1000):
    x = np.zeros(n_samples)
    # initialize random walk
    x[0] = x_0

    for t in range(1, n_samples):
        # sample proposal
        p_sample = proposal(x[t-1]).sample()
    
        # coin flip
        prob_y_to_x = likelihood(p_sample) * proposal(p_sample).log_prob(x[t-1]).exp()
        prob_x_to_y = likelihood(x[t-1]) * proposal(x[t-1]).log_prob(p_sample).exp()
        accept_prob = min(1, prob_y_to_x / prob_x_to_y)

        if random.rand() < accept_prob:
            x[t] = p_sample
        else:
            x[t] = x[t-1]
    
    return x

The key remark to make here is that the only place where the target distribution \(\pi(x)\) appears is in the acceptance probability, where only a ratio appears. Hence, one upshot of the MH algorithm is that \(\pi(x)\) doesn’t have to be normalized. This is a boon for Bayesian inference, where \(\pi(x)\) can just be an unnormalized posterior, i.e. the likelihood and prior.

auxiliary variables

However, what if the likelihood itself is hard to compute? In our situation we are trying to compute \(p(\theta\mid y_{1:T})\) via Bayesian inference, so in the MH acceptance probability above we take as our \(\pi(\theta) = p(y_{1:T}\mid\theta)p(\theta)\). But we don’t know how to compute

\[ p(y_{1:T}\mid\theta) = \int_{x_{0:T}} p(y_{1:T}\mid x_{0:T},\theta) p(x_{0:T}\mid\theta) dx_{0:T} \]

exactly. Instead, suppose that we have an approximation to the likelihood \(\widehat{z}\simeq p(y_{1:T}\mid\theta)\). What we will do is treat this as a random variable \(\widehat{z}\sim \psi(z\mid\theta, y_{1:T})\) and append this as an auxiliary variable to the joint distribution. Our joint then becomes

\[ \begin{align} \psi(\theta, z\mid y_{1:T}) &= p(\theta\mid y_{1:T})\psi(z\mid\theta, y_{1:T}) \\ &\propto p(y_{1:T}\mid\theta)p(\theta)\psi(z\mid\theta, y_{1:T}) \\ &\simeq \widehat{z} p(\theta)\psi(z\mid\theta, y_{1:T}) \end{align} \]

where we used our approximation to the likelihood \(\widehat{z}\). Here, the final term is our target distribution \(\pi(\theta, z\mid y_{1:T}) = \widehat{z} p(\theta)\psi(z\mid\theta, y_{1:T})\). This looks different from our real target \(p(\theta\mid y_{1:T})\), not to mention that we have an extra auxiliary variable \(z\).

But it doesn’t matter! All that matters for \(\pi(\theta, z\mid y_{1:T})\) is that its marginal for \(\theta\) matches \(p(\theta\mid y_{1:T})\). In particular, we see that \(\widehat{z}\simeq p(y_{1:T}\mid\theta)\) must be an unbiased estimator.

From this, we have all the ingredients we need to perform Metropolis-Hastings. At each step of the random walk, we sample both \(\theta'\) and \(z'\), and accept-or-reject based on the acceptance probability (which one can compute):

\[ \alpha_\text{MH}(\theta', z') = \min\left\{1, \frac{z' p(\theta') q(\theta'\to\theta_\text{prev})}{z_\text{prev} p(\theta_\text{prev}) q(\theta_\text{prev}\to\theta')}\right\} \]

Finally, the projection to the \(\theta\)-walk gives the resulting MCMC samples.

particle metropolis-hastings

Okay, all that sounds good, but– what are we taking as our approximation \(\widehat{z}\simeq p(y_{1:T}\mid\theta)\)? As an integral form, the likelihood we want is

\[ p(y_{1:T}\mid\theta) = \int_{x_{0:T}} p(y_{1:T}\mid x_{0:T},\theta) p(x_{0:T}\mid\theta) dx_{0:T} \]

An idea is to approximate this using a Monte Carlo estimate. Factorizing the integral even further,

\[ p(y_{1:T}\mid\theta) = \prod_{i=1}^T \int_x p(y_t\mid x_t,\theta) p(x_t\mid\ y_{1:t-1}, \theta) dx_t \]

we can perform the MC estimation sequentially, and then multiply the estimates together. Exploiting the sequential nature of the model often decreases the variance of the final estimator.

But note that we already have an estimator for \(p(x_t\mid\ y_{1:t-1}, \theta)\)– a particle filter! And in this situation, \(\omega_t\simeq p(y_t\mid x_t, \theta)\) are the importance weights in the particle filter, so can ultimately produce likelihood estimators \(\widehat{z}\) by performing a full run of a particle filter over the model and letting

\[ \widehat{z} = \prod_{t=1}^T \left[\frac{1}{N}\sum^N_{i=1} \omega^{(i)}_t\right] \]

Since the particle filter satisfies convergence guarentees, we see that \(\widehat{z}\) is indeed an unbiased estimator of \(p(y_{1:T}\mid\theta)\). Combining this particle filter likelihood estimator with the Metropolis-Hastings method above, we get the particle Metropolis-Hastings algorithm for Bayesian parameter inference. In (pseudo)code this looks like:

def pmh(pf, obs, prior, proposal, x_0, n_samples=1000, n_particles=100):
    x = np.zeros(n_samples)
    z = np.zeros(n_samples)
    # initialize random walk
    x[0] = x_0
    # particle filter returns particles and weights for each time step
    particles, weights = pf(obs, theta=x[0]).run(n_particles)
    z[0] = prod(mean(weights, dim=0))

    for m in range(1, len(obs)):
        # sample proposal
        p_sample = proposal(x[m-1]).sample()
        particles, weights = pf(obs, theta=p_sample).run(n_particles)
        z_sample = prod(mean(weights, dim=0))

        # coin flip
        prob_y_to_x = z_sample * (prior.log_prob(p_sample) + 
                                  proposal(p_sample).log_prob(x[t-1])).exp()
        prob_x_to_y = z[t-1] * (prior.log_prob(x[t-1]) + 
                                proposal(x[t-1]).log_prob(p_sample)).exp()
        accept_prob = min(1, prob_y_to_x / prob_x_to_y)

        if random.rand() < accept_prob:
            x[t] = p_sample
            z[t] = z_sample
        else:
            x[t] = x[t-1]
            z[t] = z[t-1]
    
    # project out z
    return x

probabilistic programming

If you think about it, a generative story for a joint distribution can be unraveled into a sequential model of states, and hence a version of particle MCMC methods apply here as well. In this way, the states \(x_t\) refer to the total execution state of the program at that point in time, and observations \(y_t\) correspond to the observe primitives in the PPL (i.e. conditioning statements).