Metropolis Hastings Revisited

When we introduced the Metropolis–Hastings algorithm, we never checked that the stationary distribution is indeed the target distribution $\pi$. We do this now using the detailed balance condition.

Detailed Balance

Definition 1. We say a Markov chain $X$ satisfies the detailed balance condition for a distribution $\mu$ if \begin{equation}\label{e:detailedBalance}\tag{DB} \mu(x) P(x, y) = \mu(y) P(y, x)\,. \end{equation}

Put another way, \eqref{e:detailedBalance} is equivalent to the statement that if $X_0 \sim \mu$, then \begin{equation} \P(X_0 = x, X_1 = y) = \P(X_0 = y, X_1 = x) \,. \end{equation}

Proposition 2.

If a Markov chain satisfies the detailed balance condition for a distribution $\pi$, then $\pi$ is a stationary distribution for the chain.

Proof. By detailed balance, \begin{equation} \sum_{x \in \mathcal X} \pi(x) P(x, y) = \sum_{x \in \mathcal X} \pi(y) P(y, x) = \pi(y) \sum_{x \in \mathcal X} P(y, x) = \pi(y) \end{equation}

Remark 3. The converse is false. If a Markov chain has stationary distribution $\pi$, then it need not satisfy the detailed balance condition \eqref{e:detailedBalance}. One example of this is a cycle between 3 states where every point moves one step clockwise with probability $1$.

Proposition 4. For $f, g \colon \mathcal X \to \R$ define the inner product \begin{equation} \ip{f, g}_\pi = \sum_{x \in \mathcal X} f(x) g(x) \pi(x) \,. \end{equation} If $P$ satisfies \eqref{e:detailedBalance} then $P$ is self adjoint with respect to $\ip{\cdot, \cdot}_\pi$. That is, for every $f, g \colon \mathcal X \to \R$ we have \begin{equation} \ip{f, Pg}_\pi = \ip{Pf, g}_\pi \,. \end{equation}

Proof: Direct calculation.

Remark 5. When the stationary distribution satisfies the detailed balanced condition the chain is called reversible. For reversible chains, the rate of convergence is completely determined by the spectrum of $P$. For non-reversible chains the rate of convergence is only bounded above in terms of the spectrum of $P$; but actual convergence could be faster.

Remark 6. If a Markov chain satisfies the detailed balance condition with the uniform distribution, then the transition matrix is symmetric and hence doubly stochastic (i.e. both row sums and column sums are $1$). There are, of course, many doubly stochastic matrices that are not symmetric.

Stationary Distribution

Proposition 7.

The stationary distribution of the Metropolis–Hastings chain is $\pi$.

Proof. The transition matrix of the Metropolis chain is given by \begin{equation} P(x, y) = \begin{cases} Q(x, y) A(x, y) & y \neq x \,\\ 1 - \sum_{y’ \neq x} Q(x, y’) A(x, y’) & y = x\,. \end{cases} \end{equation} Pick $x, y \in \mathcal X$ and suppose first $y \neq x$, and $\pi(x) Q(y, x) \geq \pi(y) Q(x, y)$. In this case \begin{equation} A(x, y) = \frac{\pi(y) Q(y, x) }{\pi(x) Q(x, y)} \qquad\text{and}\qquad A(y, x) = 1\,, \end{equation} and so \begin{align*} \pi(x) P(x, y) &= \pi(x) Q(x, y) A(x, y) = \pi(y) Q(y, x) \\ &= \pi(y) Q(y, x) A(y, x) = \pi(y) P(y, x)\,, \end{align*} and so the detailed balance condition \eqref{e:detailedBalance} holds in this case. By symmetry, when \begin{equation} \pi(x) Q(y, x) \leq \pi(y) Q(x, y) \end{equation} we also have \eqref{e:detailedBalance}. When $y = x$geq, the detailed balance condition \eqref{e:detailedBalance} is trivially true. Thus by Proposition 2, $\pi$ is a stationary distribution for the Metropolis–Hastings chain.

Remark 8. There are other choices of the acceptance ratio for which the stationary distribution is $\pi$. One choice (Barker ‘65) is \begin{equation} A(x, y) = \frac{1}{1 + \frac{\pi(x) Q(x, y)}{\pi(y) Q(y, x)} } = \frac{1}{1 + \frac{\pi_u(x) Q(x, y)}{\pi_u(y) Q(y, x)} } \end{equation} The advantage of the traditional choice (A) is that it minimizes the asymptotic variance \begin{equation} \lim_{N \to \infty} \frac{1}{N} \var\paren[\Big]{ \sum_{n=1}^N f(X_n) } \,. \end{equation}

Remark 9. When applying the Metropolis–Hastings algorithm, you want to choose a proposal mechanism to ensure that the chain is irreducible and aperiodic. If these two conditions are satisfied, then Proposition 7 will imply that $\dist(X_n) \to \pi$ as $n \to \infty$. This means if you run the chain long enough, you’ve generated points that are sampled according to the desired target distribution $\pi$.

The practical issue is that the convergence may happen very slowly, and you may have to wait for a very long time before $\dist(X_n)$ is close enough to the stationary distribution. In general you want to choose your proposal mechanism in a manner that improves the rate of convergence of $\dist(X_n)$ to the stationary distribution. There’s no silver bullet. Coming up with these mechanisms is usually problem specific, and estimating the rate of convergence for a given proposal mechanism is not easy.

Independence Sampler

Let $q$ be some probability distribution and suppose we choose our proposal kernel \begin{equation}\label{e:Qindep}\tag{I} Q(x, y) = q(y) \,, \end{equation} independent of $x$. In practice this is typically a bad choice as it performs badly. But this is one of the few cases you can estimate the convergence rate.

Proposition 10. Suppose there exists $M < \infty$ such that $\pi(x) \leq M q(x)$, and $X$ is the Metropolis–Hastings chain with target distribution $\pi$ and proposal kernel \eqref{e:Qindep}. Then the transition kernel is bounded below by \begin{equation} P( x, y ) \geq \frac{\pi(y)}{M} \,, \end{equation} and consequently \begin{equation} \tmix(\epsilon) \leq \frac{\ln \epsilon}{\ln( 1 - \frac{1}{M} ) } \,. \end{equation}

Proof: Will be done on the board.

Remark 11. The acceptance rate of a sampler is \begin{equation} \E A(X, Y) \end{equation} where $X, Y$ are random variables with joint distribution $\pi(x) Q(x, y)$, and $A$ is the acceptance ratio. For the independence sampler, if $\pi \leq M q$ as above, then the acceptance rate of the independence sampler is at least $1/M$.

Exploration vs Acceptance trade off

Suppose we are sampling a distribution $\pi$ on a finite graph $G$. In our examples so far (e.g. Ising model) we typically propose close by states (e.g. flip one spin). You can alternately propose states by flipping many spins, or even proposing a uniformly random spin configuration.

If you flip one spin at a time, it will take a very very long time for you to start with an all $+$ spin configuration and go to an all $-$ spin configuration. This is bad because it makes the chain mix slowly.

On the other hand, you can reduce this a lot by proposing states that flip many spins (or a cluster of like spins) simultaneously. This reduces the number of moves required to go between any two arbitrary states. However, the odds of accepting the proposed move typically decrease. The typical situation is a trade-off:

The advantage of proposing close by states, is that they tend to get accepted more often.
The trade-off is that the chain will now explore the space slower.

Glauber Dynamics / Gibbs sampling

Glauber chains (or Gibbs samplers) are used when marginals are easy to sample from. One situation is as follows: Let $G, F$ be finite sets and suppose $\mathcal X = \set{ f \colon G \to F }$. In the case of the Ising model $F = \set{\pm 1}$, and $G$ is some lattice / graph. Let $\pi_u \colon \mathcal X \to \R$ be some un-normalized probability distribution that is cheap to compute at any given $x \in \mathcal X$.

For any $v \in G$, $x \in \mathcal X$, define \begin{equation} \mathcal X(x, v) = \set{ y\in \mathcal X \st y(w) = x(w) \text{ for all } w \neq v } \,. \end{equation} Note, $\abs{\mathcal X}$ is large (it equals $\abs{F}^\abs{G}$); however $\abs{ \mathcal X(x, v) } = \abs{F}$ which is small. Thus it is easy to sample from $\pi|_{\mathcal X(x, v)}$. The Glauber chain is a Markov chain with the following update rule. Given $X_n = x$, we choose $v \in G$ uniformly at random. Now we move to state $y \in \mathcal X(x, v)$, where $y$ is chosen randomly with distribution $\pi|_{\mathcal X(x, v)}$. Explicitly, our new state $y \in \mathcal X(x, v)$ is randomly chosen with probability \begin{equation} Q_v(x, y) \defeq \frac{\pi_u(y)}{ \sum_{z \in \mathcal X(x, v)} \pi_u(z) } \end{equation}

Proposition 12. The stationary distribution of the Glauber chain is $\pi$.

Proposition 13. The Glauber chain is simply a Metropolis–Hastings chain, with proposal mechanism chosen so that: \begin{equation} Q(x, y) = \begin{cases} \displaystyle \frac{Q_v(x, y)}{\abs{G}} & \exists v \in G \st x(w) = y(w) \iff w \neq v\,, \\ \displaystyle \frac{1}{\abs{G}} \sum_{v \in G} Q_v(x, x) & y = x \\ 0 & \text{otherwise} \end{cases} \end{equation} Moreover, the acceptance probability of all proposed moves is $1$.

Convergence Diagnostics

Suppose $f \colon \mathcal X \to \R$ is a test function and we are interested in knowing whether our MCMC estimate of $\hat{\mathcal I}_N(f)$ has converged. The idea behind the Gelman Rubin diagnostic is to take $J$ independent copies of the chain $X^1$, …, $X^J$, and run each of them for $2N$ iterations. The first $N$ iterations are treated as burn in, and discarded, and we write our estimator for the $j$-th chain as \begin{equation} \hat{\mathcal I}^j_N(f) = \frac{1}{N} \sum_{n=N+1}^{2N} f(X^j_n) \,. \end{equation} We denote the average over $J$ chains by \begin{equation} \bar I_N(f) = \frac{1}{J} \sum_{j = 1}^J \hat I^j_N(f) \,. \end{equation}

Now, we expect \begin{equation} B \defeq \frac{1}{J-1} \sum_{j = 1}^{J} (\hat I_N^j(f) - \bar I_N f )^2 \approx \frac{J-1}{J} \E ( \hat I_N(f) - I_\pi(f) )^2 \xrightarrow{N \to \infty} 0 \,. \end{equation} On the other hand, \begin{equation} W \defeq \frac{1}{J} \sum_{j = 1}^J \frac{1}{N-1} \sum_{n=N+1}^{2N} \paren{f(X^j_n) - \hat I^j_N(f) }^2 \approx \frac{N - 1}{N} \var_\pi(f) \,. \end{equation} So if the chains have converged, we should have $B / W \to 0$ as $N \to \infty$.

One can use normal asymptotics and hypothesis testing to convert this into a convenient rule of thumb: Define \begin{equation} V \defeq \frac{N-1}{N} W + B \quad\text{and}\quad R = \sqrt{\frac{V}{W}} \end{equation} As $N \to \infty$ we expect $R \to 1$. If $R > 1.2$ we detect lack of convergence with 97.5% certainty.

Remark 14. The reason for the weird $J-1$ and $N-1$ that appear is to get a better estimate on the standard deviation.