Markov Chain Monte Carlo (MCMC)
['.git', 'pycache', 'templates', '/shared', 'code', 'figures', 'pdfs']
What is MCMC?
The idea behind MCMC (Markov Chain Monte Carlo) is the following. Suppose we want to sample from a target distribution $\pi$.
-
We start with an arbitrary state $X_0$.
-
Given a state $X_n$, we modify it (randomly) to obtain the next state $X_{n+1}$.
The aim is to design random modification in the second step in a manner so that when $n$ is large, the distribution of $X_n$ is close to our target distribution $\pi$.
Remark 1. The reason this is called MCMC is because typically the modification in the second step above only depends on $X_n$, and not the history. That is, the process $X_n$ forms a Markov chain.
The binary hypercube
Example 2. Let $Q_N = \set{0, 1}^N$, and $\pi$ be the uniform distribution on $Q_N$. Fix $p \in (0, 1)$. Given $X_n = ( b^1_n, b^2_n, \dots, b^N_n )$, choose $k \in \set{1, \dots, N}$ uniformly at random, and choose $U_{n+1} \sim \Unif([0, 1])$ (independently). If $U_{n+1} < p$, set $X_{n+1}$ to be $X_n$ with the $k$-th bit flipped. Otherwise set $X_{n+1} = X_n$. That is if $U_{n+1} < p$, set $b^j_{n+1} = b^j_n$ if $j \neq k$ or the coin landed tails; and $b^k_{n+1} = 1 - b^k_n$ if the coin landed heads. We will see later that \begin{equation} \frac{1}{2} \sum_{x \in Q_N} \abs{ \P(X_n = x) - \pi(x) } \leq e^{-cn/N} \,, \end{equation} for some constant $c$ that is independent of $N$. This tells you that after $O(N)$ steps the above MCMC algorithm produces “good samples”.
Remark 3. One can of course exactly sample the uniform distribution on $Q_N$ in $O(N)$ time, by choosing each bit uniformly at random. The MCMC algorithm above requires $O(N)$ steps, to produce samples whose distribution is close to the uniform distribution. Moreover, each step of the above algorithm costs $O(\ln N)$, making the total cost $O(N \ln N)$. Thus, MCMC is not the best choice in this situation. We will, however, shortly see situations where MCMC may be your only viable choice.
Question 4. Let $\Gamma_N$ be the set of all $x \in Q_N$ such that no consecutive bits are $1$. That is $x = (x_1, \dots, x_N)$ such that for every $k < N$ where $x_k = 1$, we have $x_{k+1} = 0$. How do you sample the uniform distribution on $\Gamma_N$?
While this is a little more complicated than uniformly sampling from $Q_N$, it can still be done directly with a little thought. Choose $N$-tuples at random, and replace all $1$ bits with $1, 0$ shifting the remaining bits to the right; then truncate to $N$ bits. In more complicated situations one may not be able to find simple direct algorithms. Moreover, one has to take care that the direct algorithms don’t double count or miss points, and this isn’t always easy to check.
Let us briefly discuss other ways to uniformly samply from $\Gamma_N$. One option is to use rejection sampling. The cost of rejection sampling is \begin{equation} \max_x \frac{\text{target PDF}}{\text{proposal PDF}} = \frac{1/\abs{\Gamma_N}}{1/\abs{Q_N}} = \frac{2^N}{\abs{\Gamma_N}} \,. \end{equation} When $N$ is large this is usually prohibitively expensive.
Another (viable) option is to design an MCMC algorithm:
-
Start with $X_0 \in \Gamma$ arbitrary and fix $p \in (0, 1)$.
-
Given $X_n \in \Gamma_N$, we choose $U_{n+1} \sim \Unif([0, 1])$ (independently). Choose $k \in \set{1, \dots, N}$ uniformly at random and set $X_{n+1}’$ to be $X_n$ with the $k$-th bit flipped. Define \begin{equation} X_{n+1} = \begin{cases} X_{n+1}’ & \text{if } U_{n+1} < p \text{ and } X_{n+1}’ \in \Gamma_N \,, \\ X_n & \text{otherwise} \end{cases} \,. \end{equation}
We will see later that \begin{equation} \lim_{n \to \infty} \frac{1}{2} \sum_{x \in \Gamma_N} \abs[\Big]{\P( X_n = x ) - \frac{1}{\abs{\Gamma_N}} } = 0 \,. \end{equation} In fact the above converges to $0$ exponentially as $n \to \infty$, and in this situation one can even get a bound on the exponential rate. This is important as it tells you how long you need to wait before your MCMC samples become good.
Uniform sampling on a graph
A (simple) graph $G = (V, E)$ is a collection of vertices $V$ and edges $E$.
-
For $x \in V$, $N(x)$ denotes the neighborhood of $x$, and is defined to be the set of all points $y$ such that $x, y$ are connected by an edge.
-
The degree of a vertex, $\deg(x)$, is the number of edges with endpoint $x$.
-
$\Delta(G) = \max \set{ \deg(x) \st x \in V }$.
-
A graph is connected if every pair of points can be connected by a path of edges.
The binary hypercube, is a graph whose vertices are $Q_N$, and edges connect pairs of points that differ in exactly one bit. The set $\Gamma_N$ is a connected subgraph. Our MCMC algorithms above involve choosing a neighbor of $X_n$ uniformly at random, and moving to it with probability $p$. This will produce a uniform sample if the graph is connected and all vertices have the same degree (which is true for $Q_N$).
If the number of neighbors varies at each point varies at each point then the above algorithm won’t yield a process that is eventually uniformly distributed. For instance, in the figure below, choosing a neighbor at random and moving to it with probability $p$ will move $X$ from $A$ to $B$ less often than it moves from $B$ to $A$. Thus one would expect that for large $n$, $\P(X_n = A) > \P(X_n = B )$.
We can account for the varying degree at each point as follows.
Proposition 5. Let $G = (V, E)$ be a connected graph. Fix $p \in (0, 1)$. Given $X_n \in V$, choose $X_{n+1}’ \in N(X_n)$ uniformly at random, and $U_{n+1} \sim \Unif( [0, 1])$ independently. Define \begin{equation} X_{n+1} = \begin{cases} X_{n+1}’ & \displaystyle \text{if } U_{n+1} < \frac{p \deg(X_n)}{\Delta(G)} \,, \\ X_n & \text{otherwise} \,. \end{cases} \end{equation}
There exists a constant $c$ such that for all large $n$ we have \begin{equation} \frac{1}{2} \sum_{x \in \Gamma} \abs{ \P(X_n = x) - \pi(x) } \leq e^{-cn} \,. \end{equation}
Proof: Will be done later.
Remark 6. In this generality it is not possible to determine the dependence of $c$ on $\abs{G}$. In a few examples, however, this can be computed. It is usually desirable to have $c$ to be an inverse polynomial in $\log \abs{G}$.
Hard Sphere Model
Consider a simple hard sphere model on a grid. Suppose \begin{equation} \Lambda = \{1, \ldots, n_1\} \times \{1, \ldots, n_2 \} \subseteq \Z^2 \,. \end{equation} A proper configuration on $\Lambda$ consists of coloring each point either black or white in such a way that no two adjacent points are white. Let $\mathcal X$ denote the set of all proper configurations on $\Lambda$, and $\pi$ be the uniform distribution on $\mathcal X$ so that each proper configuration is equally likely. This model arises in statistical physics and one of the questions of interest is to estimate the typical number of white points in a proper configuration. That is, if $W(x)$ is the number of white points in $x \in \mathcal X$ then we want the value of \begin{equation} \E W=\sum_{x \in \mathcal X}\frac{W(x)}{\abs{\mathcal X}} \,. \end{equation}
We can compute this using an MCMC method. Think of all colorings as a graph $G$, where two colorings are neighbors if the colorings differ at only one point. The degree of every vertex of this graph is $\Delta(G) = n_1 n_2$. Now we construct a MCMC method using Proposition 5 to produce the process $X_n$ that does a random walk on the sub-graph of proper colorings. Then we estimate \begin{equation}\label{e:EW}\tag{1} \E W \approx \frac{1}{N} \sum_{n = 1}^N X_n \,. \end{equation}
Remark 7. The fact that the right hand side of \eqref{e:EW} converges to $\E W$ as $N \to \infty$ is due to the Ergodic theorem, which we will see later. The law of large numbers doesn’t apply here as $X_n$’s are not i.i.d. You can, however, simulate $M$ independent copies of $X$, say $X^{1}$, …, $X^M$, and then use \begin{equation} \E W \approx \frac{1}{M} \sum_{i = 1}^M X^i_N \,, \end{equation} or \begin{equation} \E W \approx \frac{1}{N M} \sum_{i = 1}^M \sum_{n = 1}^N X^i_n \,. \end{equation}
Code for Figures
The code generating all figures on this page is here.