Markov Chains
The reason the Metropolis–Hastings algorithm works is because:
-
The algorithm outputs a Markov Chain.
-
Under certain conditions, the distribution of a Markov chain (started arbitrarily) will converge to its unique stationary distribution.
-
The stationary distribution of the Metropolis–Hastings Markov chain is the desired target distribution.
We will prove each of these assertions here.
Basic notions
Definition 1. A Markov chain on $\mathcal X$ is a family of random variables $X_0$, $X_1$, … such that for all $n \in \N$ and $x_0, \dots, x_{n+1} \in \mathcal X$ we have \begin{equation} \P(X_{n+1} = x_{n+1} \given \set{ X_k = x_k \st 0 \leq k \leq n} ) = \P(X_{n+1} = x_{n+1} \given X_n = x_n ) \,. \end{equation}
Definition 2. A time homogeneous Markov chain is a Markov chain where \begin{equation} \P( X_{n+1} = y \given X_n = x ) = \P( X_1 = y \given X_0 = x )\,, \end{equation} for all $x, y \in \mathcal X$ and $n \in \N$.
Example 3 (Simple random walk). Let $\mathcal X = \Z$, $\xi_n$ be i.i.d. $\pm 1$ valued random variables (coin flips) and set $X_{n+1} = X_n + \xi_{n+1}$.
Example 4 (Metropolis chain). Given $X_n$, choose $X_{n+1}$ according to the Metropolis–Hastings rule:
-
Given $X_n = x \in \mathcal X$, propose $y \in \mathcal X$ with probability $Q(x, y)$.
-
Let $X_{n+1} = y$ with probability $A(x,y)$ (defined here), and $X_{n+1} = x$ otherwise.
Definition 5. The transition matrix of a (time homogeneous) Markov chain is \begin{equation} P(x, y) = \P( X_1 = y \given X_0 = x )\,. \end{equation}
Remark 6. Notice $P(x, y) \geq 0$ and $\sum_y P(x, y) = 1$. Such matrices are called stochastic matrices.
Proposition 7. For any $n \in \N$, \begin{equation} \P( X_n = y \given X_0 = x ) = P^n(x, y)\,, \end{equation} where $P^n$ means the $n^\text{th}$ power of the matrix $P$.
Proof. Done in class.
Proposition 8. If $\dist(X_0) = \mu_0$ (i.e. $\P(X_0 = x) = \mu_0(x)$ for all $x \in \mathcal X$), then $\dist(X_n) = \mu_0 P^n$ (as a matrix product).
Proof. Done in class.
Stationary distribution.
Definition 9. We say distribution $\pi$ is stationary for the Markov chain $X$ if $\pi P = \pi$. That is if $X_0 \sim \pi$ then $X_n \sim \pi$ for all $n \in \N$.
Remark 10. We will shortly see $\pi(x)$ is the average time the Markov chain spends at location $x$. Points where $\pi(x)$ is higher are visited (proportionally) more often by $X$ than points where $\pi(x)$ is lower.
Theorem 11. If $\abs{\mathcal X} < \infty$, then any Markov chain has a stationary distribution.
Proof. This follows from the Frobenius theorem. For a direct probabilistic proof one can construct $\pi$ by picking $x_0 \in \mathcal X$ (arbitrarily), and letting $\pi(y)$ to be the average number of visits to $y$ before returning to $x_0$. See Proposition 1.14 in LP17.
Remark 12. The stationary distribution need not be unique. For example $\mathcal X = \set{0, 1}$, and $P = I$ has infinitely many stationary distributions.
Definition 13. A Markov chain is called irreducible if for any $x, y \in \mathcal X$ there exists $n \in \N$ such that $P^n(x, y) > 0$.
Theorem 14. If $P$ is irreducible, the stationary distribution is unique.
Lemma 15. If $P$ is irreducible, and $\pi$ is a stationary distribution, then $\pi(x) > 0$ for all $x \in \mathcal X$.
Proof of Theorem 14. If $\pi_1$ and $\pi_2$ are two stationary distributions, choose $x_0$ that minimizes $\pi_1(x) / \pi_2(x)$. Clearly \begin{equation} \pi_1(x_0) = \sum_{x \in \mathcal X} \frac{\pi_1(x)}{\pi_2(x)} \pi_2(x) P(x, x_0) \geq \frac{\pi_1(x_0)}{\pi_2(x_0)} \pi_2(x_0) = \pi_1(x_0)\,. \end{equation} This implies \begin{equation} \sum_{x \in \mathcal X} \frac{\pi_1(x)}{\pi_2(x)} \pi_2(x) P(x, x_0) = \sum_{x \in \mathcal X} \frac{\pi_1(x_0)}{\pi_2(x_0)} \pi_2(x) P(x, x_0) \,. \end{equation} Since $\pi_1(x) / \pi_2(x) \geq \pi_1(x_0) / \pi_2(x_0)$ for all $x$, the above equality can only hold if \begin{equation} \frac{\pi_1(x)}{\pi_2(x)} = \frac{\pi_1(x_0)}{\pi_2(x_0)} \quad\text{for every } x \in \mathcal X \text{ such that } P(x, x_0) > 0\,. \end{equation} for all $x$ such that $P(x, x_0) > 0$. Now for any given $x \in \mathcal X$, irreducibility implies we can find $N = N(x) \in \N$ such that $P^N( x, x_0) > 0$. Since $\pi_1$ and $\pi_2$ are also stationary for $P^N$ the above argument will imply \begin{equation} \frac{\pi_1(x)}{\pi_2(x)} = \frac{\pi_1(x_0)}{\pi_2(x_0)} \quad\text{for every } x \in \mathcal X\,. \end{equation} Since $\sum_{x \in \mathcal X} \pi_1(x) = \sum_{x \in \mathcal X} \pi_2(x) = 1$, this implies $\pi_1 = \pi_2$, finishing the proof.
Convergence to the stationary distribution.
Example 16. Irreducibility alone doesn’t guarantee convergence to the stationary distribution. For example the chain with transitions $P(0, 1) = P(1, 0) = 1$ and $P(1, 1) = P(0, 0) = 0$ is irreducible, but the distribution need not converge to the stationary distribution.
Definition 17. A Markov chain is called aperiodic if for all $x \in \mathcal X$, \begin{equation} \gcd\set{n \geq 1 \st P^n(x, x) > 0} = 1\,. \end{equation}
Example 18. The simple random walk is irreducible, but not aperiodic. If instead $\xi_n$ are i.i.d. random variables such that $\P(\xi_n = 0) = 1/2$ and $\P(\xi_n = \pm 1) = 1/4$, then the lazy random walk defined by $X_{n+1} = X_n + \xi_{n+1}$ is irreducible and aperiodic.
Example 19. The Markov chain with $P=I$ is aperiodic but not irreducible (if $\abs{\mathcal X} > 1$).
Remark 20. If a Markov chain is irreducible but not aperiodic, one common trick is to introduce lazyness: Flip an independent fair coin. If it lands heads, don’t move. If it lands tails, move according to the original transition kernel. That is, define a new transition matrix $Q$ by \begin{equation} Q(x,y) = \frac{1}{2} (I + P) = \begin{cases} \frac{1}{2}(1 + P(x, x)) & y = x\,,\\ \frac{1}{2} P(x, y) & y \neq x \,. \end{cases} \end{equation} The new chain will be both irreducible and aperiodic, and have the same stationary distribution.
Theorem 21. If a Markov chain is irreducible and aperiodic, then $\dist(X_n) \to \pi$ in total variation as $n \to \infty$.
Definition 22. For any two probability distributions $\mu, \nu$ we define the total variation distance between $\mu$ and $\nu$ by \begin{equation} \norm{\mu - \nu}_\TV \defeq \frac{1}{2} \sum_{x \in \mathcal X} \abs{\mu(x) - \nu(x) } \,. \end{equation}
Definition 23. We say $\dist(X_n) \to \pi$ in total variation as $n \to \infty$, if we have $\norm{\dist(X_n) - \pi}_\TV = 0$ as $n \to \infty$.
Remark 24. Notice $\norm{\dist(X_n) - \pi}_\TV = 0$ as $n \to \infty$ is equivalent to having \begin{equation} \lim_{n \to \infty} \frac{1}{2} \sum_{x \in \mathcal X} \abs{ \P( X_n = x ) - \pi(x) } = 0\,. \end{equation}
Lemma 25. If a Markov chain is aperiodic, then there exists $N \in \N$ such that $P^n(x, x) > 0$ for all $x \in \mathcal X$ and $n \geq N$.
Lemma 26. If a Markov chain is irreducible and aperiodic, there exists $N \in \N$ such that $P^N(x, y) > 0$ for all $x, y \in \mathcal X$.
Lemma 27. If $Q$ is the transition kernel of a Markov chain and $\mu, \nu$ are any two probability distributions, then \begin{equation} \norm{\mu Q - \nu Q}_\TV \norm{(\mu - \nu) Q}_\TV \leq \norm{\mu - \nu}_\TV \,. \end{equation}
Suppose there exists $N \in \N$, $\delta > 0$ and a probability distribution $\nu$ such that for all $x, y \in \mathcal X$ we have \begin{equation} P^N(x, y) \geq \delta \nu(y) \,. \end{equation} Then \begin{equation} \norm{\dist( X_{k N + j} ) - \pi}_\TV \leq (1 - \delta)^k \,. \end{equation}
Proof: First show that for any two probability distributions $\mu_1$, $\mu_2$ we have \begin{equation} \norm{ (\mu_1 - \mu_2) P^N }_\TV \leq (1 - \delta) \norm{\mu_1 - \mu_2}_\TV \,. \end{equation} Then iterate. (Will be done in class.)
Proof of Theorem 21. Follows quickly from Lemma 28