Markov Chains

The reason the Metropolis–Hastings algorithm works is because:

The algorithm outputs a Markov Chain.
Under certain conditions, the distribution of a Markov chain (started arbitrarily) will converge to its unique stationary distribution.
The stationary distribution of the Metropolis–Hastings Markov chain is the desired target distribution.

We will prove each of these assertions here.

Basic notions

Definition 1. A Markov chain on $\mathcal X$ is a family of random variables $X_0$, $X_1$, … such that for all $n \in \N$ and $x_0, \dots, x_{n+1} \in \mathcal X$ we have \begin{equation} \P(X_{n+1} = x_{n+1} \given \set{ X_k = x_k \st 0 \leq k \leq n} ) = \P(X_{n+1} = x_{n+1} \given X_n = x_n ) \,. \end{equation}

Definition 2. A time homogeneous Markov chain is a Markov chain where \begin{equation} \P( X_{n+1} = y \given X_n = x ) = \P( X_1 = y \given X_0 = x )\,, \end{equation} for all $x, y \in \mathcal X$ and $n \in \N$.

Example 3 (Simple random walk). Let $\mathcal X = \Z$, $\xi_n$ be i.i.d. $\pm 1$ valued random variables (coin flips) and set $X_{n+1} = X_n + \xi_{n+1}$.

Example 4 (Metropolis chain). Given $X_n$, choose $X_{n+1}$ according to the Metropolis–Hastings rule:

Given $X_n = x \in \mathcal X$, propose $y \in \mathcal X$ with probability $Q(x, y)$.
Let $X_{n+1} = y$ with probability $A(x,y)$ (defined here), and $X_{n+1} = x$ otherwise.

Definition 5. The transition matrix of a (time homogeneous) Markov chain is \begin{equation} P(x, y) = \P( X_1 = y \given X_0 = x )\,. \end{equation}

Remark 6. Notice $P(x, y) \geq 0$ and $\sum_y P(x, y) = 1$. Such matrices are called stochastic matrices.

Proposition 7. For any $n \in \N$, \begin{equation} \P( X_n = y \given X_0 = x ) = P^n(x, y)\,, \end{equation} where $P^n$ means the $n^\text{th}$ power of the matrix $P$.

Proof. Done in class.

Proposition 8. If $\dist(X_0) = \mu_0$ (i.e. $\P(X_0 = x) = \mu_0(x)$ for all $x \in \mathcal X$), then $\dist(X_n) = \mu_0 P^n$ (as a matrix product).

Proof. Done in class.

Stationary distribution.

Definition 9. We say distribution $\pi$ is stationary for the Markov chain $X$ if $\pi P = \pi$. That is if $X_0 \sim \pi$ then $X_n \sim \pi$ for all $n \in \N$.

Remark 10. We will shortly see $\pi(x)$ is the average time the Markov chain spends at location $x$. Points where $\pi(x)$ is higher are visited (proportionally) more often by $X$ than points where $\pi(x)$ is lower.

Theorem 11. If $\abs{\mathcal X} < \infty$, then any Markov chain has a stationary distribution.

Proof. This follows from the Frobenius theorem. For a direct probabilistic proof one can construct $\pi$ by picking $x_0 \in \mathcal X$ (arbitrarily), and letting $\pi(y)$ to be the average number of visits to $y$ before returning to $x_0$. See Proposition 1.14 in LP17.

Remark 12. The stationary distribution need not be unique. For example $\mathcal X = \set{0, 1}$, and $P = I$ has infinitely many stationary distributions.

Definition 13. A Markov chain is called irreducible if for any $x, y \in \mathcal X$ there exists $n \in \N$ such that $P^n(x, y) > 0$.

Theorem 14. If $P$ is irreducible, the stationary distribution is unique.

Lemma 15. If $P$ is irreducible, and $\pi$ is a stationary distribution, then $\pi(x) > 0$ for all $x \in \mathcal X$.

Proof of Theorem 14. If $\pi_1$ and $\pi_2$ are two stationary distributions, choose $x_0$ that minimizes $\pi_1(x) / \pi_2(x)$. Clearly \begin{equation} \pi_1(x_0) = \sum_{x \in \mathcal X} \frac{\pi_1(x)}{\pi_2(x)} \pi_2(x) P(x, x_0) \geq \frac{\pi_1(x_0)}{\pi_2(x_0)} \pi_2(x_0) = \pi_1(x_0)\,. \end{equation} This implies \begin{equation} \sum_{x \in \mathcal X} \frac{\pi_1(x)}{\pi_2(x)} \pi_2(x) P(x, x_0) = \sum_{x \in \mathcal X} \frac{\pi_1(x_0)}{\pi_2(x_0)} \pi_2(x) P(x, x_0) \,. \end{equation} Since $\pi_1(x) / \pi_2(x) \geq \pi_1(x_0) / \pi_2(x_0)$ for all $x$, the above equality can only hold if \begin{equation} \frac{\pi_1(x)}{\pi_2(x)} = \frac{\pi_1(x_0)}{\pi_2(x_0)} \quad\text{for every } x \in \mathcal X \text{ such that } P(x, x_0) > 0\,. \end{equation} for all $x$ such that $P(x, x_0) > 0$. Now for any given $x \in \mathcal X$, irreducibility implies we can find $N = N(x) \in \N$ such that $P^N( x, x_0) > 0$. Since $\pi_1$ and $\pi_2$ are also stationary for $P^N$ the above argument will imply \begin{equation} \frac{\pi_1(x)}{\pi_2(x)} = \frac{\pi_1(x_0)}{\pi_2(x_0)} \quad\text{for every } x \in \mathcal X\,. \end{equation} Since $\sum_{x \in \mathcal X} \pi_1(x) = \sum_{x \in \mathcal X} \pi_2(x) = 1$, this implies $\pi_1 = \pi_2$, finishing the proof.

Convergence to the stationary distribution.

Example 16. Irreducibility alone doesn’t guarantee convergence to the stationary distribution. For example the chain with transitions $P(0, 1) = P(1, 0) = 1$ and $P(1, 1) = P(0, 0) = 0$ is irreducible, but the distribution need not converge to the stationary distribution.

Definition 17. A Markov chain is called aperiodic if for all $x \in \mathcal X$, \begin{equation} \gcd\set{n \geq 1 \st P^n(x, x) > 0} = 1\,. \end{equation}

Example 18. The simple random walk is irreducible, but not aperiodic. If instead $\xi_n$ are i.i.d. random variables such that $\P(\xi_n = 0) = 1/2$ and $\P(\xi_n = \pm 1) = 1/4$, then the lazy random walk defined by $X_{n+1} = X_n + \xi_{n+1}$ is irreducible and aperiodic.

Example 19. The Markov chain with $P=I$ is aperiodic but not irreducible (if $\abs{\mathcal X} > 1$).

Remark 20. If a Markov chain is irreducible but not aperiodic, one common trick is to introduce lazyness: Flip an independent fair coin. If it lands heads, don’t move. If it lands tails, move according to the original transition kernel. That is, define a new transition matrix $Q$ by \begin{equation} Q(x,y) = \frac{1}{2} (I + P) = \begin{cases} \frac{1}{2}(1 + P(x, x)) & y = x\,,\\ \frac{1}{2} P(x, y) & y \neq x \,. \end{cases} \end{equation} The new chain will be both irreducible and aperiodic, and have the same stationary distribution.

Theorem 21. If a Markov chain is irreducible and aperiodic, then $\dist(X_n) \to \pi$ in total variation as $n \to \infty$.

Definition 22. For any two probability distributions $\mu, \nu$ we define the total variation distance between $\mu$ and $\nu$ by \begin{equation} \norm{\mu - \nu}_\TV \defeq \frac{1}{2} \sum_{x \in \mathcal X} \abs{\mu(x) - \nu(x) } \,. \end{equation}

Definition 23. We say $\dist(X_n) \to \pi$ in total variation as $n \to \infty$, if we have $\norm{\dist(X_n) - \pi}_\TV = 0$ as $n \to \infty$.

Remark 24. Notice $\norm{\dist(X_n) - \pi}_\TV = 0$ as $n \to \infty$ is equivalent to having \begin{equation} \lim_{n \to \infty} \frac{1}{2} \sum_{x \in \mathcal X} \abs{ \P( X_n = x ) - \pi(x) } = 0\,. \end{equation}

Lemma 25. If a Markov chain is aperiodic, then there exists $N \in \N$ such that $P^n(x, x) > 0$ for all $x \in \mathcal X$ and $n \geq N$.

Lemma 26. If a Markov chain is irreducible and aperiodic, there exists $N \in \N$ such that $P^N(x, y) > 0$ for all $x, y \in \mathcal X$.

Lemma 27. If $Q$ is the transition kernel of a Markov chain and $\mu, \nu$ are any two probability distributions, then \begin{equation} \norm{\mu Q - \nu Q}_\TV \norm{(\mu - \nu) Q}_\TV \leq \norm{\mu - \nu}_\TV \,. \end{equation}

Lemma 28.

Suppose there exists $N \in \N$, $\delta > 0$ and a probability distribution $\nu$ such that for all $x, y \in \mathcal X$ we have \begin{equation} P^N(x, y) \geq \delta \nu(y) \,. \end{equation} Then \begin{equation} \norm{\dist( X_{k N + j} ) - \pi}_\TV \leq (1 - \delta)^k \,. \end{equation}

Proof: First show that for any two probability distributions $\mu_1$, $\mu_2$ we have \begin{equation} \norm{ (\mu_1 - \mu_2) P^N }_\TV \leq (1 - \delta) \norm{\mu_1 - \mu_2}_\TV \,. \end{equation} Then iterate. (Will be done in class.)

Proof of Theorem 21. Follows quickly from Lemma 28