Ergodicity and Mixing

PageRank

Consider a large collection of webpages, with links between pages. Our task is to determine the relative importance between two pages. One simple method is to simply count links. The more links a page has pointing to it, the more important it is. But this is not quite right – if pages $A$, $B$ both have two links pointing in. According to our algorithm $A$ and $B$ should be ranked equally. However, if the links to $A$ are “more important” than the links to $B$ then $A$ should be more important. (The problem gets more complicated if there are link loops.)

Instead, we could run a Markov chain that randomly follows links, and measure the relative importance using the stationary distribution.

Example 1. Suppose $G = (P, E)$ is a directed graph, and consider the following Markov process: From any node $x \in P$, jump to a direct successor chosen uniformly at random. If there are no direct successors, then jump to any point in $P$, chosen uniformly at random.

The Markov chain obtained in this manner may not be irreducible or aperiodic. To combat this we introduce damping.

Example 2. Suppose $G = (P, E)$ is a directed graph, let $\alpha \in (0, 1)$, and consider the following Markov process: Flip an (independent) coin that lands heads with probability $\alpha$. If the coin lands heads, and the current point has direct successors, then jump to a direct successor chosen uniformly at random. Otherwise, then jump to any point in $P$, chosen uniformly at random.

This Markov chain was used by Brin and Page to rank webpages, and their original algorithm chose $\alpha = 0.85$. They find the stationary distribution by iterating (or finding the eigenvector of the transition matrix). In this case, this lemma tells us that after $n$ iterations we are at most $\alpha^n$ away (in the total variation norm) from the stationary distribution. For $\alpha = 0.85$ doing 50 iterations gets you $0.0002$ away (in total variation norm) from the stationary distribution.

In general, estimating the rate of convergence to the stationary distribution isn’t as explicit, or as easy. For instance, if the chain is generated by the Metropolis–Hastings algorithm, then it’s hard to determine the convergence rate. A useful tool to quantify the convergence rate is the mixing time.

Mixing Times

We will now assume we are working with an irreducible and aperiodic Markov chain $X_n$ on a finite state space $\mathcal X$, and let $\pi$ denote its unique stationary distribution.

Definition 3. Let $\epsilon \in (0, 1/2)$. The mixing time, denoted by $\tmix(\epsilon)$, is defined by \begin{equation} \tmix(\epsilon) = \min \set{ n \in \N \st \forall x \in \mathcal X, ~\norm{P^n(x, \cdot) - \pi(\cdot)}_\TV < \epsilon } \,. \end{equation}

Theorem 4.

Fix $\epsilon \in (0, 1/2)$ and let $N = \tmix(\epsilon)$ be the mixing time of a Markov chain $X$. Then for any $n \in \N$ and any initial distribution $\mu$ we have \begin{equation} \norm{\dist(X_{kN + j}) - \pi}_\TV \leq (2\epsilon)^k \,. \end{equation}

Proof: Let $N = \tmix(\epsilon)$. Show that \begin{equation} \norm{\mu P^N - \pi}_\TV = \norm{(\mu - \pi) P^N}_\TV \leq (2\epsilon) \norm{\mu - \pi}_\TV \,, \end{equation} and iterate. (Will be done in class.)

Proposition 5. Suppose there exists $N \in \N$, and $\delta > 0$ and a probability measure $\nu$ such that for every $x, y \in \mathcal X$ we have \begin{equation} P^N(x, y) \geq \delta \nu(y) \,. \end{equation} Then \begin{equation} \tmix(\epsilon) \leq \frac{N \ln \epsilon}{\ln( 1 - \delta )} \end{equation}

Corollary 6. For the page rank example, we have \begin{equation} \tmix(\epsilon) \leq \ln_\alpha \epsilon \end{equation}

Proposition 7. If $0 < \delta < \epsilon$, then \begin{equation} \tmix( \delta) \leq \tmix(\epsilon) \ln_{2\epsilon} \delta \,. \end{equation}

In general, estimating mixing times is hard and is the subject of many advanced books (e.g. LP17).

Remark 8. One can show (see LP17, Chapter 15) that the mixing time of the Glauber Dynamics for the Ising model on a graph with $N$ vertices and maximal degree $\Delta$ is: \begin{equation} \tmix(\epsilon) \leq \ceil[\Big]{ \frac{N \paren{\ln N + \abs{\ln \epsilon}}}{1 - \Delta \tanh(\beta)} } \end{equation}

Remark 9. Practically, if we initialize an MCMC algorithm arbitrarily then for small $n$, the distribution $X_n$ may be far away from the stationary distribution. For $n \geq \tmix(\epsilon)$, however, we know $\norm{\dist(X_n) - \pi}_\TV < \epsilon$. Thus when running MCMC algorithms one discards a certain number of iterations – this is often called a burn in time.

Ergodicity

Finally, we mention that the law of large numbers doesn’t apply to Markov chains as we typically won’t have $X_1, X_2, \dots$ to be independent, or identically distributed. However, the conclusion still holds and is often called the Ergodic theorem instead of the law of large numbers.

Theorem 10 (Ergodic theorem). Let $X_n$ be an irreducible Markov chain with stationary distribution $\pi$. Let $f \colon \mathcal X \to \R$ be any function. Then \begin{equation} \P \paren[\Big]{ \lim_{N \to \infty} \frac{1}{N} \sum_{n = 1}^N f(X_n) = \sum_{x \in \mathcal X} f(x) \pi(x) } = 1\,. \end{equation}

Proof. There’s a one page proof in the appendix of LP17, which we might cover in class if time permits.

Remark 11. Define \begin{equation} \mathcal I_\pi(f) = \sum_{x \in \mathcal X} f(x) \pi(x) \,, \end{equation} and \begin{equation} \hat{\mathcal I}_N(f) = \frac{1}{N} \sum_{n = 0}^{N-1} f(X_n) \,. \end{equation} Typically $\mathcal I_\pi(f)$ is a quantity we want to compute, and $\hat{\mathcal I}_N$ is our estimate for it. This estimate is biased in the sense that \begin{equation} \E \hat{ \mathcal I}_N f \neq \mathcal I_\pi f \,. \end{equation} However, the ergodic theorem says that our estimator is asymptotically unbiased as $N \to \infty$.

Corollary 12.

Let $X$ be an irreducible Markov chain with stationary distribution $\pi$. For every $x \in \mathcal X$ we have \begin{equation} \pi(x) = \lim_{n \to \infty} \frac{1}{N} \abs{ \set{ n \leq N-1 \st X_n = x } } \end{equation} almost surely.

Theorem 13 (Central limit theorem).

Suppose $X$ is an irreducible and ergodic Markov chain on a finite state space $\mathcal X$. Let $f\colon \mathcal X \to \R$, and let $\bar f = \sum_{\mathcal X} f(x) \pi(x)$ Then \begin{equation} \sqrt{N} \paren[\Big]{ %\frac{1}{N} \sum_{n=0}^{N-1} f(X_n) \hat{\mathcal I}_N(f) - \mathcal I_\pi(f) } \xrightarrow{d} \mathcal N(0, \tau^2) \,, \end{equation} where \begin{equation} \tau^2 = \lim_{N \to \infty} \sqrt{N} \var \paren{\hat{\mathcal I}_N} \end{equation}

Remark 14. When the state space is not finite, you need to assume that if $Y \sim \pi$ then $\E \abs{f}^{2 + \epsilon} < \infty$, and the chain is geometrically ergodic.