Simulated Annealing and Tempering
Algorithm and Intuition
Suppose we want to sample from $\pi(x) \propto e^{-\beta f(x)}$. We will shortly see that MCMC algorithms tend to get stuck near local minima of $f$. Tempering attempts to avoid this.
Requires: $f$, $\beta$.
Returns: (Better quality) samples from the distribution with probability density proportional to $e^{-\beta f}$.
- Fix a sequence $\beta_n$ that increases to $\beta$ (called the annealing schedule)
- Start with some $X_0 \in \mathcal X$ arbitrary.
- Generate $X_{n+1}$ by using an MCMC algorithm to sample from the (un-normalized) probability distribution $\pi_u = e^{-\beta_{n+1} f}$ starting from $X_n$
- Repeat for a large number of iterations.
Advantages: Better at escaping from neighborhoods of local minima of $f$.
Remark 1. Often $\beta = 1/T$, where $T$ is called the temperature and $\beta$ is called the inverse temperature.
Before discussing the need / advantages of tempering, we address a second question. Suppose we want to minimize a function $f$ in a large (high dimensional) space. A elementary stochastic valley descent algorithm for this is:
- Start at a point $x \in \mathcal X$.
- Choose a close by point $y$ randomly.
- If $f(y) < f(x)$, move to $y$. Otherwise stay at $x$.
- Repeat until your computational budget is exhausted, or until $f$ isn’t increasing much.
Remark 2. If you want to maximize $F$ instead, then you move only if $F(y) > F(x)$. In this case the algorithm is called a stochastic hill climb instead.
Clearly valley descent will get stuck at local maxima forever. In practice, it often takes too long to go down into canyons. When you’re working with functions whose derivative you can compute easily, then stochastic gradient descent may provide better results.
An alternate approach is simulated annealing – this may make you climb at certain points, but is better at avoiding getting stuck in local minima. Conventionally, simulated annealing is always stated to minimize a function; replace the function by its negative if you want to maximize it instead.
Requires $f$. Returns minima of $f$.
- Fix a sequence $\beta_n \to \infty$.
- Start with some $X_0 \in \mathcal X$.
- Generate $X_{n+1}$ from $X_n$ by using an MCMC algorithm to sample from the (un-normalized) probability distribution $\pi_u = e^{-\beta_{n+1} f}$.
- Repeat for a large number of iterations.
The final $X$ should (theoretically) be uniformly distributed on the global minima of $f$; but practically they are typically distributed very close to local minima of $f$.
Remark 3. Many authors prefer stating the above algorithm with $T_n = 1/\beta_n$ to be the temperature.
Remark 4. The choice of temperatures is the annealing schedule (or cooling schedule) is situation dependent. In practice one chooses $T_0$ to be something large (so that $\beta = 1/T_0$ is small), and $T_N$ to be very small, and allows $T_n$ to decrease geometrically from $T_0$ to $T_N$.
Remark 5. While theoretically, $X$ should be uniformly distributed on the global minima of $f$; in practice we see something different. First we can’t ever send $T_n\to0$, so $X$ will take values in some small neighborhood of the minima of $f$. Second, many MCMC algorithms tend to get stuck at local minima of $f$. As a result the return value $X$ may take values in small neighborhoods of local minima of $f$, and not just global minima.
The basis simulated annealing is the following elementary observation.
Lemma 6. Let $f \colon \mathcal X \to \R$ be some function, $\beta > 0$, and define and un-normalized probability distribution $\pi_u = e^{-\beta f}$. Let $Z = \sum_{\mathcal X} \pi_u$ be the normalization constant, and $\pi = \pi_u / Z$ be the normalized probability distribution. When $\beta = 0$, $\pi$ is uniformly distributed on all of $\mathcal X$. As $\beta \to \infty$, $\pi$ converges to a probability distribution that is uniformly distributed on the global minima of the function $f$.
Proof. Done on the board.
Example: Escaping from local minima
To see why MCMC algorithms get stuck near local minima, let’s choose a one dimensional function $f$ that has two separated local minima.
Choose $\Lambda$ to be a discrete grid of points on $[-1, 1]$. At every point on this grid we propose it’s neighbors (uniformly at random), and then accept/reject it according to the Metropolis–Hastings rule to sample from the (un-normalized) probability distribution $e^{-\beta f}$. We start the chain at the local minimum $X_0 = 0.5$. After 10,000 iterations with $β = 10$ the left figure below shows that the chain hasn’t left the neighborhood of the local minimum, and returns samples that are not yet close to our desired target distribution.
On the other hand, if we use tempering and gradually increase $\beta$ from $10^{-3}$ to $10$, then after the same number of iterations, more realizations of $X$ find the bump around $-0.5$ (the global minimum of $f$), and fewer are stuck in near $x = 0.5$. The final output is now much closer to our desired target distribution, but still not close enough. In this situation adding more iterations and choosing a slower cooling schedule you can get very close in practical amounts of time.
To understand why this happens, let $x_0$ be a local minimum and $h$ be the grid spacing of our lattice $\Lambda$. Suppose now we are fixing $\beta$ large, and just running the Metropolis–Hastings chain that proposes nearest neighbors and accepts/rejects accordingly to sample from $\pi_u \propto e^{-\beta f}$. If the chain is at $x_0$, then \begin{equation} P(x_0, x_0 \pm h) = e^{-\beta( f(x_0 \pm h) - f(x_0) )} \approx e^{-\beta h^2 f^{\prime\prime}(x_0) / 2} \,. \end{equation} When $\beta$ is large, the chain will be slow to leave $x_0$.
Moreover, once it leaves $x_0$, it gets pushed back towards $x_0$. Indeed, for $x$ in a small neighborhood to the left of $x_0$ we have \begin{equation} P(x, x + h) = \frac{1}{2} \,, \quad\text{and}\quad P(x, x - h) = \frac{e^{-\beta(f(x-h)-f(x))}}{2} \approx \frac{e^{-\beta h f’(x)}}{2} \,. \end{equation} Thus if the chain is at $x_0$, it is extremely unlikely for it to move left towards $0$, and more likely to move back towards $x_0$. This makes it very unlikely for the chain to leave the neighborhood of $x_0$, and find the other global minimum located at $-x_0$. In this case one can show that the mixing time is \begin{align} \tmix(\epsilon) &\leq \paren[\Big]{ \frac{2\pi}{\beta f^{\prime\prime}(-x_0)} }^{1/2} \paren[\Big]{\frac{e^{+\beta(f(0) - f(-x_0)}}{h}} \paren[\Big]{ \frac{1}{2}\abs[\Big]{\ln \paren[\Big]{ \frac{2 \pi \epsilon^2}{\beta f^{\prime\prime}(-x_0) h^2} }} + \beta( f(0) - f(-x_0) ) } \\ &\leq \frac{e^{C(\beta(f(0) - f(-x_0)))}}{h} \abs{\ln (h \epsilon)} \end{align}
The above shows that when $\beta$ is large (or the temperature is small), the Metropolis–Hastings chain may get stuck at local minima for exponentially long periods of time. Simulated annealing and tempering attempt to avoid this; by choosing $T$ large (or $\beta$ small) initially you explore the space rapidly and don’t get stuck in local minima for too long. As $T$ gets smaller (or $\beta$ gets larger), you start concentrating around local minima and in good situations we hope we haven’t been trapped in one large basin of attraction and missed an entire region of the phase space. This can be seen clearly in this figure.
Example: Travelling salesman.
Given $N$ points on the plane (cities), the travelling salesman problem is to find a route that travells through each city exactly once, and returns to the starting point. This is a classic problem which is known to be NP-hard, and you can read more about it on Wikipedia
This has been extensively studied, and there are several well known combinatorial algorithms that yield results close to the optimal path in practical amounts of time. Simulated annealing was originally proposed in order to solve the traveling salesman problem.
The simplest algorithm for the traveling salesman problem is the greedy nearest neighbor algorithm: Start anywhere, and simply travel to the nearest unvisited city. There are certain scenarios where this algorithm performs badly; but in most configurations this gives you a travel distance that is within 25% of the minimum.
We can improve this as follows:
-
Given a tour $\sigma$ (which is just some permutation of the $N$ cities), let $C(\sigma)$ be the total length (or cost) of the tour. We will use simulated annealing to minimize $C$ over all tours.
-
To use simulated annealing, we need a MCMC algorithm to sample from $e^{-\beta C}$. This can be done with the Metropolis Hastings algorithm.
-
Choose a cooling schedule (by experimenting) and run simulated annealing to minimize $C$ with the above proposal mechanism.
Here is what the numerical results look like, and a comparison between using a direct valley descent or blind random trials. Simulated annealing usually gives a small improvement over the greedy algorithm. The notebook can be downloaded here; the part implementing the simulated annealing algorithm is redacted, and filling it in is part of your homework.
Example: Cracking substitution ciphers.
A substitution cipher is one where you create a key that is a permutation of the alphabet (e.g. $A \mapsto K$, $B \mapsto Z$, etc.). Using this key, you can encode and decode a message. At first sight this might seem uncrackable by brute force – your key is one permutation of $28!$ (26 letters plus a period and space punctuation).
This is a needle in an enormous haystack. If you could examine $10^{12}$ keys in a second (which is a generous overestimate), then it would still take you about a billion years to crack this code. Nevertheless, if you’re sending sufficiently long (few paragraphs) of readable text data, this method is crackable in seconds using simulated annealing (or even just a stochastic hill climb).
To crack a substitution cipher, we need to first define a fitness function. Download a few long English books, and compute frequencies of letter sequences. That is compute how often ‘a’ occurs, how often ‘as’ occurs, how often ‘ast’ occurs, and so on. Using these frequencies define a fitness function $F$ that takes as input a string of symbols (e.g. a message), and outputs a real number. The closer the symbol frequencies match English, the higher the fitness should be. (There’s an elegant and clever way to do this, which I’m not describing here as finding it is part of your homework.)
We will now crack substitution ciphers as follows:
- Given a guess at a key $\sigma$, define $f(\sigma)$ to be the fitness of the coded message $M$ decoded with key $\sigma$. (We want to maximize $f(\sigma)$.)
- Given a (random) key $\sigma$, our proposal mechanism generates a new key $\tau$ by randomly swapping two symbols in $\sigma$.
- Run a stochastic hill climb to maximize $f(\sigma)$, or simulated annealing to minimize $-f(\sigma)$. (Both work really well!)
As an example, we took a passage from Arthur Conan Doyle’s Sherlock Holmes, and coded it up with a randomly chosen key. We then downloaded five different books from Project Gutenberg, and ran simulated annealing. Here’s the decoded message after every 100 iterations. The $F=\dots$ shows the fitness of the decoded message with the current guess of the key, and the $D=\dots$ shows the number of symbols where our guessed key and actual key differ. We find the correct key in a few thousand iterations!
F=-470.33, D=27: . R.SLDBERVW.LREPDS.SLD.GS.FEXFNS. LD.XRPFZO.G.LFAD.SDEMRP.LDFBM.LGP.PDZ GRZ.LDB
F=-374.97, D=24: STFS OPUEFHJSOFEMP S OPSR SNEANB STOPSAFMNLCSRSONVPS PEIFMSOPNUISORMSMPLTRFLSOPU
F=-328.22, D=24: RE SANUCEJZ AECMNS SAN IS PCGPHS RAN GEMPOB I APXN SNCTEM ANPUT AIM MNORIEO ANU
F=-301.30, D=21: CO SANULOVF AOLMNS SAN IS ELHE.S CAN HOMERT I AEZN SNLGOM ANEUG AIM MNRCIOR ANU
F=-294.07, D=17: CO SANULOVG AOLMNS SAN IS ELHE.S CAN HOMERK I AEQN SNLDOM ANEUD AIM MNRCIOR ANU
F=-290.06, D=18: CO SANULOFG AOLMNS SAN IS ELVE.S CAN VOMERK I AEZN SNLDOM ANEUD AIM MNRCIOR ANU
F=-246.34, D=13: DO NSERLOCK SOLMEN NSE IN ALVA.N DSE VOMAUY I SAWE NELTOM SEART SIM MEUDIOU SER
F=-245.93, D=13: DO NSERLOCK SOLMEN NSE IN ALFA.N DSE FOMAUY I SAWE NELTOM SEART SIM MEUDIOU SER
F=-222.94, D=11: TO NSERLOCK SOLMEN NSE IN ALUAYN TSE UOMAF. I SAWE NELDOM SEARD SIM MEFTIOF SER
F=-140.26, D= 8: TO FHERLOCK HOLMEF FHE IF ALWAYF THE WOMAN. I HAUE FELDOM HEARD HIM MENTION HER
F=-141.88, D= 6: TO FHERLOCK HOLMEF FHE IF ALWAYF THE WOMAN. I HAJE FELDOM HEARD HIM MENTION HER
F=-112.05, D= 4: TO FHERLOCK HOLMEF FHE IF ALWAYF THE WOMAN. I HAXE FELDOM HEARD HIM MENTION HER
F= -34.58, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAZE SELDOM HEARD HIM MENTION HER
F= -26.99, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.95, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.64, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.64, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -28.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.64, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -28.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -28.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.99, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.54, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.88, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.44, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.88, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.54, D= 2: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -24.30, D= 0: TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
A redacted version of the notebook that does the above is here. The cells defining the fitness function, and implementing simulated annealing have been removed as implementing them is part of the homework. When these functions are implemented, the output of the completed notebook is here