This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.
In the previous post, we proved the following convergence results for a reversible Markov chain
data:image/s3,"s3://crabby-images/5508e/5508e942cb15120cb3c4bf106968aabad9d64df6" alt="Rendered by QuickLaTeX.com \rho^{(n)}"
data:image/s3,"s3://crabby-images/9341e/9341eed35b78f6f0b327faf8ad562bbcda8b662c" alt="Rendered by QuickLaTeX.com n"
data:image/s3,"s3://crabby-images/663fe/663fefead959af77d98e39108cfcd69a6614ebb2" alt="Rendered by QuickLaTeX.com \pi"
data:image/s3,"s3://crabby-images/9c6cf/9c6cfdb23ae54a4ea329864d8ac4bc64f4d7b29f" alt="Rendered by QuickLaTeX.com \chi^2(\cdot \mid\mid \cdot)"
data:image/s3,"s3://crabby-images/ffa75/ffa757b57d49bf753e6daaba464cc9c8c0d93cdd" alt="Rendered by QuickLaTeX.com \chi^2"
data:image/s3,"s3://crabby-images/24811/24811ce40ba369f1377a9e6c224f0317bb527306" alt="Rendered by QuickLaTeX.com 1 = \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_m \ge -1"
data:image/s3,"s3://crabby-images/49302/49302f0a7dc6395dc6a957e26eac1e5a0dbd72bb" alt="Rendered by QuickLaTeX.com P"
data:image/s3,"s3://crabby-images/908eb/908eb5fc70addd6365c3ddefb2e89d7d87bfc416" alt="Rendered by QuickLaTeX.com \lambda_2"
data:image/s3,"s3://crabby-images/7c6bd/7c6bd69d604b2517d511f898ac2fc0ddcecd58cc" alt="Rendered by QuickLaTeX.com \lambda_m"
Having to prove separate bounds for two eigenvalues is inconvenient. In the next post, we will develop tools to bound . But what should we do about
Fortunately, there is a trick.
It Pays to be Lazy
Call a Markov chain lazy if, at every step, the chain has at least a 50% chance of staying put. That is, for every
.
Any Markov chain can be made into a lazy chain. At every step of the chain, flip a coin. If the coin is heads, perform a step of the Markov chain as normal, drawing the next step randomly according to the transition matrix . If the coin comes up tails, do nothing and stay put.
Algebraically, the lazy chain described in the previous paragraph corresponds to replacing the original transition matrix with the lazy transition matrix
, where
denotes the identity matrix. It is easy to see that the stationary distribution
for
is also a stationary distribution for
:
data:image/s3,"s3://crabby-images/7c7ef/7c7efc470502fe78b21d4f120ca13cbf5d6c44c4" alt="Rendered by QuickLaTeX.com \lambda_i"
data:image/s3,"s3://crabby-images/b1c50/b1c502e91ffa58347397e6f4199ba022bc440fc1" alt="Rendered by QuickLaTeX.com \ge -1"
data:image/s3,"s3://crabby-images/f6d91/f6d914647b00dfee3a511adcbf6dd66ff0e11112" alt="Rendered by QuickLaTeX.com \ge 0"
data:image/s3,"s3://crabby-images/a05b0/a05b08d44fd6ad21238294859559bdd31734fbf8" alt="Rendered by QuickLaTeX.com P^{\rm lazy}"
data:image/s3,"s3://crabby-images/f6d91/f6d914647b00dfee3a511adcbf6dd66ff0e11112" alt="Rendered by QuickLaTeX.com \ge 0"
data:image/s3,"s3://crabby-images/d13bd/d13bd55c7662f1259bcd6155fe578ea6d5f0067b" alt="Rendered by QuickLaTeX.com \lambda_2^{\rm lazy}"
Should You be Lazy?
We’ve seen one theoretical argument for why it pays to use lazy Markov chains. But should we use lazy Markov chains in practice? The answer ultimately depends on how we want to use the Markov chain.
Here are two very different ways we could use a Markov chain:
- One-shot sampling. Run the Markov chain once for a fixed number of steps
, and use the result as an approximate sample from
.
For this use case, it is important that the output is genuinely a sample from , and the possibility of large negative eigenvalues can significantly degrade the convergence. In the extreme case where
is an eigenvalue, the chain will fail to converge to stationarity at all. For this application, the lazy approach may make sense.
- Averaging. Another way we can use a Markov chain is to compute the average of value of a function
over a random variable
drawn from the stationary distribution:
of the value of
over the first
values
of the chain
As we shall see, this averaging process converges at an (asymptotically) slower rate for the lazy chain than the original chain. Therefore, for this application, we should typically just run the chain as-is, and not worry about making it lazy.
The Case Against Laziness
To see why being lazy hurts us for the averaging problem, we will use the Markov chain central limit theorem. As with many random averaging processes, we expect that in the limit of a large number of steps , the Markov chain average
Informal theorem (Markov chain central limit theorem). Let
denote the states of a Markov chain initialized in the stationary distribution
. For a large number
of steps,
is approximately normally distributed with mean
and variance
where
(1)
The Markov chain CLT shows that the variance of the Markov chain average depends not only on the variance of
in the stationary distribution, but also the covariance between
and later values
. The faster the covariance decreases, the smaller
will be and thus the smaller the error for the Markov chain average.
More formally, the Markov chain CLT is given by the following result:
Theorem (Markov chain central limit theorem). As
,
converges in distribution to a normal random variable with mean zero and variance
, as defined in (1).
To compare the lazy and non-lazy chains, let’s compute the variance parameter in the Markov chain CLT in terms of the eigenvalues of the chain. For the remainder of this post, let
be any function.
Step 1: Spectral Decomposition
Recall that, by the spectral theorem, the transition matrix has eigenvectors
that are orthonormal in the
-inner product
data:image/s3,"s3://crabby-images/74245/7424548b875b0b9dd55c4715d5c2798ef2742b05" alt="Rendered by QuickLaTeX.com \varphi_i \in \real^m"
data:image/s3,"s3://crabby-images/8c4ab/8c4ab5d09eb84a98a74bc961c3d0189da0b728f3" alt="Rendered by QuickLaTeX.com \varphi_i : \{1,\ldots,m\} \to \real"
data:image/s3,"s3://crabby-images/decb8/decb85b022a87a9c23f1bd34837bd0e5f118b00e" alt="Rendered by QuickLaTeX.com f"
(2)
data:image/s3,"s3://crabby-images/4dd29/4dd299d3fb386e004f0ce0b215ecbff29bbc70b5" alt="Rendered by QuickLaTeX.com \varphi_1 = \mathbf{1}"
data:image/s3,"s3://crabby-images/d121d/d121d55d2a7c037d383b1a564630f46a14857b28" alt="Rendered by QuickLaTeX.com 1"
Step 2: Applying the Transition Matrix to a Function
The transition matrix is defined so that if the Markov chain has probability distribution
at time
, then the chain has distribution
at time
. In other words, multiplying by
on the right advances a distribution forward one step in time. This leads us to ask, what is the interpretation of multiplying a function by
on the left. That is, is there an interpretation to the matrix–vector product
?
Indeed, there is such an interpretation: The th coordinate of
is the expectation of
conditional on the chain starting at
:
(3)
data:image/s3,"s3://crabby-images/7f682/7f682c864b899038798ddb3504b36772def81a87" alt="Rendered by QuickLaTeX.com \delta_i^\top"
data:image/s3,"s3://crabby-images/30ba2/30ba2d16e27b7a68cbdc81add8a0866dc9a2e7fb" alt="Rendered by QuickLaTeX.com i"
data:image/s3,"s3://crabby-images/30ba2/30ba2d16e27b7a68cbdc81add8a0866dc9a2e7fb" alt="Rendered by QuickLaTeX.com i"
data:image/s3,"s3://crabby-images/8299f/8299fcde848e1e00b940f04bda7c9f5309892a0e" alt="Rendered by QuickLaTeX.com P^n f"
data:image/s3,"s3://crabby-images/926bd/926bdf728d3b21b54125bb8330a9ac110b143f6b" alt="Rendered by QuickLaTeX.com \rho^{(n)} = \delta_i^\top P^n"
data:image/s3,"s3://crabby-images/9341e/9341eed35b78f6f0b327faf8ad562bbcda8b662c" alt="Rendered by QuickLaTeX.com n"
data:image/s3,"s3://crabby-images/50bec/50becb5002b83fddef842deb74249bb692231cac" alt="Rendered by QuickLaTeX.com \rho^{(n)} = \delta_i"
Step 3: Computing the Covariance
With the help of the formula for and the spectral decomposition of
, we are ready to compute the covariances appearing in (1).Let
. Then
(4)
Let’s first compute and
. Since
is the vector/function of all ones, we have
data:image/s3,"s3://crabby-images/bc11b/bc11bbe751df7cf337f6b15dd24462734850075f" alt="Rendered by QuickLaTeX.com \varphi_i"
Since we assume the chain is initialized in the stationary distribution , it remains in stationarity at time
,
, so we have
.
Now, let’s compute . Use the law of total expectation to write
Now, invoke (3) and the definition of the
data:image/s3,"s3://crabby-images/663fe/663fefead959af77d98e39108cfcd69a6614ebb2" alt="Rendered by QuickLaTeX.com \pi"
Combining this formula with our earlier observation that and plugging into (4), we obtain
(5)
data:image/s3,"s3://crabby-images/0e8e6/0e8e6855dc999f6146ffe2348ffed9fd452389df" alt="Rendered by QuickLaTeX.com \lambda_1 = 1"
data:image/s3,"s3://crabby-images/04ac1/04ac13d248d1223185242d0ba0eba3f66a6dfe0e" alt="Rendered by QuickLaTeX.com c_1"
Step 4: Wrapping it Up
With the formula (5) for the covariance in hand, we are ready to evaluate the variance parameter in the Markov chain CLT. First, note that the variance is
Now, apply the formula for the sum of an infinite geometric series to obtain
(6)
Conclusion: Laziness is (Asymptotically) Bad for Averaging
Before we get to the conclusions, let’s summarize where we are. We are seeking to use Markov chain average
![Rendered by QuickLaTeX.com \expect_{x \sim \pi} [f(x)]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-d6571ed2f0f16fa098b50799de5df7c3_l3.png)
data:image/s3,"s3://crabby-images/9cb80/9cb807d06537059fb3b7f5741e647f5374d2c3e7" alt="Rendered by QuickLaTeX.com N"
![Rendered by QuickLaTeX.com \hat{f}_N - \expect_{x\sim\pi}[f(x)]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-5e166c7df3e7d2b704ef9138fe8cd0dc_l3.png)
data:image/s3,"s3://crabby-images/bcd34/bcd344c6d8c6fa900ed5047489ae3ce4a9b3a9fd" alt="Rendered by QuickLaTeX.com \sigma^2/N"
data:image/s3,"s3://crabby-images/c343c/c343c5928d06fa4934e06bcfdd0a040a8ee92cd5" alt="Rendered by QuickLaTeX.com \sigma^2"
After some work, we were able to derive a formula (6) for in terms of the eigenvalues
of the transition matrix
. The formula for
for the lazy chain is identical, except with each eigenvalue replaced by
. Thus, we have
From these formulas, it is clear that the lazy Markov chain has a larger
data:image/s3,"s3://crabby-images/c343c/c343c5928d06fa4934e06bcfdd0a040a8ee92cd5" alt="Rendered by QuickLaTeX.com \sigma^2"
To compare the non-lazy and lazy chains in another way, consider the plot below. The blue solid line shows the amplification factor of an eigenvalue
, which represents amount by which the squared coefficient
is scaled by in
. In the red-dashed line, we see the corresponding amplification factor
for the corresponding eigenvalue
of the lazy chain. We see that at every
value, the lazy chain has a higher amplification factor than the original chain.
data:image/s3,"s3://crabby-images/e264f/e264ff2082dbeb839ccc12b91d4af87b21f345fe" alt=""
Remember that our original motivation for using the lazy chain was to remove the possibility of slow convergence of the chain to stationarity because of negative eigenvalues. But for the averaging application, negative eigenvalues are great. The process of Markov chain averaging shrinks the influence of negative eigenvalues on , whereas positive eigenvalues are amplified. For the averaging application, negative eigenvalues for the chain are a feature, not a bug.
Moral of the Story
Much of Markov chain theory, particularly in theoretical parts of computer science, mathematics, and machine learning, is centered around proving convergence to stationarity. Negative eigenvalues are a genuine obstruction to convergence to stationarity, and using the lazy chain in practice may be a sensible idea if one truly needs a sample from the stationary distribution in a given application.
But one-shot sampling is not the only or even the most common uses for Markov chains in computational practice. For other applications, such as averaging, the negative eigenvalues are actually a help. Using the lazy chain in practice for these problems would be a poor idea.
To me, the broader lesson in this story is that, as applied mathematicians, it can be inadvisable to become too fixated on one particular mathematical benchmark for designing and analyzing algorithms. Proving rapid convergence to stationarity with respect to the total variation distance is one nice way to analyze Markov chains. But it isn’t the only way, and chains not possessing this property because of large negative eigenvalues may actually be better in practice for some problems. Ultimately, applied mathematics should, at least in part, be guided by applications, and paying attention to how algorithms are used in practice should inform how we build and evaluate new ones.