This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.
In the last post, we saw a proof of the fundamental theorem of Markov chains using the method of couplings. In this post, we will use couplings to provide quantitative bounds on the rate of Markov chain convergence.
Mixing Time
Let us begin where the last post ended by recalling some facts about the mixing time of a Markov chain. Consider a Markov chain with stationary distribution
. Recall that the mixing time is
Here,




At the end of last post, we saw the following theorem, showing that the mixing time controls the rate of Markov chain convergence.
Theorem (mixing time as convergence rate). For any starting distribution
,
Consequently, if
, then
.
This result motivates us to develop techniques to bound the mixing time .
Couplings and Convergence
In the previous post, we made essential use of couplings of probability distributions. In this post, we will extend this notion by using couplings of entire Markov chains.
Definition (coupling of Markov chains). A coupling of Markov chains with transition matrix
are two processes
and
such that (A) each process is individually a Markov chain with transition matrix
: In particular,
and (B) if
, then
.
A coupling of Markov chains thus consists of a pair of Markov chains which become glued together should they ever touch.
There are several ways we can use couplings to bound the mixing time of Markov chains. Here is one way. Let and
be a coupling of Markov chains for which
is initialized in some initial distribution
and
is initialized in the stationary distribution
. Let
be the time at which
and
meet:


Theorem (convergence from couplings). Then
.
This result is an immediate application of the coupling lemma from last post. Using this bound, we can bound the mixing time
Corollary (mixing time from couplings).
.
The maximum is taken over all initial distributions .
Indeed, by the above theorem and Markov’s inequality,1For more on concentration inequalities such as Markov’s inequality, see my post on the subject.



Bit Strings
Example:There are a number of examples2See, for instance, section 5.3 of Markov Chains and Mixing Times. to prove bounds on mixing times using the method of couplings, but we will focus on a simple but illustrative example: a lazy random walk on the set of bit strings.
A bit string is a sequence of 0’s and 1’s. Bit strings are used to store and represent information on digital computers. For instance, the character “a” is stored as “1100001” using the ASCII encoding.
The Chain
Our example is a Markov chain whose states consist of all bit strings of length . For each step of the chain,
- With 50% probability, we do nothing and leave the state unchanged.
- With 50% probability, we choose a (uniform) random position from
to
and flip the bit in that position, changing 0 to 1 or 1 to 0.
One can think of this Markov chain as modeling a random process in which a bit string is corrupted by errors which flip a single bit. The stationary distribution for this chain is the uniform distribution on all bit strings (i.e., each bit string of length has the same probability
). Therefore, one can also think of this Markov chain as a very inefficient way of generating a string of random bits.3Another way of thinking about this Markov chain is as a random walk on the vertices of an
-dimensional hypercube graph. With 50% probability, we stay put, and with 50% probability, we move along a random edge. Because we have a 50% probability of doing nothing, we call this process a lazy random walk.
Designing the Coupling
Let us use the method of couplings to determine how fast this chain mixes. First, observe that there’s an alternative way of stating the transition rule for this Markov chain:
- Pick a (uniform) random position from
to
and set the bit in that position to
with 50% probability and
with 50% probability.
Indeed, under this rule, half of the time we leave the state unchanged because we set the bit in the selected position to the value it already had. For the other half of the time, we flip the selected bit.
Now, we construct a coupling. Initialize in an arbitrary distribution
and
in the stationary distribution (uniform over all bit strings). The key to construct a coupling will be to use the same randomness to update the state of both the “
” chain and the “
” chain. For this chain, there’s a natural way to do this.
At time
, pick a (uniform) random position
and a (uniform) random bit
. Set
by changing the
th bit of
to
, and set
by changing the
th bit of
to
.
We couple the two chains by setting the same position to the same bit
.
Bounding the Mixing Time
To bound the mixing time, we need to understand when these two chains meet. Observe that if we pick position to set at some point, the two Markov chains have the same bit in position
at every time later in the chain. Consequently,
If all of the positions
have been chosen by time
, then
.
The two chains might meet before all of the positions have been chosen, but they are guaranteed to meet after.
Set to be the time at which all positions
have been selected at least once. Then our observation is equivalent to saying that the meeting time
is smaller than
,

Collecting Coupons
Computing the expected value of is actually an example of a very famous problem in probability theory known as the coupon collector problem. The classic version goes like this:
A cereal company is running a promotion where each box of cereal contains a coupon, which is equally likely to be any of one of
different colors. A coupon of each color can be traded in for a toy. What is the expected number of boxes we need to open in order to qualify for the toy?
The coupon collector is exactly the same as our problem, with the different colors of coupons playing the same role as the
different positions in our bit strings. Going forward, let’s use the language of coupons and boxes to describe this problem, as its more colorful.
When tackling a hard question like computing , it can be helpful to break into easier pieces. Let’s start with the simplest possible question: How many picks do we need to collect the first coupon? Just one. We always get a coupon in our first box.
With that question answered, let’s ask the next-simplest question: How many picks do we need to collect the second coupon, differently colored than the first? There are colors and
of them are different from the first. So the probability of picking a new coupon in the second box is
. In fact,
Until we succeed at picking the second unique coupon, every box has an independent
chance of containing such a coupon.
The number of boxes needed is therefore a geometric random variable with success probability , and its mean is
. On average, we need
picks to collect the second coupon.
The reasoning is the same for the third coupon. There are coupons we haven’t collected and
total, so the probability of picking the third coupon in each box is
. Thus, the number of picks is a geometric random variable with success probability
with mean
.
In general, we need an expected picks to pick the
th coupon. Adding up the expected number of picks for each
, we obtain that the total number of picks required is to collect all the coupons is


![Rendered by QuickLaTeX.com \expect[\tau_{\rm pick}] = NH_N](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-dc3b95575279b77c40894610fa83f46c_l3.png)


Conclusion
Putting all of these ingredients together, we obtain
