This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.
In the last post, we saw a proof of the fundamental theorem of Markov chains using the method of couplings. In this post, we will use couplings to provide quantitative bounds on the rate of Markov chain convergence.
Mixing Time
Let us begin where the last post ended by recalling some facts about the mixing time of a Markov chain. Consider a Markov chain with stationary distribution . Recall that the mixing time is
Here, denotes the distribution of the state of the chain at time and is the total variation distance. For more on the total variation distance, see the previous post.
At the end of last post, we saw the following theorem, showing that the mixing time controls the rate of Markov chain convergence.
Theorem (mixing time as convergence rate). For any starting distribution ,
Consequently, if , then .
This result motivates us to develop techniques to bound the mixing time .
Couplings and Convergence
In the previous post, we made essential use of couplings of probability distributions. In this post, we will extend this notion by using couplings of entire Markov chains.
Definition (coupling of Markov chains). A coupling of Markov chains with transition matrix are two processes and such that (A) each process is individually a Markov chain with transition matrix : In particular,
and (B) if , then .
A coupling of Markov chains thus consists of a pair of Markov chains which become glued together should they ever touch.
There are several ways we can use couplings to bound the mixing time of Markov chains. Here is one way. Let and be a coupling of Markov chains for which is initialized in some initial distribution and is initialized in the stationary distribution . Let be the time at which and meet:
The following theorem shows that the tails of the random variable control the convergence of the Markov chain to stationarity.
Theorem (convergence from couplings). Then .
This result is an immediate application of the coupling lemma from last post. Using this bound, we can bound the mixing time
Corollary (mixing time from couplings). .
The maximum is taken over all initial distributions .
Indeed, by the above theorem and Markov’s inequality,1For more on concentration inequalities such as Markov’s inequality, see my post on the subject.
Take a maximum over initial distribution and set . Then the left-hand side is at most , so
Rearranging gives the corollary.
Bit Strings
Example:There are a number of examples2See, for instance, section 5.3 of Markov Chains and Mixing Times. to prove bounds on mixing times using the method of couplings, but we will focus on a simple but illustrative example: a lazy random walk on the set of bit strings.
A bit string is a sequence of 0’s and 1’s. Bit strings are used to store and represent information on digital computers. For instance, the character “a” is stored as “1100001” using the ASCII encoding.
The Chain
Our example is a Markov chain whose states consist of all bit strings of length . For each step of the chain,
- With 50% probability, we do nothing and leave the state unchanged.
- With 50% probability, we choose a (uniform) random position from to and flip the bit in that position, changing 0 to 1 or 1 to 0.
One can think of this Markov chain as modeling a random process in which a bit string is corrupted by errors which flip a single bit. The stationary distribution for this chain is the uniform distribution on all bit strings (i.e., each bit string of length has the same probability ). Therefore, one can also think of this Markov chain as a very inefficient way of generating a string of random bits.3Another way of thinking about this Markov chain is as a random walk on the vertices of an -dimensional hypercube graph. With 50% probability, we stay put, and with 50% probability, we move along a random edge. Because we have a 50% probability of doing nothing, we call this process a lazy random walk.
Designing the Coupling
Let us use the method of couplings to determine how fast this chain mixes. First, observe that there’s an alternative way of stating the transition rule for this Markov chain:
- Pick a (uniform) random position from to and set the bit in that position to with 50% probability and with 50% probability.
Indeed, under this rule, half of the time we leave the state unchanged because we set the bit in the selected position to the value it already had. For the other half of the time, we flip the selected bit.
Now, we construct a coupling. Initialize in an arbitrary distribution and in the stationary distribution (uniform over all bit strings). The key to construct a coupling will be to use the same randomness to update the state of both the “” chain and the “” chain. For this chain, there’s a natural way to do this.
At time , pick a (uniform) random position and a (uniform) random bit . Set by changing the th bit of to , and set by changing the th bit of to .
We couple the two chains by setting the same position to the same bit .
Bounding the Mixing Time
To bound the mixing time, we need to understand when these two chains meet. Observe that if we pick position to set at some point, the two Markov chains have the same bit in position at every time later in the chain. Consequently,
If all of the positions have been chosen by time , then .
The two chains might meet before all of the positions have been chosen, but they are guaranteed to meet after.
Set to be the time at which all positions have been selected at least once. Then our observation is equivalent to saying that the meeting time is smaller than ,
Thus, to bound the mixing time, it is sufficient to compute the expected value of .
Collecting Coupons
Computing the expected value of is actually an example of a very famous problem in probability theory known as the coupon collector problem. The classic version goes like this:
A cereal company is running a promotion where each box of cereal contains a coupon, which is equally likely to be any of one of different colors. A coupon of each color can be traded in for a toy. What is the expected number of boxes we need to open in order to qualify for the toy?
The coupon collector is exactly the same as our problem, with the different colors of coupons playing the same role as the different positions in our bit strings. Going forward, let’s use the language of coupons and boxes to describe this problem, as its more colorful.
When tackling a hard question like computing , it can be helpful to break into easier pieces. Let’s start with the simplest possible question: How many picks do we need to collect the first coupon? Just one. We always get a coupon in our first box.
With that question answered, let’s ask the next-simplest question: How many picks do we need to collect the second coupon, differently colored than the first? There are colors and of them are different from the first. So the probability of picking a new coupon in the second box is . In fact,
Until we succeed at picking the second unique coupon, every box has an independent chance of containing such a coupon.
The number of boxes needed is therefore a geometric random variable with success probability , and its mean is . On average, we need picks to collect the second coupon.
The reasoning is the same for the third coupon. There are coupons we haven’t collected and total, so the probability of picking the third coupon in each box is . Thus, the number of picks is a geometric random variable with success probability with mean .
In general, we need an expected picks to pick the th coupon. Adding up the expected number of picks for each , we obtain that the total number of picks required is to collect all the coupons is
More concisely, if we let denote the th harmonic number
then . Since the th harmonic number is at most , we have
Conclusion
Putting all of these ingredients together, we obtain
The mixing time is (at most) proportional to .