Tyler Chen has just released an excellent monograph on the Lanczos method, one of the most powerful algorithms for extracting information from a symmetric matrix . I would definitely recommend checking it out.
One of the subjects touched on by this book is stochastic trace estimation, a topic I’ve written about several times before on this blog. In his book, Tyler uses stochastic trace estimates based on random vectors on the sphere (a great choice), and he provides a simple, but often pessimistic, mathematical analysis of this approach. To complement Tyler’s amazing book, I felt it was a good opportunity to share a sharper analysis that I came up with using a more sophisticated argument. I would not be surprised if this argument appears elsewhere in the literature, but I am not aware of a source.
Let be random vectors in drawn uniformly at random from the sphere of radius ,1Informally, we define the uniform distribution on the sphere to be such that the probability that belongs to a set is the ratio of the (hyper-)surface area of to the (hyper-)surface area of the full sphere. Formally, the uniform distribution on the sphere can be defined multiple ways. First, treating the sphere as a subset of , we can define the uniform distribution as the -dimensional Hausdorff measure, normalized so that the sphere has unit measure. Second, observe that the sphere is acted upon in a natural way by the group of orthogonal matrices. The uniform distribution is the unique probability measure which is invariant to this group action. let be a real symmetric matrix, and define the quadratic form
and the Girard–Hutchinson trace estimator
Both and are unbiased estimates for the trace of , . The goal of this post will be to bound the probability of these quantities being much smaller or larger than the trace of .
Hanson–Wright on the Sphere
Observe that is an average of independent copies of the random variable . Therefore, we will begin our analysis with the quadratic form and discuss the Girard–Hutchinson estimate at the end of the post.
Centering
Begin by introducing the centered matrix
The transformation has the effect of shifting ’s eigenvalues to have mean zero. Consequently, since the trace is the sum of the eigenvalues, has trace zero.
Now, rewrite in terms of the centered matrix :
In the final equality, we use the fact that is on the sphere of radius so that . Rearranging, we see that the error satisfies
We have shown that the error depends only on the centered matrix rather than the original matrix . This observation is at the root of the reason that quadratic forms with vectors on the sphere have smaller tail probabilities than other distributions like the Gaussian distribution. Indeed, is smaller than when measured using the Frobenius norm , and the spectral norm is never much larger . These observations will be important later.
Laplace Transform Method and Comparison to Standard Gaussian Vectors
To derive exponential concentration inequalities, we make use of the Laplace transform method. For a random variable , define the cumulant generating function (cgf)
Bounds on the cgf immediately yield bounds on the random variable taking large values, so it will behoove us to estimate this quantity for .
To estimate the cgf of , we will make use of a comparison between Gaussian vectors and random vectors on the sphere. Recall that a standard Gaussian vector is a vector with independent standard Gaussian entries. Standard Gaussian random vectors are spherically symmetric, meaning they can be written in the form , where is a uniform random vector on the sphere and is a scaling factor.2Concretely, is a random variable with degrees of freedom. The scaling factor is independent of and, in this case, satisfies .3Indeed, the expected squared length of a standard Gaussian random vector is . But also, . Therefore, .
Using this relationship, the error can be connected to the standard Gaussian random vector as follows
Intuitively, we might expect that multiplying the random error by a scaling factor of mean one should only have the effect of increasing the variability of and thus increasing its probability of taking large values. Based on this intuition, we might conjecture that the cgfs are related
Indeed, this conjecture is true.
Proposition (Scaling hurts). Let be a random variable and let be a random variable with expectation , independent of . Then
To prove this result, let and denote expectations take over the randomness in and separately. The total expectation is . Begin with the right-hand side and invoke Jensen’s inequality over :
In the last line, we use the hypothesis and the definition of the cgf.
Having established this proposition, we conclude that for all .
Completing the Analysis
Finally, invoking a standard bound for the cgf of for a trace-zero matrix , we obtain
A cgf bound of this form immediately yields the following tail bound (see Boucheron, Lugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):
One can also control the lower tail event by instantiating this result with . Union bounding over the upper and lower tails gives the symmetric bound
Is This Bound Good?
Tail bounds for quadratic forms such as or are known as Hanson–Wright inequalities. For vectors on the sphere, our bound shows the tail probabilities decay like for small and for large . This pattern of results “subgaussian scaling for small , subexponential scaling for large ” is typical for Hanson–Wright inequalities.
The Hanson–Wright inequalities for vectors on the sphere exhibit faster decay rate than for standard Gaussian vectors. Indeed, a standard Gaussian random vector obeys a Hanson–Wright inequality of the form
for positive constants . For vectors on the sphere, and have been replaced by and which are always smaller (and sometimes much smaller). The smaller tail probabilities for versus is a real phenomenon, not an artifact of the mathematical analysis.
As another way to evaluate the quality of these bounds, recall that a Gaussian random variable with mean zero and variance has tail probabilities roughly of size . For small , our Hanson–Wright inequality for vectors on the sphere has the same form with . The true variance of is
which is close to for large . (See this post of mine for a derivation of the variance formula.) Thus, even up to the constants, the Hanson–Wright inequality we’ve derived for random vectors on the sphere is nearly as good as one could possibly hope (at least for small ).
Girard–Hutchinson Estimator on the Sphere
Using our results for one quadratic form, tail bounds for the Girard–Hutchinson estimator easily follow. Indeed, the cgf is additive for independent random variables and satisfies the identity for constant $c, so
This cgf bound leads to tail bounds