Note to Self: Hanson–Wright and Trace Estimation with Random Vectors on the Sphere

Tyler Chen has just released an excellent monograph on the Lanczos method, one of the most powerful algorithms for extracting information from a symmetric matrix A. I would definitely recommend checking it out.

One of the subjects touched on by this book is stochastic trace estimation, a topic I’ve written about several times before on this blog. In his book, Tyler uses stochastic trace estimates based on random vectors on the sphere (a great choice), and he provides a simple, but often pessimistic, mathematical analysis of this approach. To complement Tyler’s amazing book, I felt it was a good opportunity to share a sharper analysis that I came up with using a more sophisticated argument. I would not be surprised if this argument appears elsewhere in the literature, but I am not aware of a source.

Let x,x_1,\ldots,x_m be random vectors in \real^n drawn uniformly at random from the sphere of radius \sqrt{n},1Informally, we define the uniform distribution on the sphere to be such that the probability that x belongs to a set U is the ratio of the (hyper-)surface area of U to the (hyper-)surface area of the full sphere. Formally, the uniform distribution on the sphere can be defined multiple ways. First, treating the sphere as a subset of \real^n, we can define the uniform distribution as the (n-1)-dimensional Hausdorff measure, normalized so that the sphere has unit measure. Second, observe that the sphere is acted upon in a natural way by the group of orthogonal matrices. The uniform distribution is the unique probability measure which is invariant to this group action. let A be a real symmetric matrix, and define the quadratic form

    \[q \coloneqq x^\top Ax\]

and the Girard–Hutchinson trace estimator

    \[\hat{\tr}\coloneqq \frac{1}{m}\sum_{i=1}^m x_i^\top Ax_i.\]

Both q and \hat{\tr} are unbiased estimates for the trace of A, \expect[q] = \expect[\hat{\tr}] = \tr A. The goal of this post will be to bound the probability of these quantities being much smaller or larger than the trace of A.

Hanson–Wright on the Sphere

Observe that \hat{\tr} is an average of m independent copies of the random variable q. Therefore, we will begin our analysis with the quadratic form q and discuss the Girard–Hutchinson estimate \hat{\tr} at the end of the post.

Centering

Begin by introducing the centered matrix

    \[\overline{A} = A - \frac{\tr A}{n}I.\]

The transformation A\mapsto\overline{A} has the effect of shifting A’s eigenvalues to have mean zero. Consequently, since the trace is the sum of the eigenvalues, \overline{A} has trace zero.

Now, rewrite q in terms of the centered matrix \overline{A}:

    \[q = x^\top A x = x^\top \overline{A} x + \frac{\tr A}{n} \cdot x^\top I x = x^\top \overline{A} x + \tr A.\]

In the final equality, we use the fact that x is on the sphere of radius \sqrt{n} so that x^\top I x = \norm{x}^2 = n. Rearranging, we see that the error \overline{q}\coloneqq q-\tr A satisfies

    \[\overline{q} = q-\tr A = x^\top \overline{A}x.\]

We have shown that the error \overline{q} depends only on the centered matrix \overline{A} rather than the original matrix A. This observation is at the root of the reason that quadratic forms with vectors on the sphere have smaller tail probabilities than other distributions like the Gaussian distribution. Indeed, \overline{A} is smaller than A when measured using the Frobenius norm \norm{\overline A}_{\rm F} \le \norm{A}_{\rm F}, and the spectral norm is never much larger \norm{\overline{A}} \le 2\norm{A}. These observations will be important later.

Laplace Transform Method and Comparison to Standard Gaussian Vectors

To derive exponential concentration inequalities, we make use of the Laplace transform method. For a random variable z, define the cumulant generating function (cgf)

    \[\xi_z(\theta) = \log (\expect [\exp(\theta z)]).\]

Bounds on the cgf immediately yield bounds on the random variable taking large values, so it will behoove us to estimate this quantity for \overline{q}.

To estimate the cgf of \overline{q}, we will make use of a comparison between Gaussian vectors and random vectors on the sphere. Recall that a standard Gaussian vector g is a vector with independent standard Gaussian entries. Standard Gaussian random vectors are spherically symmetric, meaning they can be written in the form g = a\cdot x, where x is a uniform random vector on the sphere and a is a scaling factor.2Concretely, a\sqrt{n} is a \chi random variable with n degrees of freedom. The scaling factor a is independent of x and, in this case, satisfies \expect[a^2] = 1.3Indeed, the expected squared length of a standard Gaussian random vector is \expect[\norm{g}^2] = \sum_{i=1}^n \expect[g_i^2] = n. But also, \expect[\norm{g}^2] = \expect[a^2] \cdot \expect[\norm{x}^2] = \expect[a^2] \cdot n. Therefore, \expect[a^2]=1.

Using this relationship, the error \overline{q} can be connected to the standard Gaussian random vector g as follows

    \[g^\top \overline{A} g = a^2 \cdot x^\top \overline{A} x = a^2 \cdot \overline{q}.\]

Intuitively, we might expect that multiplying the random error \overline{q} by a scaling factor of mean one should only have the effect of increasing the variability of \overline{q} and thus increasing its probability of taking large values. Based on this intuition, we might conjecture that the cgfs are related

    \[\xi_{\overline{q}}(\theta) \le \xi_{g^\top A g}(\theta) \quad \text{for all } \theta \in \real.\]

Indeed, this conjecture is true.

Proposition (Scaling hurts). Let z be a random variable and let b be a random variable with expectation 1, independent of z. Then

    \[\xi_{z}(\theta) \le \xi_{bz}(\theta) \quad \text{for all } \theta \in \real.\]

To prove this result, let \expect_b and \expect_z denote expectations take over the randomness in b and z separately. The total expectation is \expect[\cdot] = \expect_z[\expect_b[\cdot ]]. Begin with the right-hand side and invoke Jensen’s inequality over b:

    \[\xi_{bz}(\theta) = \xi_z(\theta) = \log (\expect_z [\expect_b [\exp(\theta bz)]]) \ge \log (\expect_z [\exp(\theta \expect_b[b]z)]) = \xi_z(\theta).\]

In the last line, we use the hypothesis \expect[b] = 1 and the definition of the cgf.

Having established this proposition, we conclude that \xi_{\overline{q}}(\theta) \le \xi_{g^\top \overline{A} g}(\theta) for all \theta.

Completing the Analysis

Finally, invoking a standard bound for the cgf of g^\top \overline{A}g for a trace-zero matrix \overline{A}, we obtain

    \[\xi_{\overline{q}}(\theta) \le \xi_{g^\top \overline{A} g}(\theta) \le \frac{\theta^2 \norm{\overline{A}}_{\rm F}^2}{1 - 2\theta \norm{\overline{A}}}.\]

A cgf bound of this form immediately yields the following tail bound (see BoucheronLugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):

    \[\prob \{ x^\top A x - \tr(A) \ge t \} \le \exp \left( - \frac{t^2/2}{2 \norm{\overline{A}}_{\rm F}^2 + 2 \overline{\norm{\overline{A}}t}} \right).\]

One can also control the lower tail event x^\top A x - \tr(A) \le -t by instantiating this result with -A. Union bounding over the upper and lower tails gives the symmetric bound

    \[\prob \{ |x^\top A x - \tr(A)| \ge t \} \le 2\exp \left( - \frac{t^2/2}{2 \norm{\overline{A}}_{\rm F}^2 + 2 \overline{\norm{\overline{A}}t}} \right).\]

Is This Bound Good?

Tail bounds for quadratic forms such as x^\top A x or g^\top A g are known as Hanson–Wright inequalities. For vectors on the sphere, our bound shows the tail probabilities decay like \exp(-O(t^2/\norm{\overline{A}_{\rm F}}^2)) for small t > 0 and \exp(-O(t/\norm{\overline{A}})) for large t>0. This pattern of results “subgaussian scaling for small t, subexponential scaling for large t” is typical for Hanson–Wright inequalities.

The Hanson–Wright inequalities for vectors on the sphere exhibit faster decay rate than for standard Gaussian vectors. Indeed, a standard Gaussian random vector g obeys a Hanson–Wright inequality of the form

    \[\prob \{ g^\top A g - \tr A \ge t \} \le \exp \left( -\frac{t^2/2}{C\norm{A}_{\rm F}^2 + c t \norm{A}}\right)\]

for positive constants C,c > 0. For vectors on the sphere, \norm{A}_{\rm F}^2 and \norm{A} have been replaced by \norm{\overline{A}}_{\rm F}^2 and \norm{\overline{A}} which are always smaller (and sometimes much smaller). The smaller tail probabilities for x^\top A x versus g^\top A g is a real phenomenon, not an artifact of the mathematical analysis.

As another way to evaluate the quality of these bounds, recall that a Gaussian random variable with mean zero and variance v has tail probabilities roughly of size \exp(-t^2/2v). For small t, our Hanson–Wright inequality for vectors on the sphere has the same form with v = 2\norm{\overline{A}}_{\rm F}^2. The true variance of q = x^\top A x is

    \[\Var(q) = \frac{n}{n+2} \cdot 2\norm{\overline{A}}_{\rm F}^2,\]

which is close to v = 2\norm{\overline{A}}_{\rm F}^2 for large n. (See this post of mine for a derivation of the variance formula.) Thus, even up to the constants, the Hanson–Wright inequality we’ve derived for random vectors on the sphere is nearly as good as one could possibly hope (at least for small t).

Girard–Hutchinson Estimator on the Sphere

Using our results for one quadratic form, tail bounds for the Girard–Hutchinson estimator \overline{\tr} = m^{-1} \sum_{i=1}^m x_i^\top A x_i easily follow. Indeed, the cgf is additive for independent random variables and satisfies the identity \xi_{cz}(\theta) = \xi_z(c\theta) for constant $c, so

    \[\xi_{\overline{\tr}}(\theta) = \sum_{i=1}^m \xi_{x_i^\top Ax_i}(\theta/m) \le m \cdot \frac{(\theta/m)^2 \norm{\overline{A}}_{\rm F}^2}{1 - 2(\theta/m)\norm{\overline{A}}}.\]

This cgf bound leads to tail bounds

    \begin{align*}\prob \{ \overline{\tr} - \tr(A) \ge t \} &\le \exp \left( - \frac{t^2/2}{2m \norm{\overline{A}}_{\rm F}^2 + 2m \overline{\norm{\overline{A}}t}} \right), \\\prob \{ |\overline{\tr} - \tr(A)| \ge t \} &\le 2\exp \left( - \frac{t^2/2}{2m \norm{\overline{A}}_{\rm F}^2 + 2m \overline{\norm{\overline{A}}t}} \right).\end{align*}

Leave a Reply

Your email address will not be published. Required fields are marked *