Note to Self: Hanson–Wright and Trace Estimation with Random Vectors on the Sphere

Tyler Chen has just released an excellent monograph on the Lanczos method, one of the most powerful algorithms for extracting information from a symmetric matrix $A$ . I would definitely recommend checking it out.

One of the subjects touched on by this book is stochastic trace estimation, a topic I’ve written about several times before on this blog. In his book, Tyler uses stochastic trace estimates based on random vectors on the sphere (a great choice), and he provides a simple, but often pessimistic, mathematical analysis of this approach. To complement Tyler’s amazing book, I felt it was a good opportunity to share a sharper analysis that I came up with using a more sophisticated argument. I would not be surprised if this argument appears elsewhere in the literature, but I am not aware of a source.

Let $x,x_1,\ldots,x_m$ be random vectors in $\real^n$ drawn uniformly at random from the sphere of radius $\sqrt{n}$ ,¹ let $A$ be a real symmetric matrix, and define the quadratic form

$q \coloneqq x^\top Ax$

and the Girard–Hutchinson trace estimator

$\hat{\tr}\coloneqq \frac{1}{m}\sum_{i=1}^m x_i^\top Ax_i.$

Both $q$ and $\hat{\tr}$ are unbiased estimates for the trace of $A$ , $\expect[q] = \expect[\hat{\tr}] = \tr A$ . The goal of this post will be to bound the probability of these quantities being much smaller or larger than the trace of $A$ .

Hanson–Wright on the Sphere

Observe that $\hat{\tr}$ is an average of $m$ independent copies of the random variable $q$ . Therefore, we will begin our analysis with the quadratic form $q$ and discuss the Girard–Hutchinson estimate $\hat{\tr}$ at the end of the post.

Centering

Begin by introducing the centered matrix

$\overline{A} = A - \frac{\tr A}{n}I.$

The transformation $A\mapsto\overline{A}$ has the effect of shifting $A$ ’s eigenvalues to have mean zero. Consequently, since the trace is the sum of the eigenvalues, $\overline{A}$ has trace zero.

Now, rewrite $q$ in terms of the centered matrix $\overline{A}$ :

$q = x^\top A x = x^\top \overline{A} x + \frac{\tr A}{n} \cdot x^\top I x = x^\top \overline{A} x + \tr A.$

In the final equality, we use the fact that $x$ is on the sphere of radius $\sqrt{n}$ so that $x^\top I x = \norm{x}^2 = n$ . Rearranging, we see that the error $\overline{q}\coloneqq q-\tr A$ satisfies

$\overline{q} = q-\tr A = x^\top \overline{A}x.$

We have shown that the error $\overline{q}$ depends only on the centered matrix $\overline{A}$ rather than the original matrix $A$ . This observation is at the root of the reason that quadratic forms with vectors on the sphere have smaller tail probabilities than other distributions like the Gaussian distribution. Indeed, $\overline{A}$ is smaller than $A$ when measured using the Frobenius norm $\norm{\overline A}_{\rm F} \le \norm{A}_{\rm F}$ , and the spectral norm is never much larger $\norm{\overline{A}} \le 2\norm{A}$ . These observations will be important later.

Laplace Transform Method and Comparison to Standard Gaussian Vectors

To derive exponential concentration inequalities, we make use of the Laplace transform method. For a random variable $z$ , define the cumulant generating function (cgf)

$\xi_z(\theta) = \log (\expect [\exp(\theta z)]).$

Bounds on the cgf immediately yield bounds on the random variable taking large values, so it will behoove us to estimate this quantity for $\overline{q}$ .

To estimate the cgf of $\overline{q}$ , we will make use of a comparison between Gaussian vectors and random vectors on the sphere. Recall that a standard Gaussian vector $g$ is a vector with independent standard Gaussian entries. Standard Gaussian random vectors are spherically symmetric, meaning they can be written in the form $g = a\cdot x$ , where $x$ is a uniform random vector on the sphere and $a$ is a scaling factor.² The scaling factor $a$ is independent of $x$ and, in this case, satisfies $\expect[a^2] = 1$ .³

Using this relationship, the error $\overline{q}$ can be connected to the standard Gaussian random vector $g$ as follows

$g^\top \overline{A} g = a^2 \cdot x^\top \overline{A} x = a^2 \cdot \overline{q}.$

Intuitively, we might expect that multiplying the random error $\overline{q}$ by a scaling factor of mean one should only have the effect of increasing the variability of $\overline{q}$ and thus increasing its probability of taking large values. Based on this intuition, we might conjecture that the cgfs are related

$\xi_{\overline{q}}(\theta) \le \xi_{g^\top A g}(\theta) \quad \text{for all } \theta \in \real.$

Indeed, this conjecture is true.

Proposition (Scaling hurts). Let $z$ be a random variable and let $b$ be a random variable with expectation $1$ , independent of $z$ . Then
$\xi_{z}(\theta) \le \xi_{bz}(\theta) \quad \text{for all } \theta \in \real.$

To prove this result, let $\expect_b$ and $\expect_z$ denote expectations take over the randomness in $b$ and $z$ separately. The total expectation is $\expect[\cdot] = \expect_z[\expect_b[\cdot ]]$ . Begin with the right-hand side and invoke Jensen’s inequality over $b$ :

$\xi_{bz}(\theta) = \xi_z(\theta) = \log (\expect_z [\expect_b [\exp(\theta bz)]]) \ge \log (\expect_z [\exp(\theta \expect_b[b]z)]) = \xi_z(\theta).$

In the last line, we use the hypothesis $\expect[b] = 1$ and the definition of the cgf.

Having established this proposition, we conclude that $\xi_{\overline{q}}(\theta) \le \xi_{g^\top \overline{A} g}(\theta)$ for all $\theta$ .

Completing the Analysis

Finally, invoking a standard bound for the cgf of $g^\top \overline{A}g$ for a trace-zero matrix $\overline{A}$ , we obtain

$\xi_{\overline{q}}(\theta) \le \xi_{g^\top \overline{A} g}(\theta) \le \frac{\theta^2 \norm{\overline{A}}_{\rm F}^2}{1 - 2\theta \norm{\overline{A}}}.$

A cgf bound of this form immediately yields the following tail bound (see Boucheron, Lugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):

$\prob \{ x^\top A x - \tr(A) \ge t \} \le \exp \left( - \frac{t^2/2}{2 \norm{\overline{A}}_{\rm F}^2 + 2 \overline{\norm{\overline{A}}t}} \right).$

One can also control the lower tail event $x^\top A x - \tr(A) \le -t$ by instantiating this result with $-A$ . Union bounding over the upper and lower tails gives the symmetric bound

$\prob \{ |x^\top A x - \tr(A)| \ge t \} \le 2\exp \left( - \frac{t^2/2}{2 \norm{\overline{A}}_{\rm F}^2 + 2 \overline{\norm{\overline{A}}t}} \right).$

Is This Bound Good?

Tail bounds for quadratic forms such as $x^\top A x$ or $g^\top A g$ are known as Hanson–Wright inequalities. For vectors on the sphere, our bound shows the tail probabilities decay like $\exp(-O(t^2/\norm{\overline{A}_{\rm F}}^2))$ for small $t > 0$ and $\exp(-O(t/\norm{\overline{A}}))$ for large $t>0$ . This pattern of results “subgaussian scaling for small $t$ , subexponential scaling for large $t$ ” is typical for Hanson–Wright inequalities.

The Hanson–Wright inequalities for vectors on the sphere exhibit faster decay rate than for standard Gaussian vectors. Indeed, a standard Gaussian random vector $g$ obeys a Hanson–Wright inequality of the form

$\prob \{ g^\top A g - \tr A \ge t \} \le \exp \left( -\frac{t^2/2}{C\norm{A}_{\rm F}^2 + c t \norm{A}}\right)$

for positive constants $C,c > 0$ . For vectors on the sphere, $\norm{A}_{\rm F}^2$ and $\norm{A}$ have been replaced by $\norm{\overline{A}}_{\rm F}^2$ and $\norm{\overline{A}}$ which are always smaller (and sometimes much smaller). The smaller tail probabilities for $x^\top A x$ versus $g^\top A g$ is a real phenomenon, not an artifact of the mathematical analysis.

As another way to evaluate the quality of these bounds, recall that a Gaussian random variable with mean zero and variance $v$ has tail probabilities roughly of size $\exp(-t^2/2v)$ . For small $t$ , our Hanson–Wright inequality for vectors on the sphere has the same form with $v = 2\norm{\overline{A}}_{\rm F}^2$ . The true variance of $q = x^\top A x$ is

$\Var(q) = \frac{n}{n+2} \cdot 2\norm{\overline{A}}_{\rm F}^2,$

which is close to $v = 2\norm{\overline{A}}_{\rm F}^2$ for large $n$ . (See this post of mine for a derivation of the variance formula.) Thus, even up to the constants, the Hanson–Wright inequality we’ve derived for random vectors on the sphere is nearly as good as one could possibly hope (at least for small $t$ ).

Girard–Hutchinson Estimator on the Sphere

Using our results for one quadratic form, tail bounds for the Girard–Hutchinson estimator $\overline{\tr} = m^{-1} \sum_{i=1}^m x_i^\top A x_i$ easily follow. Indeed, the cgf is additive for independent random variables and satisfies the identity $\xi_{cz}(\theta) = \xi_z(c\theta)$ for constant $c, so

$\xi_{\overline{\tr}}(\theta) = \sum_{i=1}^m \xi_{x_i^\top Ax_i}(\theta/m) \le m \cdot \frac{(\theta/m)^2 \norm{\overline{A}}_{\rm F}^2}{1 - 2(\theta/m)\norm{\overline{A}}}.$

This cgf bound leads to tail bounds

$\begin{align*}\prob \{ \overline{\tr} - \tr(A) \ge t \} &\le \exp \left( - \frac{mt^2/2}{2\norm{\overline{A}}_{\rm F}^2 + 2m \overline{\norm{\overline{A}}t}} \right), \\\prob \{ |\overline{\tr} - \tr(A)| \ge t \} &\le 2\exp \left( - \frac{mt^2/2}{2\norm{\overline{A}}_{\rm F}^2 + 2m \overline{\norm{\overline{A}}t}} \right).\end{align*}$