Stochastic Trace Estimation

I am delighted to share that me, Joel A. Tropp, and Robert J. Webber‘s paper XTrace: Making the Most of Every Sample in Stochastic Trace Estimation has recently been released as a preprint on arXiv. In it, we consider the implicit trace estimation problem:

Implicit trace estimation problem: Given access to a square matrix A via the matrix–vector product operation \omega \mapsto A\omega, estimate its trace \tr A = \sum_{i=1}^n A_{ii}.

Algorithms for this task have many uses such as log-determinant computations in machine learning, partition function calculations in statistical physics, and generalized cross validation for smoothing splines. I described another application to counting triangles in a large network in a previous blog post.

Our paper presents new trace estimators XTrace and XNysTrace which are highly efficient, producing accurate trace approximations using a small budget of matrix–vector products. In addition, these algorithms are fast to run and are supported by theoretical results which explain their excellent performance. I really hope that you will check out the paper to learn more about these estimators!

For the rest of this post, I’m going to talk about the most basic stochastic trace estimation algorithm, the GirardHutchinson estimator. This seemingly simple algorithm exhibits a number of nuances and forms the backbone for more sophisticated trace estimates such as Hutch++, Nyström++, XTrace, and XNysTrace. Toward the end, this blog post will be fairly mathematical, but I hope that the beginning will be fairly accessible to all.

Girard–Hutchinson Estimator: The Basics

The GirardHutchinson estimator for the trace of a square matrix A is

(1)   \[\hat{\tr} = \frac{1}{m} \sum_{i=1}^m \omega_i^* A \omega_i. \]

Here, \omega_1,\ldots,\omega_m are random vectors, usually chosen to be statistically independent, and {}^* denotes the conjugate transpose of a vector or matrix. The Girard–Hutchinson estimator only depends on the matrix A through the matrix–vector products A\omega_1,\ldots,A\omega_m.

Unbiasedness

Provided the random vectors are isotropic

(2)   \[\mathbb{E} [\omega_i\omega_i^*] = I, \]

the Girard–Hutchinson estimator is unbiased:

(3)   \[\mathbb{E} [\hat{\tr}] = \tr A.\]

Let us confirm this claim in some detail. First, we use linearity of expectation to evaluate

(4)   \[\mathbb{E} [\hat{\tr}] = \mathbb{E} \left[ \frac{1}{m} \sum_{i=1}^m \omega_i^*A\omega_i \right] = \frac{1}{m} \sum_{i=1}^m \mathbb{E} \left[ \omega_i^* A \omega_i\right]. \]

Therefore, to prove that \mathbb{E} [\hat{\tr}] = \tr A, it is sufficient to prove that \mathbb{E} \left[\omega_i^*A\omega_i\right] = \tr A for each i.

When working with traces, there are two tricks that solve 90% of derivations. The first trick is that, if we view a number as a 1\times 1 matrix, then a number equals its trace, x = \tr x. The second trick is the cyclic property: For a k\times p matrix B and a p\times k matrix C, we have \tr (BC) = \tr (CB). The cyclic property should be handled with care when one works with a product of three or more matrices. For instance, we have

    \[\tr[BCD] = \tr[(BC)D] = \tr[D(BC)] = \tr[DBC].\]

However,

    \[\tr [BCD] \ne \tr[CBD] \quad \text{in general}.\]

One should think of the matrix product BCD as beads on a closed loop of string. One can move the last bead D to the front of the other two, \tr [BCD] = \tr[DBC], but not interchange two beads, \tr[BCD] \ne \tr[CBD].

With this trick in hand, let’s return to proving that \mathbb{E} \left[\omega_i^*A\omega_i\right] = \tr A for every i. Apply our two tricks:

    \[\mathbb{E} \left[\omega_i^*A\omega_i\right] = \mathbb{E} \tr \left[\omega_i^*A\omega_i\right] = \mathbb{E} \tr \left[A\omega_i\omega_i^*\right].\]

The expectation is a linear operation and the matrix A is non-random, so we can bring the expectation into the trace as

    \[\mathbb{E} \left[\omega_i^*A\omega_i\right] = \mathbb{E} \tr \left[A\omega_i\omega_i^*\right] = \tr(A \mathbb{E}[\omega_i\omega_i^*] ).\]

Invoke the isotropy condition (2) and conclude:

    \[\mathbb{E} \left[\omega_i^*A\omega_i\right] = \tr(A \mathbb{E}[\omega_i\omega_i^*] ) = \tr(A\cdot I) = \tr A.\]

Plugging this into (4) confirms the unbiasedness claim (3).

Variance

Continue to assume that the \omega_i‘s are isotropic (3) and now assume that \omega_1,\ldots,\omega_m are independent. By independence, the variance can be written as

    \[\Var(\hat{\tr}) = \frac{1}{m^2} \sum_{i=1}^m \Var(\omega_i^*A\omega_i).\]

Assuming that \omega_1,\ldots,\omega_m are identically distributed \omega_1,\ldots,\omega_m \sim \omega, we then get

    \[\Var(\hat{\tr}) = \frac{1}{m} \Var(\omega^*A\omega).\]

The variance decreases like 1/m, which is characteristic of Monte Carlo-type algorithms. Since \hat{\tr} is unbiased (i.e, (3)), this means that the mean square error decays like 1/m so the average error (more precisely root-mean-square error) decays like

    \[\left| \hat{\tr} - \tr A \right| \lessapprox \frac{\mathrm{const}}{\sqrt{m}}.\]

This type of convergence is very slow. If I want to decrease the error by a factor of 10, I must do 100\times the work!

Variance-reduced trace estimators like Hutch++ and our new trace estimator XTrace improve the rate of convergence substantially. Even in the worst case, Hutch++ and XTrace reduce the variance at a rate 1/m^2 and (root-mean-square) error at rates 1/m:

    \[\Var(\hat{\tr}_{\text{H++ or X}}) \le \frac{\mathrm{const}}{m^2},\quad \left| \hat{\tr}_{\text{H++ or X}} - \tr A \right| \lessapprox \frac{\mathrm{const}}{m}.\]

For matrices with rapidly decreasing singular values, the variance and error can decrease much faster than this.

Variance Formulas

As the rate of convergence for the Girard–Hutchinson estimator is so slow, it is imperative to pick a distribution on test vectors \omega that makes the variance of the single–sample estimate \omega^*A\omega as low as possible. In this section, we will provide several explicit formulas for the variance of the Girard–Hutchinson estimator. Derivations of these formulas will appear at the end of this post. These variance formulas help illuminate the benefits and drawbacks of different test vector distributions.

To express the formulas, we will need some notation. For a complex number z = a + bi we use \Re(z) = a and \Im(z) = b to denote the real and imaginary parts. The variance of a random complex number z is

    \[\Var(z) := \mathbb{E} |z - \mathbb{E} z|^2 = \Var(\Re z) + \Var(\Im z).\]

The Frobenius norm of a matrix A is

    \[\left\|A\right\|_{\rm F}^2 = \sum_{i,j} |A_{ij}|^2.\]

If A is real symmetric or complex Hermitian with (real) eigenvalues \lambda_1,\ldots,\lambda_n, we have

(5)   \[\left\|A\right\|_{\rm F}^2 = \sum_{i=1}^n \lambda_i^2. \]

A^\top denotes the ordinary transpose of A and A^* denotes the conjugate transpose of A.

Real-Valued Test Vectors

We first focus on real-valued test vectors \omega. Since \omega is real, we can use the ordinary transpose {}^\top rather than the conjugate transpose {}^*. Since \omega^\top A\omega is a number, it is equal to its own transpose:

    \[\omega^\top A \omega = (\omega^\top A \omega)^\top = \omega^\top A^\top \omega.\]

Therefore,

    \[\omega^\top A\omega = \frac{\omega^\top A \omega + \omega^\top A^\top \omega}{2} = \omega^\top \left( \frac{A + A^\top}{2} \right)\omega.\]

The Girard–Hutchinson trace estimator applied to A is the same as the Girard–Hutchinson estimator applied to the symmetric part of A, (A+A^\top)/2.

For the following results, assume A is symmetric, A = A^\top.

  1. Real Gaussian: \omega_1,\ldots,\omega_m are independent standard normal random vectors.

        \[\Var(\omega^\top A\omega) = 2 \left\|A\right\|_{\rm F}^2.\]

  2. Uniform signs (Rademachers): \omega_1,\ldots,\omega_m are independent random vectors with uniform \pm 1 coordinates.

        \[\Var(\omega^\top A \omega) = 2\sum_{i\ne j} |A_{ij}|^2.\]

  3. Real sphere: Assume \omega_1,\ldots,\omega_n are uniformly distributed on the real sphere of radius \sqrt{n}: \omega \sim \text{Uniform} \{x\in \mathbb{R}^n : x^\top x = n\}.

        \[\Var(\omega^\top A\omega) = \frac{2n}{n+2} \left( \left\|A\right\|_{\rm F}^2 - \frac{1}{n} |\tr A|^2 \right).\]

These formulas continue to hold for nonsymmetric A by replacing A by its symmetric part (A+A^\top)/2 on the right-hand sides of these variance formulas.

Complex-Valued Test Vectors

We now move our focus to complex-valued test vectors \omega. As a rule of thumb, one should typically expect that the variance for complex-valued test vectors applied to a real symmetric matrix A is about half the natural real counterpart—e.g., for complex Gaussians, you get about half the variance than with real Gaussians.

A square complex matrix has a Cartesian decomposition

    \[A = A^{\rm H} + i A^{\rm SH}\]

where

    \[A^{\rm H} = \frac{A+A^*}{2} ,\quad A^{\rm SH} = \frac{A - A^*}{2i}\]

denote the Hermitian and skew-Hermitian parts of A. Similar to how the imaginary part of a complex number is real, the skew-Hermitian part of a complex matrix is Hermitian (and i A^{\rm SH} is skew-Hermitian). Since A^{\rm H} and A^{\rm SH} are both Hermitian, we have

    \[\Re(\omega^* A\omega) = \omega^* A^{\rm H} \omega, \quad \Im (\omega^* A \omega) = \omega^* A^{\rm SH} \omega.\]

Consequently, the variance of \omega^*A \omega can be broken into Hermitian and skew-Hermitian parts:

    \[\Var(\omega^* A\omega) = \Var(\omega^* A^{\rm H}\omega) + \Var(\omega^* A^{\rm SH}\omega).\]

For this reason, we will state the variance formulas only for Hermitian A, with the formula for general A following from the Cartesian decomposition.

For the following results, assume A is Hermitian, A = A^*.

  1. Complex Gaussian: \omega_1,\ldots,\omega_m are independent standard complex random vectors, i.e., each \omega_i has iid entries distributed as (g_1+ig_2)/\sqrt{2} for g_1,g_2 standard normal random variables.

        \[\Var(\omega^* A\omega) = \left\|A\right\|_{\rm F}^2.\]

  2. Uniform phases (Steinhauses): \omega_1,\ldots,\omega_m are independent random vectors whose entries are uniform on the complex unit circle \{ z \in \complex : |z| \}.

        \[\Var(\omega^* A \omega) = \sum_{i\ne j} |A_{ij}|^2.\]

  3. Complex sphere: Assume \omega_1,\ldots,\omega_n are uniformly distributed on the complex sphere of radius \sqrt{n}: \omega \sim \text{Uniform} \{x\in \complex^n : x^* x = n\}.

        \[\Var(\omega^* A\omega) = \frac{n}{n+1} \left( \left\|A\right\|_{\rm F}^2 - \frac{1}{n} |\tr A|^2 \right).\]

Optimality Properties

Let us finally address the question of what the best choice of test vectors is for the Girard–Hutchinson estimator. We will state two results with different restrictions on \omega_1,\ldots,\omega_m.

Our first result, due to Hutchinson, is valid for real symmetric matrices with real test vectors.

Optimality (independent test vectors with independent coordinates). If the test vectors \omega_1,\ldots,\omega_m \in \mathbb{R}^n are isotropic (2), independent from each other, and have independent entries, then for any fixed real symmetric matrix A, the minimum variance for \hat{\tr} is obtained when \omega_1,\ldots,\omega_m are populated with random signs (\omega_i)_j \sim \textnormal{Uniform} \{\pm 1\}.

The next optimality results will have real and complex versions. To present the results for \mathbb{R}-valued and an \complex-valued test vectors on unified footing, let \field denote either \mathbb{R} or \complex. We let a \field-Hermitian matrix be either a real symmetric matrix (if \field = \mathbb{R}) or a complex Hermitian matrix (if \field = \complex). Let a \field-unitary matrix be either a real orthogonal matrix (if \field = \mathbb{R}) or a complex unitary matrix (if \field = \complex).

The condition that the vectors \omega_1,\ldots,\omega_m have independent entries is often too restrictive in practice. It rules out, for instance, the case of uniform vectors on the sphere. If we relax this condition, we get a different optimal distribution:

Optimality (independent test vectors). Consider any set \mathscr{A} of \field-Hermitian matrices which is invariant under \field-unitary similary transformations:

    \[\text{If $A \in \mathscr{A}$ and $U$ is $\field$-unitary, then $U^*AU \in \mathscr{A}$.}\]

Assume that the test vectors \omega_1,\ldots,\omega_m are independent and isotropic (2). The worst-case variance \sup_{A \in \mathscr{A}} \Var(\hat{\tr}) is minimized by choosing \omega_1,\ldots,\omega_m uniformly on the \field-sphere: \omega_1,\ldots,\omega_m \sim \text{Uniform} \{ x \in \field^n : x^*x =n \}.

More simply, if you wants your stochastic trace estimator to be effective for a class of inputs \mathscr{A} (closed under \field-unitary similarity transformations) rather than a single input matrix A, then the best distribution are test vectors drawn uniformly from the sphere. Examples of classes of matrices \mathscr{A} include:

  • Fixed eigenvalues. For fixed real eigenvalues \lambda_1,\ldots,\lambda_n \in \mathbb{R}, the set of all \field-Hermitian matrices with these eigenvalues.
  • Density matrices. The class of all trace-one psd matrices.
  • Frobenius norm ball. The class of all \field-Hermitian matrices of Frobenius norm at most 1.

Derivation of Formulas

In this section, we provide derivations of the variance formulas. I have chosen to focus on derivations which are shorter but use more advanced techniques rather than derivations which are longer but use fewer tricks.

Real Gaussians

First assume A is real. Since A is real symmetric, A has an eigenvalue decomposition A = Q\Lambda Q^\top, where Q is orthogonal and \Lambda is a diagonal matrix reporting A‘s eigenvalues. Since the real Gaussian distribution is invariant under orthogonal transformations, \omega^\top A\omega = (Q^\top \omega)^\top \Lambda (Q^\top\omega) has the same distribution as \omega^\top \Lambda \omega. Therefore,

    \[\Var(\omega^\top A \omega) = \Var(\omega^\top \Lambda \omega) = \Var \left( \sum_{i=1}^n \lambda_i \omega_i^2 \right) = \sum_{i=1}^n \lambda_i^2 \Var(\omega_i^2) = 2\sum_{i=1}^n \lambda_i^2 = 2\left\|A\right\|_{\rm F}^2.\]

Here, we used that the variance of a squared standard normal random variable is two.

For A non-real matrix, we can break the matrix A into its entrywise real and imaginary parts A = \mathfrak{R}(A) + i \, \mathfrak{I}(A). Thus,

    \[\Var(\omega^\top A \omega) = \Var(\omega^\top \mathfrak{R}(A) \omega) + \Var(\omega^\top \mathfrak{I}(A) \omega) = 2\left\|\mathfrak{R}(A)\right\|_{\rm F}^2 + 2\left\|\mathfrak{I}(A)\right\|_{\rm F}^2 = 2\left\|A\right\|_{\rm F}^2.\]

Uniform Signs

First, compute

    \[\omega^\top A \omega - \mathbb{E}[\omega^\top A \omega] = \sum_{i,j=1}^n A_{ij} \omega_i\omega_j - \sum_{i=1}^n A_{ii} = \sum_{i\ne j} A_{ij} \omega_i\omega_j + \sum_{i=1}^n A_{ii}(\omega_i^2-1).\]

For a vector \omega of uniform random signs, we have \omega_i^2 = 1 for every i, so the second sum vanishes. Note that we have assumed A symmetric, so the sum over i\ne j can be replaced by two times the sum over i < j:

    \[\omega^\top A \omega - \mathbb{E}[\omega^\top A \omega] = 2\sum_{i< j} A_{ij} \omega_i\omega_j.\]

Note that \{ \omega_i \omega_j : i < j\} are pairwise independent. As a simple exercise, one can verify that the identity

    \[\Var(a_1 X_1+\cdots+a_kX_k) = |a_1|^2 \Var(X_1) + \cdots + |a_k|^2 \Var(X_k)\]

holds for any pairwise independent family of random variances X_1,\ldots,X_k and numbers a_1,\ldots,a_k. Ergo,

    \begin{align*}\Var(\omega^\top A\omega) &= \Var(\omega^\top A \omega - \mathbb{E}[\omega^\top A \omega]) \\&= \Var\left(\sum_{i< j} 2A_{ij} \omega_i\omega_j\right) \\&= \sum_{i<j} 4 |A_{ij}|^2 \Var(\omega_i\omega_j) \\&= \sum_{i<j} 4 |A_{ij}|^2 \\&= 2 \sum_{i\ne j} |A_{ij}|^2.\end{align*}

In the second-to-last line, we use the fact that \omega_i\omega_j is a uniform random sign, which has variance 1. The final line is a consequence of the symmetry of A.

Uniform on the Real Sphere

The simplest proof is I know is by the “camel principle”. Here’s the story (a lightly edited quotation from MathOverflow):

A father left 17 camels to his three sons and, according to the will, the eldest son was to be given a half of the camels, the middle son one-third, and the youngest son the one-ninth. The sons did not know what to do since 17 is not evenly divisible into either two, three, or nine parts, but a wise man helped the sons: he added his own camel, the oldest son took 18/2=9 camels, the second son took 18/3=6 camels, the third son 18/9=2 camels and the wise man took his own camel and went away.

We are interested in a vector \omega which is uniform on the sphere of radius \sqrt{n}. Performing averages on the sphere is hard, so we add a camel to the problem by “upgrading” \omega to a spherically symmetric vector g which has a random length. We want to pick a distribution for which the computation \Var(g^\top A g) is easy. Fortunately, we already know such a distribution, the Gaussian distribution, for which we already calculated \Var(g^\top A g) = 2\left\|A\right\|_{\rm F}^2.

The Gaussian vector g and the uniform vector \omega on the sphere are related by

    \[g = \sqrt{\frac{a}{n}} \omega,\]

where a is the squared length of the Gaussian vector g. In particular, a has the distribution of the sum of n squared Gaussian random variables, which is known as a \chi^2 random variable with n degrees of freedom.

Now, we take the camel back. Compute the variance of g^\top A g using the chain rule for variance:

    \[\Var(g^\top A g) = \mathbb{E}[\Var(g^\top A g \mid a)] + \Var(\mathbb{E}[g^\top A g \mid a]).\]

Here, \Var(\cdot \mid a) and \mathbb{E}[ \cdot \mid a] denote the conditional variance and conditional expectation with respect to the random variable a. The quick and dirty ways of working with these are to treat the random variable a “like a constant” with respect to the conditional variance and expectation.

Plugging in the formula g = \sqrt{a/n} \cdot \omega and treating a “like a constant”, we obtain

    \begin{align*}\Var(g^\top A g) &= \mathbb{E}[\Var(a/n \cdot \omega^\top A \omega \mid a)] + \Var(\mathbb{E}[a/n \cdot \omega^\top A \omega \mid a]) \\&=\mathbb{E}[(a/n)^2\Var(\omega^\top A \omega)] + \Var(a/n \cdot \mathbb{E}[\omega^\top A \omega]) \\&= \frac{1}{n^2} \mathbb{E}[a^2] \cdot \Var(\omega^\top A \omega) + \frac{1}{n^2} \Var(a) |\mathbb{E} [\omega^\top A \omega]|^2.\end{align*}

As we mentioned, a is a \chi^2 random variable with n degrees of freedom and \mathbb{E}[a^2] and \Var(a) are known quantities that can be looked up:

    \[\mathbb{E}[a^2] = n(n+2), \quad \Var(a) = 2n.\]

We know \Var(g^\top A g) = 2\left\|A\right\|_{\rm F}^2 and \mathbb{E} [\omega^\top A \omega] = \tr A. Plugging these all in, we get

    \[2\left\|A\right\|_{\rm F}^2 = \frac{n+2}{n} \Var(\omega^\top A\omega) + \frac{2}{n} |\tr A|^2.\]

Rearranging, we obtain

    \[\Var(\omega^\top A\omega) = \frac{2n}{n+2} \left( \left\|A\right\|_{\rm F}^2 - \frac{1}{n}|\tr A|^2\right).\]

Complex Gaussians

The trick is the same as for real Gaussians. By invariance of complex Gaussian random vectors under unitary transformations, we can reduce to the case where A is a diagonal matrix populated with eigenvalues \lambda_1,\ldots,\lambda_n. Then

    \[\Var(\omega^*A \omega) = \Var \left( \sum_{i=1}^n \lambda_i |\omega_i|^2 \right) = \sum_{i=1}^n \Var(|\omega_i|^2) \lambda_i^2 = \sum_{i=1}^n \lambda_i^2 = \left\|A\right\|_{\rm F}^2.\]

Here, we use the fact that 2|\omega_i|^2 is a \chi^2 random variable with two degrees of freedom, which has variance four.

Random Phases

The trick is the same as for uniform signs. A short calculation (remembering that A is Hermitian and thus \overline{A_{ij}} = A_{ji}) reveals that

    \[\Var\left( \omega^* A \omega \right) = \Var \left( \sum_{i<j} 2 \Re(A_{ij} \overline{\omega_i} \omega_j) \right).\]

The random variables \{\overline{\omega_i} \omega_j : i < j\} are pairwise independent so we have

    \[\Var\left( \omega^* A \omega \right) = \Var \left( \sum_{i<j} 2 \Re(A_{ij} \overline{\omega_i} \omega_j) \right) = 4\sum_{i<j} \Var \left( \Re(A_{ij} \overline{\omega_i} \omega_j) \right).\]

Since \overline{\omega}_i \omega_j is uniformly distributed on the complex unit circle, we can assume without loss of generality that A_{ij} = |A_{ij}|. Thus, letting \phi be uniform on the complex unit circle,

    \[\Var\left( \omega^* A \omega \right) = 4\sum_{i<j} \Var \left( |A_{ij}|\Re(\phi)) \right) = 4\Var\left( \Re(\phi) \right)\sum_{i<j}|A_{ij}|^2.\]

The real and imaginary parts of \phi have the same distribution so

    \[1 = \Var(\phi) = \Var(\Re \phi) + \Var(\Im \phi) = 2 \Var(\Re \phi)\]

so \Var(\Re \phi) = 1/2. Thus

    \[\Var\left( \omega^* A \omega \right) = 2 \sum_{i<j}|A_{ij}|^2 = \sum_{i\ne j} |A_{ij}|^2.\]

Uniform on the Complex Sphere: Derivation 1 by Reduction to Real Case

There are at least three simple ways of deriving this result: the camel trick, reduction to the real case, and Haar integration. Each of these techniques illustrates a trick that is useful in its own right beyond the context of trace estimation. Since we have already seen an example of the camel trick for the real sphere, I will present the other two derivations.

Let us begin with the reduction to the real case. Let \mathfrak{R}(\cdot) and \mathfrak{I}(\cdot) denote the real and imaginary parts of a vector or matrix, taken entrywise. The key insight is that if \omega is a uniform random vector on the complex sphere of radius \sqrt{n}, then

    \[\mathscr{R}(\omega) := \twobyone{\mathfrak{R}(\omega)}{\mathfrak{I}(\omega)}\in\real^{2n} \quad \text{is a uniform random vector on the real sphere of radius $\sqrt{n}$}.\]

We’ve converted the complex vector \omega into a real vector \mathscr{R}(\omega).

Now, we need to convert the complex matrix A into a real matrix \mathscr{R}(A). To do this, recall that one way of representing complex numbers is by 2\times 2 matrices:

    \[a + bi \iff \twobytwo{a}{-b}{b}{a}.\]

Using this correspondence addition and multiplication of complex numbers can be carried by addition and multiplication of the corresponding matrices.

To convert complex matrices to real matrices, we use a matrix-version of the same representation:

    \[\mathscr{R}(A) = \twobytwo{\mathfrak{R}(A)}{-\mathfrak{I}(A)}{\mathfrak{I}(A)}{\mathfrak{R}(A)}.\]

One can check that addition and multiplication of complex matrices can be carried out by addition and multiplication of the corresponding “realified” matrices, i.e.,

    \[\mathscr{R}(A + B) = \mathscr{R}(A) + \mathscr{R}(B), \quad \mathscr{R}(A\cdot B) = \mathscr{R}(A) \cdot \mathscr{R}(B)\]

holds for all complex matrices A and B.

We’ve now converted complex matrix A and vector \omega into real matrix \mathscr{R}(A) and vector \mathscr{R}(\omega). Let’s compare \omega^*A\omega to \mathscr{R}(\omega)^\top\mathscr{R}(A)\mathscr{R}(\omega). A short calculation reveals

    \[\omega^*A\omega = \mathscr{R}(\omega)^\top \mathscr{R}(A)\mathscr{R}(\omega) .\]

Since \mathscr{R}(\omega) is a uniform random vector on the sphere of radius \sqrt{n}, \sqrt{2}\cdot \mathscr{R}(\omega) is a uniform random vector on the sphere of radius \sqrt{2n}. Thus, by the variance formula for the real sphere, we get

    \[\Var(\omega^*A\omega) = \Var[(\sqrt{2}\mathscr{R}(\omega))^\top (\mathscr{R}(A)/2)(\sqrt{2}\mathscr{R}(\omega) )] = \frac{4n}{2n+2} \left[ \|\mathscr{R}(A)/2\|_{\rm F}^2 - \frac{1}{8n}(\tr\mathscr{R}(A))^2 \right].\]

A short calculation verifies that \tr \mathscr{R}(A) = 2\tr A and \norm{\mathscr{R}(A)}_{\rm F}^2 = 2\|A\|_{\rm F}^2. Plugging this in, we obtain

    \[\Var(\omega^*A\omega)= \frac{n}{n+1} \left[ \|A\|_{\rm F}^2 - \frac{1}{n}(\tr A)^2  \right].\]

Uniform on the Complex Sphere: Derivation 2 by Haar Integration

The proof by reduction to the real case requires some cumbersome calculations and requires that we have already computed the variance in the real case by some other means. The method of Haar integration is more slick, but it requires some pretty high-power machinery. Haar integration may be a little bit overkill for this problem, but this technique is worth learning as it can handle some truly nasty expected value computations that appear, for example, in quantum information.

We seek to compute

    \[\mathbb{E} [(\omega^*A \omega)^2].\]

The first trick will be to write this expession using a single matrix trace using the tensor (Kronecker) product \otimes. For those unfamiliar with the tensor product, the main properties we will be using are

(6)   \[(A\otimes B) (C\otimes D) = (AB) \otimes (CD), \quad \tr(A\otimes B) = \tr A \cdot \tr B. \]

We saw in the proof of unbiasedness that

    \[\omega^* A \omega = \tr (\omega^*A\omega) = \tr (A \omega\omega^*).\]

Therefore, by (6),

    \[(\omega^*A\omega)^2 = (\tr [A \omega\omega^*])^2 = \tr [A\omega\omega^* \otimes A\omega\omega^*] = \tr [(A\otimes A) (\omega\omega^* \otimes \omega\omega^*)].\]

Thus, to evaluate \mathbb{E}[(\omega^*A\omega)^2], it will be sufficient to evaluate \mathbb{E}[\omega\omega^* \otimes \omega\omega^*]. Forunately, there is a useful formula for these expectation provided by a field of mathematics known as representation theory (see Lemma 1 in this paper):

    \[\mathbb{E}[ \omega\omega^* \otimes \omega\omega^*] = \frac{2n}{n+1} \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}.\]

Here, \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)} is the orthogonal projection onto the space of symmetric two-tensors \operatorname{Sym}^2(\complex^n) = \operatorname{span} \{ v \otimes v : v \in \complex^n \}. Therefore, we have that

    \[\mathbb{E}[(\omega^*A\omega)^2] = \tr [(A\otimes A) \mathbb{E}(\omega\omega^* \otimes \omega\omega^*)] = \frac{2n}{n+1} \tr [(A\otimes A) \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}].\]

To evalute the trace on the right-hand side of this equation, there is another formula (see Lemma 6 in this paper):

    \[\tr \left[(A\otimes B) \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}\right] = \frac{1}{2} \left( \tr(AB) + \tr A \cdot \tr B \right).\]

Therefore, we conclude

    \begin{align*}\Var(\omega^* A \omega) &= \mathbb{E}[(\omega^*A\omega)^2] - (\mathbb{E}[\omega^*A\omega])^2 \\&= \frac{2n}{n+1}\tr [(A\otimes A) \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}] - (\tr A)^2 \\&= \frac{n}{n+1}\left[ \tr A^2 + (\tr A)^2 \right] - (\tr A)^2 \\&= \frac{n}{n+1}\left[ \left\|A\right\|_{\rm F}^2 - \frac{1}{n} (\tr A)^2 \right].\end{align*}

Proof of Optimality Properties

In this section, we provide proofs of the two optimality properties.

Optimality: Independent Vectors with Independent Coordinates

Assume A is real and symmetric and suppose that \omega is isotropic (2) with independent coordinates. The isotropy condition

    \[\mathbb{E}[\omega\omega^\top] = I\]

implies that \mathbb{E}[\omega_i\omega_j] = \delta_{ij}, where \delta is the Kronecker symbol. Using this fact, we compute the second moment:

    \begin{align*}\mathbb{E}[ (\omega^*A \omega)^2] &= \mathbb{E}\left[ \left( \sum_{i=1}^n A_{ii} \omega_i^2 +2 \sum_{i<j} A_{ij}\omega_i\omega_j) \right)^2\right] \\&= \sum_{i=1}^n A_{ii}^2 \mathbb{E}[\omega_i^4] + \sum_{i<j} (2A_{ii}A_{jj}+4A_{ij}^2) \mathbb{E}[\omega_i^2]\mathbb{E}[\omega_j^2] \\&= \sum_{i=1}^n A_{ii}^2 \mathbb{E}[\omega_i^4] + \sum_{i<j} (2A_{ii}A_{jj}+4A_{ij}^2) .\end{align*}

Thus

    \[\Var(\omega^*A\omega) = \mathbb{E}[ (\omega^*A \omega)^2] - (\mathbb{E}[\omega^* A \omega])^2 = \sum_{i=1}^n A_{ii}^2 (\mathbb{E}[|\omega_i|^4]-1) + 4\sum_{i<j} A_{ij}^2.\]

The variance is minimized by choosing \omega with \mathbb{E} \omega_i^4 as small as possible. Since \mathbb{E} \omega_i^2 = 1, the smallest possible value for \mathbb{E} \omega_i^4 is \mathbb{E} \omega_i^4 = 1, which is obtained by populating \omega with random signs.

Optimality: Independent Vectors

This result appears to have first been proven by Richard Kueng in unpublished work. We use an argument suggested to me by Robert J. Webber.

Assume \mathscr{A} is a class of \field-Hermitian matrices closed under \field-unitary similarity transformations and that \omega is an isotropic random vector (2). Decompose the test vector as

    \[\omega = a \cdot s \quad \text{for} \quad a \in [0,+\infty), \: s \in\{x\in \field^n : x^*x = n \}.\]

First, we shall show that the variance is reduced by replacing s with a vector t drawn uniformly from the sphere

(7)   \[\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega}) \le \sup_{A\in\mathscr{A}} \Var(\omega^*A\omega \]

where

(8)   \[\tilde{\omega} = a\cdot t \quad \text{and}\quad t\sim \text{Uniform} \{ x \in \field^n :x^*x = n \} \quad \text{is independent of $a$}. \]

Note that such a t can be generated as t = Qs for a uniformly random \field-unitary matrix Q. Therefore, we have

    \begin{align*}\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega})&= \sup_{A\in\mathscr{A}} \left[\mathbb{E}[(\tilde{\omega}^*A\tilde{\omega})^2] - (\tr A)^2\right]\\&= \sup_{A\in\mathscr{A}} \left[\mathbb{E}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right].\end{align*}

Now apply Jensen’s inequality only over the randomness in Q to obtain

    \begin{align*}\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega})&= \sup_{A\in\mathscr{A}} \left[\mathbb{E}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right] \\&\le \mathbb{E}_Q \sup_{A\in\mathscr{A}} \left[\mathbb{E}_{a,s}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right].\end{align*}

Finally, note that since \mathscr{A} is closed under \field-unitary similarity transformations, the supremum over Q^*AQ for A \in \mathscr{A} is the same as the supremum of A \in \mathscr{A}, so we obtain

    \begin{align*}\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega})&\le \mathbb{E}_Q \sup_{A\in\mathscr{A}} \left[\mathbb{E}_{a,s}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right] \\&= \mathbb{E}_Q \sup_{A\in\mathscr{A}} \left[\mathbb{E}_{a,s}[a^2 \cdot s^*As] - (\tr A)^2\right] \\&= \sup_{A\in\mathscr{A}} \Var(\omega^*A\omega).\end{align*}

We have successfully proven (7). This argument is a specialized version of a far more general result which appears as Proposition 4.1 in this paper.

Next, we shall prove

(9)   \[\sup_{A\in\mathscr{A}} \Var(t^*At) \le \sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega}), \]

where t is still defined as in (8). Indeed, using the chain rule for variance, we obtain

    \begin{align*}\Var(\tilde{\omega}^*A\tilde{\omega})&= \Var(a^2\cdot t^*At) \\&= \mathbb{E}[\Var(a^2\cdot t^* A t \mid a)] + \Var(\mathbb{E}[a^2\cdot t^* A t \mid a]) \\&= \mathbb{E}[a^4]\Var(t^* A t )+ (\tr A)^2\Var(a^2) \\&\ge \mathbb{E}[a^4]\Var(t^* A t ).\end{align*}

Here, we have used that t is uniform on the sphere and thus \mathbb{E}[t^*At] = \tr A. By definition, a is the length of \omega divided by \sqrt{n}. Therefore,

    \[\mathbb{E}[a^2] = \frac{1}{n}\mathbb{E}[\omega^*\omega] = \frac{1}{n} \mathbb{E}[\tr (\omega\omega^*)] = \frac{1}{n} \tr (\mathbb{E}[\omega\omega^*]) = \frac{\tr I}{n} = 1.\]

Therefore, by Jensen’s inequality,

    \[\mathbb{E}[a^4] = \mathbb{E}[(a^2)^2] \ge (\mathbb{E}[a^2])^2 = 1.\]

Thus

    \[\Var(\tilde{\omega}^*A\tilde{\omega}) \ge \mathbb{E}[a^4]\Var(t^* A t ) \ge \Var(t^*At) \quad \text{for every }A,\]

which proves (9).

6 thoughts on “Stochastic Trace Estimation

  1. Fantastic post. Is is known how to extend this technique (and others like Hutch++) that do just trace estimation to traces of spectral functions, i.e. matrices whose spectra are composed with functions?

Leave a Reply

Your email address will not be published. Required fields are marked *