I am delighted to share that me, Joel A. Tropp, and Robert J. Webber‘s paper XTrace: Making the Most of Every Sample in Stochastic Trace Estimation has recently been released as a preprint on arXiv. In it, we consider the implicit trace estimation problem:
Implicit trace estimation problem: Given access to a square matrix
via the matrix–vector product operation
, estimate its trace
.
Algorithms for this task have many uses such as log-determinant computations in machine learning, partition function calculations in statistical physics, and generalized cross validation for smoothing splines. I described another application to counting triangles in a large network in a previous blog post.
Our paper presents new trace estimators XTrace and XNysTrace which are highly efficient, producing accurate trace approximations using a small budget of matrix–vector products. In addition, these algorithms are fast to run and are supported by theoretical results which explain their excellent performance. I really hope that you will check out the paper to learn more about these estimators!
For the rest of this post, I’m going to talk about the most basic stochastic trace estimation algorithm, the Girard–Hutchinson estimator. This seemingly simple algorithm exhibits a number of nuances and forms the backbone for more sophisticated trace estimates such as Hutch++, Nyström++, XTrace, and XNysTrace. Toward the end, this blog post will be fairly mathematical, but I hope that the beginning will be fairly accessible to all.
Girard–Hutchinson Estimator: The Basics
The Girard–Hutchinson estimator for the trace of a square matrix
is
(1) ![Rendered by QuickLaTeX.com \[\hat{\tr} = \frac{1}{m} \sum_{i=1}^m \omega_i^* A \omega_i. \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-d7de978db4f57321a96c53944bebc189_l3.png)
Here,
are random vectors, usually chosen to be statistically independent, and
denotes the conjugate transpose of a vector or matrix. The Girard–Hutchinson estimator only depends on the matrix
through the matrix–vector products
.
Unbiasedness
Provided the random vectors are isotropic
(2) ![Rendered by QuickLaTeX.com \[\mathbb{E} [\omega_i\omega_i^*] = I, \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-d962300e9943c0618c633bfeefb289f0_l3.png)
the Girard–Hutchinson estimator is unbiased:
(3) ![Rendered by QuickLaTeX.com \[\mathbb{E} [\hat{\tr}] = \tr A.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-210efada631bdca25b3196a9c9aae00d_l3.png)
Let us confirm this claim in some detail. First, we use linearity of expectation to evaluate
(4) ![Rendered by QuickLaTeX.com \[\mathbb{E} [\hat{\tr}] = \mathbb{E} \left[ \frac{1}{m} \sum_{i=1}^m \omega_i^*A\omega_i \right] = \frac{1}{m} \sum_{i=1}^m \mathbb{E} \left[ \omega_i^* A \omega_i\right]. \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-5cb7027c877f59aae291b08b47d47966_l3.png)
Therefore, to prove that
, it is sufficient to prove that
for each
.
When working with traces, there are two tricks that solve 90% of derivations. The first trick is that, if we view a number as a
matrix, then a number equals its trace,
. The second trick is the cyclic property: For a
matrix
and a
matrix
, we have
. The cyclic property should be handled with care when one works with a product of three or more matrices. For instance, we have
![Rendered by QuickLaTeX.com \[\tr[BCD] = \tr[(BC)D] = \tr[D(BC)] = \tr[DBC].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-155ce79a5ef10f252cd9517740583182_l3.png)
However,
![Rendered by QuickLaTeX.com \[\tr [BCD] \ne \tr[CBD] \quad \text{in general}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-2b72ed0e84a060cf970349f61aed68ee_l3.png)
One should think of the matrix product
as beads on a closed loop of string. One can move the last bead
to the front of the other two,
, but not interchange two beads,
.
With this trick in hand, let’s return to proving that
for every
. Apply our two tricks:
![Rendered by QuickLaTeX.com \[\mathbb{E} \left[\omega_i^*A\omega_i\right] = \mathbb{E} \tr \left[\omega_i^*A\omega_i\right] = \mathbb{E} \tr \left[A\omega_i\omega_i^*\right].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-9187ce141ddc88ae2c5f717c27571141_l3.png)
The expectation is a linear operation and the matrix
is non-random, so we can bring the expectation into the trace as
![Rendered by QuickLaTeX.com \[\mathbb{E} \left[\omega_i^*A\omega_i\right] = \mathbb{E} \tr \left[A\omega_i\omega_i^*\right] = \tr(A \mathbb{E}[\omega_i\omega_i^*] ).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-0a260629d9c41472ea9b71c3bb10a2a1_l3.png)
Invoke the isotropy condition (2) and conclude:
![Rendered by QuickLaTeX.com \[\mathbb{E} \left[\omega_i^*A\omega_i\right] = \tr(A \mathbb{E}[\omega_i\omega_i^*] ) = \tr(A\cdot I) = \tr A.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-c1c09245d42d16b1a05566cbe1e27664_l3.png)
Plugging this into (4) confirms the unbiasedness claim (3).
Variance
Continue to assume that the
‘s are isotropic (3) and now assume that
are independent. By independence, the variance can be written as
![Rendered by QuickLaTeX.com \[\Var(\hat{\tr}) = \frac{1}{m^2} \sum_{i=1}^m \Var(\omega_i^*A\omega_i).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-4401469db02c371beb1b1bbaa5de731f_l3.png)
Assuming that
are identically distributed
, we then get
![Rendered by QuickLaTeX.com \[\Var(\hat{\tr}) = \frac{1}{m} \Var(\omega^*A\omega).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-40b9400b57ed2e445fc0a0813ed61699_l3.png)
The variance decreases like
, which is characteristic of Monte Carlo-type algorithms. Since
is unbiased (i.e, (3)), this means that the mean square error decays like
so the average error (more precisely root-mean-square error) decays like
![Rendered by QuickLaTeX.com \[\left| \hat{\tr} - \tr A \right| \lessapprox \frac{\mathrm{const}}{\sqrt{m}}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-e19b6a62552344261d535991945baf0c_l3.png)
This type of convergence is very slow. If I want to decrease the error by a factor of
, I must do
the work!
Variance-reduced trace estimators like Hutch++ and our new trace estimator XTrace improve the rate of convergence substantially. Even in the worst case, Hutch++ and XTrace reduce the variance at a rate
and (root-mean-square) error at rates
:
![Rendered by QuickLaTeX.com \[\Var(\hat{\tr}_{\text{H++ or X}}) \le \frac{\mathrm{const}}{m^2},\quad \left| \hat{\tr}_{\text{H++ or X}} - \tr A \right| \lessapprox \frac{\mathrm{const}}{m}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-04f981bb20ddfeb465d879e9ef440854_l3.png)
For matrices with rapidly decreasing singular values, the variance and error can decrease much faster than this.
Variance Formulas
As the rate of convergence for the Girard–Hutchinson estimator is so slow, it is imperative to pick a distribution on test vectors
that makes the variance of the single–sample estimate
as low as possible. In this section, we will provide several explicit formulas for the variance of the Girard–Hutchinson estimator. Derivations of these formulas will appear at the end of this post. These variance formulas help illuminate the benefits and drawbacks of different test vector distributions.
To express the formulas, we will need some notation. For a complex number
we use
and
to denote the real and imaginary parts. The variance of a random complex number
is
![Rendered by QuickLaTeX.com \[\Var(z) := \mathbb{E} |z - \mathbb{E} z|^2 = \Var(\Re z) + \Var(\Im z).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a02aee32ffbeff14f4a44921d3324126_l3.png)
The Frobenius norm of a matrix
is
![Rendered by QuickLaTeX.com \[\left\|A\right\|_{\rm F}^2 = \sum_{i,j} |A_{ij}|^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-5cb12b8a023d04d9bbfff1e1cc685bb3_l3.png)
If
is real symmetric or complex Hermitian with (real) eigenvalues
, we have
(5) ![Rendered by QuickLaTeX.com \[\left\|A\right\|_{\rm F}^2 = \sum_{i=1}^n \lambda_i^2. \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a67a31b37b18f53f5d060e94c896ebc6_l3.png)
denotes the ordinary transpose of
and
denotes the conjugate transpose of
.
Real-Valued Test Vectors
We first focus on real-valued test vectors
. Since
is real, we can use the ordinary transpose
rather than the conjugate transpose
. Since
is a number, it is equal to its own transpose:
![Rendered by QuickLaTeX.com \[\omega^\top A \omega = (\omega^\top A \omega)^\top = \omega^\top A^\top \omega.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-719b03091ae9961865f7c26c73b089cb_l3.png)
Therefore,
![Rendered by QuickLaTeX.com \[\omega^\top A\omega = \frac{\omega^\top A \omega + \omega^\top A^\top \omega}{2} = \omega^\top \left( \frac{A + A^\top}{2} \right)\omega.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-26830e0938226dfafe0012340b3edc4c_l3.png)
The Girard–Hutchinson trace estimator applied to
is the same as the Girard–Hutchinson estimator applied to the symmetric part of
,
.
For the following results, assume
is symmetric,
.
- Real Gaussian:
are independent standard normal random vectors. ![Rendered by QuickLaTeX.com \[\Var(\omega^\top A\omega) = 2 \left\|A\right\|_{\rm F}^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-68d0725a0be66f7bd471887707dc00e9_l3.png)
- Uniform signs (Rademachers):
are independent random vectors with uniform
coordinates. ![Rendered by QuickLaTeX.com \[\Var(\omega^\top A \omega) = 2\sum_{i\ne j} |A_{ij}|^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-d39bacfe5e266b7e6a389b49a2ef7ab4_l3.png)
- Real sphere: Assume
are uniformly distributed on the real sphere of radius
:
. ![Rendered by QuickLaTeX.com \[\Var(\omega^\top A\omega) = \frac{2n}{n+2} \left( \left\|A\right\|_{\rm F}^2 - \frac{1}{n} |\tr A|^2 \right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-62d79e990b1371fe7f3b70afd022d359_l3.png)
These formulas continue to hold for nonsymmetric
by replacing
by its symmetric part
on the right-hand sides of these variance formulas.
Complex-Valued Test Vectors
We now move our focus to complex-valued test vectors
. As a rule of thumb, one should typically expect that the variance for complex-valued test vectors applied to a real symmetric matrix
is about half the natural real counterpart—e.g., for complex Gaussians, you get about half the variance than with real Gaussians.
A square complex matrix has a Cartesian decomposition
![Rendered by QuickLaTeX.com \[A = A^{\rm H} + i A^{\rm SH}\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-8ec815ab2ed2bfe74d2eab50fc83bbf2_l3.png)
where
![Rendered by QuickLaTeX.com \[A^{\rm H} = \frac{A+A^*}{2} ,\quad A^{\rm SH} = \frac{A - A^*}{2i}\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-1f89c50f79ec3d8662597498c14792b7_l3.png)
denote the Hermitian and skew-Hermitian parts of
. Similar to how the imaginary part of a complex number is real, the skew-Hermitian part of a complex matrix is Hermitian (and
is skew-Hermitian). Since
and
are both Hermitian, we have
![Rendered by QuickLaTeX.com \[\Re(\omega^* A\omega) = \omega^* A^{\rm H} \omega, \quad \Im (\omega^* A \omega) = \omega^* A^{\rm SH} \omega.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-222e3ce55f034ead06188b52f756dbf4_l3.png)
Consequently, the variance of
can be broken into Hermitian and skew-Hermitian parts:
![Rendered by QuickLaTeX.com \[\Var(\omega^* A\omega) = \Var(\omega^* A^{\rm H}\omega) + \Var(\omega^* A^{\rm SH}\omega).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-52f0bee30b0c5a3011d04a696be9a63b_l3.png)
For this reason, we will state the variance formulas only for Hermitian
, with the formula for general
following from the Cartesian decomposition.
For the following results, assume
is Hermitian,
.
- Complex Gaussian:
are independent standard complex random vectors, i.e., each
has iid entries distributed as
for
standard normal random variables. ![Rendered by QuickLaTeX.com \[\Var(\omega^* A\omega) = \left\|A\right\|_{\rm F}^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-cc3fa5a76928939dc4e59e4e579719ec_l3.png)
- Uniform phases (Steinhauses):
are independent random vectors whose entries are uniform on the complex unit circle
. ![Rendered by QuickLaTeX.com \[\Var(\omega^* A \omega) = \sum_{i\ne j} |A_{ij}|^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-d82497dca5262e5242afb868d5a4fa08_l3.png)
- Complex sphere: Assume
are uniformly distributed on the complex sphere of radius
:
. ![Rendered by QuickLaTeX.com \[\Var(\omega^* A\omega) = \frac{n}{n+1} \left( \left\|A\right\|_{\rm F}^2 - \frac{1}{n} |\tr A|^2 \right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-23aeee1ea3a346726462faee63b50656_l3.png)
Optimality Properties
Let us finally address the question of what the best choice of test vectors is for the Girard–Hutchinson estimator. We will state two results with different restrictions on
.
Our first result, due to Hutchinson, is valid for real symmetric matrices with real test vectors.
Optimality (independent test vectors with independent coordinates). If the test vectors
are isotropic (2), independent from each other, and have independent entries, then for any fixed real symmetric matrix
, the minimum variance for
is obtained when
are populated with random signs
.
The next optimality results will have real and complex versions. To present the results for
-valued and an
-valued test vectors on unified footing, let
denote either
or
. We let a
-Hermitian matrix be either a real symmetric matrix (if
) or a complex Hermitian matrix (if
). Let a
-unitary matrix be either a real orthogonal matrix (if
) or a complex unitary matrix (if
).
The condition that the vectors
have independent entries is often too restrictive in practice. It rules out, for instance, the case of uniform vectors on the sphere. If we relax this condition, we get a different optimal distribution:
Optimality (independent test vectors). Consider any set
of
-Hermitian matrices which is invariant under
-unitary similary transformations:
![Rendered by QuickLaTeX.com \[\text{If $A \in \mathscr{A}$ and $U$ is $\field$-unitary, then $U^*AU \in \mathscr{A}$.}\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-fd95bf4747231110e72c59c18e4b5794_l3.png)
Assume that the test vectors
are independent and isotropic (2). The worst-case variance
is minimized by choosing
uniformly on the
-sphere:
.
More simply, if you wants your stochastic trace estimator to be effective for a class of inputs
(closed under
-unitary similarity transformations) rather than a single input matrix
, then the best distribution are test vectors drawn uniformly from the sphere. Examples of classes of matrices
include:
- Fixed eigenvalues. For fixed real eigenvalues
, the set of all
-Hermitian matrices with these eigenvalues. - Density matrices. The class of all trace-one psd matrices.
- Frobenius norm ball. The class of all
-Hermitian matrices of Frobenius norm at most 1.
Derivation of Formulas
In this section, we provide derivations of the variance formulas. I have chosen to focus on derivations which are shorter but use more advanced techniques rather than derivations which are longer but use fewer tricks.
Real Gaussians
First assume
is real. Since
is real symmetric,
has an eigenvalue decomposition
, where
is orthogonal and
is a diagonal matrix reporting
‘s eigenvalues. Since the real Gaussian distribution is invariant under orthogonal transformations,
has the same distribution as
. Therefore,
![Rendered by QuickLaTeX.com \[\Var(\omega^\top A \omega) = \Var(\omega^\top \Lambda \omega) = \Var \left( \sum_{i=1}^n \lambda_i \omega_i^2 \right) = \sum_{i=1}^n \lambda_i^2 \Var(\omega_i^2) = 2\sum_{i=1}^n \lambda_i^2 = 2\left\|A\right\|_{\rm F}^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-88610c0bb2f7d00d6c5b122128b3ecc2_l3.png)
Here, we used that the variance of a squared standard normal random variable is two.
For
non-real matrix, we can break the matrix
into its entrywise real and imaginary parts
. Thus,
![Rendered by QuickLaTeX.com \[\Var(\omega^\top A \omega) = \Var(\omega^\top \mathfrak{R}(A) \omega) + \Var(\omega^\top \mathfrak{I}(A) \omega) = 2\left\|\mathfrak{R}(A)\right\|_{\rm F}^2 + 2\left\|\mathfrak{I}(A)\right\|_{\rm F}^2 = 2\left\|A\right\|_{\rm F}^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-4a48ec010d732bb8bb81a4e2f62c4a60_l3.png)
Uniform Signs
First, compute
![Rendered by QuickLaTeX.com \[\omega^\top A \omega - \mathbb{E}[\omega^\top A \omega] = \sum_{i,j=1}^n A_{ij} \omega_i\omega_j - \sum_{i=1}^n A_{ii} = \sum_{i\ne j} A_{ij} \omega_i\omega_j + \sum_{i=1}^n A_{ii}(\omega_i^2-1).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-2b40ca061457a01f94dd006c31ec31aa_l3.png)
For a vector
of uniform random signs, we have
for every
, so the second sum vanishes. Note that we have assumed
symmetric, so the sum over
can be replaced by two times the sum over
:
![Rendered by QuickLaTeX.com \[\omega^\top A \omega - \mathbb{E}[\omega^\top A \omega] = 2\sum_{i< j} A_{ij} \omega_i\omega_j.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-554778da91da826c0a2eb6a61e95c949_l3.png)
Note that
are pairwise independent. As a simple exercise, one can verify that the identity
![Rendered by QuickLaTeX.com \[\Var(a_1 X_1+\cdots+a_kX_k) = |a_1|^2 \Var(X_1) + \cdots + |a_k|^2 \Var(X_k)\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-9bc2039e8f7ba6d806c6c3ebd7a56bd5_l3.png)
holds for any pairwise independent family of random variances
and numbers
. Ergo,
![Rendered by QuickLaTeX.com \begin{align*}\Var(\omega^\top A\omega) &= \Var(\omega^\top A \omega - \mathbb{E}[\omega^\top A \omega]) \\&= \Var\left(\sum_{i< j} 2A_{ij} \omega_i\omega_j\right) \\&= \sum_{i<j} 4 |A_{ij}|^2 \Var(\omega_i\omega_j) \\&= \sum_{i<j} 4 |A_{ij}|^2 \\&= 2 \sum_{i\ne j} |A_{ij}|^2.\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-69e5092cd2e186735dff9ae8f39bf6cb_l3.png)
In the second-to-last line, we use the fact that
is a uniform random sign, which has variance
. The final line is a consequence of the symmetry of
.
Uniform on the Real Sphere
The simplest proof is I know is by the “camel principle”. Here’s the story (a lightly edited quotation from MathOverflow):
A father left 17 camels to his three sons and, according to the will, the eldest son was to be given a half of the camels, the middle son one-third, and the youngest son the one-ninth. The sons did not know what to do since 17 is not evenly divisible into either two, three, or nine parts, but a wise man helped the sons: he added his own camel, the oldest son took
camels, the second son took
camels, the third son
camels and the wise man took his own camel and went away.
We are interested in a vector
which is uniform on the sphere of radius
. Performing averages on the sphere is hard, so we add a camel to the problem by “upgrading”
to a spherically symmetric vector
which has a random length. We want to pick a distribution for which the computation
is easy. Fortunately, we already know such a distribution, the Gaussian distribution, for which we already calculated
.
The Gaussian vector
and the uniform vector
on the sphere are related by
![Rendered by QuickLaTeX.com \[g = \sqrt{\frac{a}{n}} \omega,\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a1a3030e4659fe1eb07283cf444cfaa0_l3.png)
where
is the squared length of the Gaussian vector
. In particular,
has the distribution of the sum of
squared Gaussian random variables, which is known as a
random variable with
degrees of freedom.
Now, we take the camel back. Compute the variance of
using the chain rule for variance:
![Rendered by QuickLaTeX.com \[\Var(g^\top A g) = \mathbb{E}[\Var(g^\top A g \mid a)] + \Var(\mathbb{E}[g^\top A g \mid a]).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-905a9c4d09b2126ed9eed55771631b62_l3.png)
Here,
and
denote the conditional variance and conditional expectation with respect to the random variable
. The quick and dirty ways of working with these are to treat the random variable
“like a constant” with respect to the conditional variance and expectation.
Plugging in the formula
and treating
“like a constant”, we obtain
![Rendered by QuickLaTeX.com \begin{align*}\Var(g^\top A g) &= \mathbb{E}[\Var(a/n \cdot \omega^\top A \omega \mid a)] + \Var(\mathbb{E}[a/n \cdot \omega^\top A \omega \mid a]) \\&=\mathbb{E}[(a/n)^2\Var(\omega^\top A \omega)] + \Var(a/n \cdot \mathbb{E}[\omega^\top A \omega]) \\&= \frac{1}{n^2} \mathbb{E}[a^2] \cdot \Var(\omega^\top A \omega) + \frac{1}{n^2} \Var(a) |\mathbb{E} [\omega^\top A \omega]|^2.\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-3026a100436c6b4e822b5fd2468829ce_l3.png)
As we mentioned,
is a
random variable with
degrees of freedom and
and
are known quantities that can be looked up:
![Rendered by QuickLaTeX.com \[\mathbb{E}[a^2] = n(n+2), \quad \Var(a) = 2n.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-3fc9d94ca5542ae0cb9b24a30efbd46c_l3.png)
We know
and
. Plugging these all in, we get
![Rendered by QuickLaTeX.com \[2\left\|A\right\|_{\rm F}^2 = \frac{n+2}{n} \Var(\omega^\top A\omega) + \frac{2}{n} |\tr A|^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-28c12df3dfc463c71106c87ab576a450_l3.png)
Rearranging, we obtain
![Rendered by QuickLaTeX.com \[\Var(\omega^\top A\omega) = \frac{2n}{n+2} \left( \left\|A\right\|_{\rm F}^2 - \frac{1}{n}|\tr A|^2\right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-65eb536bbaf2ab7b920da7cbdc6df709_l3.png)
Complex Gaussians
The trick is the same as for real Gaussians. By invariance of complex Gaussian random vectors under unitary transformations, we can reduce to the case where
is a diagonal matrix populated with eigenvalues
. Then
![Rendered by QuickLaTeX.com \[\Var(\omega^*A \omega) = \Var \left( \sum_{i=1}^n \lambda_i |\omega_i|^2 \right) = \sum_{i=1}^n \Var(|\omega_i|^2) \lambda_i^2 = \sum_{i=1}^n \lambda_i^2 = \left\|A\right\|_{\rm F}^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-8373d673ee36f506deb78ccb637486a3_l3.png)
Here, we use the fact that
is a
random variable with two degrees of freedom, which has variance four.
Random Phases
The trick is the same as for uniform signs. A short calculation (remembering that
is Hermitian and thus
) reveals that
![Rendered by QuickLaTeX.com \[\Var\left( \omega^* A \omega \right) = \Var \left( \sum_{i<j} 2 \Re(A_{ij} \overline{\omega_i} \omega_j) \right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-45fafcf974267732a375287d65a7cd4b_l3.png)
The random variables
are pairwise independent so we have
![Rendered by QuickLaTeX.com \[\Var\left( \omega^* A \omega \right) = \Var \left( \sum_{i<j} 2 \Re(A_{ij} \overline{\omega_i} \omega_j) \right) = 4\sum_{i<j} \Var \left( \Re(A_{ij} \overline{\omega_i} \omega_j) \right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-51d809d4c6d2210932fcaf3b06b3f8b5_l3.png)
Since
is uniformly distributed on the complex unit circle, we can assume without loss of generality that
. Thus, letting
be uniform on the complex unit circle,
![Rendered by QuickLaTeX.com \[\Var\left( \omega^* A \omega \right) = 4\sum_{i<j} \Var \left( |A_{ij}|\Re(\phi)) \right) = 4\Var\left( \Re(\phi) \right)\sum_{i<j}|A_{ij}|^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-b4a70df99004eec8eb6f05d26467c351_l3.png)
The real and imaginary parts of
have the same distribution so
![Rendered by QuickLaTeX.com \[1 = \Var(\phi) = \Var(\Re \phi) + \Var(\Im \phi) = 2 \Var(\Re \phi)\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-014c1792fb787b53c5b17ba4c76af90e_l3.png)
so
. Thus
![Rendered by QuickLaTeX.com \[\Var\left( \omega^* A \omega \right) = 2 \sum_{i<j}|A_{ij}|^2 = \sum_{i\ne j} |A_{ij}|^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-07146eb21d91f21f747ea6a1bf092ea5_l3.png)
Uniform on the Complex Sphere: Derivation 1 by Reduction to Real Case
There are at least three simple ways of deriving this result: the camel trick, reduction to the real case, and Haar integration. Each of these techniques illustrates a trick that is useful in its own right beyond the context of trace estimation. Since we have already seen an example of the camel trick for the real sphere, I will present the other two derivations.
Let us begin with the reduction to the real case. Let
and
denote the real and imaginary parts of a vector or matrix, taken entrywise. The key insight is that if
is a uniform random vector on the complex sphere of radius
, then
![Rendered by QuickLaTeX.com \[\mathscr{R}(\omega) := \twobyone{\mathfrak{R}(\omega)}{\mathfrak{I}(\omega)}\in\real^{2n} \quad \text{is a uniform random vector on the real sphere of radius $\sqrt{n}$}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-13581c61bcdb79519cd801b56723d912_l3.png)
We’ve converted the complex vector
into a real vector
.
Now, we need to convert the complex matrix
into a real matrix
. To do this, recall that one way of representing complex numbers is by
matrices:
![Rendered by QuickLaTeX.com \[a + bi \iff \twobytwo{a}{-b}{b}{a}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-06a101458da1c31ea0869cfa3dbd4b73_l3.png)
Using this correspondence addition and multiplication of complex numbers can be carried by addition and multiplication of the corresponding matrices.
To convert complex matrices to real matrices, we use a matrix-version of the same representation:
![Rendered by QuickLaTeX.com \[\mathscr{R}(A) = \twobytwo{\mathfrak{R}(A)}{-\mathfrak{I}(A)}{\mathfrak{I}(A)}{\mathfrak{R}(A)}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-e4558ae7ba06d0d95db1ff3665f81093_l3.png)
One can check that addition and multiplication of complex matrices can be carried out by addition and multiplication of the corresponding “realified” matrices, i.e.,
![Rendered by QuickLaTeX.com \[\mathscr{R}(A + B) = \mathscr{R}(A) + \mathscr{R}(B), \quad \mathscr{R}(A\cdot B) = \mathscr{R}(A) \cdot \mathscr{R}(B)\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-ca41602dcbf4ab0b1d2f1d2d19cb3156_l3.png)
holds for all complex matrices
and
.
We’ve now converted complex matrix
and vector
into real matrix
and vector
. Let’s compare
to
. A short calculation reveals
![Rendered by QuickLaTeX.com \[\omega^*A\omega = \mathscr{R}(\omega)^\top \mathscr{R}(A)\mathscr{R}(\omega) .\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-9cc7eb8d1c7b5726d44b062936cd4e59_l3.png)
Since
is a uniform random vector on the sphere of radius
,
is a uniform random vector on the sphere of radius
. Thus, by the variance formula for the real sphere, we get
![Rendered by QuickLaTeX.com \[\Var(\omega^*A\omega) = \Var[(\sqrt{2}\mathscr{R}(\omega))^\top (\mathscr{R}(A)/2)(\sqrt{2}\mathscr{R}(\omega) )] = \frac{4n}{2n+2} \left[ \|\mathscr{R}(A)/2\|_{\rm F}^2 - \frac{1}{8n}(\tr\mathscr{R}(A))^2 \right].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-dc8feeeb958c69c6eb239478674f6748_l3.png)
A short calculation verifies that
and
. Plugging this in, we obtain
![Rendered by QuickLaTeX.com \[\Var(\omega^*A\omega)= \frac{n}{n+1} \left[ \|A\|_{\rm F}^2 - \frac{1}{n}(\tr A)^2 \right].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-f48e0eebc5a1b52ff9357b83d348b1f6_l3.png)
Uniform on the Complex Sphere: Derivation 2 by Haar Integration
The proof by reduction to the real case requires some cumbersome calculations and requires that we have already computed the variance in the real case by some other means. The method of Haar integration is more slick, but it requires some pretty high-power machinery. Haar integration may be a little bit overkill for this problem, but this technique is worth learning as it can handle some truly nasty expected value computations that appear, for example, in quantum information.
We seek to compute
![Rendered by QuickLaTeX.com \[\mathbb{E} [(\omega^*A \omega)^2].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-c5a380c3abc691bdcc8542ebe3f48c0f_l3.png)
The first trick will be to write this expession using a single matrix trace using the tensor (Kronecker) product
. For those unfamiliar with the tensor product, the main properties we will be using are
(6) ![Rendered by QuickLaTeX.com \[(A\otimes B) (C\otimes D) = (AB) \otimes (CD), \quad \tr(A\otimes B) = \tr A \cdot \tr B. \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-175cabf924824dc07b5a92e57eab86c1_l3.png)
We saw in the proof of unbiasedness that
![Rendered by QuickLaTeX.com \[\omega^* A \omega = \tr (\omega^*A\omega) = \tr (A \omega\omega^*).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-ce640b29ebfc673e441b2339b9ec28ec_l3.png)
Therefore, by (6),
![Rendered by QuickLaTeX.com \[(\omega^*A\omega)^2 = (\tr [A \omega\omega^*])^2 = \tr [A\omega\omega^* \otimes A\omega\omega^*] = \tr [(A\otimes A) (\omega\omega^* \otimes \omega\omega^*)].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-be12748493beb02472a022c5efc1f855_l3.png)
Thus, to evaluate
, it will be sufficient to evaluate
. Forunately, there is a useful formula for these expectation provided by a field of mathematics known as representation theory (see Lemma 1 in this paper):
![Rendered by QuickLaTeX.com \[\mathbb{E}[ \omega\omega^* \otimes \omega\omega^*] = \frac{2n}{n+1} \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-5c39d88134e7a3628744cf2509294902_l3.png)
Here,
is the orthogonal projection onto the space of symmetric two-tensors
. Therefore, we have that
![Rendered by QuickLaTeX.com \[\mathbb{E}[(\omega^*A\omega)^2] = \tr [(A\otimes A) \mathbb{E}(\omega\omega^* \otimes \omega\omega^*)] = \frac{2n}{n+1} \tr [(A\otimes A) \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-42f3b90989c6405e98ff6d79a7fc7906_l3.png)
To evalute the trace on the right-hand side of this equation, there is another formula (see Lemma 6 in this paper):
![Rendered by QuickLaTeX.com \[\tr \left[(A\otimes B) \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}\right] = \frac{1}{2} \left( \tr(AB) + \tr A \cdot \tr B \right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-09896bb79140d5decdac0b302f6d761c_l3.png)
Therefore, we conclude
![Rendered by QuickLaTeX.com \begin{align*}\Var(\omega^* A \omega) &= \mathbb{E}[(\omega^*A\omega)^2] - (\mathbb{E}[\omega^*A\omega])^2 \\&= \frac{2n}{n+1}\tr [(A\otimes A) \operatorname{Proj}_{\operatorname{Sym}^2(\complex^n)}] - (\tr A)^2 \\&= \frac{n}{n+1}\left[ \tr A^2 + (\tr A)^2 \right] - (\tr A)^2 \\&= \frac{n}{n+1}\left[ \left\|A\right\|_{\rm F}^2 - \frac{1}{n} (\tr A)^2 \right].\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-6ab261cb7e9ad3e0f3a218fd75ad8ebf_l3.png)
Proof of Optimality Properties
In this section, we provide proofs of the two optimality properties.
Optimality: Independent Vectors with Independent Coordinates
Assume
is real and symmetric and suppose that
is isotropic (2) with independent coordinates. The isotropy condition
![Rendered by QuickLaTeX.com \[\mathbb{E}[\omega\omega^\top] = I\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-69502d5f0214a4d4cc39484a9725fe8c_l3.png)
implies that
, where
is the Kronecker symbol. Using this fact, we compute the second moment:
![Rendered by QuickLaTeX.com \begin{align*}\mathbb{E}[ (\omega^*A \omega)^2] &= \mathbb{E}\left[ \left( \sum_{i=1}^n A_{ii} \omega_i^2 +2 \sum_{i<j} A_{ij}\omega_i\omega_j) \right)^2\right] \\&= \sum_{i=1}^n A_{ii}^2 \mathbb{E}[\omega_i^4] + \sum_{i<j} (2A_{ii}A_{jj}+4A_{ij}^2) \mathbb{E}[\omega_i^2]\mathbb{E}[\omega_j^2] \\&= \sum_{i=1}^n A_{ii}^2 \mathbb{E}[\omega_i^4] + \sum_{i<j} (2A_{ii}A_{jj}+4A_{ij}^2) .\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-2123a841fb0b78ce9edbbc2f4a198beb_l3.png)
Thus
![Rendered by QuickLaTeX.com \[\Var(\omega^*A\omega) = \mathbb{E}[ (\omega^*A \omega)^2] - (\mathbb{E}[\omega^* A \omega])^2 = \sum_{i=1}^n A_{ii}^2 (\mathbb{E}[|\omega_i|^4]-1) + 4\sum_{i<j} A_{ij}^2.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-95fe70ef6f036d673badde1304da735e_l3.png)
The variance is minimized by choosing
with
as small as possible. Since
, the smallest possible value for
is
, which is obtained by populating
with random signs.
Optimality: Independent Vectors
This result appears to have first been proven by Richard Kueng in unpublished work. We use an argument suggested to me by Robert J. Webber.
Assume
is a class of
-Hermitian matrices closed under
-unitary similarity transformations and that
is an isotropic random vector (2). Decompose the test vector as
![Rendered by QuickLaTeX.com \[\omega = a \cdot s \quad \text{for} \quad a \in [0,+\infty), \: s \in\{x\in \field^n : x^*x = n \}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-9378439b80ef13bf2dff81e656ed01b3_l3.png)
First, we shall show that the variance is reduced by replacing
with a vector
drawn uniformly from the sphere
(7) ![Rendered by QuickLaTeX.com \[\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega}) \le \sup_{A\in\mathscr{A}} \Var(\omega^*A\omega \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-110fc8f9b36ec9b032825e0b3c6e62ca_l3.png)
where
(8) ![Rendered by QuickLaTeX.com \[\tilde{\omega} = a\cdot t \quad \text{and}\quad t\sim \text{Uniform} \{ x \in \field^n :x^*x = n \} \quad \text{is independent of $a$}. \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-af90470a85bcf4c3d1c303669a811101_l3.png)
Note that such a
can be generated as
for a uniformly random
-unitary matrix
. Therefore, we have
![Rendered by QuickLaTeX.com \begin{align*}\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega})&= \sup_{A\in\mathscr{A}} \left[\mathbb{E}[(\tilde{\omega}^*A\tilde{\omega})^2] - (\tr A)^2\right]\\&= \sup_{A\in\mathscr{A}} \left[\mathbb{E}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right].\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-dd38727e24f3ef4e3ef1c84ae784acdd_l3.png)
Now apply Jensen’s inequality only over the randomness in
to obtain
![Rendered by QuickLaTeX.com \begin{align*}\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega})&= \sup_{A\in\mathscr{A}} \left[\mathbb{E}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right] \\&\le \mathbb{E}_Q \sup_{A\in\mathscr{A}} \left[\mathbb{E}_{a,s}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right].\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-0c4a1612eebf87861ced8ae83b61f92a_l3.png)
Finally, note that since
is closed under
-unitary similarity transformations, the supremum over
for
is the same as the supremum of
, so we obtain
![Rendered by QuickLaTeX.com \begin{align*}\sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega})&\le \mathbb{E}_Q \sup_{A\in\mathscr{A}} \left[\mathbb{E}_{a,s}[a^2 \cdot s^*(Q^*AQ)s] - (\tr (Q^*AQ))^2\right] \\&= \mathbb{E}_Q \sup_{A\in\mathscr{A}} \left[\mathbb{E}_{a,s}[a^2 \cdot s^*As] - (\tr A)^2\right] \\&= \sup_{A\in\mathscr{A}} \Var(\omega^*A\omega).\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-49501bf09d0be7059918505c0f975c21_l3.png)
We have successfully proven (7). This argument is a specialized version of a far more general result which appears as Proposition 4.1 in this paper.
Next, we shall prove
(9) ![Rendered by QuickLaTeX.com \[\sup_{A\in\mathscr{A}} \Var(t^*At) \le \sup_{A\in\mathscr{A}} \Var(\tilde{\omega}^*A\tilde{\omega}), \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-2c67d927a461f04c98b5f1b76e7bda02_l3.png)
where
is still defined as in (8). Indeed, using the chain rule for variance, we obtain
![Rendered by QuickLaTeX.com \begin{align*}\Var(\tilde{\omega}^*A\tilde{\omega})&= \Var(a^2\cdot t^*At) \\&= \mathbb{E}[\Var(a^2\cdot t^* A t \mid a)] + \Var(\mathbb{E}[a^2\cdot t^* A t \mid a]) \\&= \mathbb{E}[a^4]\Var(t^* A t )+ (\tr A)^2\Var(a^2) \\&\ge \mathbb{E}[a^4]\Var(t^* A t ).\end{align*}](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-ff41cdacd7b61186030f66166abeeff2_l3.png)
Here, we have used that
is uniform on the sphere and thus
. By definition,
is the length of
divided by
. Therefore,
![Rendered by QuickLaTeX.com \[\mathbb{E}[a^2] = \frac{1}{n}\mathbb{E}[\omega^*\omega] = \frac{1}{n} \mathbb{E}[\tr (\omega\omega^*)] = \frac{1}{n} \tr (\mathbb{E}[\omega\omega^*]) = \frac{\tr I}{n} = 1.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-fed236a34834a474032da2b77177dac9_l3.png)
Therefore, by Jensen’s inequality,
![Rendered by QuickLaTeX.com \[\mathbb{E}[a^4] = \mathbb{E}[(a^2)^2] \ge (\mathbb{E}[a^2])^2 = 1.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-63438e2a64c576b7502e66c1f165a3bb_l3.png)
Thus
![Rendered by QuickLaTeX.com \[\Var(\tilde{\omega}^*A\tilde{\omega}) \ge \mathbb{E}[a^4]\Var(t^* A t ) \ge \Var(t^*At) \quad \text{for every }A,\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-ff80c63bbb1cac110454cdea808ca304_l3.png)
which proves (9).