Note to Self: How Accurate is Sketch and Solve?

February 12, 2025 by Ethan N. Epperly Leave a comment

Suppose we’re interested in solving an overdetermined linear least-squares problem

(1) $x = \operatorname*{argmin}_{x \in \real^n} \norm{b-Ax}, \quad A \in \real^{m\times n}, \quad b \in \real^m.$

Sketch-and-solve is a popular method for getting a fast, approximate solution to (1). Let $S \in \real^{d\times m}$ be a sketching matrix for $\onebytwo{A}{b}$ of distortion $\eta$ (see these previous posts of mine for a refresher on sketching if needed):

(2) $(1-\eta)\norm{y} \le \norm{Sy} \le (1+\eta)\norm{y} \quad \text{for every } y \in \operatorname{range}(\onebytwo{A}{b}).$

The sketch-and-solve solution to (1) is given by

(3) $\hat{x} = \operatorname*{argmin}_{\hat{x}\in\real^n} \norm{Sb-(SA)\hat{x}}.$

How small is the residual $\norm{b-A\hat{x}}$ of the sketch-and-solve solution? Here’s a one bound:

Proposition 1 (Sketch-and-solve, $1+\order(\eta)$ bound). The sketch-and-solve solution (3) satisfies the bound
$\norm{b-A\hat{x}} \le \frac{1+\eta}{1-\eta} \cdot \norm{b-Ax} = (1 + 2\eta + \order(\eta^2))\norm{b-Ax}.$

Let’s prove this bound. The residual $b-A\hat{x}$ is in the range of $\onebytwo{A}{b}$ . Therefore, by (2),

$(1-\eta) \norm{b-A\hat{x}} \le \norm{S(b-A\hat{x})}.$

Rearranging gives

$\norm{b-A\hat{x}} \le \frac{1}{1-\eta}\norm{S(b-A\hat{x})}.$

By the optimality of the sketch-and-solve solution (3), $\norm{S(b-A\hat{x})}$ is minimized for the value $\hat{x}$ . Thus, its value can only increase by replacing $\hat{x}$ by $x$ :

$\norm{b-A\hat{x}} \le \frac{1}{1-\eta}\norm{S(b-A\hat{x})}\le \frac{1}{1-\eta}\norm{S(b-Ax)}.$

Invoke property (2) a final time to obtain

(4) $\norm{b-A\hat{x}} \le \frac{1}{1-\eta}\norm{S(b-A\hat{x})}\le \frac{1}{1-\eta}\norm{S(b-Ax)}\le \frac{1+\eta}{1-\eta}\cdot \norm{b-Ax}.$

We’ve completed the proof of Proposition 1. In fact, were we to want to write the proof concisely, equation (4) represents a single-line proof of Proposition 1.

The conclusion of Proposition 1 appears to be that the residual norm $\norm{b-A\hat{x}}$ may be a factor $1+\order(\eta^1)$ larger than the minimal least-squares residual $\norm{b-Ax}$ . Interestingly, this conclusion is not sharp. In fact, the residual for sketch-and-solve can only be at most $1+\order(\eta^2)$ larger than optimal. This fact has been known at least since the work of Drineas, Mahoney, Muthukrishnan, and Sarlós (2007). The note of Larsen and Kolda (2022) provides a nice summary of Drineas et al.’s argument. My goal in this post is to simplify this argument even further, making it as elementary as possible.

Notation

To make our lives easier, we’ll begin by defining some notation. Let

(5) $A = UC \quad \text{for } U\in\real^{m\times n}, \: B \in \real^{n\times n}$

be any factorization of $A$ into a matrix $U$ with orthonormal columns and a square nonsingular matrix $C$ . Such a factorization can be computed using a QR or polar decomposition. Let

$r = b-Ax$

denote the optimal least-squares residual. We will have occasion to scale $r$ so that it is a unit vector

$\overline{r} = \frac{r}{\norm{r}}.$

Note that the optimality condition for the least-squares problem (1) is that the residual $r$ is orthogonal to $\operatorname{range}(A)$ . Consequently,

(6) $\onebytwo{U}{\overline{r}} \quad \text{is an orthonormal basis for } \operatorname{range}(\onebytwo{A}{b}).$

Sketching the basis

As saw in (6), the matrix $\onebytwo{U}{\overline{r}}$ is an orthonormal basis for $\operatorname{range}(\onebytwo{A}{b})$ . To get a sharper analysis of sketch-and-solve, we will need the following result, which shows that $S\onebytwo{U}{\overline{r}}$ is an “almost orthonormal basis”.

Lemma 2 (Sketching the basis): The matrix $S\onebytwo{U}{\overline{r}}$ has nearly orthonormal columns in the sense that

(7) $\sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) \ge 1-\eta, \quad \sigma_{\rm max}(S\onebytwo{U}{\overline{r}}) \le 1+\eta.$

Consequently,

(8) $\norm{\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}} - I} \le 2\eta+\eta^2.$

This result is a pretty direct reformulation of the sketching property (2). By the variational characterization of the minimum singular value, we have

$\sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) = \min_{\norm{z}=1} \norm{S\onebytwo{U}{\overline{r}}z}.$

The vector $\onebytwo{U}{\overline{r}}z$ is in the range of $\onebytwo{A}{b}$ . Thus, by (2),

$\sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) = \min_{\norm{z}=1} \norm{S\onebytwo{U}{\overline{r}}z} \ge (1-\eta) \min_{\norm{z}=1} \norm{\onebytwo{U}{\overline{r}}z} = 1-\eta.$

For the last equality, we used the fact that $\onebytwo{U}{\overline{r}}$ has orthonormal columns and thus

$\norm{\onebytwo{U}{\overline{r}}z} = \norm{z} \quad \text{for every } z \in \real^{n+1}.$

This proves the first inequality in (7). The second inequality follows along similar lines.

For a tall matrix $F$ , the eigenvalues of $F^\top F$ are equal to the squared singular values of $F$ . Applying this result to $F = \onebytwo{U}{\overline{r}}$ gives

$(1-\eta)^2 \le \lambda_{\rm min}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}) \le \lambda_{\rm max}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}) \le (1+\eta)^2.$

Subtracting the identity matrix shifts all the eigenvalues down by $1$ :

$-2\eta+\eta^2 \le \lambda_{\rm min}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}-I) \le \lambda_{\rm max}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}-I) \le 2\eta+\eta^2.$

The spectral norm is equal to the largest eigenvalue in absolute value. Thus

$\norm{\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}-I} \le \max(2\eta+\eta^2,-(-2\eta+\eta^2)) = 2\eta+\eta^2.$

This proves (8).

Sketch-and-Solve: Better Analysis

With Lemma 2 in hand, we are ready to state and prove our main result:

Theorem 3 (Sketch-and-solve, $1+\order(\eta^2)$ bound). The sketch-and-solve solution (3) satisfies the bound
(9) $\norm{b-A\hat{x}}^2 \le \left(1 + \frac{(2\eta+\eta^2)^2}{(1-\eta)^4}\right) \norm{b-Ax}^2 \le \left(1 + \frac{9\eta^2}{(1-\eta)^4}\right)\norm{b-Ax}^2.$
As a consequence,
(10) $\norm{b-A\hat{x}}^2 \le (1+4\eta^2+\order(\eta^3)) \norm{b-Ax}^2$
and
(11) $\norm{b-A\hat{x}} \le (1+2\eta^2+\order(\eta^3)) \norm{b-Ax}.$

Let’s prove Theorem 3. Begin by decomposing the sketch-and-solve residual $b - A\hat{x}$ :

$b - A\hat{x} = b - Ax + Ax - A\hat{x} = r + A(\hat{x}-x).$

The residual $r$ is orthogonal to the range of $A$ . Thus, by the Pythagorean theorem,

(12) $\norm{b-A\hat{x}}^2 = \norm{r}^2 + \norm{A(\hat{x}-x)}^2.$

To bound $\norm{A(\hat{x}-x)}^2$ , it will help us to have a more convenient formula for $\hat{x}-x$ . To this end, reparametrize the sketch-and-solve least-squares problem as an optimization problem over the error $e \coloneqq \hat{x} - x$ :

$\hat{x} = \operatorname*{argmin}_{\hat{x}\in\real^n} \norm{S(b - Ax)-SA(\hat{x}-x)} \implies \hat{x} - x = \operatorname*{argmin}_{e\in\real^n} \norm{Sr - SAe}.$

Applying the normal equations to this least-squares problem, we get

$\hat{x} - x = (A^\top S^\top SA)^{-1} A^\top S^\top Sr.$

Now, invoke the factorization (5):

$\hat{x} - x = C^{-1} (U^\top S^\top SU)^{-1} U^\top S^\top Sr.$

Thus,

$A(\hat{x} - x) = U(U^\top S^\top SU)^{-1} U^\top S^\top Sr.$

Taking norms and using the definition $r = \norm{r} \cdot \overline{r}$ , we obtain

(13) $\norm{A(\hat{x} - x)} \le \norm{(U^\top S^\top S U)^{-1}} \cdot \norm{U^\top S^\top S \overline{r}} \cdot \norm{r}.$

We now work to bound the two factors in (10). The matrix $SU$ is a submatrix of $S\onebytwo{U}{\overline{r}}$ . Thus, by (7),

$\sigma_{\rm min}(SU) \ge \sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) \ge 1-\eta.$

Consequently,

$\norm{(U^\top S^\top S U)^{-1}} = \sigma_{\rm min}^2(SU) \le \frac{1}{(1-\eta)^2}.$

The matrix $U^\top S^\top S\overline{r}$ is a submatrix of $\onebytwo{U}{\overline{r}}S^\top S\onebytwo{U}{\overline{r}} - I$ . Thus, by (8),

$\norm{U^\top S^\top S \overline{r}} \le \norm{\onebytwo{U}{\overline{r}}S^\top S\onebytwo{U}{\overline{r}} - I}\le 2\eta + \eta^2.$

Substituting the two previous displays into (13) yields

$\norm{A(\hat{x} - x)} \le \frac{2\eta+\eta^2}{(1-\eta)^2} \cdot \norm{r}.$

Plugging this display into (12) proves (9). Equations (10) and (11) follow from (9) and a Taylor series expansion. This proves Theorem 3.

Can We Do Better?

In the limit $\eta \to 0$ , Theorem 3 identifies the correct $1+\order(\eta^2)$ scaling for sketch-and-solve, consistent both with numerical experiments, exact computations for Gaussian embeddings, and lower bounds for solving least-squares problems by row subselection. However, comparing results for Gaussian embeddings shows there may be room for improvement. Here is an informal statement of the Gaussian results:

Informal Theorem 4 (Gaussian sketch-and-solve): Appropriately scaled, a $d\times m$ Gaussian matrix is an $\eta$ -sketching matrix provided $d\approx n/\eta^2$ . The expected least-squares residual is
(14) $\expect \norm{b - A\hat{x}}^2 \approx \left(1 + \frac{\eta^2}{(1+\eta)(1-\eta)}\right) \norm{b - Ax}^2.$

See this post of mine for a formal version of this result, its proof, and a discussion of the history of this result. We see that the bound (9) for general embeddings takes the form $1 + \order((1-\eta)^{-4})$ in the limit $\eta\to 1$ , whereas the bound (14) for Gaussian embeddings scales like $1 + \order((1-\eta)^{-1})$ . We leave it as a conjecture/open problem whether there is an improved argument for general subspace embeddings:

Conjecture 5 (Sketch-and-solve, large $\eta$ ): The sketch-and-solve solution (3) satisfies the bound
$\norm{b-A\hat{x}}^2 \le \left(1 + \frac{C\eta^2}{1-\eta}\right)\norm{b-Ax}^2$
for an absolute constant $C > 0$ .

Notes: For another proof demonstrating the correction $\order(\eta^2)$ scaling for a different type of dimensionality reduction (leverage score sampling), check out this article by Raphael Meyer.

Note to Self: Sketch-and-Solve with a Gaussian Embedding

November 19, 2024 by Ethan N. Epperly Leave a comment

In the comments to my post Does Sketching Work?, I was asked a very interesting question, which I will paraphrase as follows:

Is the sketch-and-solve least-squares solution $\hat{x} = \operatorname*{argmin}_{\hat{x}} \norm{(SA)\hat{x} - Sb}$ an unbiased estimate of the true least squares solution of $x = \operatorname*{argmin}_x \norm{Ax - b}$ ?

In this post, I will answer this question for the special case in which $S$ is a Gaussian sketch, and I will also compute the expected least-squares residual $\expect \norm{A\hat{x}-b}^2$ . Throughout this post, I will assume knowledge of sketching; see my previous two posts for a refresher if needed. For this post, $A$ will have dimensions $n\times d$ and $S$ will have dimensions $\ell\times n$ , and these parameters are related $n\ge \ell \ge d$ .

I thought I knew the answer to this question—that sketch-and-solve suffers from inversion bias. Indeed, the fact I remembered is that

(1) $\expect\Big[\big((SA)^\top (SA)\big)^{-1}\Big] \ne \big(A^\top A\big)^{-1}$

for the sketching matrices $S$ in common use. The issue of the inversion bias (1) is discussed in this paper.

However, the inversion bias (1) does not imply that sketch-and-solve is biased. In fact, for a Gaussian sketch¹

(2) $S \in \real^{\ell\times n} \quad \text{with } S_{ij} \sim \text{Normal}(0,1) \text{ iid},$

sketch-and-solve is unbiased.

Theorem (Gaussian sketch-and-solve is unbiased). Let $S$ be a Gaussian sketch (2) and let $A \in \real^{n\times d}$ be a full column-rank matrix. Then
(3) $\expect[(SA)^\dagger(Sb)] = A^\dagger b.$
That is, the sketch-and-solve solution $(SA)^\dagger(Sb)$ is an unbiased estimate for the least-squares solution $A^\dagger b$ .

Here, ${}^\dagger$ is the Moore–Penrose pseudoinverse, which effectuates the least-squares solution:

$A^\dagger b = \operatorname*{argmin}_{x\in\real^d} \norm{Ax - b}.$

In particular, by the normal equations, we have the relation

(4) $A^\dagger = (A^\top A)^{-1} A^\top,$

valid for any matrix $A$ with full column-rank.

Let us prove this theorem. The Gaussian sketch (1) is orthogonally invariant: that is, $V^\top SU$ has the same distribution as $S$ for any orthogonal matrices $U$ and $V$ . Thus, we are free to reparametrize. Let

$A = U\twobyone{\Sigma}{0}V^\top$

be a (full) singular value decomposition of $A$ . Make the changes of variables

$A \mapsto U^\top A V, \quad S \mapsto V^\top SU, \quad b \mapsto U^\top b.$

After this change of variables, $S$ still has the same distribution (2) and

$A = \twobyone{\Sigma}{0}.$

Partition the reparametrized $b$ and $S$ conformally with $A$ :

$b = \twobyone{b_1}{b_2}, \quad S = \onebytwo{S_1}{S_2}.$

Under this change of variables, the least-squares solution is

(5) $A^\dagger b = \Sigma^{-1}b_1.$

Now, we are ready to prove the claim (3). Observe that $SA = S_1\Sigma$ . Begin by using the normal equations (4) to write out the sketch-and-solve solution

$(SA)^\dagger (Sb) = [(SA)^\top (SA)]^{-1}(SA)^\top (Sb).$

Observe that $SA = S_1\Sigma$ . Thus,

$(SA)^\dagger (Sb) = [\Sigma S_1^\top S_1\Sigma]^{-1}\Sigma S_1^\top \onebytwo{S_1}{S_2}\twobyone{b_1}{b_2}.$

Simplify to obtain

$(SA)^\dagger (Sb) = \Sigma^{-1}(S_1^\top S_1)^{-1} S_1^\top (S_1b_1 + S_2 b_2).$

Using the normal equations (4) and the solution formula (5), we obtain

(6) $(SA)^\dagger (Sb) = \Sigma^{-1}b_1 + \Sigma^{-1} S_1^\dagger S_2b_2 = A^\dagger b + \Sigma^{-1} S_1^\dagger S_2b_2.$

By the definition (2) of $S$ , $S_1$ and $S_2$ are independent and $\expect[S_2] = 0$ . Thus, taking expectations, we obtain

$\expect[(SA)^\dagger (Sb)] = A^\dagger b + \Sigma^{-1} \expect[S_1^\dagger] \expect[S_2]b_2 = A^\dagger b.$

The desired claim (3) is proven.

Note that the solution formula (6) holds for any sketching matrix $S$ . We only use Gaussianity at the last step, where we invoke three properties (1) $S$ is mean zero; (2) $S$ is orthogonally invariant; and (3) conditional on the first $d$ columns of $S$ , the last $n-d$ columns of $S$ are mean-zero.

The Least-Squares Residual for Gaussian Sketch-and-Solve

The solution formula (6) can be used to understand other properties of the Gaussian sketch-and-solve solution. In this section, we use (6) to understand the expected squared residual norm

$\expect \norm{A\hat{x} - b}^2,$

where $\hat{x} = (SA)^\dagger (Sb)$ is the sketch-and-solve solution. We write the expected residual norm as

$\expect\norm{A\hat{x} - b}^2 = \expect\norm{\twobyone{b_1 + S_1^\dagger S_2b_2}{0} - \twobyone{b_1}{b_2}}^2 = \norm{b_2}^2 + \expect\norm{S_1^\dagger S_2b_2}^2.$

Now, use the Gaussian–inverse Gaussian matrix product formula to compute the expectation, obtaining

$\expect\norm{A\hat{x} - b}^2 = \left(1+\frac{d}{\ell-d-1}\right)\norm{b_2}^2.$

Finally, we recognize $\norm{b_2}$ is the optimal least-squares residual $\min_x \norm{Ax - b}$ . Thus, we have shown

$\expect\norm{A\hat{x} - b}^2 = \left(1+\frac{d}{\ell-d-1}\right)\min_{x\in\real^d} \norm{Ax-b}^2.$

Note that this is an exact formula for the expected squared residual for the sketch-and-solve solution! Thus, the sketch-and-solve solution has an expected squared residual that factor $1 + d/(\ell-d-1)$ larger than the optimal value. In particular,

$\expect\norm{A\hat{x} - b}^2 = \left(1+\varepsilon\right)\min_{x\in\real^d} \norm{Ax-b}^2 \quad \text{when } \ell = \frac{d}{\varepsilon} + d + 1.$

Remark (Existing literature): When I originally wrote this post, I had not seen these results anywhere in the literature. Thanks to Michał Dereziński, who provided me with the following references: the following lecture notes of Mert Pilanci, equation 1 in this recent paper of Michał’s, and this recent survey by Michał and Michael Mahoney. I find it striking that these basic and beautiful results appear to only have been explicitly recorded very recently. Very big thanks also to Rob Webber for providing help in preparing this post.

Update (11/21/24): After this post was initially released, Mert Pilanci provided me with an earlier reference (2020); see also this paper for extensions. He also mentions an even earlier paper (2017) that does similar calculations in a statistical setting, but does not obtain exactly these formulas. See also this paper for exact and asymptomatically exact formulas for sketch-and-solve in the statistical setting.

Note to Self: Hanson–Wright and Trace Estimation with Random Vectors on the Sphere

November 4, 2024 by Ethan N. Epperly Leave a comment

Tyler Chen has just released an excellent monograph on the Lanczos method, one of the most powerful algorithms for extracting information from a symmetric matrix $A$ . I would definitely recommend checking it out.

One of the subjects touched on by this book is stochastic trace estimation, a topic I’ve written about several times before on this blog. In his book, Tyler uses stochastic trace estimates based on random vectors on the sphere (a great choice), and he provides a simple, but often pessimistic, mathematical analysis of this approach. To complement Tyler’s amazing book, I felt it was a good opportunity to share a sharper analysis that I came up with using a more sophisticated argument. I would not be surprised if this argument appears elsewhere in the literature, but I am not aware of a source.

Let $x,x_1,\ldots,x_m$ be random vectors in $\real^n$ drawn uniformly at random from the sphere of radius $\sqrt{n}$ ,¹ let $A$ be a real symmetric matrix, and define the quadratic form

$q \coloneqq x^\top Ax$

and the Girard–Hutchinson trace estimator

$\hat{\tr}\coloneqq \frac{1}{m}\sum_{i=1}^m x_i^\top Ax_i.$

Both $q$ and $\hat{\tr}$ are unbiased estimates for the trace of $A$ , $\expect[q] = \expect[\hat{\tr}] = \tr A$ . The goal of this post will be to bound the probability of these quantities being much smaller or larger than the trace of $A$ .

Hanson–Wright on the Sphere

Observe that $\hat{\tr}$ is an average of $m$ independent copies of the random variable $q$ . Therefore, we will begin our analysis with the quadratic form $q$ and discuss the Girard–Hutchinson estimate $\hat{\tr}$ at the end of the post.

Centering

Begin by introducing the centered matrix

$\overline{A} = A - \frac{\tr A}{n}I.$

The transformation $A\mapsto\overline{A}$ has the effect of shifting $A$ ’s eigenvalues to have mean zero. Consequently, since the trace is the sum of the eigenvalues, $\overline{A}$ has trace zero.

Now, rewrite $q$ in terms of the centered matrix $\overline{A}$ :

$q = x^\top A x = x^\top \overline{A} x + \frac{\tr A}{n} \cdot x^\top I x = x^\top \overline{A} x + \tr A.$

In the final equality, we use the fact that $x$ is on the sphere of radius $\sqrt{n}$ so that $x^\top I x = \norm{x}^2 = n$ . Rearranging, we see that the error $\overline{q}\coloneqq q-\tr A$ satisfies

$\overline{q} = q-\tr A = x^\top \overline{A}x.$

We have shown that the error $\overline{q}$ depends only on the centered matrix $\overline{A}$ rather than the original matrix $A$ . This observation is at the root of the reason that quadratic forms with vectors on the sphere have smaller tail probabilities than other distributions like the Gaussian distribution. Indeed, $\overline{A}$ is smaller than $A$ when measured using the Frobenius norm $\norm{\overline A}_{\rm F} \le \norm{A}_{\rm F}$ , and the spectral norm is never much larger $\norm{\overline{A}} \le 2\norm{A}$ . These observations will be important later.

Laplace Transform Method and Comparison to Standard Gaussian Vectors

To derive exponential concentration inequalities, we make use of the Laplace transform method. For a random variable $z$ , define the cumulant generating function (cgf)

$\xi_z(\theta) = \log (\expect [\exp(\theta z)]).$

Bounds on the cgf immediately yield bounds on the random variable taking large values, so it will behoove us to estimate this quantity for $\overline{q}$ .

To estimate the cgf of $\overline{q}$ , we will make use of a comparison between Gaussian vectors and random vectors on the sphere. Recall that a standard Gaussian vector $g$ is a vector with independent standard Gaussian entries. Standard Gaussian random vectors are spherically symmetric, meaning they can be written in the form $g = a\cdot x$ , where $x$ is a uniform random vector on the sphere and $a$ is a scaling factor.² The scaling factor $a$ is independent of $x$ and, in this case, satisfies $\expect[a^2] = 1$ .³

Using this relationship, the error $\overline{q}$ can be connected to the standard Gaussian random vector $g$ as follows

$g^\top \overline{A} g = a^2 \cdot x^\top \overline{A} x = a^2 \cdot \overline{q}.$

Intuitively, we might expect that multiplying the random error $\overline{q}$ by a scaling factor of mean one should only have the effect of increasing the variability of $\overline{q}$ and thus increasing its probability of taking large values. Based on this intuition, we might conjecture that the cgfs are related

$\xi_{\overline{q}}(\theta) \le \xi_{g^\top A g}(\theta) \quad \text{for all } \theta \in \real.$

Indeed, this conjecture is true.

Proposition (Scaling hurts). Let $z$ be a random variable and let $b$ be a random variable with expectation $1$ , independent of $z$ . Then
$\xi_{z}(\theta) \le \xi_{bz}(\theta) \quad \text{for all } \theta \in \real.$

To prove this result, let $\expect_b$ and $\expect_z$ denote expectations take over the randomness in $b$ and $z$ separately. The total expectation is $\expect[\cdot] = \expect_z[\expect_b[\cdot ]]$ . Begin with the right-hand side and invoke Jensen’s inequality over $b$ :

$\xi_{bz}(\theta) = \xi_z(\theta) = \log (\expect_z [\expect_b [\exp(\theta bz)]]) \ge \log (\expect_z [\exp(\theta \expect_b[b]z)]) = \xi_z(\theta).$

In the last line, we use the hypothesis $\expect[b] = 1$ and the definition of the cgf.

Having established this proposition, we conclude that $\xi_{\overline{q}}(\theta) \le \xi_{g^\top \overline{A} g}(\theta)$ for all $\theta$ .

Completing the Analysis

Finally, invoking a standard bound for the cgf of $g^\top \overline{A}g$ for a trace-zero matrix $\overline{A}$ , we obtain

$\xi_{\overline{q}}(\theta) \le \xi_{g^\top \overline{A} g}(\theta) \le \frac{\theta^2 \norm{\overline{A}}_{\rm F}^2}{1 - 2\theta \norm{\overline{A}}}.$

A cgf bound of this form immediately yields the following tail bound (see Boucheron, Lugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):

$\prob \{ x^\top A x - \tr(A) \ge t \} \le \exp \left( - \frac{t^2/2}{2 \norm{\overline{A}}_{\rm F}^2 + 2 \overline{\norm{\overline{A}}t}} \right).$

One can also control the lower tail event $x^\top A x - \tr(A) \le -t$ by instantiating this result with $-A$ . Union bounding over the upper and lower tails gives the symmetric bound

$\prob \{ |x^\top A x - \tr(A)| \ge t \} \le 2\exp \left( - \frac{t^2/2}{2 \norm{\overline{A}}_{\rm F}^2 + 2 \overline{\norm{\overline{A}}t}} \right).$

Is This Bound Good?

Tail bounds for quadratic forms such as $x^\top A x$ or $g^\top A g$ are known as Hanson–Wright inequalities. For vectors on the sphere, our bound shows the tail probabilities decay like $\exp(-O(t^2/\norm{\overline{A}_{\rm F}}^2))$ for small $t > 0$ and $\exp(-O(t/\norm{\overline{A}}))$ for large $t>0$ . This pattern of results “subgaussian scaling for small $t$ , subexponential scaling for large $t$ ” is typical for Hanson–Wright inequalities.

The Hanson–Wright inequalities for vectors on the sphere exhibit faster decay rate than for standard Gaussian vectors. Indeed, a standard Gaussian random vector $g$ obeys a Hanson–Wright inequality of the form

$\prob \{ g^\top A g - \tr A \ge t \} \le \exp \left( -\frac{t^2/2}{C\norm{A}_{\rm F}^2 + c t \norm{A}}\right)$

for positive constants $C,c > 0$ . For vectors on the sphere, $\norm{A}_{\rm F}^2$ and $\norm{A}$ have been replaced by $\norm{\overline{A}}_{\rm F}^2$ and $\norm{\overline{A}}$ which are always smaller (and sometimes much smaller). The smaller tail probabilities for $x^\top A x$ versus $g^\top A g$ is a real phenomenon, not an artifact of the mathematical analysis.

As another way to evaluate the quality of these bounds, recall that a Gaussian random variable with mean zero and variance $v$ has tail probabilities roughly of size $\exp(-t^2/2v)$ . For small $t$ , our Hanson–Wright inequality for vectors on the sphere has the same form with $v = 2\norm{\overline{A}}_{\rm F}^2$ . The true variance of $q = x^\top A x$ is

$\Var(q) = \frac{n}{n+2} \cdot 2\norm{\overline{A}}_{\rm F}^2,$

which is close to $v = 2\norm{\overline{A}}_{\rm F}^2$ for large $n$ . (See this post of mine for a derivation of the variance formula.) Thus, even up to the constants, the Hanson–Wright inequality we’ve derived for random vectors on the sphere is nearly as good as one could possibly hope (at least for small $t$ ).

Girard–Hutchinson Estimator on the Sphere

Using our results for one quadratic form, tail bounds for the Girard–Hutchinson estimator $\overline{\tr} = m^{-1} \sum_{i=1}^m x_i^\top A x_i$ easily follow. Indeed, the cgf is additive for independent random variables and satisfies the identity $\xi_{cz}(\theta) = \xi_z(c\theta)$ for constant $c, so

$\xi_{\overline{\tr}}(\theta) = \sum_{i=1}^m \xi_{x_i^\top Ax_i}(\theta/m) \le m \cdot \frac{(\theta/m)^2 \norm{\overline{A}}_{\rm F}^2}{1 - 2(\theta/m)\norm{\overline{A}}}.$

This cgf bound leads to tail bounds

$\begin{align*}\prob \{ \overline{\tr} - \tr(A) \ge t \} &\le \exp \left( - \frac{mt^2/2}{2\norm{\overline{A}}_{\rm F}^2 + 2m \overline{\norm{\overline{A}}t}} \right), \\\prob \{ |\overline{\tr} - \tr(A)| \ge t \} &\le 2\exp \left( - \frac{mt^2/2}{2\norm{\overline{A}}_{\rm F}^2 + 2m \overline{\norm{\overline{A}}t}} \right).\end{align*}$

Note to Self: Gaussian Hypercontractivity

February 4, 2024 by Ethan N. Epperly Leave a comment

The purpose of this note is to describe the Gaussian hypercontractivity inequality. As an application, we’ll obtain a weaker version of the Hanson–Wright inequality.

The Noise Operator

We begin our discussion with the following question:

Let $f : \real^n \to \real$ be a function. What happens to $f$ , on average, if we perturb its inputs by a small amount of Gaussian noise?

Let’s be more specific about our noise model. Let $x \in \real^n$ be an input to the function $f$ and fix a parameter $0 \le \varrho \le 1$ (think of $\varrho$ as close to 1). We’ll define the noise corruption of $x$ to be

(1) $\tilde{x}_\varrho = \varrho \cdot x + \sqrt{1-\varrho^2} \cdot g, \quad \text{where } g\sim \operatorname{Normal}(0,I).$

Here, $\operatorname{Normal}(0,I)$ is the standard multivariate Gaussian distribution. In our definition of $\tilde{x}_\varrho$ , we both add Gaussian noise $\sqrt{1-\varrho^2} \cdot g$ and shrink the vector $x$ by a factor $\varrho$ . In particular, we highlight two extreme cases:

No noise. If $\varrho = 1$ , then there is no noise and $\tilde{x}_\varrho = x$ .
All noise. If $\varrho = 0$ , then there is all noise and $\tilde{x}_\varrho = g$ . The influence of the original vector $x$ has been washed away completely.

The noise corruption (1) immediately gives rise to the noise operator¹ $T_\varrho$ . Let $f : \real^n \to \real$ be a function. The noise operator $T_\varrho$ is defined to be:

(2) $(T_\varrho f)(x) = \expect[f(\tilde{x}_\varrho)] = \expect_{g\sim \operatorname{Normal}(0,I)}[f( \varrho \cdot x + \sqrt{1-\varrho^2}\cdot g)].$

The noise operator computes the average value of $f$ when evaluated at the noisy input $\tilde{x}_\varrho$ . Observe that the noise operator maps a function $f : \real^n \to \real$ to another function $T_\varrho f : \real^n \to \real$ . Going forward, we will write $T_\varrho f(x)$ to denote $(T_\varrho f)(x)$ .

To understand how the noise operator acts on a function $f$ , we can write the expectation in the definition (2) as an integral:

$T_\varrho f(x) = \int_{\real^d} f(\varrho x + y) \frac{1}{(2\pi (1-\varrho^2))^{d/2}}\e^{-\frac{|y|^2}{2(1-\varrho^2)}} \, \mathrm{d} y.$

Here, $|y|$ denotes the (Euclidean) length of $y\in\real^d$ . We see that $T_\varrho f$ is the convolution of $f(\varrho x)$ with a Gaussian density. Thus, $T_\varrho$ acts to smooth the function $f$ .

See below for an illustration. The red solid curve is a function $f$ , and the blue dashed curve is $T_{0.95}f$ .

As we decrease $\varrho$ from $1$ to $0$ , the function $T_\varrho f$ is smoothed more and more. When we finally reach $\varrho = 0$ , $T_\varrho f$ has been smoothed all the way into a constant.

Random Inputs

The noise operator converts a function $f$ to another function $T_\varrho f$ . We can evaluate these two functions at a Gaussian random vector $x\sim \operatorname{Normal}(0,I)$ , resulting in two random variables $f(x)$ and $T_\varrho f(x)$ .

We can think of $T_\varrho f(x)$ as a modification of the random variable $f(x)$ where “a $1-\varrho^2$ fraction of the variance of $x$ has been averaged out”. We again highlight the two extreme cases:

No noise. If $\varrho = 1$ , $T_\varrho f(x) = f(x)$ . None of the variance of $x$ has been averaged out.
All noise. If $\varrho=0$ , $T_\varrho f(x) = \expect_{g\sim\operatorname{Normal}(0,I)}[f(g)]$ is a constant random variable. All of the variance of $x$ has been averaged out.

Just as decreasing $\varrho$ smoothes the function $T_\varrho f$ until it reaches a constant function at $\varrho = 0$ , decreasing $\varrho$ makes the random variable $T_\varrho f(x)$ more and more “well-behaved” until it becomes a constant random variable at $\varrho = 0$ . This “well-behavingness” property of the noise operator is made precise by the Gaussian hypercontractivity theorem.

Moments and Tails

In order to describe the “well-behavingness” properties of the noise operator, we must answer the question:

How can we measure how well-behaved a random variable is?

There are many answers to this question. For this post, we will quantify the well-behavedness of a random variable by using the $L_p$ norm.²

The $L_p$ norm of a ( $\real$ -valued) random variable $y$ is defined to be

(3) $\norm{y}_p \coloneqq \left( \expect[|y|^p] \right)^{1/p}.$

The $p$ th power of the $L_p$ norm $\norm{y}_p^p$ is sometimes known as the $p$ th absolute moment of $y$ .

The $L_p$ norms of random variables control the tails of a random variable—that is, the probability that a random variable is large in magnitude. A random variables with small tails is typically thought of as a “nice” or “well-behaved” random variable. Random quantities with small tails are usually desirable in applications, as they are more predictable—unlikely to take large values.

The connection between tails and $L_p$ norms can be derived as follows. First, write the tail probability $\prob \{|y| \ge t\}$ for $t > 0$ using $p$ th powers:

$\prob \{|y| \ge t\} = \prob\{ |y|^p \ge t^p \}.$

Then, we apply Markov’s inequality, obtaining

(4) $\prob \{|y| \ge t\} = \prob \{ |y|^p \ge t^p \} \le \frac{\expect [|y|^p]}{t^p} = \frac{\norm{y}_p^p}{t^p}.$

We conclude that a random variable with finite $L_p$ norm (i.e., $\norm{y}_p < +\infty$ ) has tails that decay at at a rate $1/t^p$ or faster.

Gaussian Contractivity

Before we introduce the Gaussian hypercontractivity theorem, let’s establish a weaker property of the noise operator, contractivity.

Proposition 1 (Gaussian contractivity). Choose a noise level $0 < \varrho \le 1$ and a power $p\ge 1$ , and let $x\sim \operatorname{Normal}(0,I)$ be a Gaussian random vector. Then $T_\varrho$ contracts the $L_p$ norm of $f(x)$ :
$\norm{T_\varrho f(x)}_p \le \norm{f(x)}_p.$

This result shows that the noise operator makes the random variable $T_\varrho f(x)$ no less nice than $f(x)$ was.

Gaussian contractivity is easy to prove. Begin using the definition of the noise operator (2) and $L_p$ norm (3):

$\norm{T_\varrho f(x)}_p^p = \expect_{x\sim \operatorname{Normal}(0,I)} \left[ \left|\expect_{g\sim \operatorname{Normal}(0,I)}[f(\varrho x + \sqrt{1-\varrho^2}\cdot g)]\right|^p\right]$

Now, we can apply Jensen’s inequality to the convex function $t\mapsto t^p$ , obtaining

$\norm{T_\varrho f(x)}_p^p \le \expect_{x,g\sim \operatorname{Normal}(0,I)} \left[ \left|f(\varrho x + \sqrt{1-\varrho^2}\cdot g)\right|^p\right].$

Finally, realize that for the independent normal random vectors $x,g\sim \operatorname{Normal}(0,I)$ , we have

$\varrho x + \sqrt{1-\varrho^2}\cdot g \sim \operatorname{Normal}(0,I).$

Thus, $\varrho x + \sqrt{1-\varrho^2}\cdot g$ has the same distribution as $x$ . Thus, using $x$ in place of $\varrho x + \sqrt{1-\varrho^2}\cdot g$ , we obtain

$\norm{T_\varrho f(x)}_p^p \le \expect_{x\sim \operatorname{Normal}(0,I)} \left[ \left|f(x)\right|^p\right] = \norm{f(x)}_p^p.$

Gaussian contractivity (Proposition 1) is proven.

Gaussian Hypercontractivity

The Gaussian contractivity theorem shows that $T_\varrho f(x)$ is no less well-behaved than $f(x)$ is. In fact, $T_\varrho f(x)$ is more well-behaved than $f$ is. This is the content of the Gaussian hypercontractivity theorem:

Theorem 2 (Gaussian hypercontractivity): Choose a noise level $0 < \varrho \le 1$ and a power $p\ge 1$ , and let $x\sim \operatorname{Normal}(0,I)$ be a Gaussian random vector. Then
$\norm{T_\varrho f(x)}_{1+(p-1)/\varrho^2} \le \norm{f(x)}_p.$
In particular, for $p=2$ ,
$\norm{T_\varrho f(x)}_{1+\varrho^{-2}} \le \norm{f(x)}_2.$

We have highlighted the $p=2$ case because it is the most useful in practice.

This result shows that as we take $\varrho$ smaller, the random variable $T_\varrho f(x)$ becomes more and more well-behaved, with tails decreasing at a rate

$\prob \{ |T_\varrho f(x)| \ge t \} \le \frac{\norm{T_\varrho f(x)}_{1+(p-1)/\varrho^2}^{1+(p-1)/\varrho^2}}{t^{1 + (p-1)/\varrho^2}} \le \frac{\norm{f(x)}_p^{1+(p-1)/\varrho^2}}{t^{1 + (p-1)/\varrho^2}}.$

The rate of tail decrease becomes faster and faster as $\varrho$ becomes closer to zero.

We will prove the Gaussian hypercontractivity at the bottom of this post. For now, we will focus on applying this result.

Multilinear Polynomials

A multilinear polynomial $f(x) = f(x_1,\ldots,x_d)$ is a multivariate polynomial in the variables $x_1,\ldots,x_d$ in which none of the variables $x_1,\ldots,x_d$ is raised to a power higher than one. So,

(5) $1+x_1x_2$

is multilinear, but

$1+x_1+x_1x_2^2$

is not multilinear (since $x_2$ is squared).

For multilinear polynomials, we have the following very powerful corollary of Gaussian hypercontractivity:

Corollary 3 (Absolute moments of a multilinear polynomial of Gaussians). Let $f$ be a multilinear polynomial of degree $k$ . (That is, at most $k$ variables $x_{i_1}\cdots x_{i_k}$ occur in any monomial of $f$ .) Then, for a Gaussian random vector $x\sim \operatorname{Normal}(0,I)$ and for all $q \ge 2$ ,
$\norm{f(x)}_q \le (q-1)^{k/2} \norm{f(x)}_2.$

Let’s prove this corollary. The first observation is that the noise operator has a particularly convenient form when applied to a multilinear polynomial. Let’s test it out on our example (5) from above. For

$f(x) = 1+x_1x_2,$

we have

$\begin{align*}T_\varrho f(x) &= \expect_{g_1,g_2 \sim \operatorname{Normal}(0,1)} \left[1+ (\varrho x_1 + \sqrt{1-\varrho^2}\cdot g_1)(\varrho x_2 + \sqrt{1-\varrho^2}\cdot g_2)\right].\\&= 1 + \expect[\varrho x_1 + \sqrt{1-\varrho^2}\cdot g_1]\expect[\varrho x_2 + \sqrt{1-\varrho^2}\cdot g_2]\\&= 1+ (\varrho x_1)(\varrho x_2) \\&= f(\varrho x).\end{align*}$

We see that the expectation applies to each variable separately, resulting in each $x_i$ replaced by $\varrho x_i$ . This trend holds in general:

Proposition 4 (noise operator on multilinear polynomials). For any multilinear polynomial $f$ , $T_\varrho f(x) = f(\varrho x)$ .

We can use Proposition 4 to obtain bounds on the $L_p$ norms of multilinear polynomials of a Gaussian random variable. Indeed, observe that

$f(x) = f(\varrho \cdot x/\varrho) = T_\varrho f(x/\varrho).$

Thus, by Gaussian hypercontractivity, we have

$\norm{f(x)}_{1+\varrho^{-2}}=\norm{T_\varrho f(x/\varrho)}_{1+\varrho^{-2}} \le \norm{f(x/\varrho)}_2.$

The final step of our argument will be to compute $\norm{f(x/\varrho)}_2$ . Write $f$ as

$f(x) = \sum_{i_1,\ldots,i_s} a_{i_1,\ldots,i_s} x_{i_1} \cdots x_{i_s}.$

Since $f$ is multilinear, $i_j\ne i_\ell$ for $j\ne \ell$ . Since $f$ is degree- $k$ , $s\le k$ . The multilinear monomials $x_{i_1}\cdots x_{i_s}$ are orthonormal with respect to the $L_2$ inner product:

$\expect[(x_{i_1}\cdots x_{i_s}) \cdot (x_{i_1'}\cdots x_{i_s'})] = \begin{cases} 0 &\text{if } \{i_1,\ldots,i_s\} \ne \{i_1',\ldots,i_{s'}\}, \\1, & \text{otherwise}.\end{cases}$

(See if you can see why!) Thus, by the Pythagorean theorem, we have

$\norm{f(x)}_2^2 = \sum_{i_1,\ldots,i_s} a_{i_1,\ldots,i_s}^2.$

Similarly, the coefficients of $f(x/\varrho)$ are $a_{i_1,\ldots,i_s} / \varrho$ . Thus,

$\norm{f(x/\varrho)}_2^2 = \sum_{i_1,\ldots,i_s} \varrho^{-2s} a_{i_1,\ldots,i_s}^2 \le \varrho^{-2k} \sum_{i_1,\ldots,i_s} a_{i_1,\ldots,i_s}^2 = \varrho^{-2k}\norm{f(x)}_2^2.$

Thus, putting all of the ingredients together, we have

$\norm{f(x)}_{1+\varrho^{-2}}=\norm{T_\varrho f(x/\varrho)}_p \le \norm{f(x/\varrho)}_2 \le \varrho^{-k} \norm{f(x)}_2.$

Setting $q = 1+\varrho^{-2}$ (equivalently $\varrho = 1/\sqrt{q-1}$ ), Corollary 3 follows.

Hanson–Wright Inequality

To see the power of the machinery we have developed, let’s prove a version of the Hanson–Wright inequality.

Theorem 5 (suboptimal Hanson–Wright). Let $A$ be a symmetric matrix with zero on its diagonal and $x\sim \operatorname{Normal}(0,I)$ be a Gaussian random vector. Then

$\prob \{|x^\top A x| \ge t \} \le \exp\left(- \frac{t}{\sqrt{2}\mathrm{e}\norm{A}_{\rm F}} \right) \quad \text{for } t\ge \sqrt{2}\mathrm{e}\norm{A}_{\rm F}.$

Hanson–Wright has all sorts of applications in computational mathematics and data science. One direct application is to obtain probabilistic error bounds for the error incurred by a stochastic trace estimation formulas.

This version of Hanson–Wright is not perfect. In particular, it does not capture the Bernstein-type tail behavior of the classical Hanson–Wright inequality

$\prob\{|x^\top Ax| \ge t\} \le 2\exp \left( -\frac{t^2}{4\norm{A}_{\rm F}^2+4\norm{A}t} \right).$

But our suboptimal Hanson–Wright inequality is still pretty good, and it requires essentially no work to prove using the hypercontractivity machinery. The hypercontractivity technique also generalizes to settings where some of the proofs of Hanson–Wright fail, such as multilinear polynomials of degree higher than two.

Let’s prove our suboptimal Hanson–Wright inequality. Set $f(x) = x^\top Ax$ . Since $A$ has zero on its diagonal, $f$ is a multilinear polynomial of degree two in the entries of $x$ . The random variable $f(x)$ is mean-zero, and a short calculation shows its $L_2$ norm is

$\norm{f(x)}_2 = \sqrt{\Var(f(x))} = \sqrt{2} \norm{A}_{\rm F}.$

Thus, by Corollary 3,

(6) $\norm{f(x)}_q \le (q-1) \norm{f(x)}_2 \le \sqrt{2} q \norm{A}_{\rm F} \quad \text{for every } q\ge 2.$

In fact, since the $L_q$ norms are monotone, (6) holds for $1\le q\le 2$ as well. Therefore, the standard tail bound for $L_q$ norms (4) gives

(7) $\prob \{|x^\top A x| \ge t \} \le \frac{\norm{f(x)}_q^q}{t^q} \le \left( \frac{\sqrt{2}q\norm{A}_{\rm F}}{t} \right)^q\quad \text{for }q\ge 1.$

Now, we must optimize the value of $q$ to obtain the sharpest possible bound. To make this optimization more convenient, introduce a parameter

$\alpha \coloneqq \frac{\sqrt{2}q\norm{A}_{\rm F}}{t}.$

In terms of the $\alpha$ parameter, the bound (7) reads

$\prob \{|x^\top A x| \ge t \} \le \exp\left(- \frac{t}{\sqrt{2}\norm{A}_{\rm F}} \alpha \ln \frac{1}{\alpha} \right) \quad \text{for } t\ge \frac{\sqrt{2}\norm{A}_{\rm F}}{\alpha}.$

The tail bound is minimized by taking $\alpha = 1/\mathrm{e}$ , yielding the claimed result

$\prob \{|x^\top A x| \ge t \} \le \exp\left(- \frac{t}{\sqrt{2}\mathrm{e}\norm{A}_{\rm F}} \right) \quad \text{for } t\ge \sqrt{2}\mathrm{e}\norm{A}_{\rm F}.$

Proof of Gaussian Hypercontractivity

Let’s prove the Gaussian hypercontractivity theorem. For simplicity, we will stick with the $d = 1$ case, but the higher-dimensional generalizations follow along similar lines. The key ingredient will be the Gaussian Jensen inequality, which made a prominent appearance in a previous blog post of mine. Here, we will only need the following version:

Theorem 6 (Gaussian Jensen). Let $b : \real^2 \to \real$ be a twice differentiable function and let $(x,\tilde{x}) \sim \operatorname{Normal}(0,\Sigma)$ be jointly Gaussian random variables with covariance matrix $\Sigma$ . Then
(8) $b(\expect[h_1(x)], \expect[h_2(\tilde{x})]) \ge \expect [b(h_1(x),h_2(\tilde{x}))]$
holds for all test functions $h_1,h_2 : \real \to \real$ if, and only if,
(9) $\Sigma \circ \nabla^2 b \quad\text{is negative semidefinite on all of $\real^2$}.$

Here, $\circ$ denotes the entrywise product of matrices and $\nabla^2 b : \real^2\to \real^{2\times 2}$ is the Hessian matrix of the function $b$ .

To me, this proof of Gaussian hypercontractivity using Gaussian Jensen (adapted from Paata Ivanishvili‘s excellent post) is amazing. First, we reformulate the Gaussian hypercontractivity property a couple of times using some functional analysis tricks. Then we do a short calculation, invoke Gaussian Jensen, and the theorem is proved, almost as if by magic.

Part 1: Tricks

Let’s begin with “tricks” part of the argument.

Trick 1. To prove Gaussian hypercontractivity holds for all functions $f$ , it is sufficient to prove for all nonnegative functions $f\ge 0$ .

Indeed, suppose Gaussian hypercontractivity holds for all nonnegative functions $f$ . Then, for any function $f$ , apply Jensen’s inequality to conclude

$\begin{align*} T_\varrho |f|(x) &= \expect_{g\sim \operatorname{Normal}(0,1)} \left| f(\varrho x+\sqrt{1-\varrho^2}\cdot g)\right| \\&\ge \left| \expect_{g\sim \operatorname{Normal}(0,1)} f(\varrho x+\sqrt{1-\varrho^2}\cdot g)\right| \\&= |T_\varrho f(x)|.\end{align*}$

Thus, assuming hypercontractivity holds for the nonnegative function $|f|$ , we have

$\norm{T_\varrho f(x)}_{1+(p-1)/\varrho^2} \le \norm{T_\varrho |f|(x)}_{1+(p-1)/\varrho^2} \le \norm{|f|(x)}_p = \norm{f}_p.$

Thus, the conclusion of the hypercontractivity theorem holds for $f$ as well, and the Trick 1 is proven.

Trick 2. To prove Gaussian hypercontractivity for all $f\ge 0$ , it is sufficient to prove the following “bilinearized” Gaussian hypercontractivity result:
$\expect[g(x) \cdot T_\varrho f(x)]\le \norm{g(x)}_{q'} \norm{f(x)}_p$
holds for all $g\ge 0$ with $\norm{g(x)}_{q'} < +\infty$ . Here, $q'=q/(q-1)$ is the Hölder conjugate to $q = 1+(p-1)/\varrho^2$ .

Indeed, this follows³ from the dual characterization of the norm of $T_\varrho f(x)$ :

$\norm{T_\varrho f(x)}_q = \sup_{\substack{\norm{g(x)} < +\infty \\ g\ge 0}} \frac{\expect[g(x) \cdot T_\varrho f(x)]}{\norm{g(x)}_{q'}}.$

Trick 2 is proven.

Trick 3. Let $x,\tilde{x}$ be a pair of standard Gaussian random variables with correlation $\rho$ . Then the bilinearized Gaussian hypercontractivity statement is equivalent to
$\expect[g(x) f(\tilde{x})]\le (\expect[(g(x)^{q'})])^{1/q'} (\expect[(f(\tilde{x})^{p})])^{1/p}.$

Indeed, define $\tilde{x} = \varrho x + \sqrt{1-\varrho^2} \cdot g$ for the random variable in the definition of the noise operator $T_\varrho$ . The random variable $\tilde{x}$ is standard Gaussian and has correlation $\varrho$ with $f$ , concluding the proof of Trick 3.

Finally, we apply a change of variables as our last trick:

Trick 4. Make the change of variables $u \coloneqq f^p$ and $v \coloneqq g^{q'}$ , yielding the final equivalent version of Gaussian hypercontractivity:
$\expect[v(x)^{1/q'} u(\tilde{x})^{1/p}]\le (\expect[v(x)])^{1/q'} (\expect[u(\tilde{x}))])^{1/p}$
for all functions $u$ and $v$ (in the appropriate spaces).

Part 2: Calculation

We recognize this fourth equivalent version of Gaussian hypercontractivity as the conclusion (8) to Gaussian Jensen with

$b(u,v) = u^{1/p}v^{1/q'}$

. Thus, to prove Gaussian hypercontractivity, we just need to check the hypothesis (9) of the Gaussian Jensen inequality (Theorem 6).

We now enter the calculation part of the proof. First, we compute the Hessian of $b$ :

$\nabla^2 b(u,v) = u^{1/p}v^{1/q'}\cdot\begin{bmatrix} - \frac{1}{pp'} u^{-2} & \frac{1}{pq'} u^{-1}v^{-1} \\ \frac{1}{pq'} u^{-1}v^{-1} & - \frac{1}{qq'} v^{-2}\end{bmatrix}.$

We have written $p'$ for the Hölder conjugate to $p$ . By Gaussian Jensen, to prove Gaussian hypercontractivity, it suffices to show that

$\nabla^2 b(u,v)\circ \twobytwo{1}{\varrho}{\varrho}{1}= u^{1/p}v^{1/q'}\cdot\begin{bmatrix} - \frac{1}{pp'} u^{-2} & \frac{\varrho}{pq'} u^{-1}v^{-1} \\ \frac{\varrho}{pq'} u^{-1}v^{-1} & - \frac{1}{qq'} v^{-2}\end{bmatrix}$

is negative semidefinite for all $u,v\ge 0$ . There are a few ways we can make our lives easier. Write this matrix as

$\nabla^2 b(u,v)\circ \twobytwo{1}{\varrho}{\varrho}{1}= u^{1/p}v^{1/q'}\cdot B^\top\begin{bmatrix} - \frac{p}{p'} & \varrho \\ \varrho & - \frac{q'}{q} \end{bmatrix}B \quad \text{for } B = \operatorname{diag}(p^{-1}u^{-1},(q')^{-1}v^{-1}).$

Scaling $A\mapsto \alpha \cdot A$ by nonnegative $\alpha$ and conjugation $A\mapsto B^\top A B$ both preserve negative semidefiniteness, so it is sufficient to prove

$H = \begin{bmatrix} - \frac{p}{p'} & \varrho \\ \varrho & - \frac{q'}{q} \end{bmatrix} \quad \text{is negative semidefinite}.$

Since the diagonal entries of $H$ are negative, at least one of $H$ ‘s eigenvalues is negative.⁴ Therefore, to prove $H$ is negative semidefinite, we can prove that its determinant (= product of its eigenvalues) is nonnegative. We compute

$\det H = \frac{pq'}{p'q} - \varrho^2 .$

Now, just plug in the values for $p'=p/(p-1)$ , $q=1+(p-1)/\varrho^2$ , $q'=q/(q-1)$ :

$\det H = \frac{pq'}{p'q} - \varrho^2 = \frac{p-1}{q-1} - \varrho^2 = \frac{p-1}{(p-1)/\varrho^2} - \varrho^2 = 0.$

Thus, $\det H \ge 0$ . We conclude $H$ is negative semidefinite, proving the Gaussian hypercontractivity theorem.

Note to Self: Norm of a Gaussian Random Vector

February 21, 2023 by Ethan N. Epperly Leave a comment

Let $g$ be a standard Gaussian vector—that is, a vector populated by independent standard normal random variables. What is the expected length $\mathbb{E} \|g\|$ of $g$ ? (Here, and throughout, $\|\cdot\|$ denotes the Euclidean norm of a vector.) The length of $g$ is the square root of the sum of $n$ independent standard normal random variables

$\|g\| = \sqrt{g_1^2 + \cdots + g_n^2},$

which is known as a $\chi$ random variable with $n$ degrees of freedom. (Not to be confused with a $\chi^\mathbf{2}$ random variable!) Its mean value is given by the rather unpleasant formula

$\mathbb{E} \|g\| = \sqrt{2} \frac{\Gamma((n+1)/2)}{\Gamma(n/2)},$

where $\Gamma(\cdot)$ is the gamma function. If you are familiar with the definition of the gamma function, the derivation of this formula is not too hard—it follows from a change of variables to $n$ -dimensional spherical coordinates.

This formula can be difficult to interpret and use. Fortunately, we have the rather nice bounds

(1) $\sqrt{n-1} < \frac{n}{\sqrt{n+1}} < \mathbb{E} \|g\| < \sqrt{n}.$

This result appears, for example, page 11 of this paper. These bounds show that, for large $n$ , $\mathbb{E} \|g\|$ is quite close to $\sqrt{n}$ . The authors of the paper remark that this inequality can be proved by induction. I had difficulty reproducing the inductive argument for myself. Fortunately, I found a different proof which I thought was very nice, so I thought I would share it here.

Our core tool will be Wendel’s inequality (see (7) in Wendel’s original paper): For $x > 0$ and $0 < s < 1$ , we have

(2) $\frac{x}{(x+s)^{1-s}} < \frac{\Gamma(x+s)}{\Gamma(x)} < x^s.$

Let us first use Wendel’s inequality to prove (1). Indeed, invoke Wendel’s inequality with $x = n/2$ and $s = 1/2$ and multiply by $\sqrt{2}$ to obtain

$\frac{\sqrt{2} \cdot n/2}{(n/2+1/2)^{1/2}} < \sqrt{2}\frac{\Gamma((n+1)/2)}{\Gamma(n/2)} = \mathbb{E}\|g\| < \sqrt{2}\cdot \sqrt{n/2},$

which simplifies directly to (1).

Now, let’s prove Wendel’s inequality (2). The key property for us will be the strict log-convexity of the gamma function: For real numbers $x,y > 0$ and $0 < s < 1$ ,

(3) $\Gamma((1-s)x + sy) < \Gamma(x)^{1-s} \Gamma(y)^s.$

We take this property as established and use it to prove Wendel’s inequality. First, use the log-convexity property (3) with $y = x+1$ to obtain

$\Gamma(x+s) = \Gamma((1-s)x + s(x+1)) < \Gamma(x)^{1-s} \Gamma(x+1)^s.$

Divide by $\Gamma(x)$ and use the property that $\Gamma(x+1)/\Gamma(x) = x$ to conclude

(4) $\frac{\Gamma(x+s)}{\Gamma(x)} < \left( \frac{\Gamma(x+1)}{\Gamma(x)} \right)^s = x^s.$

This proves the upper bound in Wendel’s inequality (2). To prove the lower bound, invoke the upper bound (4) with $x+s$ in place of $x$ and $1-s$ in place of $s$ to obtain

$\frac{\Gamma(x+1)}{\Gamma(x+s)} < (x+s)^{1-s}.$

Multiplying by $\Gamma(x+s)$ , dividing by $(x+s)^{1-s}\Gamma(x)$ , and using $\Gamma(x+1)/\Gamma(x) = x$ again yields

$\frac{\Gamma(x+s)}{\Gamma(x)} > \frac{\Gamma(x+1)}{\Gamma(x)} \cdot \frac{1}{(x+s)^{1-s}} = \frac{x}{(x+s)^{1-s}},$

finishing the proof of Wendel’s inequality.

Notes. The upper bound in (1) can be proven directly by Lyapunov’s inequality: $\mathbb{E} \|g\| \le (\mathbb{E} \|g\|^2)^{1/2} = n^{1/2}$ , where we use the fact that $\|g\|^2 = g_1^2 + \cdots + g_n^2$ is the sum of $n$ random variables with mean one. The weaker lower bound $\mathbb{E} \|g\| \ge \sqrt{n-1}$ follows from a weaker version of Wendel’s inequality, Gautschi’s inequality.

After the initial publication of this post, Sam Buchanan mentioned another proof of the lower bound $\mathbb{E} \|g\| \ge \sqrt{n-1}$ using the Gaussian Poincaré inequality. This inequality states that, for a function $f : \real^n \to \real$ ,

$\Var(f(g)) \le \mathbb{E} \| \nabla f(g)\|^2.$

To prove the lower bound, set $f(g) := \|g\|$ which has gradient $\nabla f(g) = g/\|g\|$ . Thus,

$\mathbb{E} \| \nabla f(g)\|^2 = 1 \ge \Var(f(g)) = \mathbb{E} \|g\|^2 - (\mathbb{E} \|g\|)^2 = n - (\mathbb{E} \|g\|)^2.$

Rearrange to obtain $\mathbb{E} \|g\| \ge \sqrt{n-1}$ .

Note to Self: Hanson–Wright Inequality

October 4, 2022 by Ethan N. Epperly 7 Comments

This post is part of a new series for this blog, Note to Self, where I collect together some notes about an idea related to my research. This content may be much more technical than most of the content of this blog and of much less wide interest. My hope in sharing this is that someone will find this interesting and useful for their own work.

This post is about a fundamental tool of high-dimensional probability, the Hanson–Wright inequality. The Hanson–Wright inequality is a concentration inequality for quadratic forms of random vectors—that is, expressions of the form $x^\top A x$ where $x$ is a random vector. Many statements of this inequality in the literature have an unspecified constant $c > 0$ ; our goal in this post will be to derive a fairly general version of the inequality with only explicit constants.

The core object of the Hanson–Wright inequality is a subgaussian random variable. A random variable $Y$ is subgaussian if the probability it exceeds a threshold $t$ in magnitude decays as

(1) $\mathbb{P}\{|Y|\ge t\} \le \mathrm{e}^{-t^2/a} \quad \text{for some $a>0$ and for all sufficiently large $t$.}$

The name subgaussian is appropriate as the tail probabilities of Gaussian random variables exhibit the same square-exponential decrease $\mathrm{e}^{-t^2/a}$ .

A (non-obvious) fact is that if $Y$ is subgaussian in the sense (1) and centered ( $\mathbb{E} Y = 0$ ), then $Y$ ‘s cumulant generating function (cgf)

$\xi_Y(t) := \log \mathbb{E} \exp(tY).$

is subquadratic: There is a constant $c > 0$ (independent of $Y$ and $a$ ), for which

(2) $\xi_Y(t) \le ca t^2 \quad \text{for all $t\in\mathbb{R}$}.$

Moreover,¹ a subquadratic cgf (2) also implies the subgaussian tail property (1), with a different parameter $a > 0$ .

Since properties (1) and (2) are equivalent (up to a change in the parameter $a$ ), we are free to fix a version of property (2) as our definition for a (centered) subgaussian random variable.

Definition (subgaussian random variable): A centered random variable $X$ is said to be $v$ -subgaussian or subgaussian with variance proxy $v$ if its cgf is subquadratic:
(3) $\xi_{x}(t) \le\frac{1}{2} vt^2 \quad \text{for all $t\in\mathbb{R}$.}$

For instance, a mean-zero Gaussian random variable $X$ with variance $\sigma^2$ has cgf

(4) $\xi_X(t) = \frac{1}{2} \sigma^2 t^2,$

and is thus subgaussian with variance proxy $v = \sigma^2$ equal to its variance.

Here is a statement of the Hanson–Wright inequality as it typically appears with unspecified constants (see Theorem 6.2.1 of Vershynin’s High-Dimensional Probability):

Theorem (Hanson–Wright): Let $x$ be a random vector with independent centered $v$ -subgaussian entries and let $A$ be a square matrix. Then
$\mathbb{P}\left\{\left|x^\top Ax - \mathbb{E} \left[x^\top A x\right]\right|\ge t \right\} \le 2\exp\left(- \frac{c\cdot t^2}{v^2\left\|A\right\|_{\rm F}^2 + v\left\|A\right\|t} \right),$
where $c>0$ is a constant (not depending on $v$ , $x$ , $t$ , or $A$ ).²

This type of concentration is exactly the same type as provided by Bernstein’s inequality (which I discussed in my post on concentration inequalities). In particular, for small deviations $t$ , the tail probabilities decay are subgaussian with variance proxy $\approx v^2\left\|A\right\|_{\rm F}^2$ :

$\mathbb{P}\left\{\left|x^\top Ax - \mathbb{E}\left[x^\top Ax\right]\right|\ge t \right\} \stackrel{\text{small $t$}}{\lessapprox} 2\exp\left(- \frac{c\cdot t^2}{v^2\left\|A\right\|_{\rm F}^2} \right)$

For large deviations $t$ , this switches to subexponential tail probabilities with decay rate $\approx v\|A\|$ :

$\mathbb{P}\left\{\left|x^\top Ax - \mathbb{E}\left[x^\top Ax\right]\right|\ge t \right\} \stackrel{\text{large $t$}}{\lessapprox} 2\exp\left(- \frac{c\cdot t}{v\|A\|} \right).$

Mediating these two parameter regimes are the size of the matrix $A$ , as measured by its Frobenius and spectral norms, and the degree of subgaussianity of $x$ , measured by the variance proxy $v$ .

Diagonal-Free Hanson–Wright

Now we come to a first version of the Hanson–Wright inequality with explicit constants, first for a matrix which is diagonal-free—that is, having all zeros on the diagonal. I obtained this version of the inequality myself, though I am very sure that this version of the inequality or an improvement thereof appears somewhere in the literature.

Theorem (Hanson–Wright, explicit constants, diagonal-free): Let $x$ random vector with independent centered $v$ -subguassian entries and let $A$ be a diagonal-free square matrix. Then we have the cgf bound
$\xi_{x^\top Ax}(t) \le \frac{16v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-4v\left\|A\right\|t)}.$
As a consequence, we have the concentration bound
$\mathbb{P} \{ x^\top A x \ge t \} \le \exp\left( -\frac{t^2/2}{16v^2 \left\|A\right\|_{\rm F}^2+4v\left\|A\right\|t} \right).$
Similarly, we have the lower tail
$\mathbb{P} \{ x^\top A x \le -t \} \le \exp\left( -\frac{t^2/2}{16v^2 \left\|A\right\|_{\rm F}^2+4v\left\|A\right\|t} \right)$
and the two-sided bound
$\mathbb{P} \{ |x^\top A x| \ge t \} \le 2\exp\left( -\frac{t^2/2}{16v^2 \left\|A\right\|_{\rm F}^2+4v\left\|A\right\|t} \right).$

Let us begin proving this result. Our proof will follow the same steps as Vershynin’s proof in High-Dimensional Probability (which in turn is adapted from an article by Rudelson and Vershynin), but taking care to get explicit constants. Unfortunately, proving all of the relevant tools from first principles would easily triple the length of this post, so I make frequent use of results from the literature.

We begin by the decoupling bound (Theorem 6.1.1 in Vershynin’s High-Dimensional Probability), which allows us to replace one $x$ with an independent copy $\tilde{x}$ at the cost of a factor of four:

(5) $\xi_{x^\top Ax}(t) \le \xi_{\tilde{x}^\top Ax}(4t).$

We seek to compare the bilinear form $\tilde{x}^\top Ax$ to the Gaussian bilinear form $\tilde{g}^\top Ag$ where $\tilde{g}$ and $g$ are independent standard Gaussian vectors. We begin with the following cgf bound for the Gaussian quadratic form $g^\top Ag$ :

$\xi_{g^\top Ag}(t) \le \frac{\left\|A\right\|_{\rm F}^2 \, t^2}{1-2\|A\|\, t}.$

This equation is the result of Example 2.12 in Boucheron, Lugosi, and Massart’s Concentration Inequalities. By applying this result to the Hermitian dilation of $A$ in $A$ ‘s place, one obtains a similar result for the decoupled bilinear form $\tilde{g}^\top Ag$ :

(6) $\xi_{\tilde{g}^\top Ag}(t) \le \frac{\left\|A\right\|_{\rm F}^2 \, t^2}{2(1-\|A\|\, t)}.$

We now seek to compare $\xi_{\tilde{x}^\top Ax}(t)$ to $\xi_{\tilde{g}^\top Ag}(t)$ . To do this, we first evaluate the cgf of $\tilde{x}^\top Ax$ only over the randomness in $\tilde{x}$ . Since we’re only taking an expectation over the random variable $\tilde{x}$ , we can apply the subquadratic tail condition (3) to obtain

(7) $\log \mathbb{E}_{\tilde{x}} \exp(t \, \tilde{x}^\top Ax) = \sum_{i=1}^n \log \mathbb{E}_{\tilde{x}} \exp(t \,\tilde{x}_i (Ax)_i) \le \frac{1}{2} v \left(\sum_{i=1}^n (Ax)_i^2\right)t^2 \le \frac{1}{2} v\left\|Ax\right\|^2 \, t^2.$

Now we perform a similar computation for the quantity $\tilde{g}^\top Ax$ in which $\tilde{x}$ has been replaced by the Gaussian vector $\tilde{g}$ :

$\log \mathbb{E}_{\tilde{g}} \exp((\sqrt{v} t) \, \tilde{g}^\top Ax) = \frac{1}{2} v \left\|Ax\right\|^2 \, t^2.$

We stress that this is an equality since the cgf of a Gaussian random variable is given by (4). Thus we can substitute the left-hand side of the above display into the right-hand side of (7), yielding

(8) $\log \mathbb{E}_{\tilde{x}} \exp(t \, \tilde{x}^\top Ax) \le \log \mathbb{E}_{\tilde{g}} \exp((\sqrt{v} t) \, \tilde{g}^\top Ax).$

We now perform this same trick again using the randomness in $x$ :

(9) $\log \mathbb{E}_{\tilde{g},x} \exp((\sqrt{v} t) \, \tilde{g}^\top Ax) \le \log \mathbb{E}_{\tilde{g}} \exp \left(\frac{1}{2} v^2 \left\|A^\top \tilde{g}\right\|^2t^2\right) = \log \mathbb{E}_{\tilde{g},g} \exp(v t \, \tilde{g}^\top Ag).$

Packaging up (8) and (9) gives

(10) $\xi_{\tilde{x}^\top Ax}(t)\le \xi_{\tilde{g}^\top Ag}(vt).$

Combining all these results (5), (6), and (10), we obtain

$\xi_{x^\top Ax}(t) \le \xi_{\tilde{x}^\top Ax}(4t) \le \xi_{\tilde{g}^\top Ag}(4vt) \le \frac{16v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-4v\left\|A\right\|t)}.$

This cgf implies the desired probability bound on the upper tail as a consequence of the following fact (see Boucheron, Lugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):

Fact (Bernstein concentration from Bernstein cgf bound): Suppose that a random variable $X$ satisfies the cgf bound $\xi_X(t) \le \tfrac{vt^2}{2(1-ct)}$ for $0 < t < 1/c$ . Then
$\mathbb{P} \left\{ X\ge t \right\} \le \exp\left( -\frac{t^2/2}{v+ct} \right).$

To get the bound on the lower tail, apply the result for the upper tail to the matrix $-A$ to obtain

$\mathbb{P} \{ x^\top A x \le -t \} = \mathbb{P} \{ x^\top (-A) x \ge t \} \le \exp\left( -\frac{t^2/2}{16v^2 \left\|A\right\|_{\rm F}^2+4v\left\|A\right\|t} \right).$

Finally, to obtain the two-sided bound, use a union bound over the upper and lower tails:

$\mathbb{P} \{ |x^\top A x| \ge t \} \le \mathbb{P} \{ x^\top A x \ge t \} + \mathbb{P} \{ x^\top A x \le -t \} \le 2\exp\left( -\frac{t^2/2}{16v^2 \left\|A\right\|_{\rm F}^2+4v\left\|A\right\|t} \right).$

General Hanson–Wright

Now, here’s a more general result (with worse constants) which permits the matrix $A$ to possess a diagonal.

Theorem (Hanson–Wright, explicit constants): Let $x$ random vector with independent centered $v$ -subguassian entries and let $A$ be an arbitrary square matrix. Then we have the cgf bound
$\xi_{x^\top Ax-\mathbb{E} [x^\top A x]}(t) \le \frac{40v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left\|A\right\|t)}.$
As a consequence, we have the concentration bound
$\mathbb{P} \{ x^\top A x-\mathbb{E} [x^\top A x] \ge t \} \le \exp\left( -\frac{t^2/2}{40v^2 \left\|A\right\|_{\rm F}^2+8v\left\|A\right\|t} \right).$
Left tail and two-sided bounds versions of this bound also hold:
$\mathbb{P} \{ x^\top A x-\mathbb{E} [x^\top A x] \le -t \} \le \exp\left( -\frac{t^2/2}{40v^2 \left\|A\right\|_{\rm F}^2+8v\left\|A\right\|t} \right)$
and
$\mathbb{P} \{ |x^\top A x-\mathbb{E} [x^\top A x]| \ge t \} \le 2\exp\left( -\frac{t^2/2}{40v^2 \left\|A\right\|_{\rm F}^2+8v\left\|A\right\|t} \right).$

Decompose the matrix $A = D+F$ into its diagonal and off-diagonal portions. For any two random variables $X$ and $Y$ (possibly highly dependent), we can bound the cgf of their sum using the following “union bound”:

(11) $\begin{align*} \xi_{X+Y}(t) &= \log \mathbb{E} \left[\exp(tX)\exp(tY)\right] \\&\le \log \left(\left[\mathbb{E} \exp(2tX)\right]^{1/2}\left[\mathbb{E}\exp(2tY)\right]^{1/2}\right) \\&=\frac{1}{2} \xi_X(2t) + \frac{1}{2}\xi_Y(2t). \end{align*}$

The two equality statements are the definition of the cumulant generating function and the inequality is Cauchy–Schwarz.

Using the “union bound”, it is sufficient to obtain bounds for the cgfs of the diagonal and off-diagonal parts $x^\top D x - \mathbb{E}[x^\top Ax]$ and $x^\top F x$ . We begin with the diagonal part. We compute

(12) $\begin{align*}\xi_{x^\top D x - \mathbb{E}[x^\top Ax]}(t) &= \log \mathbb{E} \exp\left(t \sum_{i=1}^n A_{ii}(x_i^2 - \mathbb{E}[x_i^2]) \right) \\ &= \sum_{i=1}^n \log \mathbb{E} \exp\left((t A_{ii})\cdot(x_i^2 - \mathbb{E}[x_i^2]) \right). \end{align*}$

For the cgf of $x_i^2 - \mathbb{E}[x_i^2]$ , we use the following bound, taken from Appendix B of the following paper:

$\log \mathbb{E} \exp\left(t(x_i^2 - \mathbb{E}[x_i^2]) \right) \le \frac{8v^2t^2}{1-2v|t|}.$

Substituting this result into (12) gives

(13) $\xi_{x^\top D x - \mathbb{E}[x^\top Ax]}(t) \le \sum_{i=1}^n \frac{8v^2|A_{ii}|^2t^2}{1-2v|A_{ii}|t} \le \frac{8v^2\|A\|_{\rm F}^2t^2}{1-2v\|A\|t}\quad \text{for $t>0$}.$

For the second inequality, we used the facts that $\max_i |A_{ii}| \le \|A\|$ and $\sum_i |A_{ii}|^2 \le \|A\|_{\rm F}^2$ .

We now look at the off-diagonal part $x^\top F x$ . We use a version of the decoupling bound (5) where we compare $x^\top F x$ to $\tilde{x}^\top A x$ , where we’ve both replaced one copy of $x$ with an independent copy and reinstated the diagonal of $A$ (see Remark 6.1.3 in Vershynin’s High-Dimensional Probability):

$\xi_{x^\top F x}(t) \le \xi_{\tilde{x}^\top Ax}(4t).$

We can now just repeat the rest of the argument for the diagonal-free Hanson–Wright inequality, yielding the same conclusion

(14) $\xi_{x^\top Fx}(t) \le \frac{16v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-4v\left\|A\right\|t)}.$

Combining (11), (13), and (14), we obtain

$\begin{align*}\xi_{x^\top Ax-\mathbb{E} [x^\top A x]} &\le \frac{1}{2} \xi_{x^\top D x - \mathbb{E}[x^\top Ax]}(2t) + \frac{1}{2} \xi_{x^\top Fx}(2t) \\&\le \frac{8v^2\|A\|_{\rm F}^2t^2}{2(1-4v\|A\|t)} + \frac{32v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left|A\right|t)} \\&\le \frac{8v^2\|A\|_{\rm F}^2t^2}{2(1-4v\|A\|t)} + \frac{32v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left\|A\right\|t)} \\&\le \frac{40v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left\|A\right\|t)}.\end{align*}$

As with above, this cgf bound implies the desired probability bound.