Note to Self: How Accurate is Sketch and Solve?

Suppose we’re interested in solving an overdetermined linear least-squares problem

(1)   \[x = \operatorname*{argmin}_{x \in \real^n} \norm{b-Ax}, \quad A \in \real^{m\times n}, \quad b \in \real^m. \]

Sketch-and-solve is a popular method for getting a fast, approximate solution to (1). Let S \in \real^{d\times m} be a sketching matrix for \onebytwo{A}{b} of distortion \eta (see these previous posts of mine for a refresher on sketching if needed):

(2)   \[(1-\eta)\norm{y} \le \norm{Sy} \le (1+\eta)\norm{y} \quad \text{for every } y \in \operatorname{range}(\onebytwo{A}{b}). \]

The sketch-and-solve solution to (1) is given by

(3)   \[\hat{x} = \operatorname*{argmin}_{\hat{x}\in\real^n} \norm{Sb-(SA)\hat{x}}. \]

How small is the residual \norm{b-A\hat{x}}of the sketch-and-solve solution? Here’s a one bound:

Proposition 1 (Sketch-and-solve, 1+\order(\eta) bound). The sketch-and-solve solution (3) satisfies the bound

    \[\norm{b-A\hat{x}} \le \frac{1+\eta}{1-\eta} \cdot \norm{b-Ax} = (1 + 2\eta + \order(\eta^2))\norm{b-Ax}.\]

Let’s prove this bound. The residual b-A\hat{x} is in the range of \onebytwo{A}{b}. Therefore, by (2),

    \[(1-\eta) \norm{b-A\hat{x}} \le \norm{S(b-A\hat{x})}.\]

Rearranging gives

    \[\norm{b-A\hat{x}} \le \frac{1}{1-\eta}\norm{S(b-A\hat{x})}.\]

By the optimality of the sketch-and-solve solution (3), \norm{S(b-A\hat{x})} is minimized for the value \hat{x}. Thus, its value can only increase by replacing \hat{x} by x:

    \[\norm{b-A\hat{x}} \le \frac{1}{1-\eta}\norm{S(b-A\hat{x})}\le \frac{1}{1-\eta}\norm{S(b-Ax)}.\]

Invoke property (2) a final time to obtain

(4)   \[\norm{b-A\hat{x}} \le \frac{1}{1-\eta}\norm{S(b-A\hat{x})}\le \frac{1}{1-\eta}\norm{S(b-Ax)}\le \frac{1+\eta}{1-\eta}\cdot \norm{b-Ax}. \]

We’ve completed the proof of Proposition 1. In fact, were we to want to write the proof concisely, equation (4) represents a single-line proof of Proposition 1.

The conclusion of Proposition 1 appears to be that the residual norm \norm{b-A\hat{x}} may be a factor 1+\order(\eta^1) larger than the minimal least-squares residual \norm{b-Ax}. Interestingly, this conclusion is not sharp. In fact, the residual for sketch-and-solve can only be at most 1+\order(\eta^2) larger than optimal. This fact has been known at least since the work of Drineas, Mahoney, Muthukrishnan, and Sarlós (2007). The note of Larsen and Kolda (2022) provides a nice summary of Drineas et al.’s argument. My goal in this post is to simplify this argument even further, making it as elementary as possible.

Notation

To make our lives easier, we’ll begin by defining some notation. Let

(5)   \[A = UC \quad \text{for } U\in\real^{m\times n}, \: B \in \real^{n\times n} \]

be any factorization of A into a matrix U with orthonormal columns and a square nonsingular matrix C. Such a factorization can be computed using a QR or polar decomposition. Let

    \[r = b-Ax\]

denote the optimal least-squares residual. We will have occasion to scale r so that it is a unit vector

    \[\overline{r} = \frac{r}{\norm{r}}.\]

Note that the optimality condition for the least-squares problem (1) is that the residual r is orthogonal to \operatorname{range}(A). Consequently,

(6)   \[\onebytwo{U}{\overline{r}} \quad \text{is an orthonormal basis for } \operatorname{range}(\onebytwo{A}{b}). \]

Sketching the basis

As saw in (6), the matrix \onebytwo{U}{\overline{r}} is an orthonormal basis for \operatorname{range}(\onebytwo{A}{b}). To get a sharper analysis of sketch-and-solve, we will need the following result, which shows that S\onebytwo{U}{\overline{r}} is an “almost orthonormal basis”.

Lemma 2 (Sketching the basis): The matrix S\onebytwo{U}{\overline{r}} has nearly orthonormal columns in the sense that

(7)   \[\sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) \ge 1-\eta, \quad \sigma_{\rm max}(S\onebytwo{U}{\overline{r}}) \le 1+\eta. \]

Consequently,

(8)   \[\norm{\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}} - I} \le 2\eta+\eta^2. \]

This result is a pretty direct reformulation of the sketching property (2). By the variational characterization of the minimum singular value, we have

    \[\sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) = \min_{\norm{z}=1} \norm{S\onebytwo{U}{\overline{r}}z}.\]

The vector \onebytwo{U}{\overline{r}}z is in the range of \onebytwo{A}{b}. Thus, by (2),

    \[\sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) = \min_{\norm{z}=1} \norm{S\onebytwo{U}{\overline{r}}z} \ge (1-\eta) \min_{\norm{z}=1} \norm{\onebytwo{U}{\overline{r}}z} = 1-\eta.\]

For the last equality, we used the fact that \onebytwo{U}{\overline{r}} has orthonormal columns and thus

    \[\norm{\onebytwo{U}{\overline{r}}z} = \norm{z} \quad \text{for every } z \in \real^{n+1}.\]

This proves the first inequality in (7). The second inequality follows along similar lines.

For a tall matrix F, the eigenvalues of F^\top F are equal to the squared singular values of F. Applying this result to F = \onebytwo{U}{\overline{r}} gives

    \[(1-\eta)^2 \le \lambda_{\rm min}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}) \le \lambda_{\rm max}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}) \le (1+\eta)^2.\]

Subtracting the identity matrix shifts all the eigenvalues down by 1:

    \[-2\eta+\eta^2 \le \lambda_{\rm min}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}-I) \le \lambda_{\rm max}(\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}-I) \le 2\eta+\eta^2.\]

The spectral norm is equal to the largest eigenvalue in absolute value. Thus

    \[\norm{\onebytwo{U}{\overline{r}}^\top S^\top S\onebytwo{U}{\overline{r}}-I} \le \max(2\eta+\eta^2,-(-2\eta+\eta^2)) = 2\eta+\eta^2.\]

This proves (8).

Sketch-and-Solve: Better Analysis

With Lemma 2 in hand, we are ready to state and prove our main result:

Theorem 3 (Sketch-and-solve, 1+\order(\eta^2) bound). The sketch-and-solve solution (3) satisfies the bound

(9)   \[\norm{b-A\hat{x}}^2 \le \left(1 + \frac{(2\eta+\eta^2)^2}{(1-\eta)^4}\right) \norm{b-Ax}^2 \le \left(1 + \frac{9\eta^2}{(1-\eta)^4}\right)\norm{b-Ax}^2. \]

As a consequence,

(10)   \[\norm{b-A\hat{x}}^2 \le (1+4\eta^2+\order(\eta^3)) \norm{b-Ax}^2 \]

and

(11)   \[\norm{b-A\hat{x}} \le (1+2\eta^2+\order(\eta^3)) \norm{b-Ax}. \]

Let’s prove Theorem 3. Begin by decomposing the sketch-and-solve residual b - A\hat{x}:

    \[b - A\hat{x} = b - Ax + Ax - A\hat{x} = r + A(\hat{x}-x).\]

The residual r is orthogonal to the range of A. Thus, by the Pythagorean theorem,

(12)   \[\norm{b-A\hat{x}}^2 = \norm{r}^2 + \norm{A(\hat{x}-x)}^2. \]

To bound \norm{A(\hat{x}-x)}^2, it will help us to have a more convenient formula for \hat{x}-x. To this end, reparametrize the sketch-and-solve least-squares problem as an optimization problem over the error e \coloneqq \hat{x} - x:

    \[\hat{x} = \operatorname*{argmin}_{\hat{x}\in\real^n} \norm{S(b - Ax)-SA(\hat{x}-x)} \implies \hat{x} - x = \operatorname*{argmin}_{e\in\real^n} \norm{Sr - SAe}.\]

Applying the normal equations to this least-squares problem, we get

    \[\hat{x} - x = (A^\top S^\top SA)^{-1} A^\top S^\top Sr.\]

Now, invoke the factorization (5):

    \[\hat{x} - x = C^{-1} (U^\top S^\top SU)^{-1} U^\top S^\top Sr.\]

Thus,

    \[A(\hat{x} - x) = U(U^\top S^\top SU)^{-1} U^\top S^\top Sr.\]

Taking norms and using the definition r = \norm{r} \cdot \overline{r}, we obtain

(13)   \[\norm{A(\hat{x} - x)} \le \norm{(U^\top S^\top S U)^{-1}} \cdot \norm{U^\top S^\top S \overline{r}} \cdot \norm{r}. \]


We now work to bound the two factors in (10). The matrix SU is a submatrix of S\onebytwo{U}{\overline{r}}. Thus, by (7),

    \[\sigma_{\rm min}(SU) \ge \sigma_{\rm min}(S\onebytwo{U}{\overline{r}}) \ge 1-\eta.\]

Consequently,

    \[\norm{(U^\top S^\top S U)^{-1}} = \sigma_{\rm min}^2(SU) \le \frac{1}{(1-\eta)^2}.\]

The matrix U^\top S^\top S\overline{r} is a submatrix of \onebytwo{U}{\overline{r}}S^\top S\onebytwo{U}{\overline{r}} - I. Thus, by (8),

    \[\norm{U^\top S^\top S \overline{r}} \le \norm{\onebytwo{U}{\overline{r}}S^\top S\onebytwo{U}{\overline{r}} - I}\le 2\eta + \eta^2.\]

Substituting the two previous displays into (13) yields

    \[\norm{A(\hat{x} - x)} \le \frac{2\eta+\eta^2}{(1-\eta)^2} \cdot \norm{r}.\]

Plugging this display into (12) proves (9). Equations (10) and (11) follow from (9) and a Taylor series expansion. This proves Theorem 3.

Can We Do Better?

In the limit \eta \to 0, Theorem 3 identifies the correct 1+\order(\eta^2) scaling for sketch-and-solve, consistent both with numerical experiments, exact computations for Gaussian embeddings, and lower bounds for solving least-squares problems by row subselection. However, comparing results for Gaussian embeddings shows there may be room for improvement. Here is an informal statement of the Gaussian results:

Informal Theorem 4 (Gaussian sketch-and-solve): Appropriately scaled, a d\times m Gaussian matrix is an \eta-sketching matrix provided d\approx n/\eta^2. The expected least-squares residual is

(14)   \[\expect \norm{b - A\hat{x}}^2 \approx \left(1 + \frac{\eta^2}{(1+\eta)(1-\eta)}\right) \norm{b - Ax}^2.\]

See this post of mine for a formal version of this result, its proof, and a discussion of the history of this result. We see that the bound (9) for general embeddings takes the form 1 + \order((1-\eta)^{-4}) in the limit \eta\to 1, whereas the bound (14) for Gaussian embeddings scales like 1 + \order((1-\eta)^{-1}). We leave it as a conjecture/open problem whether there is an improved argument for general subspace embeddings:

Conjecture 5 (Sketch-and-solve, large \eta): The sketch-and-solve solution (3) satisfies the bound

    \[\norm{b-A\hat{x}}^2 \le \left(1 + \frac{C\eta^2}{1-\eta}\right)\norm{b-Ax}^2\]

for an absolute constant C > 0.


Notes: For another proof demonstrating the correction \order(\eta^2) scaling for a different type of dimensionality reduction (leverage score sampling), check out this article by Raphael Meyer.

Leave a Reply

Your email address will not be published. Required fields are marked *