Low-Rank Approximation Toolbox: The Gram Correspondence

December 7, 2024 by Ethan N. Epperly Leave a comment

I am delighted to share that my paper Randomly pivoted Cholesky: Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations, joint with Yifan Chen, Joel Tropp, and Robert Webber, has been published online in Communications on Pure and Applied Mathematics. To celebrate this occasion, I want to share one of my favorite tricks in the design of low-rank approximation algorithms, which I will call the Gram correspondence.

Projection Approximation and Nyström Approximation

When we construct a low-rank approximation to a matrix, the type of approximation we use is typically dictated by the size and properties of the matrix. For a rectangular matrix $B \in \real^{M\times N}$ , one of the standard techniques is the standard randomized SVD algorithm:

Generate a Gaussian random matrix $\Omega \in \real^{N\times k}$ .
Form the product $Y = B\Omega$ .
Compute an (economy-size) QR decomposition $Y = QR$ .
Evaluate the SVD $Q^\top B = \tilde{U} \Sigma V^\top$ .
Output the low-rank approximation $\hat{B} = U\Sigma V^\top$ for $U = Q\tilde{U}$ .

Succinctly, the output of the randomized SVD is given by the formula $\hat{B} = \Pi_{B\Omega} B$ , where $\Pi_F$ denotes the orthogonal projector onto the column span of a matrix $F$ . This motivates the general definition:

Definition (projection approximation): Given a test matrix $\Omega \in \real^{N\times k}$ , the projection approximation to the matrix $B$ is $\hat{B} = \Pi_{B\Omega} B$ .

The class of projection approximations is much richer than merely the approximation constructed by the standard randomized SVD. Indeed, low-rank approximations computed by randomized subspace iteration, randomized block Krylov iteration, column-pivoted QR decompositions, etc. all fit under the umbrella of projection approximations.

Many matrices in applications have additional structure such as symmetry or sparsity, and it can be valuable to make use of low-rank approximations that take advantage of those properties. An especially important type of structure is positive semidefiniteness. For our purposes, a positive semidefinite matrix $A \in \real^{N\times N}$ is one that is symmetric and possesses nonnegative eigenvalues, and we will use abbreviate “positive semidefinite” as “psd”. Psd matrices arise in applications as covariance matrices, kernel matrices, and as discretizations of certain differential and integral operators. Further, any rectangular matrix $B \in \real^{M\times N}$ gives rise to its psd Gram matrix $A = B^\top B$ ; we will have much more to say about Gram matrices below.

To compute a low-rank approximation to a psd matrix, the preferred format as usually the Nyström approximation:

Definition (Nyström approximation): Given a test matrix $\Omega \in \real^{N\times k}$ , the Nyström approximation is¹
$\hat{A} \coloneqq (A\Omega) (\Omega^\top A \Omega)^{-1}(A\Omega)^\top.$

As discussed in a previous post, the general class of Nyström approximations includes many useful specializations depending on how the matrix $\Omega$ is selected.

The Gram Correspondence

The Gram correspondence is a connection between projection approximation and Nyström approximation:

The Gram correspondence: Let $B\in\real^{M\times N}$ be any rectangular matrix and consider the Gram matrix $A = B^\top B$ . Fix a test matrix $\Omega$ , and define the Nyström approximation $\hat{A} = A\langle \Omega\rangle$ to $A$ and the projection approximation $\hat{B} = \Pi_{B\Omega}B$ to $B$ . Then
$\hat{A} = \smash{\hat{B}}^\top \hat{B}.$
That is, the Gram matrix $\hat{B}^\top\hat{B}$ of the projection approximation $\hat{B}\approx B$ is the Nyström approximation $\hat{A} \approx A$ .

As we will see, this correspondence has many implications:

Special cases. The correspondence contains several important facts as special cases. Examples include the equivalence between the randomized SVD and single-pass Nyström approximation and the equivalence of (partial) column-pivoted QR factorization and (partial) pivoted Cholesky decomposition.
Algorithm design. Since Nyström approximation and projection approximations are closely related, one can often interconvert algorithms for one type of approximation into the other. These conversions can lead one to discover new algorithms. We will provide a historical answer with the discovery of randomly pivoted Cholesky.
Error bounds. The Gram correspondence is immensely helpful in the analysis of algorithms, as it makes error bounds for projection approximations and Nyström approximations easily derivable from each other.

For those interested, we give a short proof of the Gram correspondence below.

Proof of Gram correspondence

The Gram correspondence is straightforward to derive, so let’s do so before moving on. Because $\Pi_{B\Omega}$ is a projection matrix, it satisfies $\Pi_{B\Omega}^2 = \Pi_{B\Omega}$ . Thus,

$\hat{B}^\top \hat{B} = B^\top \Pi_{B\Omega}^2B = B^\top \Pi_{B\Omega}B.$

The projector $\Pi_F$ has the formula $F (F^\top F)^{-1} F^\top$ . Using this formula, we obtain

$\hat{B}^\top \hat{B} = B^\top B\Omega (\Omega^\top B^\top B \Omega)^{-1} \Omega^\top B^\top B.$

This is precisely the formula for the Nyström approximation $(B^\top B)\langle \Omega\rangle$ , confirming the Gram correspondence.

Aside: Gram Square Roots

Before we go forward, let me highlight a point of potential conclusion and introduce some helpful terminology. When thinking about algorithms for a psd matrix $A$ , it will often to be helpful to conjure a matrix $B$ for which $A = B^\top B$ . Given a psd matrix $A$ , there is always a matrix $B$ for which $A = B^\top B$ , but this $B$ is not unique. Indeed, if $A = B^\top B$ , then we have the infinite family of decompositions $A = (UB)^\top (UB)$ generated by every matrix $U$ with orthonormal columns. This motivates the following definition:

Definition (Gram square root): A matrix $B$ is a Gram square root for $A$ if $A = B^\top B$ .

A Gram square root $B$ for $A$ need not be a square root in the traditional sense that $B^2 = A$ . Indeed, a Gram square root $B$ can be rectangular, so $B^2$ need not even be defined.² Using the Gram square root terminology, the Gram correspondence can be written more succinctly:

Gram correspondence (concise): If $B$ is any Gram square root of $A$ , then the projection approximation $\hat{B} = \Pi_{B\Omega}B$ is a Gram square root of the Nyström approximation $\hat{A} = A\langle \Omega \rangle$ .

Examples of the Gram Correspondence

Before going further, let us see a couple of explicit examples of the Gram correspondence.

Randomized SVD and Single-Pass Nyström Approximation

The canonical example of the Gram correspondence is the equivalence of the randomized SVD algorithm and the single-pass Nyström approximation. Let $B$ be a Gram square root for a psd matrix $A$ . The randomized SVD of $B$ is given by the following steps, which we saw in the introduction:

Generate a Gaussian random matrix $\Omega \in \real^{N\times k}$ .
Form the product $Y = B\Omega$ .
Compute a QR decomposition $Y = QR$ .
Evaluate the SVD $Q^\top B = \tilde{U} \Sigma V^\top$ .
Output the low-rank approximation $\hat{B} = U\Sigma V^\top$ for $U = Q\tilde{U}$ .

Now, imagine taking the same Gaussian matrix $\Omega$ from the randomized SVD, and use it to compute the Nyström approximation $\hat{A} = A\langle \Omega\rangle$ . Notice that this Nyström approximation can be computed in a single pass over the matrix $A$ . Namely, use a single pass over $A$ to compute $Y=A\Omega$ and use following formula:³

$\hat{A} = Y (\Omega^\top Y)^{-1} Y^\top.$

For this reason, we call $\hat{A}$ the single-pass Nyström approximation.

The Gram correspondence tells us that the randomized SVD and single-pass Nyström approximation are closely related, in the sense that $\hat{A} = \hat{B}^\top \hat{B}$ . The randomized SVD approximation $\hat{B}$ is a Gram square root of the single-pass Nyström approximation $\hat{A}$ .

Column-Pivoted QR Factorization and Pivoted Cholesky Decomposition

A more surprising consequence of the Gram correspondence is the connection between low-rank approximations produced by partial column-pivoted QR factorization of a rectangular matrix $B$ and a partial pivoted Cholesky decomposition of $A$ . Let’s begin by describing these two approximation methods.

Let’s begin with pivoted partial Cholesky decomposition. Let $A \in \real^{N\times N}$ be a psd matrix, and initialize the zero approximation $\hat{A} \gets 0$ . For $i=1,2,\ldots,k$ , perform the following steps:

Choose a column index $1 \le s_i \le N$ . These indices are referred to as pivot indices or, more simply, pivots.
Update the low-rank approximation
$\hat{A} \gets \hat{A} + \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.$
Update the matrix
$A \gets A - \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.$

Here, we are using MATLAB notation to index the matrix $A$ . The output of this procedure is

$\hat{A} = A(:,S) A(S,S)^{-1} A(S,:),$

where $S = \{s_1,\ldots,s_k\}$ . This type of low-rank approximation is known as a column Nyström approximation and is a special case of the general Nyström approximation with test matrix $\Omega = I(:,S)$ equal to a subset of columns of the identity matrix $I$ . For an explanation of why this procedure is called a “pivoted partial Cholesky decomposition” and the relation to the usual notion of Cholesky decomposition, see this previous post of mine.

Given the Gram correspondence, we expect that this pivoted partial Cholesky procedure for computing a Nyström approximation to a psd matrix $A$ should have an analog for computing a projection approximation to a rectangular matrix $B$ . This analog is given by the partial column-pivoted QR factorization, which produces a low-rank approximation according as follows. Let $B \in \real^{M\times N}$ be a psd matrix, and initialize the zero approximation $\hat{B} \gets 0$ . For $i=1,2,\ldots,k$ , perform the following steps:

Choose a pivot index $1 \le s_i \le N$ .
Update the low-rank approximation by adding the projection of $B$ onto the selected column: $\hat{B} \gets \hat{B} + B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2$ .
Update the matrix by removing the projection of $B$ onto the selected column: $B \gets B - B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2$ .

The output of this procedure is the column projection approximation $\Pi_{B(:,S)} B$ , which is an example of the general projection approximation with $\Omega = I(:,S)$ .⁴

The pivoted partial Cholesky and QR factorizations are very traditional ways of computing a low-rank approximation in numerical linear algebra. The Gram correspondence tells us immediately that these two approaches are closely related:

Let $B$ be a Gram square root of $A$ . Compute a column projection approximation $\hat{B}$ to $B$ using a partially column-pivoted QR factorization with pivots $s_1,\ldots,s_k$ , and compute a column Nyström approximation $\hat{A}$ to $A$ using a partial pivoted Cholesky decomposition with the same set of pivots $s_1,\ldots,s_k$ . Then $\hat{B}^\top \hat{B} = \hat{A}$ .

I find this remarkable: the equivalence of the randomized SVD and single-pass randomized Nyström approximation and the equivalence of (partial pivoted) Cholesky and QR factorizations are both consequences of the same general principle, the Gram correspondence.

Using the Gram Correspondence to Discover Algorithms

The Gram correspondence is more than just an interesting way of connecting existing types of low-rank approximations: It can be used to discover new algorithms. We can illustrate this with an example, the randomly pivoted Cholesky algorithm.

Background: The Randomly Pivoted QR Algorithm

In 2006, Deshpande, Rademacher, Vempala, and Wang discovered an algorithm that they called adaptive sampling for computing a projection approximation to a rectangular matrix $B \in \real^{M\times N}$ . Beginning from the trivial initial approximation $\hat{B} \gets 0$ , algorithm proceeds as follows for $i=1,2,\ldots,k$ :

Choose a column index $1 \le s_i \le N$ randomly according to the squared column norm distribution:
$\prob\{s_i = j\} = \frac{\norm{B(:,j)}^2}{\norm{B}_{\rm F}^2} \quad \text{for } j=1,2,\ldots,k.$
Update the low-rank approximation by adding the projection of $B$ onto the selected column: $\hat{B} \gets \hat{B} + B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2$ .
Update the matrix by removing the projection of $B$ onto the selected column: $B \gets B - B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2$ .

Given our discussion above, we recognize this as an example of partial column-pivoted QR factorization with a particular randomized procedure for selecting the pivots $s_i$ .⁵ Therefore, it can also be convenient to call this algorithm randomly pivoted QR.

Randomly pivoted QR is a nice algorithm for rectangular low-rank approximation. Each step requires a full pass over the matrix and $\order(MN)$ operations, so the full procedure requires $\order(MNk)$ operations. This makes the cost of the algorithm similar to other methods for rectangular low-rank approximation such as the randomized SVD, but it has the advantage that it computes a column projection approximation.⁶ However, randomly pivoted QR is not a particularly effective algorithm for computing a low-rank approximation to a psd matrix, since—as we shall see—there are faster procedures available.

Following the Gram correspondence, we expect there should be an analog of the randomly pivoted QR algorithm for computing a low-rank approximation of a psd matrix. That algorithm, which we call randomly pivoted Cholesky, is derived from randomly pivoted QR in the following optional section:

Derivation of Randomly Pivoted Cholesky Algorithm

Let $A \in \real^{N\times N}$ be a psd matrix. For conceptual purposes, let us also consider a Gram square root $B\in \real^{M\times N}$ of $A$ ; this matrix $B$ will only be a conceptual device that we will use to help us derive the appropriate algorithm, not something that will be required to run the algorithm itself. Let us now walk through the steps of the randomly pivoted QR algorithm on $B$ , and see how they lead to a randomly pivoted Cholesky algorithm for computing a low-rank approximation to $A$ .

Step 1 of randomly pivoted QR. Randomly pivoted QR begins by drawing a random pivot $s_i$ according to the rule

$\prob\{s_i = j\} = \frac{\norm{B(:,j)}^2}{\norm{B}_{\rm F}^2} \quad \text{for } j=1,2,\ldots,k.$

Since $A = B^\top B$ , we may compute

$\norm{B(:,j)}^2 = B(:,j)^\top B(:,j) = (B^\top B)(j,j) = A(j,j).$

The squared column norms of $B$ are the diagonal entries of the matrix $A$ . Similarly, $\norm{B}_{\rm F}^2 = \tr(A)$ . Therefore, we can write the probability distribution for the random pivot $s_i$ using only the matrix $A$ as

$\prob\{s_i = j\} = \frac{A(j,j)}{\tr A} \quad \text{for } j=1,2,\ldots,N.$

Step 2 of randomly pivoted QR. The randomly pivoted QR update rule is $\hat{B} \gets \hat{B} + B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2$ . Therefore, the update rule for $\hat{A} = \hat{B}^\top\hat{B}$ is

$\hat{A} &\gets \left(\hat{B} + \frac{B(:,s_i) (B(:,s_i)^\top B)}{\norm{B(:,s_i)}^2}\right)^\top \left(\hat{B} + \frac{B(:,s_i) (B(:,s_i)^\top B)}{\norm{B(:,s_i)}^2}\right).$

After a short derivation,⁷, this simplifies to the update rule

$\hat{A} \gets \hat{A} +\frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.$

Two remarkable things have happened. First, we have obtained an update rule for $\hat{A}$ that depends only on the psd matrices $\hat{A}$ and $A$ , and all occurences of the Gram square roots $\hat{B}$ and $B$ have vanished. Second, this update rule is exactly the update rule for a partial Cholesky decomposition.

Step 3 of randomly pivoted QR. Using a similar derivation to step 2, we update an update rule for $A$ :

$A \gets A -\frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.$

This completes the derivation.

Randomly Pivoted Cholesky

That derivation was a bit complicated, so let’s summarize. We can with the randomly pivoted QR algorithm for computing a projection approximation to a rectangular matrix $B$ , and we used it to derive a randomly pivoted Cholesky algorithm for computing a column Nyström approximation to a psd matrix $A$ . Removing the cruft of the derivation, this algorithm is very simple to state:

Choose a column index $1 \le s_i \le N$ randomly according to the diagonal distribution:
$\prob\{s_i = j\} = \frac{A(j,j)}{\tr(A)} \quad \text{for } j=1,2,\ldots,k.$
Update the low-rank approximation
$\hat{A} \gets \hat{A} + \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.$
Update the matrix
$A \gets A - \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.$

I find this to be remarkably cool. We started with a neat algorithm (randomly pivoted QR) for approximating rectangular matrices, and we used the Gram correspondence to derive a different randomly pivoted Cholesky algorithm for psd matrix approximation!

And randomly pivoted Cholesky allows us to pull a cool trick that we couldn’t with randomly pivoted QR. Observe that the randomly pivoted Cholesky algorithm only ever interacts with the residual matrix $A$ through the selected pivot columns $A(:,s_i)$ and through the diagonal entries $A(j,j)$ . Therefore, we can derive an optimized version of the randomly pivoted Cholesky algorithm that only reads $(k+1)N$ entries of the matrix $A$ ( $k$ columns plus the diagonal) and requires only $\order(k^2N)$ operations! See our paper for details.

So we started with the randomly pivoted QR algorithm, which requires $\order(kMN)$ operations, and we used it to derive the randomly pivoted Cholesky algorithm that runs in $\order(kN^2)$ operations. Let’s make this concrete with some specific numbers. Setting $k = 100$ and $M = N = 10^6$ , randomly pivoted QR requires roughly $10^{14}$ (100 trillion) operations and randomly pivoted Cholesky requires roughly $10^{10}$ (10 billion) operations, a factor of 10,000 smaller operation count!

Randomly pivoted Cholesky has an interesting history. It appears to have first appeared in print in the work of Musco and Woodruff (2017), who used the Gram correspondence to derive the algorithm from randomly pivoted QR. It is remarkable that it took a full 11 years after Deshpande and co-author’s original work on randomly pivoted QR for randomly pivoted Cholesky to be discovered. Even after Musco and Woodruff’s paper, the algorithm appears to have largely been overlooked for computation in practice, and I am unaware of any paper documenting computational experiments with randomly pivoted Cholesky before our paper was initially released in 2022.⁸ Our paper reexamined the randomly pivoted Cholesky procedure, providing computational experiments comparing it to alternatives and using it for scientific machine learning applications, and provided new error bounds.

Other Examples of Algorithm Design by the Gram Correspondence

The Gram correspondence gives a powerful tool for inventing new algorithms or showing equivalence between existing algorithms. There are many additional examples, such as block (and rejection sampling-accelerated) versions of randomly pivoted QR/Cholesky, Nyström versions of randomized block Krylov iteration, and column projection/Nyström approximations generated by determinantal point process sampling. Indeed, if you invent a new low-rank approximation algorithm, it is always worth checking: Using the Gram correspondence, can you get another algorithm for free?

Transference of Error Bounds

The Gram correspondence also allows us to transfer error bounds between projection approximations and Nyström approximations. Let’s begin with the simplest manifestation of this principle:

Transference of Error Bounds I: Let $B$ be a Gram square root of $A$ , and let $\hat{B}$ and $\hat{A}$ be the corresponding projection and Nyström approximations generated using the same test matrix $\Omega$ . Then
$\norm{B - \hat{B}}_{\rm F}^2 = \tr(A - \hat{A}).$

This shows that the Frobenius norm error of a projection approximation and the trace error of a Nyström approximation are the same.⁹ Using this result, we can immediately transfer error bounds for projection approximations to Nyström approximations and visa versa. For instance, in this previous post, we proved the following error bound for the rank- $k$ randomized SVD (with a Gaussian test matrix $\Omega$ ):

$\expect \norm{B - \hat{B}}_{\rm F}^2 \le \min_{r\le k-2} \left(1 + \frac{r}{k-(r+1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2,$

Here, $\lowrank{B}_r$ denotes the best rank- $r$ approximation to $B$ . This result shows that the error of the rank- $k$ randomized SVD is comparable to the best rank- $r$ approximation for any $r \le k-2$ . Using the transference principle, we immediately obtain a corresponding bound for rank- $k$ Nyström approximation (with a Gaussian test matrix $\Omega$ ):¹⁰

$\expect \tr(A - \hat{A}) \le \min_{r\le k-2} \left(1 + \frac{r}{k-(r+1)} \right) \tr(A - \lowrank{A}_r).$

With no additional work, we have converted an error bound for the randomized SVD to an error bound for single-pass randomized Nyström approximation!

For those interested, one can extend the transference of error bounds to general unitarily invariant norms. See the following optional section for details:

Transference of Error Bounds for General Unitarily Invariant Norms

We begin by recalling the notions of (quadratic) unitarily invariant norms; see this section of a previous post for a longer refresher. A norm $\norm{\cdot}_{\rm UI}$ is unitarily invariant if

$\norm{UCV}_{\rm UI} = \norm{C}_{\rm UI}$

for every matrix $C$ and orthogonal matrices $U$ and $V$ .¹¹Examples of unitarily invariant norms include the nuclear norm $\norm{C}_* = \sum_i \sigma_i(C)$ , the Frobenius norm $\norm{C}_{\rm F}^2 = \sum_i \sigma_i^2(C)$ , and the spectral norm $\norm{C} = \max_i \sigma_i(C)$ . (It is not coincidental that all these unitarily invariant norms can be expressed only in terms of the singular values $\sigma_i(C)$ of the matrix $C$ .) Associated to every unitarily invariant norm it its associated quadratic unitarily invariant norm, defined as

$\norm{C}_{\rm Q}^2 = \norm{C^\top C}_{\rm UI} = \norm{CC^\top}_{\rm UI}.$

For example,

The quadratic unitarily invariant norm associated to the nuclear norm $\norm{\cdot}_{\rm UI} = \norm{\cdot}_*$ is the Frobenius norm $\norm{\cdot}_{\rm Q} = \norm{\cdot}_{\rm F}$ .
The spectral norm is its own associated quadratic unitarily invariant norm $\norm{\cdot}_{\rm UI} = \norm{\cdot} = \norm{\cdot}_{\rm Q}$ .

This leads us to the more general version of the transference principle:

Transference of Error Bounds II: Let $B$ be a Gram square root of $A$ , and let $\hat{B}$ and $\hat{A}$ be the corresponding projection and Nyström approximations generated using the same test matrix $\Omega$ . Then
$\norm{B - \hat{B}}_{\rm Q}^2 = \norm{A - \hat{A}}$
for every unitarily invariant norm $\norm{\cdot}_{\rm UI}$ and its associated quadratic unitarily invariant norm $\norm{\cdot}_{\rm Q}$ .

This version of the principle is more general, since the nuclear norm of a psd matrix is the trace, whence

$\norm{B - \hat{B}}_{\rm F}^2 = \norm{A - \hat{A}}_* = \tr(A - \hat{A}).$

We now also have more examples. For instance, for the spectral norm, we have

$\norm{B - \hat{B}}^2 = \norm{A - \hat{A}}.$

Let us quickly prove the transference of error principle. Write the error $B - \hat{B} = B - \Pi_{B\Omega}B = (I-\Pi_{B\Omega})B$ . Thus,

$\norm{B - \hat{B}}_{\rm Q}^2 = \norm{(I-\Pi_{B\Omega})B}_{\rm Q}^2 = \norm{B^\top (I-\Pi_{B\Omega})^2B}_{\rm UI}.$

The orthogonal projections $\Pi_{B\Omega}$ and $I-\Pi_{B\Omega}$ are equal to their own square, so

$\begin{align*}\norm{B - \hat{B}}_{\rm Q}^2 &= \norm{B^\top (I-\Pi_{B\Omega})B}_{\rm UI} = \norm{B^\top B - B^\top\Pi_{B\Omega}B}_{\rm UI} \\ &= \norm{B^\top B - (\Pi_{B\Omega}B)^\top(\Pi_{B\Omega}B)}_{\rm UI}.\end{align*}$

Finally, write $\hat{B} = \Pi_{B\Omega}B$ and use the Gram correspondence $A = B^\top B$ , $\hat{A} = \hat{B}^\top\hat{B}$ to conclude

$\norm{B - \hat{B}}_{\rm Q}^2 = \norm{A - \hat{A}}_{\rm UI},$

as desired.

Conclusion

This has been a long post about a simple idea. Many of the low-rank approximation algorithms in the literature output a special type of approximation, either a projection approximation or a Nyström approximation. As this post has shown, these two types of approximation are equivalent, with algorithms and error bounds for one type of approximation transferring immediately to the other format. This equivalence is a powerful tool for the algorithm designer, allowing us to discover new algorithms from old ones, like randomly pivoted Cholesky from randomly pivoted QR.

Low-Rank Approximation Toolbox: Analysis of the Randomized SVD

June 22, 2023 by Ethan N. Epperly 2 Comments

In the previous post, we looked at the randomized SVD. In this post, we continue this discussion by looking at the analysis of the randomized SVD. Our approach is adapted from a new analysis of the randomized SVD by Joel A. Tropp and Robert J. Webber.

There are many types of analysis one can do for the randomized SVD. For instance, letting $B$ be a matrix and $X$ the rank- $k$ output of the randomized SVD, natural questions include:

What is the expected error $\mathbb{E} \norm{X-B}$ measured in some norm $\norm{\cdot}$ ? What about expected squared error $\mathbb{E} \left\|X-B\right\|^2$ ? How do these answers change with different norms?
How large can the randomized SVD error $\norm{X - B}$ get except for some small failure probability $\delta$ ?
How close is the randomized SVD truncated to rank $\rho$ compared to the best rank- $\rho$ approximation to $B$ ?
How close are the singular values and vectors of the randomized SVD approximation $X$ compared to those of $B$ ?

Implicitly and explicitly, the analysis of Tropp and Webber provides satisfactory answers to a number of these questions. In the interest of simplifying the presentation, we shall focus our presentation on just one of these questions, proving following result:

$\mathbb{E} \left\|B-X\right\|_{\rm F}^2 \le \left( 1 + \frac{k}{k-(r+1)}\right) \left\|B - \lowrank{B}_r \right\|_{\rm F}^2 \quad \text{for every $r \le k-2$}.$

Here, $\left\|\cdot\right\|_{\rm F}$ is the Frobenius norm. We encourage the interested reader to check out Tropp and Webber’s paper to see the methodology we summarize here used to prove numerous facts about the randomized SVD and its extensions using subspace iteration and block Krylov iteration.

Projection Formula

Let us recall the randomized SVD as we presented it in last post:

Collect information. Form $Y = B\Omega$ where $\Omega$ is an appropriately chosen $n\times k$ random matrix.
Orthogonalize. Compute an orthonormal basis $Q$ for the column space of $Y$ by, e.g., thin QR factorization.
Project. Form $C \coloneqq Q^*B$ , where ${}^*$ denotes the conjugate transpose.

The randomized SVD is the approximation

$B\approx X \coloneqq QC.$

It is easy to upgrade $X = QC$ into a compact SVD form $X = U\Sigma V^*$ , as we did as steps 4 and 5 in the previous post.

For the analysis of the randomized SVD, it is helpful to notice that the approximation $X$ takes the form

$X = QC = QQ^*B = \Pi_{B\Omega} B.$

Here, $\Pi_{B\Omega}$ denotes the orthogonal projection onto the column space of the matrix $B\Omega$ . We call $X = \Pi_{B\Omega} B$ the projection formula for the randomized SVD.¹

Aside: Quadratic Unitarily Invariant Norms

To state our error bounds in the most general language, we can adopt the language of quadratic unitarily invariant norms. Recall that a square matrix $U$ is unitary if

$U^*U = UU^* = I.$

You may recall from my post on Nyström approximation that a norm $\left\|\cdot\right\|_{\rm UI}$ on matrices is unitarily invariant if

$\left\|UBV\right\|_{\rm UI} = \left\|B\right\|_{\rm UI} \quad \text{for all unitary matrices $U$, $V$ and any matrix $B$}.$

A unitarily invariant norm $\left\|\cdot\right\|_{\rm Q}$ is said to be quadratic if there exists another unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$ such that

(1) $\left\|B\right\|_{\rm Q}^2 = \left\|B^*B\right\|_{\rm UI} = \left\|BB^*\right\|_{\rm UI} \quad \text{for every matrix $B$.}$

Many examples of quadratic unitarily invariant norms are found among the Schatten $p$ -norms, defined as

$\left\|B\right\|_{S_p}^p \coloneqq \sum_i \sigma_i(B)^p.$

Here, $\sigma_1(B) \ge \sigma_2(B) \ge \cdots$ denote the decreasingly order singular values of $B$ . The Schatten $2$ -norm $\left\|\cdot\right\|_{S_2}$ is the Frobenius norm of a matrix. The spectral norm (i.e., operator 2-norm) is the Schatten $\infty$ -norm, defined to be

$\norm{B} = \left\|B\right\|_{S_\infty} \coloneqq \max_i \sigma_i(B) = \sigma_1(B).$

The Schatten $p$ -norms are unitarily invariant norms for every $1\le p \le \infty$ . However, the Schatten $p$ -norms are quadratic unitarily invariant norms only for $2\le p \le \infty$ since

$\left\|B\right\|_{S_p}^2 = \left\|B^*B\right\|_{S_{p/2}}.$

For the remainder of this post, we let $\left\|\cdot\right\|_{\rm Q}$ and $\left\|\cdot\right\|_{\rm UI}$ be a quadratic unitarily invariant norm pair satisfying (1).

Error Decomposition

The starting point of our analysis is the following decomposition of the error of the randomized SVD:

(2) $\left\|B - X\right\|_{\rm Q}^2 \le \left\|B - \lowrank{B}_r\right\|_{\rm Q}^2 + \left\|\lowrank{B}_r - \Pi_{B\Omega} \lowrank{B}_r\right\|_{\rm Q}^2.$

Recall that we have defined $\lowrank{B}_r$ to be an optimal rank- $r$ approximation to $B$ :

$\left\|B - \lowrank{B\right\|_r}_{\rm Q} \le \left\|B - C\right\|_{\rm Q} \quad \text{for any rank-$r$ matrix $C$}.$

Thus, the error decomposition says that the excess error is bounded by

$\left\|B - X\right\|_{\rm Q}^2 - \left\|B - \lowrank{B}_r\right\|_{\rm Q}^2 \le \left\|\lowrank{B}_r - \Pi_{B\Omega} \lowrank{B}_r\right\|_{\rm Q}^2.$

Using the projection formula, we can prove the error decomposition (2) directly:

$\begin{align*}\left\|B - X\right\|_{\rm Q}^2 &= \left\|B - \Pi_{B\Omega} B \right\|_{\rm Q}^2\\&= \norm{(I-\Pi_{B\Omega})BB^*(I-\Pi_{B\Omega})}_{\rm UI} \\&\le \norm{(I-\Pi_{B\Omega})(BB^* - \lowrank{B}_r\lowrank{B}_r^*)(I-\Pi_{B\Omega})}_{\rm UI} \\&\qquad + \norm{(I-\Pi_{B\Omega})\lowrank{B}_r\lowrank{B}_r^*(I-\Pi_{B\Omega})}_{\rm UI}\\&\le \left\|BB^* - \lowrank{B}_r\lowrank{B}_r^*\right\|_{\rm UI} + \left\|\lowrank{B}_r - \Pi_{B\Omega} \lowrank{B}_r\right\|_{\rm Q}^2 \\&\le \left\|B - \lowrank{B}_r \right\|_{\rm Q}^2 + \left\|\lowrank{B}_r- \Pi_{B\Omega} \lowrank{B}_r\right\|_{\rm Q}^2.\end{align*}$

The first line is the projection formula, the second is relation between $\norm{\cdot}_{\rm Q}$ and $\norm{\cdot}_{\rm UI}$ , and the third is the triangle inequality. For the fifth line, we use the fact that $I - \Pi_{B\Omega}$ is an orthoprojector and thus has unit spectral norm $\norm{I - \Pi_{B\Omega}}$ together with the fact that

$\left\|BCD\right\|_{\rm UI} \le \norm{B} \cdot \left\|C\right\|_{\rm UI}\cdot \norm{D} \quad \text{for all matrices $B,C,D$.}$

For the final inequality, we used commutation rule

$(B - \lowrank{B}_r)(B - \lowrank{B}_r)^* = BB^* - \lowrank{B}_r\lowrank{B}_r^*.$

Bounding the Error

In this section, we shall continue to refine the error decomposition (2). Consider a thin SVD of $B$ , partitioned as

$B = \onebytwo{U_1}{U_2} \twobytwo{\Sigma_1}{0}{0}{\Sigma_2}\onebytwo{V_1}{V_2}^*$

where

$\onebytwo{U_1}{U_2}$ and $\onebytwo{V_1}{V_2}$ both have orthonormal columns,
$\Sigma_1$ and $\Sigma_2$ are square diagonal matrices with nonnegative entries,
$U_1$ and $V_1$ have $r$ columns, and
$\Sigma_1$ is an $r\times r$ matrix.

Under this notation, the best rank- $r$ approximation is $\lowrank{B}_r = U_1\Sigma_1V_1^*$ . Define

$\twobyone{\Omega_1}{\Omega_2} \coloneqq \twobyone{V_1^*}{V_2^*} \Omega.$

We assume throughout that $\Omega_1$ is full-rank (i.e., $\operatorname{rank} \Omega_1 = r$ ).

Applying the error decomposition (2), we get

(3) $\begin{align*}\left\|B - X\right\|_{\rm Q}^2&\le \norm{ \onebytwo{U_1}{U_2} \twobytwo{0}{0}{0}{\Sigma_2}\onebytwo{V_1}{V_2}^*}_{\rm Q}^2 + \norm{(I-\Pi_{B\Omega})U_1\Sigma_1V_1^*}_{\rm Q}^2 \\&= \left\| \Sigma_2 \right\|_{\rm Q}^2 + \norm{(I-\Pi_{B\Omega})U_1\Sigma_1}_{\rm Q}^2. \end{align*}$

For the second line, we used the unitary invariance of $\left\|\cdot\right\|_{\rm Q}$ . Observe that the column space of $B\Omega$ is a superset of the column space of $B\Omega\Omega_1^\dagger$ so

(4) $\norm{(I-\Pi_{B\Omega})U_1\Sigma_1}_{\rm Q}^2 \le \norm{(I-\Pi_{B\Omega\Omega_1^\dagger})U_1\Sigma_1}_{\rm Q}^2 = \norm{\Sigma_1U_1^*(I-\Pi_{B\Omega\Omega_1^\dagger})U_1\Sigma_1}_{\rm UI}.$

Let’s work on simplifying the expression

$\Sigma_1U_1^*(I-\Pi_{B\Omega\Omega_1^\dagger})U_1\Sigma_1.$

First, observe that

$B\Omega\Omega_1^\dagger = B\onebytwo{V_1}{V_2}\twobyone{\Omega_1}{\Omega_2}\Omega_1^\dagger = \onebytwo{U_1}{U_2} \twobytwo{\Sigma_1}{0}{0}{\Sigma_2} \twobyone{I}{\Omega_2\Omega_1^\dagger} = U_1\Sigma_1 + U_2\Sigma_2\Omega_2\Omega_1^\dagger.$

Thus, the projector $\Pi_{B\Omega\Omega_1^\dagger}$ takes the form

$\begin{align*}\Pi_{B\Omega\Omega_1^\dagger} &= (B\Omega\Omega_1^\dagger) \left[(B\Omega\Omega_1^\dagger)^*(B\Omega\Omega_1^\dagger)\right]^\dagger (B\Omega\Omega_1^\dagger)^* \\&= (U_1\Sigma_1 + U_2\Sigma_2\Omega_2\Omega_1^\dagger) \left[\Sigma_1^2 + (\Sigma_2\Omega_2\Omega_1^\dagger)^*(\Sigma_2\Omega_2\Omega_1^\dagger)\right]^\dagger (U_1\Sigma_1 + U_2\Sigma_2\Omega_2\Omega_1^\dagger)^*.\end{align*}$

Here, ${}^\dagger$ denotes the Moore–Penrose pseudoinverse, which reduces to the ordinary matrix inverse for a square nonsingular matrix. Thus,

(5) $\Sigma_1U_1^*(I-\Pi_{B\Omega\Omega_1^\dagger})U_1\Sigma_1 = \Sigma_1^2 - \Sigma_1^2 \left[\Sigma_1^2 + (\Sigma_2\Omega_2\Omega_1^\dagger)^*(\Sigma_2\Omega_2\Omega_1^\dagger)\right]^\dagger\Sigma_1^2.$

Remarkably, this seemingly convoluted combination of matrices actually is a well-studied operation in matrix theory, the parallel sum. The parallel sum of positive semidefinite matrices $A$ and $H$ is defined as

(6) $A : H \coloneqq A - A(A+H)^\dagger A.$

We will have much more to say about the parallel sum soon. For now, we use this notation to rewrite (5) as

$\Sigma_1U_1^*(I-\Pi_{B\Omega\Omega_1^\dagger})U_1\Sigma_1 = \Sigma_1^2 : [(\Sigma_2\Omega_2\Omega_1^\dagger)^*(\Sigma_2\Omega_2\Omega_1^\dagger)].$

Plugging this back into (4) and then (3), we obtain the error bound

(7) $\left\|B - X\right\|_{\rm Q}^2\le \left\| \Sigma_2 \right\|_{\rm Q}^2 + \left\|\Sigma_1^2 : [(\Sigma_2\Omega_2\Omega_1^\dagger)^*(\Sigma_2\Omega_2\Omega_1^\dagger)]\right\|_{\rm UI}.$

Simplifying this expression further will require more knowledge about parallel sums, which we shall discuss now.

Aside: Parallel Sums

Let us put aside the randomized SVD for a moment and digress to the seemingly unrelated topic of electrical circuits. Suppose we have a battery of voltage $v$ and we connect the ends using a wire of resistance $a$ . The current $\mathrm{curr}_a$ is given by Ohm’s law

$v = a \cdot \mathrm{curr}_a.$

Similarly, if the wire is replaced by a different wire with resistance $h$ , the current $\mathrm{curr}_h$ is then

$v = h \cdot \mathrm{curr}_h.$

Now comes the interesting question. What if we connect the battery by both wires (resistances $a$ and $h$ ) simultaneously in parallel, so that current can flow through either wire? Since wires of resistances $a$ and $h$ still have voltage $v$ , Ohm’s law still applies and thus the total current is

$\mathrm{curr}_{\rm total} = \mathrm{curr}_a + \mathrm{curr}_h = \frac{v}{a} + \frac{v}{h} = v \left(a^{-1}+h^{-1}\right).$

The net effect of connecting resisting wires of resistances $a$ and $h$ is the same as a single resisting wire of resistance

(8) $a:h \coloneqq \frac{v}{\mathrm{curr}_{\rm total}} = \left( a^{-1}+h^{-1}\right)^{-1}.$

We call the the operation $a \mathbin{:} h$ the parallel sum of $a$ and $h$ because it describes how resistances combine when connected in parallel.

The parallel sum operation $:$ can be extended to all nonnegative numbers $a$ and $h$ by continuity:

$a:h \coloneqq \lim_{b\downarrow a, \: k\downarrow h} b:k = \begin{cases}\left( a^{-1}+h^{-1}\right)^{-1}, & a,h > 0 ,\\0, & \textrm{otherwise}.\end{cases}$

Electrically, this definition states that one obtains a short circuit (resistance $0$ ) if either of the wires carries zero resistance.

Parallel sums of matrices were introduced by William N. Anderson, Jr. and Richard J. Duffin as a generalization of the parallel sum operation from electrical circuits. There are several equivalent definitions. The natural definition (8) applies for positive definite matrices

$A:H = (A^{-1}+H^{-1})^{-1} \quad \text{for $A,H$ positive definite.}$

This definition can be extended to positive semidefinite matrices by continuity. It can be useful to have an explicit definition which applies even to positive semidefinite matrices. To do so, observe that the (scalar) parallel sum satisfies the identity

$a:h = \frac{1}{a^{-1}+h^{-1}} = \frac{ah}{a+h} = \frac{a(a+h)-a^2}{a+h} = a - a(a+h)^{-1}a.$

To extend to matrices, we capitalize the letters and use the Moore–Penrose pseudoinverse in place of the inverse:

$A : H = A - A(A+H)^\dagger A.$

This was the definition (6) of the parallel sum we gave above which we found naturally in our randomized SVD error bounds.

The parallel sum enjoys a number of beautiful properties, some of which we list here:

Symmetry. $A:H = H:A$ ,
Simplified formula. $A:H = (A^{-1}+H^{-1})^{-1}$ for positive definite matrices,
Bounds. $A:H \preceq A$ and $A:H \preceq H$ .
Monotone. $A\mapsto (A:H)$ is monotone in the Loewner order.
Concavity. The map $(A,H) \mapsto A:H$ is (jointly) concave (with respect to the Loewner order).
Conjugation. For any square $C$ , $C^*(A:H)C = (C^*AC):(C^*HC)$ .
Trace. $\tr(A:H) \le (\tr A):(\tr H)$ .

For completionists, we include a proof of the last property for reference.

Proof of Property 8

Let $\left\|\cdot\right\|_{\rm UI}'$ be the unitarily invariant norm dual to $\left\|\cdot\right\|_{\rm UI}$ .² By duality, we have

$\begin{align*}\left\|A:H\right\|_{\rm UI} &= \max_{M\succeq 0,\: \left\|M\right\|_{\rm UI}'\le 1} \tr((A:H)M) \\&= \max_{M\succeq 0,\: \left\|M\right\|_{\rm UI}'\le 1} \tr(M^{1/2}(A:H)M^{1/2})\\&= \max_{M\succeq 0,\: \left\|M\right\|_{\rm UI}'\le 1} \tr((M^{1/2}AM^{1/2}):(M^{1/2}HM^{1/2})) \\&\le \max_{M\succeq 0,\: \left\|M\right\|_{\rm UI}'\le 1} \left[\tr(M^{1/2}AM^{1/2}):\tr(M^{1/2}HM^{1/2}) \right] \\&\le \left[\max_{M\succeq 0,\: \left\|M\right\|_{\rm UI}'\le 1} \tr(M^{1/2}AM^{1/2})\right]:\left[\max_{M\succeq 0,\: \left\|M\right\|_{\rm UI}'\le 1} \tr(M^{1/2}HM^{1/2})\right]\\&= \left\|A\right\|_{\rm UI} : \left\|H\right\|_{\rm UI}.\end{align*}$

The first line is duality, the second holds because $M\succeq 0$ , the third line is property 6, the fourth line is property 7, the fifth line is property 4, and the sixth line is duality.

Nearing the Finish: Enter the Randomness

Equipped with knowledge about parallel sums, we are now prepared to return to bounding the error of the randomized SVD. Apply property 8 of the parallel sum to the randomized SVD bound (7) to obtain

$\begin{align*}\left\|B - X\right\|_{\rm Q}^2&\le \left\| \Sigma_2 \right\|_{\rm Q}^2 + \left\|\Sigma_1^2 : [(\Sigma_2\Omega_2\Omega_1^\dagger)^*(\Sigma_2\Omega_2\Omega_1^\dagger)]\right\|_{\rm UI} \\&\le \left\| \Sigma_2 \right\|_{\rm Q}^2 + \left\|\Sigma_1^2\right\|_{\rm UI} : \left\|[(\Sigma_2\Omega_2\Omega_1^\dagger)^*(\Sigma_2\Omega_2\Omega_1^\dagger)]\right\|_{\rm UI} \\&= \left\| \Sigma_2 \right\|_{\rm Q}^2 + \left\|\Sigma_1\right\|_{\rm Q}^2 : \left\|\Sigma_2\Omega_2\Omega_1^\dagger\right\|^2_{\rm Q}.\end{align*}$

This bound is totally deterministic: It holds for any choice of test matrix $\Omega$ , random or otherwise. We can use this bound to analyze the randomized SVD in many different ways for different kinds of random (or non-random) matrices $\Omega$ . Tropp and Webber provide a number of examples instantiating this general bound in different ways.

We shall present only the simplest error bound for randomized SVD. We make the following assumptions:

The test matrix $\Omega$ is standard Gaussian.

Under these assumptions and using the property $a : h \le h$ , we obtain the following expression for the expected error of the randomized SVD

(9) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2\le \left\| B - \lowrank{B\right\|_r }_{\rm F}^2 + \mathbb{E}\left\|\Sigma_2\Omega_2\Omega_1^\dagger\right\|^2_{\rm F}.$

All that is left to is compute the expectation of $\left\|\Sigma_2\Omega_2\Omega_1^\dagger\right\|_{\rm F}^2$ . Remarkably, this can be done in closed form.

Aside: Expectation of Gaussian–Inverse Gaussian Matrix Product

The final ingredient in our randomized SVD analysis is the following computation involving Gaussian random matrices. Let $G$ and $\Gamma$ be independent standard Gaussian random matrices with $\Gamma$ of size $r\times k$ , $k\ge r-2$ . Then

$\mathbb{E} \left\|SG\Gamma^\dagger R\right\|_{\rm F}^2 = \frac{1}{k-(r+1)}\left\|S\right\|_{\rm F}^2\left\|R\right\|_{\rm F}^2.$

For the probability lovers, we will now go on to prove this. For those who are less probabilistically inclined, this result can be treated as a black box.

Proof of Random Matrix Formula

To prove this, first consider a simpler case where $\Gamma^\dagger$ does not appear:

$\mathbb{E} \left\|SGW\right\|_{\rm F}^2 = \mathbb{E} \sum_{ij} \left(\sum_{k\ell} s_{ik}g_{k\ell}w_{\ell j} \right)^2 = \mathbb{E} \sum_{ijk\ell} s_{ik}^2 w_{\ell j}^2 = \left\|S\right\|_{\rm F}^2\left\|W\right\|_{\rm F}^2.$

Here, we used the fact that the entries $g_{k\ell}$ are independent, mean-zero, and variance-one. Thus, applying this result conditionally on $\Gamma$ with $W = \Gamma^\dagger R$ , we get

(10) $\mathbb{E} \left\|SG\Gamma^\dagger R\right\|_{\rm F}^2 =\mathbb{E}\left[ \mathbb{E}\left[ \left\|SG\Gamma^\dagger R\right\|_{\rm F}^2 \,\middle|\, \Gamma\right]\right] =\left\|S\right\|_{\rm F}^2\cdot \mathbb{E} \left\|\Gamma^\dagger R\right\|_{\rm F}^2.$

To compute $\mathbb{E} \left\|\Gamma^\dagger R\right\|_{\rm F}^2$ , we rewrite using the trace

$\mathbb{E} \left\|\Gamma^\dagger R\right\|_{\rm F}^2 = \mathbb{E} \tr \left(\Gamma^\dagger RR^* \Gamma^{*\dagger} \right)= \tr \left(RR^* \mathbb{E}[\Gamma^{*\dagger}\Gamma^\dagger] \right) = \tr \left(RR^* \mathbb{E}[(\Gamma\Gamma^*)^{-1}] \right).$

The matrix $(\Gamma\Gamma^*)^{-1}$ is known as an inverse-Wishart matrix and is a well-studied random matrix model. In particular, its expectation is known to be $(k-(r+1))^{-1}I$ . Thus, we obtain

$\mathbb{E} \left\|\Gamma^\dagger R\right\|_{\rm F}^2 = \tr \left(RR^* \mathbb{E}[(\Gamma\Gamma^*)^{-1}] \right) = \frac{\tr \left(RR^* \right)}{k-(r+1)} = \frac{\left\|R\right\|_{\rm F}^2}{k-(r+1)}.$

Plugging into (10), we obtain the desired conclusion

$\mathbb{E} \left\|SG\Gamma^\dagger R\right\|_{\rm F}^2 =\left\|S\right\|_{\rm F}^2\cdot \mathbb{E} \left\|\Gamma^\dagger R\right\|_{\rm F}^2 = \frac{1}{k-(r+1)}\left\|S\right\|_{\rm F}^2\left\|R\right\|_{\rm F}^2.$

This completes the proof.

Wrapping it All Up

We’re now within striking distance. All that is left is to assemble the pieces we have been working with to obtain a final bound for the randomized SVD.

Recall we have chosen the random matrix $\Omega$ to be standard Gaussian. By the rotational invariance of the Gaussian distribution, $\Omega_1$ and $\Omega_2$ are independent and standard Gaussian as well. Plugging the matrix expectation bound to (9) then completes the analysis

$\begin{align*}\mathbb{E} \left\|B - X\right\|_{\rm F}^2&\le \left\| B - \lowrank{B\right\|_r }_{\rm F}^2 + \mathbb{E}\norm{\Sigma_2\Omega_2\Omega_1^\dagger I_{k\times k}}^2_{\rm F} \\&= \left\| B - \lowrank{B\right\|_r }_{\rm F}^2 + \frac{\left\|\Sigma_2\right\|_{\rm F}^2\norm{I_{k\times k}}_{\rm F}^2}{k-(r+1)} \\&= \left(1 + \frac{k}{k-(r+1)}\right) \left\| B - \lowrank{B\right\|_r }_{\rm F}^2.\end{align*}$

Low-Rank Approximation Toolbox: Randomized SVD

June 12, 2023 by Ethan N. Epperly 6 Comments

The randomized SVD and its relatives are workhorse algorithms for low-rank approximation. In this post, I hope to provide a fairly brief introduction to this useful method. In the following post, we’ll look into its analysis.

The randomized SVD is dead-simple to state: To approximate an $m\times n$ matrix $B$ by a rank- $k$ matrix, perform the following steps:

Collect information. Form $Y = B\Omega$ where $\Omega$ is an appropriately chosen $n\times k$ random matrix.
Orthogonalize. Compute an orthonormal basis $Q$ for the column space of $Y$ by, e.g., thin QR factorization.
Project. Form $C \coloneqq Q^*B$ . (Here, ${}^*$ denotes the conjugate transpose of a complex matrix, which reduces to the ordinary transpose if the matrix is real.)

If all one cares about is a low-rank approximation to the matrix $B$ , one can stop the randomized SVD here, having obtained the approximation

$B \approx X \coloneqq Q C.$

As the name “randomized SVD” suggests, one can easily “upgrade” the approximation $X = QC$ to a compact SVD format:

SVD. Compute a compact SVD of the matrix $C$ : $C = \hat{U}\Sigma V^*$ where $\hat{U}$ and $\Sigma$ and $k\times k$ and $V$ is $n\times k$ .
Clean up. Set $U \coloneqq Q\hat{U}$ .

We now have approximated $B$ by a rank- $k$ matrix $X$ in compact SVD form:

$B \approx X = QC = U\Sigma V^*.$

One can use the factors $U$ , $\Sigma$ , and $V$ of the approximation $X \approx B$ as estimates of the singular vectors and values of the matrix $B$ .

What Can You Do with the Randomized SVD?

The randomized SVD has many uses. Pretty much anywhere you would ordinarily use an SVD is a candidate for the randomized SVD. Applications include model reduction, fast direct solvers, computational biology, uncertainty quantification, among numerous others.

To speak in broad generalities, there are two ways to use the randomized SVD:

As an approximation. First, we could use $X$ as a compressed approximation to $B$ . Using the format $X = QC$ , $X$ requires just $(m+n)k$ numbers of storage, whereas $B$ requires a much larger $mn$ numbers of storage. As I detailed at length in my blog post on low-rank matrices, many operations are cheap for the low-rank matrix $X$ . For instance, we can compute a matrix–vector product $Xz$ in roughly $(m+n)k$ operations rather than the $mn$ operations to compute $Bz$ . For these use cases, we don’t need the “SVD” part of the randomized SVD.
As an approximate SVD. Second, we can actually use the $U$ , $\Sigma$ , and $V$ produced by the randomized SVD as approximations to the singular vectors and values of $B$ . In these applications, we should be careful. Just because $X$ is a good approximation to $B$ , it is not necessarily the case that $U$ , $\Sigma$ , and $V$ are close to the singular vectors and values of $B$ . To use the randomized SVD in this context, it is safest to use posterior diagnostics (such as the ones developed in this paper of mine) to ensure that the singular values/vectors of interest are computed to a high enough accuracy. A useful rule of thumb is the smallest two to five singular values and vectors computed by the randomized SVD are suspect and should be used in applications only with extreme caution. When the appropriate care is taken and for certain problems, the randomized SVD can provide accurate singular vectors far faster than direct methods.

Accuracy of the Randomized SVD

How accurate is the approximation $X$ produced by the randomized SVD? No rank- $k$ approximation can be more accurate than the best rank- $k$ approximation $\lowrank{B}_k$ to the matrix $B$ , furnished by an exact SVD of $B$ truncated to rank $k$ . Thus, we’re interested in how much larger $B - X$ is than $B - \lowrank{B}_k$ .

Frobenius Norm Error

A representative result (see, e.g., Theorem 8.1 of this paper) describing the error of the randomized SVD is the following:

(1) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2 \le \min_{r \le k-2} \left( 1 + \frac{r}{k-(r+1)} \right) \left\|B - \lowrank{B}_r\right\|^2_{\rm F}.$

This result states that the average squared Frobenius-norm error of the randomized SVD is comparable with the best rank- $r$ approximation error for any $r\le k-2$ . In particular, choosing $k = r+2$ , we see that the randomized SVD error is at most $(1+r)$ times the best rank- $r$ approximation

(2) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2 \le (1+r) \left\|B - \lowrank{B}_r \right\|^2_{\rm F} \quad \text{for $k=r+2$}.$

Choosing $k = 2r+1$ , we see that the randomized SVD has at most twice the error of the best rank- $r$ approximation

(3) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2 \le 2 \left\|B - \lowrank{B}_r\right\|^2_{\rm F} \quad \text{for $k=2r+1$}.$

In effect, these results tell us that if we want an approximation that is nearly as good as the best rank- $r$ approximation, using an approximation rank of, say, $k = r+2$ or $k = 2r$ for the randomized SVD suffices. These results hold even for worst-case matrices. For nice matrices with steadily decaying singular values, the randomized SVD can perform even better than equations (2)–(3) would suggest.

Spectral Norm Error

The spectral norm is often a better error metric for matrix computations than the Frobenius norm. Is the randomized SVD also comparable with the best rank- $k$ approximation when we use the spectral norm? In this setting, a representative error bound (see, e.g., Theorem 8.7 of this paper) is

(4) $\mathbb{E} \left\|B - X\right\|^2 \le \min_{r \le k-2} \left( 1 + \frac{2r}{k-(r+1)} \right) \left(\left\|B - \lowrank{B}_r \right\|^2 + \frac{\mathrm{e}^2}{k-r} \left\|B - \lowrank{B}_r\right\|^2_{\rm F} \right).$

The spectral norm error of the randomized SVD depends on the Frobenius norm error of the best rank- $r$ approximation for $r \le k-2$ .

Recall that the spectral and Frobenius norm can be defined in terms of the singular values of the matrix $B$ :

$\norm{B} = \sigma_1(B),\quad \left\|B\right\|_{\rm F}^2 = \sum_i \sigma_i(B)^2.$

Rewriting (4) using these formulas, we get

$\mathbb{E} \left\|B - X\right\|^2 \le \min_{r \le k-2} \left( 1 + \frac{2r}{k-(r+1)} \right) \left(\sigma_{r+1}^2 + \frac{\mathrm{e}^2}{k-r} \sum_{i>r} \sigma_i^2 \right).$

Despite being interested in the largest singular value of the error matrix $B-X$ , this bound demonstrates that the randomized SVD incurs errors based on the entire tail of $B$ ‘s singular values $\sum_{i>r}\sigma_i^2$ . The randomized SVD is much worse than the best rank- $k$ approximation for a matrix $B$ with a long detail of slowly declining singular values.

Improving the Approximation: Subspace Iteration and Block Krylov Iteration

We saw that the randomized SVD produces nearly optimal low-rank approximations when we measure using the Frobenius norm. When we measure using the spectral norm, we have a split decision: If the singular values decay rapidly, the randomized SVD is near-optimal; if the singular values decay more slowly, the randomized SVD can suffer from large errors.

Fortunately, there are two fixes to improve the randomized SVD in the presence of slowly decaying singular values:

Subspace iteration: To improve the quality of the randomized SVD, we can combine it with the famous power method for eigenvalue computations. Instead of $Y = B\Omega$ , we use the powered expression $Y = B(B^*B)^q\Omega$ . Powering has the effect of amplifying the large singular values of $B$ and dimishing the influence of the tail.
Block Krylov iteration: Subspace iteration is effective, but fairly wasteful. To compute $B(B^*B)^q\Omega$ we compute the block Krylov sequence
$B\Omega, BB^*B\Omega, BB^*BB^*B\Omega,\ldots,B(B^*B)^q\Omega$
and throw away all but the last term. To make more economical use of resources, we can use the entire sequence as our information matrix $Y$ :
$Y = \begin{bmatrix} B\Omega & BB^*B\Omega & BB^*BB^*B\Omega& \cdots & B(B^*B)^q\Omega\end{bmatrix}.$
Storing the entire block Krylov sequence is more memory-intensive but makes more efficient use of matrix products than subspace iteration.

Both subspace iteration and block Krylov iteration should be carefully implemented to produce accurate results.

Both subspace iteration and block Krylov iteration diminish the effect of a slowly decaying tail of singular values on the accuracy of the randomized SVD. The more slowly the tail decays, the larger one must take the number of iterations $q$ to obtain accurate results.

Low-Rank Approximation Toolbox: Nyström, Cholesky, and Schur

October 24, 2022 by Ethan N. Epperly 2 Comments

In the last post, we looked at the Nyström approximation to an $N\times N$ positive semidefinite (psd) matrix $A$ . A special case was the column Nyström approximation, defined to be¹

(Nys) $\hat{A} = A(:,S) \, A(S,S)^{-1} \, A(S,:),$

where $S = \{s_1,\ldots,s_k\} \subseteq \{1,2,\ldots,N\}$ identifies a subset of $k$ columns of $A$ . Provided $k\ll N$ , this allowed us to approximate all $N^2$ entries of the matrix $A$ using only the $kN$ entries in columns $s_1,\ldots,s_k$ of $A$ , a huge savings of computational effort!

With the column Nyström approximation presented just as such, many questions remain:

Why this formula?
Where did it come from?
How do we pick the columns $s_1,\ldots,s_k$ ?
What is the residual $A - \hat{A}$ of the approximation?

In this post, we will answer all of these questions by drawing a connection between low-rank approximation by Nyström approximation and solving linear systems of equations by Gaussian elimination. The connection between these two seemingly unrelated areas of matrix computations will pay dividends, leading to effective algorithms to compute Nyström approximations by the (partial) Cholesky factorization of a positive (semi)definite matrix and an elegant description of the residual of the Nyström approximation as the Schur complement.

Cholesky: Solving Linear Systems

Suppose we want solve the system of linear equations $Ax = b$ , where $A$ is a real $N\times N$ invertible matrix and $b$ is a vector of length $N$ . The standard way of doing this in modern practice (at least for non-huge matrices $A$ ) is by means of Gaussian elimination/LU factorization. We factor the matrix $A$ as a product $A = LU$ of a lower triangular matrix $L$ and an upper triangular matrix $U$ .² The system $Ax = b$ is solved by first solving $Ly = b$ for $y$ and then $Ux = y$ for $x$ ; the triangularity of $L$ and $U$ make solving the associated systems of linear equations easy.

For real symmetric positive definite matrix $A$ , a simplification is possible. In this case, one can compute an LU factorization where the matrices $L$ and $U$ are transposes of each other, $U = L^\top$ . This factorization $A = LL^\top$ is known as a Cholesky factorization of the matrix $A$ .

The Cholesky factorization can be easily generalized to allow the matrix $A$ to be complex-valued. For a complex-valued positive definite matrix $A$ , its Cholesky decomposition takes the form $A = LL^*$ , where $L$ is again a lower triangular matrix. All that has changed is that the transpose ${}^\top$ has been replaced by the conjugate transpose ${}^*$ . We shall work with the more general complex case going forward, though the reader is free to imagine all matrices as real and interpret the operation ${}^*$ as ordinary transposition if they so choose.

Schur: Computing the Cholesky Factorization

Here’s one way of computing the Cholesky factorization using recursion. Write the matrix $A$ in block form as

$A = \twobytwo{A_{11}}{A_{12}}{A_{21}}{A_{22}}.$

Our first step will be “block Cholesky factorize” the matrix $A$ , factoring $A$ as a product of matrices which are only block triangular. Then, we’ll “upgrade” this block factorization into a full Cholesky factorization.

The core idea of Gaussian elimination is to combine rows of a matrix to introduce zero entries. For our case, observe that multiplying the first block row of $A$ by $A_{21}A_{11}^{-1}$ and subtracting this from the second block row introduces a matrix of zeros into the bottom left block of $A$ . (The matrix $A_{11}$ is a principal submatrix of $A$ and is therefore guaranteed to be positive definite and thus invertible.³) In matrix language,

$\twobytwo{I}{0}{-A_{21}A_{11}^{-1}}{I}\twobytwo{A_{11}}{A_{12}}{A_{21}}{A_{22}} = \twobytwo{A_{11}}{A_{12}}{0}{A_{22} - A_{21}A_{11}^{-1}A_{12}}.$

Isolating $A$ on the left-hand side of this equation by multiplying by

$\twobytwo{I}{0}{-A_{21}A_{11}^{-1}}{I}^{-1} = \twobytwo{I}{0}{A_{21}A_{11}^{-1}}{I}$

yields the block triangular factorization

$A = \twobytwo{A_{11}}{A_{12}}{A_{21}}{A_{22}} = \twobytwo{I}{0}{A_{21}A_{11}^{-1}}{I} \twobytwo{A_{11}}{A_{12}}{0}{A_{22} - A_{21}A_{11}^{-1}A_{12}}.$

We’ve factored $A$ into block triangular pieces, but these pieces are not (conjugate) transposes of each other. Thus, to make this equation more symmetrical, we can further decompose

(1) $A = \twobytwo{A_{11}}{A_{12}}{A_{21}}{A_{22}} = \twobytwo{I}{0}{A_{21}A_{11}^{-1}}{I} \twobytwo{A_{11}}{0}{0}{A_{22} - A_{21}A_{11}^{-1}A_{12}} \twobytwo{I}{0}{A_{21}A_{11}^{-1}}{I}^*.$

This is a block version of the Cholesky decomposition of the matrix $A$ taking the form $A = LDL^*$ , where $L$ is a block lower triangular matrix and $D$ is a block diagonal matrix.

We’ve met the second of our main characters, the Schur complement

(Sch) $S = A_{22} - A_{21}A_{11}^{-1}A_{12}.$

This seemingly innocuous combination of matrices has a tendency to show up in surprising places when one works with matrices.⁴ It’s appearance in any one place is unremarkable, but the shear ubiquity of $A_{22} - A_{21}A_{11}^{-1}A_{12}$ in matrix theory makes it deserving of its special name, the Schur complement. To us for now, the Schur complement is just the matrix appearing in the bottom right corner of our block Cholesky factorization.

The Schur complement enjoys the following property:⁵

Positivity of the Schur complement: If $A=\twobytwo{A_{11}}{A_{12}}{A_{21}}{A_{22}}$ is positive (semi)definite, then the Schur complement $S = A_{22} - A_{21}A_{11}^{-1}A_{12}$ is positive (semi)definite.

As a consequence of this property, we conclude that both $A_{11}$ and $S$ are positive definite.

With the positive definiteness of the Schur complement in hand, we now return to our Cholesky factorization algorithm. Continue by recursively⁶ computing Cholesky factorizations of the diagonal blocks

$A_{11} = L_{11}^{\vphantom{*}}L_{11}^*, \quad S = L_{22}^{\vphantom{*}}L_{22}^*.$

Inserting these into the block $LDL^*$ factorization (1) and simplifying gives a Cholesky factorization, as desired:

$A = \twobytwo{L_{11}}{0}{A_{21}^{\vphantom{*}}(L_{11}^{*})^{-1}}{L_{22}}\twobytwo{L_{11}}{0}{A_{21}^{\vphantom{*}}(L_{11}^{*})^{-1}}{L_{22}}^* =: LL^*.$

Voilà, we have obtained a Cholesky factorization of a positive definite matrix $A$ !

By unwinding the recursions (and always choosing the top left block $A_{11}$ to be of size $1\times 1$ ), our recursive Cholesky algorithm becomes the following iterative algorithm: Initialize $L$ to be the zero matrix. For $j = 1,2,3,\ldots,N$ , perform the following steps:

Update $L$ . Set the $j$ th column of $L$ :
$L(j:N,j) \leftarrow A(j:N,j)/\sqrt{a_{jj}}.$
Update $A$ . Update the bottom right portion of $A$ to be the Schur complement:
$A(j+1:N,j+1:N)\leftarrow A(j+1:N,j+1:N) - \frac{A(j+1:N,j)A(j,j+1:N)}{a_{jj}}.$

This iterative algorithm is how Cholesky factorization is typically presented in textbooks.

Nyström: Using Cholesky Factorization for Low-Rank Approximation

Our motivating interest in studying the Cholesky factorization was the solution of linear systems of equations $Ax = b$ for a positive definite matrix $A$ . We can also use the Cholesky factorization for a very different task, low-rank approximation.

Let’s first look at things through the lense of the recursive form of the Cholesky factorization. The first step of the factorization was to form the block Cholesky factorization

$A = \twobytwo{I}{0}{A_{21}A_{11}^{-1}}{I} \twobytwo{A_{11}}{0}{0}{A_{22} - A_{21}A_{11}^{-1}A_{12}} \twobytwo{I}{0}{A_{21}A_{11}^{-1}}{I}^*.$

Suppose that we choose the top left $A_{11}$ block to be of size $k\times k$ , where $k$ is much smaller than $N$ . The most expensive part of the Cholesky factorization will be the recursive factorization of the Schur complement $A_{22} - A_{21}A_{11}^{-1} A_{12}$ , which is a large matrix of size $(N-k)\times (N-k)$ .

To reduce computational cost, we ask the provocative question: What if we simply didn’t factorize the Schur complement? Observe that we can write the block Cholesky factorization as a sum of two terms

(2) $A = \twobyone{I}{A_{21}{A_{11}^{-1}}} A_{11} \twobyone{I}{A_{22}{A_{11}^{-1}}}^* + \twobytwo{0}{0}{0}{A_{22}-A_{21}A_{11}^{-1}A_{12}}.$

We can use the first term of this sum as a rank- $k$ approximation to the matrix $A$ . The low-rank approximation, which we can write out more conveniently as

$\hat{A} = \twobyone{A_{11}}{A_{21}} A_{11}^{-1} \onebytwo{A_{11}}{A_{12}} = \twobytwo{A_{11}}{A_{12}}{A_{21}}{A_{21}A_{11}^{-1}A_{12}},$

is the column Nyström approximation (Nys) to $A$ associated with the index set $S = \{1,2,3,\ldots,k\}$ and is the final of our three titular characters. The residual of the Nyström approximation is the second term in (2), which is none other than the Schur complement (Sch), padded by rows and columns of zeros:

$A - \hat{A} = \twobytwo{0}{0}{0}{A_{22}-A_{21}A_{11}^{-1}A_{12}}.$

Observe that the approximation $\hat{A}$ is obtained from the process of terminating a Cholesky factorization midway through algorithm execution, so we say that the Nyström approximation results from a partial Cholesky factorization of the matrix $A$ .

Summing things up:

If we perform a partial Cholesky factorization on a positive (semi)definite matrix, we obtain a low-rank approximation known as the column Nyström approximation. The residual of this approximation is the Schur complement, padded by rows and columns of zeros.

The idea of obtaining a low-rank approximation from a partial matrix factorization is very common in matrix computations. Indeed, the optimal low-rank approximation to a real symmetric matrix is obtained by truncating its eigenvalue decomposition and the optimal low-rank approximation to a general matrix is obtained by truncating its singular value decomposition. While the column Nyström approximation is not the optimal rank- $k$ approximation to $A$ (though it does satisfy a weaker notion of optimality, as discussed in this previous post), it has a killer feature not possessed by the optimal approximation:

The column Nyström approximation is formed from only $k$ columns from the matrix $A$ . A column Nyström approximation approximates an $N\times N$ matrix by only reading a fraction of its entries!

Unfortunately there’s not a free lunch here. The column Nyström is only a good low-rank approximation if the Schur complement has small entries. In general, this need not be the case. Fortunately, we can improve our situation by means of pivoting.

Our iterative Cholesky algorithm first performs elimination using the entry in position $(1,1)$ followed by position $(2,2)$ and so on. There’s no need to insist on this exact ordering of elimination steps. Indeed, at each step of the Cholesky algorithm, we can choose whichever diagonal position $(j,j)$ that we want to perform elimination. The entry we choose to perform elimination with is called a pivot.

Obtaining good column Nyström approximations requires identifying the a good choice for the pivots to reduce the size of the entries of the Schur complement at each step of the algorithm. With general pivot selection, an iterative algorithm for computing a column Nyström approximation by partial Cholesky factorization proceeds as follows: Initialize an $N\times k$ matrix $F$ to store the column Nyström approximation $\hat{A} = FF^*$ , in factored form. For $j = 1,2,\ldots,k$ , perform the following steps:

Select pivot. Choose a pivot $s_j$ .
Update the approximation. $F(:,j) \leftarrow A(:,s_j) / \sqrt{a_{s_js_j}}$ .
Update the residual. $A \leftarrow A - \frac{A(:,s_j)A(s_j,:)}{a_{s_js_j}}$ .

This procedure results in the Nyström approximation (Nys) with column set $S = \{s_1,\ldots,s_k\}$ :

$\hat{A} = FF^* = A(:,S) \, A(S,S)^{-1} \, A(S,:).$

The pivoted Cholesky steps 1–3 requires updating the entire matrix $A$ at every step. With a little more cleverness, we can optimize this procedure to only update the entries of the matrix $A$ we need to form the approximation $\hat{A} = FF^*$ . See Algorithm 2 in this preprint of my coauthors and I for details.

How should we choose the pivots? Two simple strategies immediately suggest themselves:

Uniformly random. At each step $j$ , select $s_j$ uniformly at random from among the un-selected pivot indices.
Greedy.⁷ At each step $j$ , select $s_j$ according to the largest diagonal entry of the current residual $A$ :
$s_j = \argmax_{1\le k\le N} a_{kk}.$

The greedy strategy often (but not always) works well, and the uniformly random approach can work surprisingly well if the matrix $A$ is “incoherent”, with all rows and columns of the matrix possessing “similar importance”.

Despite often working fairly well, both the uniform and greedy schemes can fail significantly, producing very low-quality approximations. My research (joint with Yifan Chen, Joel A. Tropp, and Robert J. Webber) has investigated a third strategy striking a balance between these two approaches:

Diagonally weighted random. At each step $j$ , select $s_j$ at random according to the probability weights based on the current diagonal of the matrix
$\mathbb{P} \{ s_j = k \} = \frac{a_{kk}}{\operatorname{tr} A}.$

Our paper provides theoretical analysis and empirical evidence showing that this diagonally-weighted random pivot selection (which we call randomly pivoted Cholesky aka RPCholesky) performs well at approximating all positive semidefinite matrices $A$ , even those for which uniform random and greedy pivot selection fail. The success of this approach can be seen in the following examples (from Figure 1 in the paper), which shows RPCholesky can produce much smaller errors than the greedy and uniform methods.

Conclusions

In this post, we’ve seen that a column Nyström approximation can be obtained from a partial Cholesky decomposition. The residual of the approximation is the Schur complement. I hope you agree that this is a very nice connection between these three ideas. But beyond its mathematical beauty, why do we care about this connection? Here are my reasons for caring:

Analysis. Cholesky factorization and the Schur complement are very well-studied in matrix theory. We can use known facts about Cholesky factorization and Schur complements to prove things about the Nyström approximation, as we did when we invoked the positivity of the Schur complement.
Algorithms. Cholesky factorization-based algorithms like randomly pivoted Cholesky are effective in practice at producing high-quality column Nyström approximations.

On a broader level, our tale of Nyström, Cholesky, and Schur demonstrates that there are rich connections between low-rank approximation and (partial versions of) classical matrix factorizations like LU (with partial pivoting), Cholesky, QR, eigendecomposition, and SVD for to full-rank matrices. These connections can be vital to analyzing low-rank approximation algorithms and developing improvements.

Low-Rank Approximation Toolbox: Nyström Approximation

October 11, 2022 by Ethan N. Epperly 5 Comments

Welcome to a new series for this blog, Low-Rank Approximation Toolbox. As I discussed in a previous post, many matrices we encounter in applications are well-approximated by a matrix with a small rank. Efficiently computing low-rank approximations has been a major area of research, with applications in everything from classical problems in computational physics and signal processing to trendy topics like data science. In this series, I want to explore some broadly useful algorithms and theoretical techniques in the field of low-rank approximation.

I want to begin this series by talking about one of the fundamental types of low-rank approximation, the Nyström approximation of a $N\times N$ (real symmetric or complex Hermitian) positive semidefinite (psd) matrix $A$ . Given an arbitrary $N\times k$ “test matrix” $\Omega$ , the Nyström approximation is defined to be

(1) $A\langle \Omega\rangle := A\Omega \, (\Omega^*A\Omega)^{-1} \, \Omega^*A.$

This formula is sensible whenever $\Omega^*A\Omega$ is invertible; if $\Omega^*A\Omega$ is not invertible, then the inverse ${}^{-1}$ should be replaced by the Moore–Penrose pseudoinverse ${}^\dagger$ . For simplicity, I will assume that $\Omega^* A \Omega$ is invertible in this post, though everything we discuss will continue to work if this assumption is dropped. I use ${}^*$ to denote the conjugate transpose of a matrix, which agrees with the ordinary transpose ${}^\top$ for real matrices. I will use the word self-adjoint to refer to a matrix which satisfies $A=A^*$ .

The Nyström approximation (1) answers the question

What is the “best” rank- $k$ approximation to the psd matrx $A$ provided only with the matrix–matrix product $A\Omega$ , where $\Omega$ is a known $N\times k$ matrix ( $k\ll N$ )?

Indeed, if we let $Y = A\Omega$ , we observe that the Nyström approximation can be written entirely using $Y$ and $\Omega$ :

$A\langle \Omega\rangle = Y \, (\Omega^* Y)^{-1}\, Y^*.$

This is central advantage of the Nyström approximation: to compute it, the only access to the matrix $A$ I need is the ability to multiply the matrices $A$ and $\Omega$ . In particular, I only need a single pass over the entries of $A$ to compute the Nyström approximation. This allows the Nyström approximation to be used in settings when other low-rank approximations wouldn’t work, such as when $A$ is streamed to me as a sum of matrices that must be processed as they arrive and then discarded.

Choosing the Test Matrix

Every choice of $N\times k$ test matrix $\Omega$ defines a rank- $k$ Nyström approximation¹ $A\langle \Omega\rangle$ by (1). Unfortunately, the Nyström approximation won’t be a good low-rank approximation for every choice of $\Omega$ . For an example of what can go wrong, if we pick $\Omega$ to have columns selected from the eigenvectors of $A$ with small eigenvalues, the approximation $A\langle \Omega\rangle$ will be quite poor.

The very best choice of $\Omega$ would be the $k$ eigenvectors associated with the largest eigenvalues. Unfortunately, computing the eigenvectors to high accuracy is computationally costly. Fortunately, we can get decent low-rank approximations out of much simpler $\Omega$ ‘s:

Random: Perhaps surprisingly, we get a fairly low-rank approximation out of just choosing $\Omega$ to be a random matrix, say, populated with statistically independent standard normal random entries. Intuitively, a random matrix is likely to have columns with meaningful overlap with the large-eigenvalue eigenvectors of $A$ (and indeed with any $k$ fixed orthonormal vectors). One can also pick more exotic kinds of random matrices, which can have computational benefits.
Random then improve: The more similar the test matrix $\Omega$ is to the large-eigenvalue eigenvectors of $A$ , the better the low-rank approximation will be. Therefore, it makes sense to use the power method (usually called subspace iteration in this context) to improve a random initial test matrix $\Omega_{\rm init}$ to be closer to the eigenvectors of $A$ .² See section 11.6 of the following survey for details.
Column selection: If $\Omega$ consists of columns $i_1,i_2,\ldots,i_k$ of the identity matrix, then $A\Omega$ just consists of columns $i_1,\ldots,i_k$ of $A$ : In MATLAB notation,

$A(:,\{i_1,\ldots,i_k\}) = A\Omega \quad \text{for}\quad \Omega = I(:,\{i_1,i_2,\ldots,i_k\}).$
This is highly appealing as it allows us to approximate the matrix $A$ by only reading a small fraction of its entries (provided $k\ll N$ )! Producing a good low-rank approximation requires selecting the right column indices $i_1,\ldots,i_k$ (usually under the constraint of reading a small number of entries from $A$ ). In my research with Yifan Chen, Joel A. Tropp, and Robert J. Webber, I’ve argued that the most well-rounded algorithm for this task is a randomly pivoted partial Cholesky decomposition.

The Projection Formula

Now that we’ve discussed the choice of test matrix, we shall explore the quality of the Nyström approximation as measured by the size of the residual $A - A\langle \Omega\rangle$ . As a first step, we shall show that the residual is psd. This means that $A\langle \Omega\rangle$ is an underapproximation to $A$ .

The positive semidefiniteness of the residual follows from the following projection formula for the Nyström approximation:

$A\langle \Omega \rangle = A^{1/2} P_{A^{1/2}\Omega} A^{1/2}.$

Here, $P_{A^{1/2}\Omega}$ denotes the the orthogonal projection onto the column space of the matrix $A^{1/2}\Omega$ . To deduce the projection formula, we break down $A$ as $A = A^{1/2}\cdot A^{1/2}$ in (1):

$A\langle \Omega\rangle = A^{1/2} \left( A^{1/2}\Omega \left[ (A^{1/2}\Omega)^* A^{1/2}\Omega \right]^{-1} (A^{1/2}\Omega)^* \right) A^{1/2}.$

The fact that the paranthesized quantity is $P_{A^{1/2}\Omega}$ can be verified in a variety of ways, such as by QR factorization.³

With the projection formula in hand, we easily obtain the following expression for the residual:

$A - A\langle \Omega\rangle = A^{1/2} (I - P_{A^{1/2}\Omega}) A^{1/2}.$

To show that this residual is psd, we make use of the conjugation rule.

Conjugation rule: For a matrix $B$ and a self-adjoint matrix $H$ , if $H$ is psd then $B^*HB$ is psd. If $B$ is invertible, then the converse holds: if $B^*HB$ is psd, then $H$ is psd.

The matrix $I - P_{A^{1/2}\Omega}$ is an orthogonal projection and therefore psd. Thus, by the conjugation rule, the residual of the is Nyström approximation is psd:

$A - A\langle \Omega\rangle = \left(A^{1/2}\right)^* (I-P_{A^{1/2}\Omega})A^{1/2} \quad \text{is psd}.$

Optimality of the Nyström Approximation

There’s a question we’ve been putting off that can’t be deferred any longer:

Is the Nyström approximation actually a good low-rank approximation?

As we discussed earlier, the answer to this question depends on the test matrix $\Omega$ . Different choices for $\Omega$ give different approximation errors. See the following papers for Nyström approximation error bounds with different choices of $\Omega$ . While the Nyström approximation can be better or worse depending on the choice of $\Omega$ , what is true about Nyström approximation for every choice of $\Omega$ is that:

The Nyström approximation $A\langle \Omega\rangle$ is the best possible rank- $k$ approximation to $A$ given the information $A\Omega$ .

In precise terms, I mean the following:

Theorem: Out of all self-adjoint matrices $\hat{A}$ spanned by the columns of $A\Omega$ with a psd residual $A - \hat{A}$ , the Nyström approximation has the smallest error as measured by either the spectral or Frobenius norm (or indeed any unitarily invariant norm, see below).

Let’s break this statement down a bit. This result states that the Nyström approximation is the best approximation $\hat{A}$ to $A$ under three conditions:

$\hat{A}$ is self-adjoint.
$\hat{A}$ is spanned by the columns of $A\Omega$ .

I find these first two requirements to be natural. Since $A$ is self-adjoint, it makes sense to require our approximation $\hat{A}$ to be as well. The stipulation that $\hat{A}$ is spanned by the columns $A\Omega$ seems like a very natural requirement given we want to think of approximations which only use the information $A\Omega$ . Additionally, requirement 2 ensures that $\hat{A}$ has rank at most $k$ , so we are really only considering low-rank approximations to $A$ .

The last requirement is less natural:

The residual $A - \hat{A}$ is psd.

This is not an obvious requirement to impose on our approximation. Indeed, it was a nontrivial calculation using the projection formula to show that the Nyström approximation itself satisfies this requirement! Nevertheless, this third stipulation is required to make the theorem true. The Nyström approximation (1) is the best “underapproximation” to the matrix $A$ to in the span of $A\Omega$ .

Intermezzo: Unitarily Invariant Norms and the Psd Order

To prove our theorem about the optimality of the Nyström approximation, we shall need two ideas from matrix theory: unitarily invariant norms and the psd order. We shall briefly describe each in turn.

A norm $\left\|\cdot\right\|_{\rm UI}$ defined on the set of $N\times N$ matrices is said to be unitarily invariant if the norm of a matrix does not change upon left- or right-multiplication by a unitary matrix:

$\left\|UBV\right\|_{\rm UI} = \left\|B\right\|_{\rm UI} \quad \text{for all unitary matrices $U$ and $V$.}$

Recall that a unitary matrix $U$ (called a real orthogonal matrix if $U$ is real-valued) is one that obeys $U^*U = UU^* = I$ . Unitary matrices preserve the Euclidean lengths of vectors, which makes the class of unitarily invariant norms highly natural. Important examples include the spectral, Frobenius, and nuclear matrix norms:

The unitarily invariant norm of a matrix $B$ depends entirely on its singular values $\sigma(B)$ . For instance, the spectral, Frobenius, and nuclear norms take the forms

$\begin{align*}\left\|B\right\|_{\rm op} &= \sigma_1(B),& &\text{(spectral)} \\\left\|B\right\|_{\rm F} &= \sqrt{\sum_{j=1}^N (\sigma_j(B))^2},& &\text{(Frobenius)} \\\left\|B\right\|_{*} &=\sum_{j=1}^N \sigma_j(B)).& &\text{(nuclear)}\end{align*}$

In addition to being entirely determined by the singular values, unitarily invariant norms are non-decreasing functions of the singular values: If the $j$ th singular value of $B$ is larger than the $j$ th singular value of $C$ for $1\le j\le N$ , then $\left\|B\right\|_{\rm UI}\ge \left\|C\right\|_{\rm UI}$ for every unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$ . For more on unitarily invariant norms, see this short and information-packed blog post from Nick Higham.

Our second ingredient is the psd order (also known as the Loewner order). A self-adjoint matrix $A$ is larger than a self-adjoint matrix $H$ according to the psd order, written $A\succeq H$ , if the difference $A-H$ is psd. As a consequence, $A\succeq 0$ if and only if $A$ is psd, where $0$ here denotes the zero matrix of the same size as $A$ . Using the psd order, the positive semidefiniteness of the Nyström residual can be written as $A - A\langle \Omega\rangle \succeq 0$ .

If $A$ and $H$ are both psd matrices and $A$ is larger than $H$ in the psd order, $A\succeq H\succeq 0$ , it seems natural to expect that $A$ is larger than $H$ in norm. Indeed, this intuitive statement is true, at least when one restricts oneself to unitarily invariant norms.

Psd order and norms. If $A\succeq H\succeq 0$ , then $\left\|A\right\|_{\rm UI} \ge \left\|H\right\|_{\rm UI}$ for every unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$ .

This fact is a consequence of the following observations:

If $A\succeq H$ , then the eigenvalues of $A$ are larger than $H$ in the sense that the $j$ th largest eigenvalue of $A$ is larger than the $j$ th largest eigenvalue of $H$ .
The singular values of a psd matrix are its eigenvalues.
Unitarily invariant norms are non-decreasing functions of the singular values.

Optimality of the Nyström Approximation: Proof

In this section, we’ll prove our theorem showing the Nyström approximation is the best low-rank approximation satisfying properties 1, 2, and 3. To this end, let $\hat{A}$ be any matrix satisfying properties 1, 2, and 3. Because of properties 1 (self-adjointness) and 2 (spanned by columns of $A\Omega$ ), $\hat{A}$ can be written in the form

$\hat{A} = A\Omega \, T \, (A\Omega)^* = A \Omega \, T \, \Omega^*A,$

where $T$ is a self-adjoint matrix. To make this more similar to the projection formula, we can factor $A^{1/2}$ on both sides to obtain

$\hat{A} = A^{1/2} (A^{1/2}\Omega\, T\, \Omega^*A^{1/2}) A^{1/2}.$

To make this more comparable to the projection formula, we can reparametrize by introducing a matrix $Q$ with orthonormal columns with the same column space as $A^{1/2}\Omega$ . Under this parametrization, $\hat{A}$ takes the form

$\hat{A} = A^{1/2} \,QMQ^*\, A^{1/2} \quad \text{where} \quad M\text{ is self-adjoint}.$

The residual of this approximation is

(2) $A - \hat{A} = A^{1/2} (I - QMQ^*)A^{1/2}.$

We now make use of the of conjugation rule again. To simplify things, we make the assumption that $A$ is invertible. (As an exercise, see if you can adapt this argument to the case when this assumption doesn’t hold!) Since $A - \hat{A}\succeq 0$ is psd (property 3), the conjugation rule tells us that

$I - QMQ^*\succeq 0.$

What does this observation tell us about $M$ ? We can apply the conjugation rule again to conclude

$Q^*(I - QMQ^*)Q = Q^*Q - (Q^*Q)M(Q^*Q) = I-M \succeq 0.$

(Notice that $Q^*Q = I$ since $Q$ has orthonormal columns.)

We are now in a place to show that $A - \hat{A}\succeq A - A\langle \Omega\rangle$ . Indeed,

$\begin{align*}A - \hat{A} - (A-A\langle \Omega\rangle) &= A\langle\Omega\rangle - \hat{A} \\&= A^{1/2}\underbrace{QQ^*}_{=P_{A^{1/2}\Omega}}A^{1/2} - A^{1/2}QMQ^*A^{1/2} \\&=A^{1/2}Q(I-M)Q^*A^{1/2}\\&\succeq 0.\end{align*}$

The second line is the projection formula together with the observation that $P_{A^{1/2\Omega}} = QQ^*$ and the last line is the conjugation rule combined with the fact that $I-M$ is psd. Thus, having shown

$A - \hat{A} \succeq A - A\langle\Omega\rangle \succeq 0,$

we conclude

$\|A - \hat{A}\|_{\rm UI} \ge \left\|A - A\langle \Omega\rangle\right\|_{\rm UI} \quad \text{for every unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$}.$