I am delighted to share that my paper Randomly pivoted Cholesky: Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations, joint with Yifan Chen, Joel Tropp, and Robert Webber, has been published online in Communications on Pure and Applied Mathematics. To celebrate this occasion, I want to share one of my favorite tricks in the design of low-rank approximation algorithms, which I will call the Gram correspondence.
Projection Approximation and Nyström Approximation
When we construct a low-rank approximation to a matrix, the type of approximation we use is typically dictated by the size and properties of the matrix. For a rectangular matrix , one of the standard techniques is the standard randomized SVD algorithm:
- Generate a Gaussian random matrix .
- Form the product .
- Compute an (economy-size) QR decomposition .
- Evaluate the SVD .
- Output the low-rank approximation for .
Succinctly, the output of the randomized SVD is given by the formula , where denotes the orthogonal projector onto the column span of a matrix . This motivates the general definition:
Definition (projection approximation): Given a test matrix , the projection approximation to the matrix is .
The class of projection approximations is much richer than merely the approximation constructed by the standard randomized SVD. Indeed, low-rank approximations computed by randomized subspace iteration, randomized block Krylov iteration, column-pivoted QR decompositions, etc. all fit under the umbrella of projection approximations.
Many matrices in applications have additional structure such as symmetry or sparsity, and it can be valuable to make use of low-rank approximations that take advantage of those properties. An especially important type of structure is positive semidefiniteness. For our purposes, a positive semidefinite matrix is one that is symmetric and possesses nonnegative eigenvalues, and we will use abbreviate “positive semidefinite” as “psd”. Psd matrices arise in applications as covariance matrices, kernel matrices, and as discretizations of certain differential and integral operators. Further, any rectangular matrix gives rise to its psd Gram matrix ; we will have much more to say about Gram matrices below.
To compute a low-rank approximation to a psd matrix, the preferred format as usually the Nyström approximation:
Definition (Nyström approximation): Given a test matrix , the Nyström approximation is1Throughout this post, the inverse is interpreted as the Moore–Penrose pseudoinverse if its argument is not invertible.
As discussed in a previous post, the general class of Nyström approximations includes many useful specializations depending on how the matrix is selected.
The Gram Correspondence
The Gram correspondence is a connection between projection approximation and Nyström approximation:
The Gram correspondence: Let be any rectangular matrix and consider the Gram matrix . Fix a test matrix , and define the Nyström approximation to and the projection approximation to . Then
That is, the Gram matrix of the projection approximation is the Nyström approximation .
As we will see, this correspondence has many implications:
- Special cases. The correspondence contains several important facts as special cases. Examples include the equivalence between the randomized SVD and single-pass Nyström approximation and the equivalence of (partial) column-pivoted QR factorization and (partial) pivoted Cholesky decomposition.
- Algorithm design. Since Nyström approximation and projection approximations are closely related, one can often interconvert algorithms for one type of approximation into the other. These conversions can lead one to discover new algorithms. We will provide a historical answer with the discovery of randomly pivoted Cholesky.
- Error bounds. The Gram correspondence is immensely helpful in the analysis of algorithms, as it makes error bounds for projection approximations and Nyström approximations easily derivable from each other.
For those interested, we give a short proof of the Gram correspondence below.
Aside: Gram Square Roots
Before we go forward, let me highlight a point of potential conclusion and introduce some helpful terminology. When thinking about algorithms for a psd matrix , it will often to be helpful to conjure a matrix for which . Given a psd matrix , there is always a matrix for which , but this is not unique. Indeed, if , then we have the infinite family of decompositions generated by every matrix with orthonormal columns. This motivates the following definition:
Definition (Gram square root): A matrix is a Gram square root for if .
A Gram square root for need not be a square root in the traditional sense that . Indeed, a Gram square root can be rectangular, so need not even be defined.2The Gram square root should be contrasted with a different type of square root, the matrix square root written . The matrix square root is square and psd, and satisfies . Moreover, it is the unique matrix with these properties. Using the Gram square root terminology, the Gram correspondence can be written more succinctly:
Gram correspondence (concise): If is any Gram square root of , then the projection approximation is a Gram square root of the Nyström approximation .
Examples of the Gram Correspondence
Before going further, let us see a couple of explicit examples of the Gram correspondence.
Randomized SVD and Single-Pass Nyström Approximation
The canonical example of the Gram correspondence is the equivalence of the randomized SVD algorithm and the single-pass Nyström approximation. Let be a Gram square root for a psd matrix . The randomized SVD of is given by the following steps, which we saw in the introduction:
- Generate a Gaussian random matrix .
- Form the product .
- Compute a QR decomposition .
- Evaluate the SVD .
- Output the low-rank approximation for .
Now, imagine taking the same Gaussian matrix from the randomized SVD, and use it to compute the Nyström approximation . Notice that this Nyström approximation can be computed in a single pass over the matrix . Namely, use a single pass over to compute and use following formula:3In practice, one should be careful about implementation for a single-pass Nyström approximation; see this paper for details.
For this reason, we call the single-pass Nyström approximation.
The Gram correspondence tells us that the randomized SVD and single-pass Nyström approximation are closely related, in the sense that . The randomized SVD approximation is a Gram square root of the single-pass Nyström approximation .
Column-Pivoted QR Factorization and Pivoted Cholesky Decomposition
A more surprising consequence of the Gram correspondence is the connection between low-rank approximations produced by partial column-pivoted QR factorization of a rectangular matrix and a partial pivoted Cholesky decomposition of . Let’s begin by describing these two approximation methods.
Let’s begin with pivoted partial Cholesky decomposition. Let be a psd matrix, and initialize the zero approximation . For , perform the following steps:
- Choose a column index . These indices are referred to as pivot indices or, more simply, pivots.
- Update the low-rank approximation
- Update the matrix
Here, we are using MATLAB notation to index the matrix . The output of this procedure is
where . This type of low-rank approximation is known as a column Nyström approximation and is a special case of the general Nyström approximation with test matrix equal to a subset of columns of the identity matrix . For an explanation of why this procedure is called a “pivoted partial Cholesky decomposition” and the relation to the usual notion of Cholesky decomposition, see this previous post of mine.
Given the Gram correspondence, we expect that this pivoted partial Cholesky procedure for computing a Nyström approximation to a psd matrix should have an analog for computing a projection approximation to a rectangular matrix . This analog is given by the partial column-pivoted QR factorization, which produces a low-rank approximation according as follows. Let be a psd matrix, and initialize the zero approximation . For , perform the following steps:
- Choose a pivot index .
- Update the low-rank approximation by adding the projection of onto the selected column: .
- Update the matrix by removing the projection of onto the selected column: .
The output of this procedure is the column projection approximation , which is an example of the general projection approximation with .4The column projection approximation is often presented in factorized form in one of two ways. First, can be expressed as , where is a matrix with orthonormal columns and is an upper trapezoidal matrix, up to a permutation of the rows. This factorized form is easy to compute roughly following steps 1–3 above, which explains why we call that procedure a “QR factorization“. Second, can be factorized for a weight matrix . This type of factorization is known as an interpolative decomposition.
The pivoted partial Cholesky and QR factorizations are very traditional ways of computing a low-rank approximation in numerical linear algebra. The Gram correspondence tells us immediately that these two approaches are closely related:
Let be a Gram square root of . Compute a column projection approximation to using a partially column-pivoted QR factorization with pivots , and compute a column Nyström approximation to using a partial pivoted Cholesky decomposition with the same set of pivots . Then .
I find this remarkable: the equivalence of the randomized SVD and single-pass randomized Nyström approximation and the equivalence of (partial pivoted) Cholesky and QR factorizations are both consequences of the same general principle, the Gram correspondence.
Using the Gram Correspondence to Discover Algorithms
The Gram correspondence is more than just an interesting way of connecting existing types of low-rank approximations: It can be used to discover new algorithms. We can illustrate this with an example, the randomly pivoted Cholesky algorithm.
Background: The Randomly Pivoted QR Algorithm
In 2006, Deshpande, Rademacher, Vempala, and Wang discovered an algorithm that they called adaptive sampling for computing a projection approximation to a rectangular matrix . Beginning from the trivial initial approximation , algorithm proceeds as follows for :
- Choose a column index randomly according to the squared column norm distribution:
- Update the low-rank approximation by adding the projection of onto the selected column: .
- Update the matrix by removing the projection of onto the selected column: .
Given our discussion above, we recognize this as an example of partial column-pivoted QR factorization with a particular randomized procedure for selecting the pivots .5Interestingly, Deshpande and coauthors did not make the connection between their approach and pivoted QR algorithms in their work. Therefore, it can also be convenient to call this algorithm randomly pivoted QR.
Randomly pivoted QR is a nice algorithm for rectangular low-rank approximation. Each step requires a full pass over the matrix and operations, so the full procedure requires operations. This makes the cost of the algorithm similar to other methods for rectangular low-rank approximation such as the randomized SVD, but it has the advantage that it computes a column projection approximation.6There are other algorithms for computing a high-quality column projection approximation in operations, such as sketchy pivoting. Variants of these approaches are compared in this recent paper. However, randomly pivoted QR is not a particularly effective algorithm for computing a low-rank approximation to a psd matrix, since—as we shall see—there are faster procedures available.
Following the Gram correspondence, we expect there should be an analog of the randomly pivoted QR algorithm for computing a low-rank approximation of a psd matrix. That algorithm, which we call randomly pivoted Cholesky, is derived from randomly pivoted QR in the following optional section:
Randomly Pivoted Cholesky
That derivation was a bit complicated, so let’s summarize. We can with the randomly pivoted QR algorithm for computing a projection approximation to a rectangular matrix , and we used it to derive a randomly pivoted Cholesky algorithm for computing a column Nyström approximation to a psd matrix . Removing the cruft of the derivation, this algorithm is very simple to state:
- Choose a column index randomly according to the diagonal distribution:
- Update the low-rank approximation
- Update the matrix
I find this to be remarkably cool. We started with a neat algorithm (randomly pivoted QR) for approximating rectangular matrices, and we used the Gram correspondence to derive a different randomly pivoted Cholesky algorithm for psd matrix approximation!
And randomly pivoted Cholesky allows us to pull a cool trick that we couldn’t with randomly pivoted QR. Observe that the randomly pivoted Cholesky algorithm only ever interacts with the residual matrix through the selected pivot columns and through the diagonal entries . Therefore, we can derive an optimized version of the randomly pivoted Cholesky algorithm that only reads entries of the matrix ( columns plus the diagonal) and requires only operations! See our paper for details.
So we started with the randomly pivoted QR algorithm, which requires operations, and we used it to derive the randomly pivoted Cholesky algorithm that runs in operations. Let’s make this concrete with some specific numbers. Setting and , randomly pivoted QR requires roughly (100 trillion) operations and randomly pivoted Cholesky requires roughly (10 billion) operations, a factor of 10,000 smaller operation count!
Randomly pivoted Cholesky has an interesting history. It appears to have first appeared in print in the work of Musco and Woodruff (2017), who used the Gram correspondence to derive the algorithm from randomly pivoted QR. It is remarkable that it took a full 11 years after Deshpande and co-author’s original work on randomly pivoted QR for randomly pivoted Cholesky to be discovered. Even after Musco and Woodruff’s paper, the algorithm appears to have largely been overlooked for computation in practice, and I am unaware of any paper documenting computational experiments with randomly pivoted Cholesky before our paper was initially released in 2022.8A partial exception to this statement is the work of Poulson (2020), who used randomly pivoted Cholesky to sample from determinantal point processes. Poulson’s setting is quite different from ours: his input is always a rank- orthogonal projection matrix, he runs for exactly steps, and he uses the pivot set as output. Our paper reexamined the randomly pivoted Cholesky procedure, providing computational experiments comparing it to alternatives and using it for scientific machine learning applications, and provided new error bounds.
Other Examples of Algorithm Design by the Gram Correspondence
The Gram correspondence gives a powerful tool for inventing new algorithms or showing equivalence between existing algorithms. There are many additional examples, such as block (and rejection sampling-accelerated) versions of randomly pivoted QR/Cholesky, Nyström versions of randomized block Krylov iteration, and column projection/Nyström approximations generated by determinantal point process sampling. Indeed, if you invent a new low-rank approximation algorithm, it is always worth checking: Using the Gram correspondence, can you get another algorithm for free?
Transference of Error Bounds
The Gram correspondence also allows us to transfer error bounds between projection approximations and Nyström approximations. Let’s begin with the simplest manifestation of this principle:
Transference of Error Bounds I: Let be a Gram square root of , and let and be the corresponding projection and Nyström approximations generated using the same test matrix . Then
This shows that the Frobenius norm error of a projection approximation and the trace error of a Nyström approximation are the same.9Note that, for any Nyström approximation to , the residual is psd. Therefore, the trace error is nonnegative and satisfies if and only if ; these statements justify why is an appropriate expression for measuring the error of the approximation . Using this result, we can immediately transfer error bounds for projection approximations to Nyström approximations and visa versa. For instance, in this previous post, we proved the following error bound for the rank- randomized SVD (with a Gaussian test matrix ):
Here, denotes the best rank- approximation to . This result shows that the error of the rank- randomized SVD is comparable to the best rank- approximation for any . Using the transference principle, we immediately obtain a corresponding bound for rank- Nyström approximation (with a Gaussian test matrix ):10Note that in order to transfer the bound from randomized SVD to single-pass Nyström, we also needed to use the identity . This identity, too, follows by the transference principle, since and are, themselves, projection approximations and Nyström approximations corresponding to choosing to be the dominant right singular vectors of !
With no additional work, we have converted an error bound for the randomized SVD to an error bound for single-pass randomized Nyström approximation!
For those interested, one can extend the transference of error bounds to general unitarily invariant norms. See the following optional section for details:
Conclusion
This has been a long post about a simple idea. Many of the low-rank approximation algorithms in the literature output a special type of approximation, either a projection approximation or a Nyström approximation. As this post has shown, these two types of approximation are equivalent, with algorithms and error bounds for one type of approximation transferring immediately to the other format. This equivalence is a powerful tool for the algorithm designer, allowing us to discover new algorithms from old ones, like randomly pivoted Cholesky from randomly pivoted QR.