Low-Rank Approximation Toolbox: The Gram Correspondence

I am delighted to share that my paper Randomly pivoted Cholesky: Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations, joint with Yifan Chen, Joel Tropp, and Robert Webber, has been published online in Communications on Pure and Applied Mathematics. To celebrate this occasion, I want to share one of my favorite tricks in the design of low-rank approximation algorithms, which I will call the Gram correspondence.

Projection Approximation and Nyström Approximation

When we construct a low-rank approximation to a matrix, the type of approximation we use is typically dictated by the size and properties of the matrix. For a rectangular matrix B \in \real^{M\times N}, one of the standard techniques is the standard randomized SVD algorithm:

  1. Generate a Gaussian random matrix \Omega \in \real^{N\times k}.
  2. Form the product Y = B\Omega.
  3. Compute an (economy-size) QR decomposition Y = QR.
  4. Evaluate the SVD Q^\top B = \tilde{U} \Sigma V^\top.
  5. Output the low-rank approximation \hat{B} = U\Sigma V^\top for U = Q\tilde{U}.

Succinctly, the output of the randomized SVD is given by the formula \hat{B} = \Pi_{B\Omega} B, where \Pi_F denotes the orthogonal projector onto the column span of a matrix F. This motivates the general definition:

Definition (projection approximation): Given a test matrix \Omega \in \real^{N\times k}, the projection approximation to the matrix B is \hat{B} = \Pi_{B\Omega} B.

The class of projection approximations is much richer than merely the approximation constructed by the standard randomized SVD. Indeed, low-rank approximations computed by randomized subspace iteration, randomized block Krylov iteration, column-pivoted QR decompositions, etc. all fit under the umbrella of projection approximations.

Many matrices in applications have additional structure such as symmetry or sparsity, and it can be valuable to make use of low-rank approximations that take advantage of those properties. An especially important type of structure is positive semidefiniteness. For our purposes, a positive semidefinite matrix A \in \real^{N\times N} is one that is symmetric and possesses nonnegative eigenvalues, and we will use abbreviate “positive semidefinite” as “psd”. Psd matrices arise in applications as covariance matrices, kernel matrices, and as discretizations of certain differential and integral operators. Further, any rectangular matrix B \in \real^{M\times N} gives rise to its psd Gram matrix A = B^\top B; we will have much more to say about Gram matrices below.

To compute a low-rank approximation to a psd matrix, the preferred format as usually the Nyström approximation:

Definition (Nyström approximation): Given a test matrix \Omega \in \real^{N\times k}, the Nyström approximation is1Throughout this post, the inverse {}^{-1} is interpreted as the Moore–Penrose pseudoinverse if its argument is not invertible.

    \[\hat{A} \coloneqq (A\Omega) (\Omega^\top A \Omega)^{-1}(A\Omega)^\top.\]

As discussed in a previous post, the general class of Nyström approximations includes many useful specializations depending on how the matrix \Omega is selected.

The Gram Correspondence

The Gram correspondence is a connection between projection approximation and Nyström approximation:

The Gram correspondence: Let B\in\real^{M\times N} be any rectangular matrix and consider the Gram matrix A = B^\top B. Fix a test matrix \Omega, and define the Nyström approximation \hat{A} = A\langle \Omega\rangle to A and the projection approximation \hat{B} = \Pi_{B\Omega}B to B. Then

    \[\hat{A} = \smash{\hat{B}}^\top \hat{B}.\]

That is, the Gram matrix \hat{B}^\top\hat{B} of the projection approximation \hat{B}\approx B is the Nyström approximation \hat{A} \approx A.

As we will see, this correspondence has many implications:

  1. Special cases. The correspondence contains several important facts as special cases. Examples include the equivalence between the randomized SVD and single-pass Nyström approximation and the equivalence of (partial) column-pivoted QR factorization and (partial) pivoted Cholesky decomposition.
  2. Algorithm design. Since Nyström approximation and projection approximations are closely related, one can often interconvert algorithms for one type of approximation into the other. These conversions can lead one to discover new algorithms. We will provide a historical answer with the discovery of randomly pivoted Cholesky.
  3. Error bounds. The Gram correspondence is immensely helpful in the analysis of algorithms, as it makes error bounds for projection approximations and Nyström approximations easily derivable from each other.

For those interested, we give a short proof of the Gram correspondence below.

Proof of Gram correspondence
The Gram correspondence is straightforward to derive, so let’s do so before moving on. Because \Pi_{B\Omega} is a projection matrix, it satisfies \Pi_{B\Omega}^2 = \Pi_{B\Omega}. Thus,

    \[\hat{B}^\top \hat{B} = B^\top \Pi_{B\Omega}^2B = B^\top \Pi_{B\Omega}B.\]

The projector \Pi_F has the formula F (F^\top F)^{-1} F^\top. Using this formula, we obtain

    \[\hat{B}^\top \hat{B} = B^\top B\Omega (\Omega^\top B^\top B \Omega)^{-1} \Omega^\top B^\top B.\]

This is precisely the formula for the Nyström approximation (B^\top B)\langle \Omega\rangle, confirming the Gram correspondence.

Aside: Gram Square Roots

Before we go forward, let me highlight a point of potential conclusion and introduce some helpful terminology. When thinking about algorithms for a psd matrix A, it will often to be helpful to conjure a matrix B for which A = B^\top B. Given a psd matrix A, there is always a matrix B for which A = B^\top B, but this B is not unique. Indeed, if A = B^\top B, then we have the infinite family of decompositions A = (UB)^\top (UB) generated by every matrix U with orthonormal columns. This motivates the following definition:

Definition (Gram square root): A matrix B is a Gram square root for A if A = B^\top B.

A Gram square root B for A need not be a square root in the traditional sense that B^2 = A. Indeed, a Gram square root B can be rectangular, so B^2 need not even be defined.2The Gram square root should be contrasted with a different type of square root, the matrix square root written A^{1/2}. The matrix square root A^{1/2} is square and psd, and satisfies A = (A^{1/2})^2. Moreover, it is the unique matrix with these properties. Using the Gram square root terminology, the Gram correspondence can be written more succinctly:

Gram correspondence (concise): If B is any Gram square root of A, then the projection approximation \hat{B} = \Pi_{B\Omega}B is a Gram square root of the Nyström approximation \hat{A} = A\langle \Omega \rangle.

Examples of the Gram Correspondence

Before going further, let us see a couple of explicit examples of the Gram correspondence.

Randomized SVD and Single-Pass Nyström Approximation

The canonical example of the Gram correspondence is the equivalence of the randomized SVD algorithm and the single-pass Nyström approximation. Let B be a Gram square root for a psd matrix A. The randomized SVD of B is given by the following steps, which we saw in the introduction:

  1. Generate a Gaussian random matrix \Omega \in \real^{N\times k}.
  2. Form the product Y = B\Omega.
  3. Compute a QR decomposition Y = QR.
  4. Evaluate the SVD Q^\top B = \tilde{U} \Sigma V^\top.
  5. Output the low-rank approximation \hat{B} = U\Sigma V^\top for U = Q\tilde{U}.

Now, imagine taking the same Gaussian matrix \Omega from the randomized SVD, and use it to compute the Nyström approximation \hat{A} = A\langle \Omega\rangle. Notice that this Nyström approximation can be computed in a single pass over the matrix A. Namely, use a single pass over A to compute Y=A\Omega and use following formula:3In practice, one should be careful about implementation for a single-pass Nyström approximation; see this paper for details.

    \[\hat{A} = Y (\Omega^\top Y)^{-1} Y^\top.\]

For this reason, we call \hat{A} the single-pass Nyström approximation.

The Gram correspondence tells us that the randomized SVD and single-pass Nyström approximation are closely related, in the sense that \hat{A} = \hat{B}^\top \hat{B}. The randomized SVD approximation \hat{B} is a Gram square root of the single-pass Nyström approximation \hat{A}.

Column-Pivoted QR Factorization and Pivoted Cholesky Decomposition

A more surprising consequence of the Gram correspondence is the connection between low-rank approximations produced by partial column-pivoted QR factorization of a rectangular matrix B and a partial pivoted Cholesky decomposition of A. Let’s begin by describing these two approximation methods.

Let’s begin with pivoted partial Cholesky decomposition. Let A \in \real^{N\times N} be a psd matrix, and initialize the zero approximation \hat{A} \gets 0. For i=1,2,\ldots,k, perform the following steps:

  1. Choose a column index 1 \le s_i \le N. These indices are referred to as pivot indices or, more simply, pivots.
  2. Update the low-rank approximation

        \[\hat{A} \gets \hat{A} + \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.\]

  3. Update the matrix

        \[A \gets A - \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.\]

Here, we are using MATLAB notation to index the matrix A. The output of this procedure is

    \[\hat{A} = A(:,S) A(S,S)^{-1} A(S,:),\]

where S = \{s_1,\ldots,s_k\}. This type of low-rank approximation is known as a column Nyström approximation and is a special case of the general Nyström approximation with test matrix \Omega = I(:,S) equal to a subset of columns of the identity matrix I. For an explanation of why this procedure is called a “pivoted partial Cholesky decomposition” and the relation to the usual notion of Cholesky decomposition, see this previous post of mine.

Given the Gram correspondence, we expect that this pivoted partial Cholesky procedure for computing a Nyström approximation to a psd matrix A should have an analog for computing a projection approximation to a rectangular matrix B. This analog is given by the partial column-pivoted QR factorization, which produces a low-rank approximation according as follows. Let B \in \real^{M\times N} be a psd matrix, and initialize the zero approximation \hat{B} \gets 0. For i=1,2,\ldots,k, perform the following steps:

  1. Choose a pivot index 1 \le s_i \le N.
  2. Update the low-rank approximation by adding the projection of B onto the selected column: \hat{B} \gets \hat{B} + B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2.
  3. Update the matrix by removing the projection of B onto the selected column: B \gets B - B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2.

The output of this procedure is the column projection approximation \Pi_{B(:,S)} B, which is an example of the general projection approximation with \Omega = I(:,S).4The column projection approximation is often presented in factorized form in one of two ways. First, \hat{B} can be expressed as \hat{B} = QR, where Q is a matrix with orthonormal columns and R is an upper trapezoidal matrix, up to a permutation of the rows. This factorized form is easy to compute roughly following steps 1–3 above, which explains why we call that procedure a “QR factorization“. Second, \hat{B} can be factorized B = B(:,S) W for a weight matrix W \in \real^{k\times N}. This type of factorization is known as an interpolative decomposition.

The pivoted partial Cholesky and QR factorizations are very traditional ways of computing a low-rank approximation in numerical linear algebra. The Gram correspondence tells us immediately that these two approaches are closely related:

Let B be a Gram square root of A. Compute a column projection approximation \hat{B} to B using a partially column-pivoted QR factorization with pivots s_1,\ldots,s_k, and compute a column Nyström approximation \hat{A} to A using a partial pivoted Cholesky decomposition with the same set of pivots s_1,\ldots,s_k. Then \hat{B}^\top \hat{B} = \hat{A}.

I find this remarkable: the equivalence of the randomized SVD and single-pass randomized Nyström approximation and the equivalence of (partial pivoted) Cholesky and QR factorizations are both consequences of the same general principle, the Gram correspondence.

Using the Gram Correspondence to Discover Algorithms

The Gram correspondence is more than just an interesting way of connecting existing types of low-rank approximations: It can be used to discover new algorithms. We can illustrate this with an example, the randomly pivoted Cholesky algorithm.

Background: The Randomly Pivoted QR Algorithm

In 2006, Deshpande, Rademacher, Vempala, and Wang discovered an algorithm that they called adaptive sampling for computing a projection approximation to a rectangular matrix B \in \real^{M\times N}. Beginning from the trivial initial approximation \hat{B} \gets 0, algorithm proceeds as follows for i=1,2,\ldots,k:

  1. Choose a column index 1 \le s_i \le N randomly according to the squared column norm distribution:

        \[\prob\{s_i = j\} = \frac{\norm{B(:,j)}^2}{\norm{B}_{\rm F}^2} \quad \text{for } j=1,2,\ldots,k.\]

  2. Update the low-rank approximation by adding the projection of B onto the selected column: \hat{B} \gets \hat{B} + B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2.
  3. Update the matrix by removing the projection of B onto the selected column: B \gets B - B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2.

Given our discussion above, we recognize this as an example of partial column-pivoted QR factorization with a particular randomized procedure for selecting the pivots s_i.5Interestingly, Deshpande and coauthors did not make the connection between their approach and pivoted QR algorithms in their work. Therefore, it can also be convenient to call this algorithm randomly pivoted QR.

Randomly pivoted QR is a nice algorithm for rectangular low-rank approximation. Each step requires a full pass over the matrix and \order(MN) operations, so the full procedure requires \order(MNk) operations. This makes the cost of the algorithm similar to other methods for rectangular low-rank approximation such as the randomized SVD, but it has the advantage that it computes a column projection approximation.6There are other algorithms for computing a high-quality column projection approximation in \order(MNk) operations, such as sketchy pivoting. Variants of these approaches are compared in this recent paper. However, randomly pivoted QR is not a particularly effective algorithm for computing a low-rank approximation to a psd matrix, since—as we shall see—there are faster procedures available.

Following the Gram correspondence, we expect there should be an analog of the randomly pivoted QR algorithm for computing a low-rank approximation of a psd matrix. That algorithm, which we call randomly pivoted Cholesky, is derived from randomly pivoted QR in the following optional section:

Derivation of Randomly Pivoted Cholesky Algorithm
Let A \in \real^{N\times N} be a psd matrix. For conceptual purposes, let us also consider a Gram square root B\in \real^{M\times N} of A; this matrix B will only be a conceptual device that we will use to help us derive the appropriate algorithm, not something that will be required to run the algorithm itself. Let us now walk through the steps of the randomly pivoted QR algorithm on B, and see how they lead to a randomly pivoted Cholesky algorithm for computing a low-rank approximation to A.

Step 1 of randomly pivoted QR. Randomly pivoted QR begins by drawing a random pivot s_i according to the rule

    \[\prob\{s_i = j\} = \frac{\norm{B(:,j)}^2}{\norm{B}_{\rm F}^2} \quad \text{for } j=1,2,\ldots,k.\]

Since A = B^\top B, we may compute

    \[\norm{B(:,j)}^2 = B(:,j)^\top B(:,j) = (B^\top B)(j,j) = A(j,j).\]

The squared column norms of B are the diagonal entries of the matrix A. Similarly, \norm{B}_{\rm F}^2 = \tr(A). Therefore, we can write the probability distribution for the random pivot s_i using only the matrix A as

    \[\prob\{s_i = j\} = \frac{A(j,j)}{\tr A} \quad \text{for } j=1,2,\ldots,N.\]

Step 2 of randomly pivoted QR. The randomly pivoted QR update rule is \hat{B} \gets \hat{B} + B(:,s_i) (B(:,s_i)^\top B) / \norm{B(:,s_i)}^2. Therefore, the update rule for \hat{A} = \hat{B}^\top\hat{B} is 

    \[\hat{A} &\gets \left(\hat{B} + \frac{B(:,s_i) (B(:,s_i)^\top B)}{\norm{B(:,s_i)}^2}\right)^\top \left(\hat{B} + \frac{B(:,s_i) (B(:,s_i)^\top B)}{\norm{B(:,s_i)}^2}\right).\]

After a short derivation,7Using the relations \hat{A} = \hat{B}^\top \hat{B}, A = B^\top B, and A(s_i,s_i) = \norm{B(:,s_i)}^2 = B(:,s_i)^\top B(:,s_i), this simplifies to \hat{A} \gets \hat{A} + \frac{B^\top B(:,s_i) B(:,s_i)^\top \hat{B}}{A(s_i,s_i)} + \frac{\hat{B}^\top B(:,s_i) B(:,s_i)^\top B}{A(s_i,s_i)} + \frac{B^\top B(:,s_i) B(:,s_i)^\top B}{A(s_i,s_i)}.The matrix \hat{B} is a projection approximation and B is the residual to that approximation. Therefore, the columns of these matrices are orthogonal to each other, \hat{B}^\top B = 0, leading the second and third term to vanish. Finally, using the relation A = B^\top B again, we obtain the update rule \hat{A} \gets \hat{A} +\frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}., this simplifies to the update rule

    \[\hat{A} \gets \hat{A} +\frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.\]

Two remarkable things have happened. First, we have obtained an update rule for \hat{A} that depends only on the psd matrices \hat{A} and A, and all occurences of the Gram square roots \hat{B} and B have vanished. Second, this update rule is exactly the update rule for a partial Cholesky decomposition.

Step 3 of randomly pivoted QR. Using a similar derivation to step 2, we update an update rule for A:

    \[A \gets A -\frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.\]

This completes the derivation.

Randomly Pivoted Cholesky

That derivation was a bit complicated, so let’s summarize. We can with the randomly pivoted QR algorithm for computing a projection approximation to a rectangular matrix B, and we used it to derive a randomly pivoted Cholesky algorithm for computing a column Nyström approximation to a psd matrix A. Removing the cruft of the derivation, this algorithm is very simple to state:

  1. Choose a column index 1 \le s_i \le N randomly according to the diagonal distribution:

        \[\prob\{s_i = j\} = \frac{A(j,j)}{\tr(A)} \quad \text{for } j=1,2,\ldots,k.\]

  2. Update the low-rank approximation

        \[\hat{A} \gets \hat{A} + \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.\]

  3. Update the matrix

        \[A \gets A - \frac{A(:,s_i)A(s_i,:)}{A(s_i,s_i)}.\]

I find this to be remarkably cool. We started with a neat algorithm (randomly pivoted QR) for approximating rectangular matrices, and we used the Gram correspondence to derive a different randomly pivoted Cholesky algorithm for psd matrix approximation!

And randomly pivoted Cholesky allows us to pull a cool trick that we couldn’t with randomly pivoted QR. Observe that the randomly pivoted Cholesky algorithm only ever interacts with the residual matrix A through the selected pivot columns A(:,s_i) and through the diagonal entries A(j,j). Therefore, we can derive an optimized version of the randomly pivoted Cholesky algorithm that only reads (k+1)N entries of the matrix A (k columns plus the diagonal) and requires only \order(k^2N) operations! See our paper for details.

So we started with the randomly pivoted QR algorithm, which requires \order(kMN) operations, and we used it to derive the randomly pivoted Cholesky algorithm that runs in \order(kN^2) operations. Let’s make this concrete with some specific numbers. Setting k = 100 and M = N = 10^6, randomly pivoted QR requires roughly 10^{14} (100 trillion) operations and randomly pivoted Cholesky requires roughly 10^{10} (10 billion) operations, a factor of 10,000 smaller operation count!

Randomly pivoted Cholesky has an interesting history. It appears to have first appeared in print in the work of Musco and Woodruff (2017), who used the Gram correspondence to derive the algorithm from randomly pivoted QR. It is remarkable that it took a full 11 years after Deshpande and co-author’s original work on randomly pivoted QR for randomly pivoted Cholesky to be discovered. Even after Musco and Woodruff’s paper, the algorithm appears to have largely been overlooked for computation in practice, and I am unaware of any paper documenting computational experiments with randomly pivoted Cholesky before our paper was initially released in 2022.8A partial exception to this statement is the work of Poulson (2020), who used randomly pivoted Cholesky to sample from determinantal point processes. Poulson’s setting is quite different from ours: his input A is always a rank-k orthogonal projection matrix, he runs for exactly k steps, and he uses the pivot set S as output. Our paper reexamined the randomly pivoted Cholesky procedure, providing computational experiments comparing it to alternatives and using it for scientific machine learning applications, and provided new error bounds.

Other Examples of Algorithm Design by the Gram Correspondence

The Gram correspondence gives a powerful tool for inventing new algorithms or showing equivalence between existing algorithms. There are many additional examples, such as block (and rejection sampling-accelerated) versions of randomly pivoted QR/Cholesky, Nyström versions of randomized block Krylov iteration, and column projection/Nyström approximations generated by determinantal point process sampling. Indeed, if you invent a new low-rank approximation algorithm, it is always worth checking: Using the Gram correspondence, can you get another algorithm for free?

Transference of Error Bounds

The Gram correspondence also allows us to transfer error bounds between projection approximations and Nyström approximations. Let’s begin with the simplest manifestation of this principle:

Transference of Error Bounds I: Let B be a Gram square root of A, and let \hat{B} and \hat{A} be the corresponding projection and Nyström approximations generated using the same test matrix \Omega. Then

    \[\norm{B - \hat{B}}_{\rm F}^2 = \tr(A - \hat{A}).\]

This shows that the Frobenius norm error of a projection approximation and the trace error of a Nyström approximation are the same.9Note that, for any Nyström approximation \hat{A} to A, the residual A - \hat{A} is psd. Therefore, the trace error \tr(A - \hat{A}) is nonnegative and satisfies \tr(A - \hat{A}) = 0 if and only if A = \hat{A}; these statements justify why \tr(A-\hat{A}) is an appropriate expression for measuring the error of the approximation A \approx \hat{A}. Using this result, we can immediately transfer error bounds for projection approximations to Nyström approximations and visa versa. For instance, in this previous post, we proved the following error bound for the rank-k randomized SVD (with a Gaussian test matrix \Omega):

    \[\expect \norm{B - \hat{B}}_{\rm F}^2 \le \min_{r\le k-2} \left(1 + \frac{r}{k-(r+1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2,\]

Here, \lowrank{B}_r denotes the best rank-r approximation to B. This result shows that the error of the rank-k randomized SVD is comparable to the best rank-r approximation for any r \le k-2. Using the transference principle, we immediately obtain a corresponding bound for rank-k Nyström approximation (with a Gaussian test matrix \Omega):10Note that in order to transfer the bound from randomized SVD to single-pass Nyström, we also needed to use the identity \norm{B - \lowrank{B}_r}_{\rm F}^2 = \tr(A - \lowrank{A}_r). This identity, too, follows by the transference principle, since \lowrank{B}_r and \lowrank{A}_r are, themselves, projection approximations and Nyström approximations corresponding to choosing \Omega to be the dominant r right singular vectors of B!

    \[\expect \tr(A - \hat{A}) \le \min_{r\le k-2} \left(1 + \frac{r}{k-(r+1)} \right) \tr(A - \lowrank{A}_r).\]

With no additional work, we have converted an error bound for the randomized SVD to an error bound for single-pass randomized Nyström approximation!

For those interested, one can extend the transference of error bounds to general unitarily invariant norms. See the following optional section for details:

Transference of Error Bounds for General Unitarily Invariant Norms
We begin by recalling the notions of (quadratic) unitarily invariant norms; see this section of a previous post for a longer refresher. A norm \norm{\cdot}_{\rm UI} is unitarily invariant if

    \[\norm{UCV}_{\rm UI} = \norm{C}_{\rm UI}\]

for every matrix C and orthogonal matrices U and V.11As the name “unitarily invariant” suggests, if C is complex-valued we require this condition to hold for the larger class of all unitary matrices U and V.Examples of unitarily invariant norms include the nuclear norm \norm{C}_* = \sum_i \sigma_i(C), the Frobenius norm \norm{C}_{\rm F}^2 = \sum_i \sigma_i^2(C), and the spectral norm \norm{C} = \max_i \sigma_i(C). (It is not coincidental that all these unitarily invariant norms can be expressed only in terms of the singular values \sigma_i(C) of the matrix C.) Associated to every unitarily invariant norm it its associated quadratic unitarily invariant norm, defined as

    \[\norm{C}_{\rm Q}^2 = \norm{C^\top C}_{\rm UI} = \norm{CC^\top}_{\rm UI}.\]

For example,

  • The quadratic unitarily invariant norm associated to the nuclear norm \norm{\cdot}_{\rm UI} = \norm{\cdot}_* is the Frobenius norm \norm{\cdot}_{\rm Q} = \norm{\cdot}_{\rm F}.
  • The spectral norm is its own associated quadratic unitarily invariant norm \norm{\cdot}_{\rm UI} = \norm{\cdot} = \norm{\cdot}_{\rm Q}.

This leads us to the more general version of the transference principle:

Transference of Error Bounds II: Let B be a Gram square root of A, and let \hat{B} and \hat{A} be the corresponding projection and Nyström approximations generated using the same test matrix \Omega. Then

    \[\norm{B - \hat{B}}_{\rm Q}^2 = \norm{A - \hat{A}}\]

for every unitarily invariant norm \norm{\cdot}_{\rm UI} and its associated quadratic unitarily invariant norm \norm{\cdot}_{\rm Q}.

This version of the principle is more general, since the nuclear norm of a psd matrix is the trace, whence

    \[\norm{B - \hat{B}}_{\rm F}^2 = \norm{A - \hat{A}}_* = \tr(A - \hat{A}).\]

We now also have more examples. For instance, for the spectral norm, we have

    \[\norm{B - \hat{B}}^2 = \norm{A - \hat{A}}.\]

Let us quickly prove the transference of error principle. Write the error B - \hat{B} = B - \Pi_{B\Omega}B = (I-\Pi_{B\Omega})B.  Thus,

    \[\norm{B - \hat{B}}_{\rm Q}^2 = \norm{(I-\Pi_{B\Omega})B}_{\rm Q}^2 = \norm{B^\top (I-\Pi_{B\Omega})^2B}_{\rm UI}.\]

The orthogonal projections \Pi_{B\Omega} and I-\Pi_{B\Omega} are equal to their own square, so

    \begin{align*}\norm{B - \hat{B}}_{\rm Q}^2 &= \norm{B^\top (I-\Pi_{B\Omega})B}_{\rm UI} = \norm{B^\top B - B^\top\Pi_{B\Omega}B}_{\rm UI} \\ &= \norm{B^\top B - (\Pi_{B\Omega}B)^\top(\Pi_{B\Omega}B)}_{\rm UI}.\end{align*}

Finally, write \hat{B} = \Pi_{B\Omega}B and use the Gram correspondence A = B^\top B, \hat{A} = \hat{B}^\top\hat{B} to conclude

    \[\norm{B - \hat{B}}_{\rm Q}^2 = \norm{A - \hat{A}}_{\rm UI},\]

as desired.

Conclusion

This has been a long post about a simple idea. Many of the low-rank approximation algorithms in the literature output a special type of approximation, either a projection approximation or a Nyström approximation. As this post has shown, these two types of approximation are equivalent, with algorithms and error bounds for one type of approximation transferring immediately to the other format. This equivalence is a powerful tool for the algorithm designer, allowing us to discover new algorithms from old ones, like randomly pivoted Cholesky from randomly pivoted QR.

Leave a Reply

Your email address will not be published. Required fields are marked *