I’m excited to share that my paper Kernel quadrature with randomly pivoted Cholesky, joint with Elvira Moreno, has been accepted to NeurIPS 2023 as a spotlight.
Today, I want to share with you a little about the kernel quadrature problem. To avoid this post getting too long, I’m going to write this post assuming familiarity with the concepts of reproducing kernel Hilbert spaces and Gaussian processes.
Integration and Quadrature
Integration is one of the most widely used operations in mathematics and its applications. As such, it is a basic problem of wide interest to develop numerical methods for evaluating integrals.
In this post, we will consider a quite general integration problem. Let be a domain and let be a (finite Borel) measure on . We consider the task of evaluating
One can imagine that , , and are fixed, but we may want to evaluate this same integral for multiple different functions .
To evaluate, we will design a quadrature approximation to the integral :
Concretely, we wish to find real numbers and points such that the approximation is accurate.
Smoothness and Reproducing Kernel Hilbert Spaces
As is frequently the case in computational mathematics, the accuracy we can expect for this integration problem depends on the smoothness of the integrand . The more smooth is, the more accurately we can expect to compute for a given budget of computational effort.
In this post, will measure smoothness using the reproducing kernel Hilbert space (RKHS) formalism. Let be an RKHS with norm . We can interpret the norm as assigning a roughness to each function . If is large, then is rough; if is small, then is smooth.
Associated to the RKHS is the titular reproducing kernel . The kernel is a bivariate function . It is related to the RKHS inner product by the reproducing property
Here, represents the univariate function obtained by setting the first input of to be .
The Ideal Weights
To design a quadrature rule, we have to set the nodes and weights . Let’s first assume that the nodes are fixed, and talk about how to pick the weights .
There’s one choice of weights that we’ll called the ideal weights. There (at least) are five equivalent ways of characterizing the ideal weights. We’ll present all of them. As an exercise, you can try and convince yourself that these characterizations are equivalent, giving rise to the same weights.
Interpretation 1: Exactness
A standard way of designing quadrature rules is to make them exact (i.e., error-free) for some class of functions. For instance, many classical quadrature rules are exact for polynomials of degree up to .
For kernel quadrature, it makes sense to design the quadrature rule to be exact for the kernel function at the selected nodes. That is, we require
Enforcing exactness gives us linear equations for the unknowns :
Under mild conditions, this system of linear equations is uniquely solvable, and the solution is the ideal weights.
Interpretation 2: Interpolate and Integrate
Here’s another very classical way of designing a quadrature rule. First, interpolate the function values at the nodes, obtaining an interpolant . Then, obtain an approximation to the integral by integrating the interpolant:
In our context, the appropriate interpolation method is kernel interpolation.1Kernel interpolation is also called Gaussian process regression or kriging though (confusingly) these terms can also refer to slightly different methods. It is the regularization-free limit of kernel ridge regression. The kernel interpolant is defined to be the minimum-norm function that interpolates the data:
Remarkably, this infinite-dimensional problem has a tractably computable solution. In fact, is the unique function of the form
that agrees with on the points .With a little algebra, you can show that the integral of is
where are the ideal weights.
Interpretation 3: Minimizing the Worst-Case Error
Define the worst-case error of weights and nodes to be
The quantity is the highest possible quadrature error for a function of norm at most 1.
Having defined the worst-case error, the ideal weights are precisely the weights that minimize this quantity
Mean-Square Error
Interpretation 4: Minimizing theThe next two interpretations of the ideal weights will adopt a probabilistic framing. A Gaussian process is a random function such that ’s values at any collection of points are (jointly) Gaussian random variables. We write for a mean-zero Gaussian process with covariance function :
Define the mean-square quadrature error of weights and nodes to be
The mean-square error reports the expected squared quadrature error over all functions drawn from a Gaussian process with covariance function .
Pleasantly, the mean-square error is equal ro the square of the worst-case error
As such, the ideal weights also minimize the mean-square error
Conditional Expectation
Interpretation 5:For our last interpretation, again consider a Gaussian process . The integral of this random function is a random variable. To numerically integrate a function , compute the expectation of conditional on agreeing with at the quadrature nodes:
One can show that this procedure yields the quadrature scheme with the ideal weights.
Conclusion
We’ve just seen five sensible ways of defining the ideal weights for quadrature in a general reproducing kernel Hilbert space. Remarkably, all five lead to exactly the same choice of weights. To me, these five equivalent characterizations give me more confidence that the ideal weights really are the “right” or “natural” choice for kernel quadrature.
That said, there are other reasonable requirements that we might want to impose on the weights. For instance, if is a probability measure and , it is reasonable to add an additional constraint that the weights lie in the probability simplex
With this additional stipulation, a quadrature rule can be interpreted as integrating against a discrete probability measure ; thus, in effect, quadrature amounts to approximating one probability measure by another . Additional constraints such as these can easily be imposed when using the optimization characterizations 3 and 4 of the ideal weights. See this paper for details.
What About the Nodes?
We’ve spent a lot of time talking about how to pick the quadrature weights, but how should we pick the nodes ? To pick the nodes, it seems sensible to try and minimize the worst-case error with the ideal weights . For this purpose, we can use the following formula:
Here, is the Nyström approximation to the kernel induced by the nodes , defined to be
We have written for the kernel matrix with entry and and for the row and column vectors with th entry and .
I find the appearance of the Nyström approximation in this context to be surprising and delightful. Previously on this blog, we’ve seen (column) Nyström approximation in the context of matrix low-rank approximation. Now, a continuum analog of the matrix Nyström approximation has appeared in the error formula for numerical integration.
The appearance of the Nyström approximation in the kernel quadrature error also suggests a strategy for picking the nodes.
Node selection strategy. We should pick the nodes to make the Nyström approximation as accurate as possible.
The closer is to , the smaller the function is and, thus, the smaller the error
Fortunately, we have randomized matrix algorithms for picking good nodes for matrix Nyström approximation such as randomly pivoted Cholesky, ridge leverage score sampling, and determinantal point process sampling; maybe these matrix tools can be ported to the continuous kernel setting?
Indeed, all three of these algorithms—randomly pivoted Cholesky, ridge leverage score sampling, and determinantal point process sampling—have been studied for kernel quadrature. The first of these algorithms, randomly pivoted Cholesky, is the subject of our paper. We show that this simple, adaptive sampling method produces excellent nodes for kernel quadrature. Intuitively, randomly pivoted Cholesky is effective because it is repulsive: After having picked nodes , it has a high probability of placing the next node far from the previously selected nodes.
The following image shows 20 nodes selected by randomly pivoted Cholesky in a crescent-shaped region. The cyan–pink shading denotes the probability distribution for picking the next node. We see that the center of the crescent does not have any nodes, and thus is most likely to receive a node during the next round of sampling.
In our paper, we demonstrate—empirically and theoretically—that randomly pivoted Cholesky produces excellent nodes for quadrature. We also discuss efficient rejection sampling algorithms for sampling nodes with the randomly pivoted Cholesky distribution. Check out the paper for details!