Fast randomized iteration: diffusion Monte Carlo through the lens of numerical linear algebra

We review the basic outline of the highly successful diffusion Monte Carlo technique commonly used in contexts ranging from electronic structure calculations to rare event simulation and data assimilation, and propose a new class of randomized iterative algorithms based on similar principles to address a variety of common tasks in numerical linear algebra. From the point of view of numerical linear algebra, the main novelty of the Fast Randomized Iteration schemes described in this article is that they work in either linear or constant cost per iteration (and in total, under appropriate conditions) and are rather versatile: we will show how they apply to solution of linear systems, eigenvalue problems, and matrix exponentiation, in dimensions far beyond the present limits of numerical linear algebra. While traditional iterative methods in numerical linear algebra were created in part to deal with instances where a matrix (of size $\mathcal{O}(n^2)$) is too big to store, the algorithms that we propose are effective even in instances where the solution vector itself (of size $\mathcal{O}(n)$) may be too big to store or manipulate. In fact, our work is motivated by recent DMC based quantum Monte Carlo schemes that have been applied to matrices as large as $10^{108} \times 10^{108}$. We provide basic convergence results, discuss the dependence of these results on the dimension of the system, and demonstrate dramatic cost savings on a range of test problems.


Introduction
Randomized algorithms have become immensely popular in numerical linear algebra over the last decade. Most of the developments to date have focused on low-rank approximations of large matrices (see e.g. [15,19,20,21,23,37,39,46,47,56]). For earlier uses of randomization in numerical linear algebra see, for example, [1] (in the context of matrix inversion) and [32] (for estimates of the trace of a matrix), and for an interesting description of the relationships between Markov Chain Monte Carlo schemes and common iterative techniques in numerical linear algebra see [27]. The goal of this article is to, after providing a brief introduction to the highly successful diffusion Monte Carlo (DMC) algorithm, suggest a new class of algorithms inspired by DMC for problems in numerical linear algebra. DMC is used in applications including electronic structure calculations, rare event simulation, and data assimilation, to efficiently approximate expectations of the type appearing in Feynman-Kac formulae, i.e., for weighted expectations of Markov processes typically associated with parabolic partial differential equations (see e.g. [17]). While based on principles underlying DMC, the fast randomized iteration (FRI) schemes that we study in this article are designed to address arguably the most classical and common tasks in matrix computations: linear systems, eigenvector problems, and matrix exponentiation, i.e., solving for v in for matrices A that might not have any natural association with a Markov process. FRI schemes rely on basic principles similar to those at the core of other methods that have appeared recently in the numerical linear algebra literature but they differ substantially in detail and in the problems they address. These differences will be remarked on again later, but roughly, while most of the randomized linear algebra techniques in the references above rely on a single sampling step, FRI methods randomize repeatedly and, as a consequence, require more carefully constructed randomizations. FRI schemes are not, however, the first to employ randomization within iterative schemes. In fact the strategy of replacing expensive integrals or sums (without immediate stochastic interpretation) appearing in iterative protocols has a long history. For example, it was used in schemes for the numerical solution of hyperbolic systems of partial differential equations in [12]. That strategy is represented today in applications ranging from density functional calculations in physics and chemistry (see e.g. [3]) to maximum likelihood estimation in statistics and machine learning (see e.g. [9]). Though related in that they rely on repeated randomization within an iterative scheme, these schemes differ from the methods we consider in that they do not use a stochastic representation of the solution vector itself. In contrast to these and other randomized methods that have been used in linear algebra applications, our focus is on problems for which the solution vector is extremely large so they can only be treated by linear or constant cost algorithms. In fact, the scheme that is our primary focus is ideally suited to problems so large that the solution vector itself is too large to store so that no traditional iterative method (even for sparse matrices) is appropriate. This is possible because the scheme computes only low dimensional projections of the solution vector and not the solution vector itself. The full solution is replaced by a sequence of sparse random vectors whose expectations are close the true solution and whose variances are small.
Diffusion Monte Carlo (see e.g. [2,8,11,24,29,30,33,35,38]) is a central component in the quantum Monte Carlo (QMC) approach to computing the electronic ground state energy of the Schrödinger-Hamiltonian operator 1 We are motivated in particular by the work in [8] in which the authors apply a version of the DMC procedure to a finite (but huge) dimensional projection of H onto a discrete basis respecting an anti-symmetry property of the desired eigenfunction. The approach in [8] and subsequent papers have yielded remarkable results in situations where the projected Hamiltonian is an extremely large matrix (e.g. 10 108 × 10 108 , see [51]) and standard approaches to finite dimensional eigenproblems are far from reasonable (see [5,6,7,13,14,51]). The basic DMC approach is also at the core of schemes developed for a number of applications beyond electronic structure. In fact, early incarnations of DMC were used in the simulation of small probability events [31,48] for statistical physics models. Also in statistical physics, the transfer matrix Monte Carlo (TMMC) method was developed to compute the partition functions of certain lattice models by exploiting the observation that the partition function can be represented as the dominant eigenvalue of the so-called transfer matrix, a real, positive, and sparse matrix (see [41]). TMMC may be regarded as an application of DMC to discrete eigenproblems. DMC has also become a popular method for many data assimilation problems and the notion of a "compression" operation introduced below is very closely related to the "resampling" strategies developed for those problems (see e.g. [28,34,16]).
One can view the basic DMC (or Markov chain Monte Carlo for that matter) procedure as a combination of two steps: In one step an integral operator (a Green's function) is applied to an approximate solution consisting of a weighted finite sum of delta functions, in another step the resulting function (which is no longer a mixture of delta functions) is again approximated by a finite mixture of delta functions. The more delta functions allowed in the mixture, the higher the accuracy and cost of the method. A key to understanding the success of these methods is the observation that not all delta functions (i.e., at all positions in space) need appear in the mixture. A similar observation holds for the methods we introduce: FRI schemes need not access all entries in the matrix of interest to yield an accurate solution. In fact, we prove that the cost to achieve a fixed level of accuracy with our methods can be bounded independently of the size of the matrix, though in many applications one should expect some dependence on dimension. As with other randomized schemes, when an effective deterministic method is available it will very likely outperform the methods we propose; our focus is on problems for which satisfactory deterministic alternatives are not available (e.g. when the size of the intermediate iterates or final result are so large as to prohibit any conceivable deterministic methods). Moreover, the schemes that we propose are a supplement and not a replacement for traditional dimensional reduction strategies (e.g. intelligent choice of basis). Indeed, successful application of DMC within practical QMC applications relies heavily on a change of variables based on approximations extracted by other methods (see the discussion of importance sampling in [24]).
The theoretical results that we provide are somewhat atypical of results commonly presented in the numerical linear algebra literature. In the context of linear algebra applications, both DMC and FRI schemes are most naturally viewed as randomizations of standard iterative procedures and their performance is largely determined by the structure of the particular deterministic iteration being randomized. For this reason, as well as to avoid obscuring the essential issues with the details of individual cases, we choose to frame our results in terms of the difference between the iterates v t produced by a general iterative scheme and the iterates generated by the corresponding randomized scheme V t (rather than considering the difference between V t and lim t→∞ v t ). In ideal situations our bounds are of the form where the constant C is independent of the dimension n of the problem and m ≤ n controls the cost per iteration of the randomized scheme (one iteration of the randomized scheme is roughly a factor of n/m less costly than its deterministic counterpart and the two schemes are identical when m = n). The norm in the above display measures the root mean squared deviation in low dimensional projections of the iterates. This choice is important as described in more detail in Sections 3 and 4. For more general applications, one can expect the constant C, which incorporates the stability properties of the randomized iteration, to depend on dimension. In the worst case scenario, the randomized scheme is no more efficient than its deterministic counterpart (in other words, reasonable performance may require m ∼ n). Our numerical simulations, in which n/m ranges roughly between 10 7 and 10 14 , strongly suggest that this scenario may be rare.
We will begin our development in Section 2 with a description of the basic diffusion Monte Carlo procedure. Next, in Section 3 we describe how ideas borrowed from DMC can be applied to general iterative procedures in numerical linear algebra. As a prototypical example, we describe how randomization can be used to dramatically decrease the cost of finding the dominant eigenvalue and (projections of) the dominant eigenvector, first to O(n) in the dimension of the problem and then to O (1). Also in that section, we consider the specific case in which the iteration mapping is an ε-perturbation of the identity, relevant to a wide range of applications involving evolutionary differential equations. In this case a poorly chosen randomization scheme can result in an unstable algorithm while a well chosen randomization can result in an error that decreases with ε (over ε −1 iterations). Next, in Sections 4 and 5, we establish several simple bounds regarding the stability and error of our schemes. Finally, in Section 6 we provide three computational examples to demonstrate the performance of our approach. A simple, educational implementation of Fast Randomized Iteration applied to our first computational example can be found here [55].
Remark 1. In several places we have included remarks that clarify or emphasize concepts that may otherwise be unclear to readers more familiar with classical, deterministic, numerical linear algebra methods. We anticipate that some of these remarks will be useful to this article's broader audience as well.

Diffusion Monte Carlo within quantum Monte Carlo
The ground state energy, λ * , of a quantum mechanical system governed by the Hamiltonian in (2) is the smallest eigenvalue (with corresponding eigenfunction v * ) of the Hermitian operator H. The starting point for a DMC calculation is the imaginary-time Schrödinger equation 2 3 (for a review of QMC see [24]). One can, in principle, use power iteration to find λ * : beginning from an initial guess v 0 (and assuming a gap between λ * and the rest of the spectrum of H), the iteration will converge to the pair (λ * , v * ) where v * is the eigenfunction corresponding to λ * . Remark 2. For readers who are more familiar with the power method in numerical linear algebra, this may seem a bit odd but a discrete analogue of (4) is just v t = Av t−1 / Av t−1 1 applied to a positive definite matrix A = exp(−εH) ∈ C n×n where H is Hermitian 4 . The slight departure from the usual power method is only in (i) normalizing by a 1-norm (or rather, by the sum of entries 1 T Av t−1 since both v and A are non-negative), and (ii) iterating on exp(−εH) instead of on H directly (the goal in this context is to find the smallest eigenvalue of H not the magnitudedominant eigenvalue of H). The iteration on the eigenvalue is then λ t = − −1 log Av t−1 1 since λ * (H) = − −1 log λ * (A).
The first step in any (deterministic or stochastic) practical implementation of (4) is discretization of the operator e −εH . Diffusion Monte Carlo often uses the second order time discretization A standard deterministic approach would then proceed by discretizing the operator K ε in space and replacing e −εH in (4) with the space and time discretized approximate operator. The number of spatial discretization points required by such a scheme to achieve a fixed accuracy will, in general, grow exponentially in the dimension d of the argument of H. 2 The reader familiar with quantum mechanics but unfamiliar with QMC may wonder why we begin with the imaginary time Schrödinger equation and not the usual Schrödinger equation i∂tv = −Hv. The reason is that while the solutions to both equations can be expanded in terms of the eigenfunction of H, for the usual Schrödinger equation the contributions to the solution from eigenfunctions with larger eigenvalues do not decay relative to the ground state. By approximating the solution to the real time equation for large times we can approximate the ground state eigenfunction of H. 3 In practical QMC applications one solves for the ρ = v * ṽ whereṽ is an approximate solution found in advance by other methods. The new function ρ is the ground state eigenfunction of a Hamiltonian of the formHv = − 1 2 ∆v + div(bv) +Ũ v. The implications for the discussion in this section are minor. 4 Note that H, which is a matrix is not the same as H which is an operator on a function space and is only introduced for the purposes of relating the expression in (4) to the usual power method for matrices Diffusion Monte Carlo uses two randomization steps to avoid this explosion in cost as d increases. These randomizations have the effect of ensuring that the random approximations V m t of the iterates v t are always of the form where δ y (x) is the Dirac delta function centered at y ∈ R d , the W (j) t are real, non-negative numbers with Nt j=1 W (j) t = 1, and, for each j ≤ N t , X (j) t ∈ R d . As will be made clear in a moment, the integer m superscripted in our notation controls the number of delta functions, N t , included in the above expression for V m t . The fact that the function V m t is non-zero at only N t values is crucial to the efficiency of diffusion Monte Carlo. Starting with N 0 = m and from an initial condition of the form which can be assembled in O(m) operations. The first of the randomization steps used in DMC relies on the well known relationship where f is a test function, B s is a standard Brownian motion evaluated at time s ≥ 0, and the subscript on the expectation is meant to indicate that B 0 = y (i.e. in the expectation in (5) B ε is a Gaussian random variable with mean y and variance ε). In fact, this representation is a special case of the Feynman-Kac formula where, conditioned on the X 1 is normally distributed with mean X (j) 0 and covariance εI (here I is the d × d identity matrix). This first randomization has allowed an approximation of K ε V m 0 by a distribution,Ṽ m 1 , that is again supported on only m points in R d . One might, therefore, attempt to define a sequence V m t iteratively by the recursion where we have recursively defined the weights The cost of this randomization procedure is O(dm) so that the total cost of a single iteration is O(dm).
At each step the weights in the expression for the iterates V m t are multiplied by additional random factors. These factors are determined by the potential U and the positions of the ξ (j) t . On the other hand, the ξ (j) t , evolve without reference to the potential function U (they are discretely sampled points from m independent Brownian motions). As a consequence, over many iterations one can expect extreme relative variations in the W (j) t and, therefore, poor accuracy in V m t as an approximation of the functions v t produced by (4).
The second randomization employed by the DMC algorithm is the key to controlling the growth in variance and generalizations of the idea will be key to designing fast randomized iteration schemes in the next section. In order to control the variation in weights, at step t, DMC randomly removes points ξ (j) t corresponding to small weights W (j) t and duplicates points corresponding to large by "resampling" a new collection of m points from the ξ t . The resulting points are labeled X (j) t and the new distribution Y m t takes the form The next iterate V m t+1 is then built exactly as before but with V m t replaced by Y m t . All methods to select the X (j) t generate, for each j, a non-negative integer N t . For example, one popular strategy in DMC generates the N The above steps define a randomized iterative algorithm to generate approximations V m t of v t . The second randomization procedure (generating Y m t from V m t ) will typically require O(m) operations, preserving the overall O(dm) per iteration cost of DMC (as we have described it). The memory requirements of the scheme are also O(dm). The eigenvalue λ * can be approximated, for example, by for T large.
Before moving on to more general problems notice that the scheme just outlined applies just as easily to space-time discretizations of e −εH . For example, if we set h = (1 + 2d)ε and denote by E d h ⊂ R d the uniform rectangular grid with resolution h in each direction, the operator e −εH can be discretized using where, for any vector g ∈ E d h , (here we find it convenient to identify functions g : The operator e −ε∆ h again has a stochastic representation; now the representation is in terms of a jump Markov process with jumps from a point in E d h to one of its nearest neighbors on the grid (for an interesting approach to discretizing stochastic differential equations taking advantage of a similar observation see [10]).
Remark 3. The reader should notice that not only will we be unable to store the matrix K ε,h (which is exponentially large in d) or afford to compute K ε,h v for a general vector v, but we will not even be able to store the iterates v t generated by the power method. Even the sparse matrix routines developed in numerical linear algebra to deal with large matrices are not reasonable for this problem.
In this discrete context, a direct application of the DMC approach (as in [41]) would represent the solution vector as a superposition of standard basis elements 5 and replace calculation of K ε,h v by a random approximation whose cost is (for this particular problem) free of any direct dependence on the size of K ε,h (though its cost can depend on structural properties of K ε,h which may be related to its size), whose expectation is exactly K ε,h v, and whose variance is small. The approach in [8] is also an application of these same basic DMC steps to a discrete problem, though in that case the desired eigenvector has entries of a priori unkown sign, requiring that the solution vector be represented by a superposition of signed standard basis elements.
In this article we take a different approach to adapting DMC to discrete problems. Instead of reproducing in the discrete setting exactly the steps comprising DMC, consider a slightly modified scheme that omits direct randomization of an approximation to e ε∆ h , and instead relies solely on a general random mapping Φ m t very similar to the map from V m t to Y m t but which takes a vector V m t ∈ R E d h and produces a new random vector Y m t with m, or nearly m, non-zero components and satisfying E [Y m t | V m t ] as above. Starting from a non-negative initial vector V m 0 ∈ E d h with V m 0 1 = 1 and with at most O(md) non-zero entries, V m t+1 is generated from V m t as follows: Step with at most m non-zero entries. Step For example, adapting the popular resampling choice mentioned above, we might choose the entries of Note that the iterates V m t generated by this scheme have at most O(dm) non-zero entries. The details of the mappings Φ m t (which we call compression mappings) will be described later in Section 5 where, for example, we will find that the cost of applying the mapping Φ m t will typically be O(dm) when its argument has O(dm) non-zero entries. And while the cost of applying K ε,h to an arbitrary vector in R E d h is |E d h |, the cost of applying K ε,h to a vector with m non-zero entries is only O(md). The total cost of the scheme is therefore O(md) in storage and operations per iteration. These cost requirements are dependent on the particular spatial discretization of e ε∆ ; if we had chosen a discretization corresponding to a dense matrix K ε,h then the cost reduction would be much less extreme. Nonetheless, as we will see in Section 3, the scheme just described can be easily generalized and, as we will see in Sections 4 and 6, will often result in methods that are significantly faster than their deterministic counterparts.
Even restricting ourselves to the quantum Monte Carlo context, there is ample motivation to generalize the DMC scheme. Often one wishes to approximate not the smallest eigenvalue of H but instead the smallest eigenvalue corresponding to an antisymmetric (in exchange of particle positions) eigenfunction. DMC as described in this section, cannot be applied directly to computing this value, a difficulty commonly referred to as the Fermion sign problem. Several authors have attempted to address this issue with various modifications of DMC. In particular, Alavi and coworkers recently developed a version of DMC for a particular spatial disretization of the Hamiltonian (in the configuration interaction basis) that exactly preserves antisymmetry (unlike the finite difference discretization we just described). Run to convergence, their method provides the same approximation as the so called full CI method but can be applied with a much larger basis (e.g. in experiments by Alavi and coworkers reported in [51] up to 10 108 total functions in the expansion of the solution). Though it differs substantially in its details from the scheme proposed in [8], the generalizations of DMC represented by the two step procedure in the last paragraph and by the developments in the next section can also be applied to a wider range of discretizations of H. Finally we remark that, while we have considered DMC in the particular context of computing the ground state energy of a Hamiltonian, the method is used for a much wider variety of tasks with only minor modification to its basic structure. For example, particle filters (see e.g. [16]) are an application DMC to on-line data assimilation and substantive differences are mostly in the interpretation of the operator to which DMC is applied (and the fact that one is typically interested in the solution after finitely many iterations).

A general framework
Consider the general iterative procedure, v t+1 = M(v t ) (8) for v t ∈ C n . Eigenproblems, linear systems, and matrix exponentiation can each be accomplished by versions of this iteration. In each of those settings the cost of evaluating M(v) is dominated by a matrix-vector multiplication. We assume that the cost (in terms of floating point operations and storage) of performing the required matrix-vector multiplication makes it impossible to carry out recursion (8) to the desired precision. As in the steps described at the end of the last section, we will consider the error resulting from replacement of (8) by where the compression maps Φ m t : C n → C n are independent, inexpensive to evaluate, and enforce sparsity in the V m t iterates (the number of non-zero entries in V m t will be O(m)) so that M can be evaluated at much less expense. When M is a perturbation of identity and an O(n) scheme is appropriate (see Sections 3.2 and 4 below) we will also consider the scheme, The compressions Φ m t will satisfy (or very nearly satisfy) the statistical consistency criterion and will have to be carefully constructed to avoid instabilities and yield effective methods. For definiteness one can imagine that Φ m t is defined by a natural extension of (7) to accept arguments v ∈ C n (V m t is no longer a non-negative real number). This choice has several drawbacks, not least of which is its cost, and we do not use it in our numerical experiments. Alternative compression schemes, including the one used in our numerical simulations, are considered in detail in Section 5. There we will learn that one can expect that, for any pair f, v ∈ C n , (the superscript H is used throughout this article to denote the conjugate transpose of a vector with complex entries). These errors are introduced at each iteration and need to be removed to obtain an accurate estimate. Depending on the setting, we may rely on averaging over long trajectories, averaging over parallel simulations (replicas), or dynamical self-averaging (see Sections 3.2 and 4), to remove the noise introduced by our randomization procedure. Because the specific choice of M and the form of averaging used to remove noise can differ substantially by setting, we will describe the schemes within the context of specific (and common) iterative procedures.
3.1. The eigenproblem revisited Consider, for example, a more general eigenproblem than the one we considered in Section 2. Given K ∈ C n×n the goal is to determine λ * ∈ C and v * ∈ C n such that Kv * = λ * v * (14) and such that, for any other solution pair (µ, u), |µ| < |λ * |. In what follows in this section we will assume that this problem has a unique solution. The standard methods of approximate solution of (14) are variants of the power method, a simple version of which performs Under generic conditions, these converge to the correct (λ * , v * ) starting from an appropriate initial vector v 0 (see e.g. [18]). The scheme in (15) requires O(n 2 ) work per iteration and at least O (n) storage. In this article, we are interested in situations in which these cost and storage requirements are unreasonable.
From O n 2 to O (n) For the iteration in (15) the randomized scheme (9) (along with an approximation of λ * ) becomes where ε t is a sequence of positive numbers with ∞ t+1 then V t and Λ t are trajectory averages). In (16), the compressions Φ m t are independent of one another. Using the rules defining Φ m t in Section 5, construction of Φ m t (V m t ) at each step will require O(n) operations. Since multiplication of the vector Φ m t (V m t ) by a dense matrix K requires O(nm) operations, this scheme has an O(n) storage and cost per iteration requirement. For exactly the same reasons, iterations based on (10) will also have cost and storage requirements of O(n). As we will see in Section 4, when the mapping M is of the form v + εb(v) for some small parameter ε, methods of the form (10) can be expected to have smaller error than iteration (9).
From O(n) to O(1) For many problems even O(n) cost and storage requirements are unacceptable. But now suppose that K is sparse with q non-zero entries per column. In this case not only would the cost of computing the matrix vector product KΦ m t (v) reduce to O(qm), but, according to (16), the vectors V m t would now have at most O(qm) non zero entries and the cost of computing Φ m t (V m t ) would reduce to O (qm) operations as well. Thus when K is sparse the total number of floating point operations required by (16) reduces to O (qm) per iteration. This observation does not hold for methods of the form (10) which will typically result in dense iterates V m t and a cost of O(n) even when K is sparse.
As we have mentioned (and as was true in Section 2), in many settings even storing the full solution vector is impossible. Overcoming this impediment requires a departure from the usual perspective of numerical linear algebra. Instead of trying to approximate all entries of v * , our goal becomes to compute for some matrix f with n columns and many fewer (O(1) in n) rows. This change in perspective is reflected in the form of our compression rule error estimate in (13) and in the form of our convergence results in Section 4 that measure error in terms of dot products with test vectors. Indeed, were we to estimate a more standard quantity such as we would find that the error decreased proportional to (n − m)/n requiring that m increase with n to achieve fixed accuracy. The consequence in terms of the algorithm is simply the removal in (16) of the equation defining V m t and insertion of While estimation of f * may seem an unusual goal in the context of classical iterative schemes it is completely in line with the goals of any Markov chain Monte Carlo scheme which typically seek only to compute averages with respect to the invariant measure of a Markov chain and not to completely characterize that measure.
Schemes with O (1) storage and operations requirements per-iteration can easily be designed for any general matrix. Accomplishing this for a dense matrix requires an additional randomization in which columns of K (or of some factor of K) are randomly set to zero independently at each iteration, e.g. again in the context of power iteration, assuming that V m t−1 has at most O(m) non-zero entries, one can use in place of (16), where here K j is used to denote the jth column of K and each Φ The number of entries retained in each column is controlled by m j t which, if the norms of the columns of K are known in advance (otherwise one can set m j t = 1), can be set randomly at each iteration by applying, e.g. (6) or some other resampling scheme with W (j) t replaced by the vector with jth entry equal to K j 1 |(V m t ) j |. This alternative approach also leaves the next iterate, V m t+1 , with at most O(m) non-zero entries. Use of (18) in place of (16) will result in a scheme whose cost per iteration is independent of n if the compressions of the columns have cost independent of n. This may be possible without introducing significant error, for example, when the entries in the columns of K can take only a small number of values. However, use of (18) does not reduce the number of entries of K that must be accessed at any iteration and it will always increase error. In our numerical examples we do not apply the approach in (18) and we expect that in many applications effective constant cost per iteration schemes will only be possible if K is sparse.

Peturbations of identity
We now consider the case that M is a perturbation of the identity, i.e., that where ε is a small positive parameter. This case is of particular importance because, when the goal is to solve a differential equation initial value problem discrete-in-time approximations take the form (8) with M of the form in (19). As is the case in several of our numerical examples, the solution to (20) may represent, for example, a semidiscretization (a discretization in space) of a partial differential equation (PDE). Several common tasks in numerical linear algebra, not necessarily directly related to ODE or PDE can also be addressed by considering (20). For example, suppose that we solve the ordinary differential equation (ODE) (20) with b(y) = Ay − r for some r ∈ C n and any n × n complex valued matrix A. The solution to (20) in this case is y(t) = e tA y 0 + A −1 I − e tA r.
Setting r = 0 in the last display we find that any method to approximate ODE (20) for t = 1 can be used to approximate the product of a given vector and the exponential of the matrix A. On the other hand, if r = 0 and all eigenvalues of A have negative real part then, for very large t, the solution to (20) converges to A −1 r. In fact, in this case we obtain the continuous time variant of Jacobi iteration for the equation Ax = r. Like Jacobi iteration, it can be extended to somewhat more general matrices. Discretizing (20) in time with b(y) = Ay − r and a small time step allows treatment of matrices with a wider range of eigenvalues than would be possible with standard Jacobi iteration. Some important eigenproblems are also solved using an M satisfying (19). For example, this is the case when the goal is to compute the eigenvalue/vector pair corresponding to the eigenvalue of largest real part (rather than of largest magnitude) of a differential operator, e.g. the Schrödinger operator discussed in Section 2. While the power method applied directly to a matrix A converges to the eigenvector of A corresponding to the eigenvalue of largest magnitude, the angle between the vector exp(tA)y 0 and the eigenvector corresponding to the eigenvalue of A with largest real part converges to 0 (assuming y 0 is not orthogonal to that eigenvector). If we discretize (20) in time with b(v) = Av and renormalize the solution at each time step (to have unit norm) then the iteration will converge to the desired eigenvector (or a ε dependent approximation of that eigenvector).
As we will learn in the next section, designing effective fast randomized iteration schemes for these problems requires that the error in the stochastic representation M • Φ m t of M decrease sufficiently rapidly with ε. In particular, in order for our schemes to accurately approximate solutions to (20) over intervals of O(1) units of time (i.e., over O (1/ε) iterations of the discrete scheme), we will need, and will verify in certain cases, that Obtaining a bound of this type will require that we use a carefully constructed random compression Φ m t such as those described in Section 5. In fact, when a scheme with O(n) cost per iteration is acceptable, iteration (9) can be replaced by (10), i.e., by in which case we can expect errors over O ε −1 iterations that vanish with ε (rather than merely remaining stable). As we will see in more detail in the next section, the principle of dynamic selfaveraging is essential to the convergence of either (9) or (10) when M is a perturbation of identity. The same principle is invoked in the contexts of, for example, multi-scale simulation (see e.g. [45] and [22] and the many references therein) and stochastic approximation (see e.g. [36] and the many references therein).

Convergence
Many randomized schemes in numerical linear algebra (see e.g. [15,19,20,21,23,37,39,46,47,56]) rely at their core on approximation of a product such as AB, where, for example, A and B are n × n matrices, by a product of the form AΘB where Θ is an n × n random matrix with E [Θ] = I and so that AΘB can be assembled at much less expense than AB. For example, one might choose Θ to be a diagonal matrix with only m n non-zero entries on the diagonal so that ΘB has only m non-zero rows and AΘB can be assembled in O(n 2 m) operations instead of O(n 3 ) operations. Alternatively one might choose Θ = ξξ T where ξ is a n×m random matrix with independent entries, each having mean 0 and variance 1/m. With this choice one can again construct AΘB in O(n 2 m) operations. Typically, this randomization is carried out once in the course of the algorithm. The error made in such an approximation can be expected to be of size O(1/ √ m) where the prefactor depends (very roughly) on the size of the matrices (and other structural properties) but does not depend directly on n.
In the schemes that we consider, we apply a similar randomization to speed matrix vector multiplication at each iteration of the algorithm (though our compression rules depend on the vector appearing in the product). As explored below, the consequence is that any stability property of the original, deterministic iteration responsible for its convergence, will be weakened by the randomization and that effect may introduce an additional n dependence in the cost of the algorithm to achieve a fixed accuracy. The compression rule must therefore be carefully constructed to minimize error. This is discussed in detail in Section 5. In this section, we consider the error resulting from (9) and (10) for an unspecified compression rule satisfying the generic error properties established (with caveats) in Section 5. Both because it provides a dramatic illustration of the need to construct accurate compression rules and because of its importance in practical applications, we pay particular attention to the case in which M is an ε-perturbation of the identity. Our results rely on classical techniques in the numerical analysis of deterministic and stochastic dynamical systems and, in particular, are typical of basic results concerning the convergence of stochastic approximation (see e.g. [36] for a general introduction and [40] for results in the context of machine learning) and interacting particle methods (see e.g. [17] for a general introduction and [49] for results in the context of QMC). They concern the mean squared size of the difference between the output, V m t , of the randomized scheme and the output, v t , of its deterministic counterpart and are chosen to efficiently highlight important issues such as the role of stability properties of the deterministic iteration (8), the dependence of the error on the size of the solution vector, n, and the role of dynamic self-averaging. More sophisticated results (such as Central Limit Theorems and asymptotic and non-asymptotic exponential bounds on deviation probabilities) are possible following developments in, for example, [36] and [17,49].
Our notion of error will be important. It will not be possible to prove, for example that ] is small when n is large and v 1 = 1. Take, for example, the case that v i = 1/n. In this case any scheme that sets n − m entries to zero will result in an error Φ m t (v) − v 1 ≥ (n − m)/n. On the other hand, we need to choose a measure of error sufficiently stringent so that our eventual error bounds imply that our methods accurately approximate observables of the form f H v t . For example, analogues of all of the results below using the error metric could be established. However, error bounds of this form are not, by themselves, enough to imply bounds on the error in f H v t because they ignore correlations between the components of V m t . Remark 5. It is perhaps more typical in numerical linear algebra to state error bounds in terms of the quantity one ultimately hopes to approximate and not in terms of the distance to another approximation of that quantity. For example, one might wonder why our results aren't stated in terms of the total work required to achieve (say with high probability) an error of a specified size in an approximation of the dominate eigenvalue of a matrix. Our choice to consider the difference between V m t and v t is motivated by the fact that the essential characteristics contributing to errors due to randomization are most naturally described in terms of the map defining the deterministic iteration. More traditional characterizations of the accuracy of the schemes can be inferred from the bounds provided below and error bounds for the corresponding deterministic iterative schemes.
Motivated by our stated goal, as described in Section 3, of estimating quantities of the form f H v t we measure the size of the (random) errors produced by our scheme using the norm where X is a random variable with values in C n (all random variables referenced are assumed to be functions on a single probability space which will be left unspecified). This norm is the (∞, 2)-norm of the square root of the second moment matrix of X, i.e.
It is not difficult to see that the particular square root chosen does not effect the value of the norm. It will become apparent that our choice of the norm in (22) is a natural one for our convergence results in this and the next section.
The following alternate characterization of |||·||| will be useful later.
Lemma 1. The norm in (22) may also be expressed as where is the dual norm of the ∞-norm of G ∈ C n×n , Proof. Indeed, and Jensen's inequality we find that The other inequality follows by noting that for any f ∈ C n with f ∞ ≤ 1, the matrix G with first row equal to f H and all other rows zero satisfies the constraint (25). That (24) is the dual norm of the ∞-norm follows from straightforward verification or see [25,Proposition 6.3].
Note that if the variable X is not random then one can choose (22) and find that |||X||| = X 1 . When X is random we have the upper bound |||X||| 2 ≤ E X 2 1 . If, on the other hand, X is random but has mean zero and independent components then |||X||| 2 = E X 2 2 . Concerning the relationship between these two norms more generally, we rely on the following lemma.
Lemma 2. Let A be any n × n Hermissian matrix with entries in C. Then Proof. The result holds in n = 1 dimensions. Suppose that the result holds in n − 1 dimensions. We will show that it must also therefore hold in n dimensions and conclude, by induction, that the result holds in any dimension.
LetÃ be the (n − 1) × (n − 1) principle sub-matrix of an n × n matrix A. For any vector f ∈ C n we can write wheref ∈ C n−1 has entries equal to the first n − 1 entries of f. By the induction hypothesis, we can choose the first n − 1 entries of f (i.e.,f ) so that the right hand side of the last display is not less than If, for this choice off , n−1 j=1 A njfj is non-zero then choose f n as Otherwise set f n = 1. With the resulting choice of f n , We have therefore shown that sup Lemma 2, applied to the second moment matrix of X, implies that |||X||| 2 ≥ E X 2 2 . Summarizing these relationships we have Consistent with results in the next section we will assume that the typical error from our compression rule is for v ∈ C n , where γ is a constant that is independent of m and n. We will also assume that for some constant C independent of m and n. For all of the compression methods discussed in the next section, the bias E [Φ m t (v)] − v 1 is negligible compared to the statistical error in (27). To slightly simplify expressions we will omit contributions to our bounds from compression bias (we do consider the bias of the overall iterative procedure). In other words we will assume that (11) is satisfied exactly.
As a result of the appearance of v 1 in (27), in our eventual error bounds it will be impossible to avoid dependence on V m t 1 . The growth of these quantities is controllable by increasing m, but the value of m required will often depend on the n. The next theorem concerns the size of E V m t 2 1 . After the statement and proof of the theorem we discuss how the various quantities appearing there can be be expected to depend on n. In this theorem and in the rest of our results it will be convenient to recognize that, in many applications, the iterates V m t and Y m t = Φ m t (V m t ) are confined within some subset of C n . For example, the iterates may all have a fixed norm or may have all non-negative entries. We use the symbol X to identify this subset (which differs depending on the problem). Theorem 6. Suppose that for some constants γ and R, for all v ∈ X , and that, for some constant β, v 2 1 ≤ βU(v) for all v ∈ X . Assume further that there is a constant C and some matrix G ∈ C n×n satisfying (25), so that, for all z ∈ C n , sup . Using the fact that U is twice differentiable with bounded second derivative this last expression is bounded above by As a consequence, noting (27), we arrive at the upper bound from which we can conclude that First, the reader should notice that setting γ = 0 in Theorem 6 shows that the deterministic iteration (8) is stable whenever α < 1. However, even for an M corresponding to a stable iteration, the randomized iteration (9) may not be stable (and will, in general, be less stable). If the goal is to estimate, e.g. a fixed point of M, the user will first have to choose m large enough that the randomized iteration is stable.
Though it is not explicit in the statement of Theorem 6, in general the requirements for stability will depend on n. Consider, for example, the case of a linear iteration, M(v) = Kv. This iteration is stable if the largest eigenvalue (in magnitude) is less than 1. If we choose U(v) = v 2 2 then we can take α to be the largest eigenvalue of K H K and R = 0 in the statement of Theorem 6. The bound · 1 ≤ √ n · 2 (and the fact that it is sharp) suggests that we will have to take β = n in Theorem 6 (note that we can take C = 1 in the eventual bound). This scaling suggests that, to guarantee stability we need to choose m ∼ n.
Fortunately this prediction is often (but not always) pessimistic. For example, if K is a matrix with non-negative real entries and V m 0 has non-negative real entries then the iterates V m t will have real, non-negative entries (i.e., v ∈ X implies v i ≥ 0). We can therefore use for v ∈ X and find that we can take α = K 2 1 , R = 0, and β = 1, in the statement of Theorem 6. With this choice of U we can again choose C = 1 so that n does not appear directly in the stability bound. We anticipate that most applications will fall somewhere between these two extremes; maintaining stability will require increasing m as n is increased but not in proportion to the increase in n.
Having characterized the stability our schemes we now move on to bounding their error. We have crafted the theorem below to address both situations in which one is interested in the error after a finite number of iterations and situations that require error bounds independent of the number of iterations. In general, achieving error bounds independent of the number of iterations requires that M satisfy stronger stability properties than those implied by the conditions in Theorem 6. While the requirements in Theorem 6 could be modified to imply the appropriate properties for most applications, we opt instead for a more direct approach and modify our stability assumptions on M to (29) and (30) below. In this theorem and below we will make use the notation M t s to denote M composed with itself t − s times. In our proof of the bound in Theorem 7 below we will divide the error into two terms, one of which is a sum of individual terms with vanishing conditional expectations. Those individual summands are Martingale differences, and the size of their sum can be expected to grow less than linearly with the number of iterations. This general observation is called dynamic self-averaging and results in an improved error bound. The improvement is essential in the context of perturbations of the identity and we will mention it again below when we focus on that case.
Theorem 7. Suppose that the iterates V m t of (9) remain in X ⊂ C n . Fix a positive integer T. Assume that, for some α ≥ 0 and some constants L 1 and L 2 and for every pair of integers s ≤ r ≤ T, for every vector f ∈ C n with f ∞ ≤ 1 there is a matrix G ∈ C n×n satisfying (25) and and for every f ∈ C n with f ∞ ≤ 1 and some matrix G ∈ C n×n satisfying (25), for all v ∈ X there is a matrix A (a measurable function of v) so that for allṽ ∈ X . Then the error at step t ≤ T satisfies the bound Proof. We begin with a standard expansion of the scheme's error.

and the last equation becomes
. The right hand side of the last equation is bounded above by Considering the first term in the last display note that, for any fixed f ∈ C n , Letting F r denote the σ-algebra generated by {V m s } r s=0 and {Y m r } r−1 s=0 , for s < r we can write Because, conditioned on V m r , Y m r is independent of F r , the expression above vanishes exactly. Supremizing over the choice of f , we have shown that Expanding the term inside of the square root we find that where, in the second inequality, we have used the triangle inequality for the 2 norm in R t . If A r ∈ C n×n is any matrix value random variable, measurable with respect to the σ-algebra generated by V m r , then . As a consequence, applying our assumptions (29) and (30), we obtain the upper bound Bounding the error from the random compressions we arrive at the error bound Condition (29) is easily verified for linear maps M = K where the magnitude of the dominant eigenvalue of K is less than 1. The condition is more difficult to verify for the power iteration map A condition close to (29) holds for all v,ṽ with v 1 = ṽ 1 = 1 but the parameter α in general depends on the proximity of v andṽ to the space spanned by all of the non-dominant eigenvectors (see e.g. [54]). For large enough m we can ensure that all iterates V m t remain at least some fixed distance from the space spanned by the non-dominant eigenvectors but this will often require that m grow with n. Moreover, almost sure bounds separating the iterates from the non-dominant eigenspace are more than is required to prove convergence of the randomized schemes.
Again we can identify cases in which condition (29) holds for the power iteration map without direct dependence on n in the bound. When the matrix K is real and non-negative we can assume that the iterates V m t are also real and non-negative, i.e., the iterates remain in On the other hand, if K is also irreducible, then the non-dominant eigenspace of K consists of vectors with both positive and negative entries and our iterates, by virtue of being non-negative, are bounded away from this space. For more general problems one can expect the speedup over the standard deterministic power method to be roughly between a factor of n and no speedup at all (it is clear that the randomized scheme can be worse than its deterministic counterpart when that method is a reasonable alternative). A complete description of the speedup in the particular case that M is the power iteration map would be very interesting but is not pursued here.
Bias. Even for a very general iteration the effect of randomization is evident when one considers the size of the expected error (rather than the expected size of the error). When M(v) = Kv and V m t is generated by (9), one can easily check that z t = E [V m t ] satisfies the iteration z t+1 = Kz t , i.e., z t = v t . Even when the mapping M is non-linear, the expected error, E [V m t ]−v t , is often much smaller than the error, V m t − v t , itself. The following is just one simple result in this direction and demonstrates that one can often expect the bias to be O(m −1 ) (which should be contrasted to an expected error of O(m −1/2 )). The proof is very similar to the proof of Theorem 7 and is omitted.
Under the same assumptions as in Theorem 7, the bias at step t ≤ T satisfies the bound Perturbations of identity. When the goal is to solve ordinary or partial differential equations, stronger assumptions on the structure of M are appropriate. We now consider the case in which M is a perturbation of the identity. More precisely we will assume that Though we will not write it explicitly, further dependence of b on ε is allowed as long as the assumptions on b below hold uniformly in ε. In the differential equations setting ε can be thought of as a time discretization parameter as in Section 2. When M is a perturbation of identity, it is reasonable to strengthen our assumptions on the error made at each compression step. The improvement stems from the fact that the mapping M nearly preserves the sparsity of its argument. As we will explain in detail in the next section, if v ∈ S m where S m = {z ∈ C n : #{j : z j = 0} ≤ m} and w ∈ C n , then it is reasonable to assume that, for example, for some constant C independent of m and n. The following Lemma illustrates how such a bound on the compression rule can translate into small compression errors when M is a perturbation of the identity.
Lemma 3. Suppose that (32) and (28) hold. If M(v) = v + εb(v) and, for some constant L, b(v) 1 ≤ L (1 + v 1 ) for all v ∈ X , then, for V m t generated by (9), and for some constantγ, for some constant C. Our assumed bound on the growth of b along with (28) implies that for some constant C . From these bounds it follows that for some constantγ, We now provide versions of Theorems 6 and 7 appropriate when M is a perturbation of identity. The proofs of both of these theorems are very similar to the proofs of Theorems 6 and 7 and are, at least in part, omitted. First we address stability in the perturbation of identity case.
Theorem 9. Suppose that the iterates V m t of (9) remain in X ⊂ C n and that (33) holds. Suppose further that U satisfies the conditions in the statement of Theorem 6 with the exception of the following: Now where the various constants appearing in this expression were defined in and before the statement of Theorem 6.
What is important about the statement of Theorem 9 is that the bound remains stable as ε decreases despite the fact that set being supremized over is increasing. Under the assumptions in the theorem (which are only reasonable when M is a perturbation of the identity) one can expect that the iterates can be bounded over O ε −1 iterations uniformly in ε.
The following theorem interprets the result of Theorem 7 when M is a perturbation of identity. One might expect that, over O(ε −1 ) iterations, O ( √ ε) errors made during the compression step would accumulate and lead to an error of O(ε −1/2 ). Indeed, this is exactly what would happen if the errors made in the compression step were systematic (i.e., if the compression bias was O ( √ ε)). Fortunately, when the compression rule satisfies the consistency criterion (11) the errors self average and their effect on the overall error of the scheme is reduced. As mentioned above, this phenomenon played a role in the structure of the result in Theorem 7 and its proof, but its role is more crucial in Theorem 10 which provides uniform in ε bounds on the error of (9) over O(ε −1 ) iterations. Without the reduction in the growth of the error with t provided by self-averaging it would not be possible to achieve an error bound over O ε −1 iterations that is stable as ε decreases.
Theorem 10. Suppose that the iterates V m t of (9) remain in X ⊂ C n and that (33) holds. Fix a real number T > 0. Assume that Assume further that, for some real number β and some constants L 1 and L 2 and for every pair of integers s ≤ r ≤ T /ε, for every vector f ∈ C n with f ∞ ≤ 1 there is a matrix G ∈ C n×n satisfying (25) and and for every f ∈ C n with f ∞ ≤ 1 and some matrix G ∈ C n×n satisfying (25), for all v ∈ X there is a matrix A (a measurable function of v) so that for allṽ ∈ X . Then the error at step t ≤ T /ε satisfies the bound where M 2 T = sup r<T /ε E V m r 2 1 . Proof. By exactly the same arguments used in the proof of Theorem 7 we arrive at the bound Bounding the error from the random compressions we arrive at the error bound Though the error established in the last claim is stable as ε decreases, we have mentioned in Section 3.2 that when M is a perturbation of identity, by using iteration (10) instead of (9), one might be able to obtain errors that vanish as ε decreases (keeping m fixed). This is the subject of Theorem 11 below which, like Theorem 10 relies crucially on self-averaging of the compression errors. Note that iteration (10) typically requires O (n) operations per iteration and storage of length n vectors. We have the following theorem demonstrating the decrease in error with ε in this setting.
Theorem 11. Suppose that the iterates V m t of (10) remain in X ⊂ C n and that (27) holds. Under the same assumptions on M as in Theorem 10, the error at step t ≤ T /ε satisfies the bound Proof. By a very similar argument as in the proof of Theorem 7 arrive at the bound )||| which, also as in that proof, is bounded above by (34) and Lemma 1 we find that |||b(Y m r ) − b(V m r )||| ≤ L 1 |||Y m r − V m r |||. The rest of the argument proceeds exactly as proof of Theorem 7.

Compression rules
In this section we give a detailed description of the compression rule used in our numerical simulations as well as several others and analysis of the accuracy of those schemes. Our compression schemes have the general form in Algorithm 1. Programmed efficiently, and assuming that v has Data: v ∈ C n with all nonzero entries, m ∈ N Result: V = Φ m (v) ∈ C n with at most m nonzero entries = 0; Algorithm 1: A simple compression rule.
exactly n nonzero entries, all of the schemes we discuss in this section will require at most O(n) floating point operations including the generation of as few as one uniform random variate and O(n + m log n) floating point comparisons. It is likely that better compression schemes are possible, for example by incorporation of ideas from [30]. The reader should note that in this section n represents the number of non-zero entries in the input vector v of the compression rule and not the dimension associated with a particular problem (which may be much larger). What is left unspecified by the steps in Algorithm 1 is the choice of the joint distribution of the N j . For many applications, the details of this choice are crucial. The simplest possibilities, such as choosing the N j to be distributed according to a multinomial distribution with probabilities p j = |v j |/ v 1 , can easily result in unstable schemes. Before detailing better choices, we remark on the deterministic procedure preceding the generation of the N j . Let σ be a perturbation of {1, 2, . . . , n} so that the elements of v σ have decreasing magnitude and let τ m v is the number of entries of v that are exactly preserved (in V ) during the while loop in Algorithm 1. Note that if the number of nonzero entries in v exceeds m then τ m v < m.
Proof. Observe that if τ m v > 0, then condition for some ≤ τ m v . From the definition of τ m v and the fact that ≤ τ m v , we must also have that Combining the last two inequalities yields As we will find below, this Lemma implies that our compression rules will have errors that decrease as τ m v increases, justifying the presence of this additional deterministic step. In cases where the |v σ j | decrease rapidly with j, the reduction in error can be dramatic.
Returning now to a more detailed description of the joint distribution of the N j we begin with a particularly simple choice (and one that will have to be modified later) suggested by common practice in diffusion Monte Carlo applications. For each j generate an independent uniform random variable U (j) in (0, 1), and set One can easily check that E [N j ] = (m − )|v j |/ v 1 . The next lemma characterizes the error resulting from this choice. Because we have a particular interest in cases in which M is a perturbation of identity and because, in those cases has at most m non-zero entries, we will pay close attention to the effects on our error bounds of a perturbation w ∈ C n of a vector v with only m nonzero entries, i.e., v ∈ S m . The lemma includes an estimate of E Φ m t (v) 2 1 and of the probability of the event {Φ t (v) = 0} which will both play a role in the discussion immediately following its proof.
Lemma 5. For Φ t defined by Algorithm 1 with (37) and for any vectors v ∈ S m and w ∈ C n , The righthand side of the (38) is itself bounded above by Concerning the size of the resampled vector we have the bound Proof. First we assume that, for all j, |v j + w j | ≤ v + w /m. We will remove this assumption later. With this assumption in place, N j ∈ {0, 1} and the while loop in Algorithm 1 is inactive so that The random variables in the sum are independent so the last expression becomes Since N j ∈ {0, 1}, the expression for the variance of N j becomes Because this scheme does not depend on the ordering of the entries of v + w we can assume that the entries have been ordered so that v j = 0 for j > m. In this case we can write which then implies that Expression (38) follows from the definition of τ m v+w and the fact that v σ j for j ≤ τ m v+w are preserved exactly by Φ t . The upper bound in (39) follows after an application of Lemma 4.
In bounding the size of Φ m t (v + w) we will again assume that τ m v+w = 0 and that the entries have been ordered so that v j = 0 for j > m. The size of the resampled vector can be bounded by first noting that, since the N j are independent and are in {0, 1}, Breaking up the last sum in this expression we find that It follows then that (at least when τ m v+w = 0), Writing the corresponding formula for τ m v+w > 0 and applying Lemma 4 gives the bound in the statement of the lemma.
Finally we consider the probability of the event {Φ m t (v + w) = 0} . If τ m v+w = 0 then N j ∈ {0, 1} so that P [N j = 0] = 1 − m|v j + w j |/ v + w 1 , and, since the N j are independent, The first product in the last display is easily seen to be bounded above by e −m . The second product is maximized subject to the constraint when the terms in the product are all equal, in which case we get In practice one would need to modify the choice in (37) to avoid several potential drawbacks. First, With any application of the resulting compression rule (assuming that τ m v = 0) there is a non-zero probability that Φ m t (v) = 0 (i.e., that all the N j = 0). As Lemma 5 demonstrates, the probability of this event is extremely small. The issue can be avoided by simply sampling The resulting bias and additional cost are negligible (as can be seen using the estimate of the probability of the event {Φ m t (v) = 0} given in Lemma 5). Another potential drawback is the possibility that Φ m t (v) has more than m non-zero entries. This can be easily avoided at insignificant additional cost by randomly setting entries of Φ m t (v) to zero (and renormalizing the result) using the systematic compression procedure described below. The resulting compression of v has the same 1 norm as Φ m t (v) and exactly the same expectation. The resulting statistical error is also insignificant (a fact that can be established using the estimate on the size of Φ m t (v) in Lemma 5). A much more significant drawback associated with the choice (37) is the requirement that we generate n (assuming v has n non-zero components) independent uniform random variates. Depending on the cost of evaluating M(V m t ), the generation of so many random variables could constitute a significant portion of the cost of the overall scheme. An unbiased method that is observed in practice to behave similarly to the scheme in (37) but which requires only a single random variate, exactly preserves the 1 norm of the input vector, and produces an output vector with exactly m non-zero entries, can be defined by, for k = 1, 2, . . . , m − , setting and where and U ∈ (0, 1) is a single uniformly chosen random variable. The random sampling step in the resulting compression scheme corresponds to the so called systematic resampling scheme used frequently in the context of sequential Monte Carlo (see e.g. [16]). It is well known in that context, that systematic sampling can fail to converge for certain sequences of input vectors (they need to change as one increases m or the method will converge). Nonetheless, systematic resampling is an extremely popular technique and in many applications results in lower error than competing schemes. Similarly, for certain choices of v, the compression scheme specified by (41) and (42) is guaranteed to converge only as m/n approaches 1. Supporting the use of the scheme for problems in which M is a perturbation of the identity, it is possible to prove that if, after exact preservation of all entries v σ j + w σ j with j ≤ τ m v+w , the remaining entries are re-ordered so that entries with v j = 0 appear at the end of the list of remaining entries in v + w, then (assuming τ m v+w = 0 for simplicity) when w 1 < v + w 1 /m. Though we have not used it in our numerical tests, the additional rearrangement step imposes no substantial additional cost and is essential to this bound. Slight modifications of (41) and (42) do result in provably convergent schemes. For example, another commonly used resampling scheme in sequential Monte Carlo called stratified resampling (see e.g. [16]) corresponds to choosing each uniform random variable U (k) ∈ [(k − 1)/(m − ), k/(m − )) in (40) independently. Again the resulting scheme is unbiased and preserves the 1 norm of the input vector. The following lemma provides a bound on the scheme's error.
Lemma 6. Let σ be a perturbation of {1, 2, . . . , n} so that the elements of v σ + w σ have decreasing magnitude and let τ m v+w be defined as in (36) with v replaced by v + w. For the compression rule defined in Algorithm 1 with U (k) chosen independently in (40), for any pair v, w ∈ C n with v ∈ S m . The right hand side in (43) is itself bounded above by Proof. Because the scheme exactly preserves entries v σ + w σ for ≤ τ m v+w and, for all > τ m v+w , it suffices to consider the case in which, for all j, |v j + w j | < v + w 1 /m. Expression (43) will follow by setting to 0 entries with index σ j for j ≤ τ (m) in both v and w and replacing m in the eventual bound by m − τ m v+w . Let V = Φ m (v + w) and suppose f ∈ C n with f ∞ ≤ 1. Our goal is to establish an upper bound for where var on the right hand side of the last display represents the variance of the random variable f H V and we have used the easily established fact that E [V ] = v + w. Let s 0 = 0 and for j = 1, 2, . . . , n, and define the function I(u) for u ∈ [0, 1] by Then we can write In terms of the p k,j we can write For k = 1, 2, . . . , m, let j k = arg max j≤n p k,j and r k = p k,j k .
We find that var f It remains to bound the sum m k=1 (1 − r k ) in terms of w 1 / v + w 1 . Define α 0 = 0 and for k = 1, 2, . . . , m. Let α k be the index of the kth entry of v + w with v k = 0. Notice that, since |v j + w j | < v + w 1 /m, Moreover, notice that allowing us to conclude by induction that On the other hand, these bounds, and the fact that |v j + w j | < v + w 1 /m, also imply that From the bounds (45) and (46) we conclude that within any interval [(k − 1)/m, k/m) we can find at most two points in the set {s α j } and those two points are s α k−1 and s α k . Likewise, we can find at most the two points s α k −1 and s α k+1 −1 from the set {s α j −1 } in [(k − 1)/m, k/m).
If s α k is contained in the interval [(k − 1)/m, k/m) then so is the complete interval (max{(k − 1)/m, s α k −1 }, s α k ). If the entire interval (s α k −1 , s α k ) is contained in [(k − 1)/m, k/m) then r k ≥ m|v α k + w α k |/ v + w 1 . Since, for any i, and |v j + w j | < v + w 1 /m, we find that, for any j, and therefore that If s α k −1 ≤ (k − 1)/m then r k ≥ ms α k − (k − 1) and, from (46), On the other hand, if s α k ≥ k/m then s α k −1 is in the interval [(k − 1)/m, k/m) along with the complete interval (s α k −1 , k/m). Therefore, r k ≥ k − ms α k −1 and, from (47), so that again, Since r k ≥ 0 the above bounds on 1 − r k allow us to conclude that concluding the proof of (43). The upper bound (44) follows by an application of Lemma 4.
Though the bounds in Lemma 6 are somewhat crude, they are sharp in one respect. One can find sequences of vectors achieving an error scaling as v + w 1/2 1 w 1/2 1 as m is increased. As a consequence, when M is an ε perturbation of the identity, we can expect |||Φ t (V m t ) − V m t ||| to be of order √ ε for this scheme, but we cannot guarantee that the error will remain controllable with m (e.g. O(m −1/2 )) as ε decreases. This point demonstrates that careful attention to the compression scheme is required in cases in which M is a perturbation of the identity. In that context, even schemes inspired by well regarded resampling strategies may not be effective.

Numerical tests
In this section we describe the application of the framework above to particular matrices arising in (i) the computation of the per-spin partition function of the 2D Ising model, (ii) the spectral gap of a diffusion process governing the evolution of a system of up to five, 2-dimensional particles (i.e., up to ten spatial dimensions), and (iii) a free energy landscape for that process. The corresponding numerical linear algebra problems are, respectively, (i) computing the dominant eigenvalue/eigenvector of matrices up to size 10 15 × 10 15 , (ii) computing the second largest eigenvalue/eigenvector of matrices up to size 10 20 × 10 20 , and (iii) solving a linear system involving exponentiation of matrices up to size 10 20 ×10 20 . Aside from sparsity, these matrices have no known readily exploitable structure for computations.
The reader may wonder why we consider random compressions instead of simple thresholding, i.e., a compression rule in which, if σ j is the index of the jth largest (in magnitude) of v, v σ j is simply set to zero for all j > m (the resulting vector can be normalized to preserve 1 -norm or not). In the rest of this paper we will refer to methods using such a compression rule as truncation-by-size (TbS) schemes. TbS schemes have been considered by many authors (see e.g. [26,50,42,43]) and are a natural approach. Note however that the error (if the compression is not normalized), for the thresholding compression can be as large as f ∞ v 1 (1 − m/n) which only vanishes if m is increased faster than n. In contrast, the random compressions above can have vanishing error even when n is infinite. This observation is key to understanding the substantial reduction in error we find for our fast randomized scheme over TbS in numerical results presented in this section.
In order for the FRI approach to yield significant performance improvements, one must use an efficient implementation of matrix by sparse vector multiplication. In the examples below we list the i, j pairs for which the product K ij v j is nonzero, then sort the products according to the i-index and finally, add the products with common i. This is a simple and sub-optimal solution. In particular, if the entries of the matrix K can be efficiently accessed column by column then the sorting step can be avoided. More details can be found in the example code available here [55].  We therefore also cannot hope to apply the power method (or its relatives) directly to K when is large. In our experiments we set T = 2.2, B = 0.01, and = 50 so that n = 2 > 10 15 , We apply both the FRI and TbS, O(1) schemes to computing λ * as well as to computing the sum of all components of the corresponding eigenvector, v * (normalized to have sum equal to 1), with index greater than or equal to 2 49 , i.e., Knowledge of the partition function λ * as a function of temperature T and field strength B allows one to determine useful quantities such as the average magnetization (sum of spin values) and to diagnose phase transitions [26]. Our choice to estimate λ * and f * is motivated in part by the fact that these quantities can be approximated accurately by the corresponding values for smaller Ising systems. We will compare our results to those for the 24-spin Ising model which we can solve by standard power iteration. For an effective specialized method for this problem see [44]. A simple, educational implementation of Fast Randomized Iteration applied to this problem can be found here [55]. Corresponding trajectory averages of the approximation, F m t , of the total weight of all components of v * with index greater than or equal to 2 49 with 95% confidence intervals for the FRI method. The best (highest m) estimate for f * is 0.606, a difference of roughly 5B from the value for the 24-spin model.
In Figure 1 we report the trajectory averages of the approximations Λ m t and F m t generated by the FRI scheme (iteration (16) using Algorithm 1) with m = 2 20 , 2 21 , 2 22 , 2 23 , and 2 24 and 10 5 total iterations. The best (highest m) approximation of λ * is 2.584 and the best approximation of f * is 0.606. The results for the 24-spin Ising problem are λ * ≈ 2.596 and f * ≈ 0.658 a difference of roughly 1B and 5B from the respective approximations generated by the FRI method. In  These plots strongly suggest that the iteration equilibrates rapidly (relative to the total number of iterations). Indeed, we estimate the integrated autocorrelation times 6 of Λ m t and F m t to be 20.5 and 274 respectively. This in turn suggests that one could achieve a dramatic speedup by running many parallel and independent copies (replicas) of the simulation and averaging the resulting estimates of λ * and f * though we have not taken advantage of this here. Figure 3 reports the analogous trajectories (see Equation (49) below) of Λ m t and F m t as generated by iteration (16) with the TbS scheme (iteration (16) using truncation-by-size) and the same values of m. The best (highest m) TbS approximation of λ * is 2.545 and the best TbS approximation of f * is 0.014, a difference of more than 5B and 64B respectively. In Figure 4 we plot the sums of the values of the approximation, V m t , of the dominant eigenvector of the Ising transfer matrix at t = 10 5 over 256 intervals of equal size out of the 2 50 total indices. The top plot represents V m t as generated by the FRI method and the middle plot represents V m t as generated by the TbS approach. The TbS iteration has converged to a vector with nearly all of weight concentrated on very low indices. The bottom plot in Figure 4 represents the dominant eigenvector for the 24-spin Ising transfer matrix. The qualitative agreement with the realization of V m t as generated by the FRI method is much stronger than agreement with the result of the TbS method.
Remark 12. In this problem we compute the dominant eigenvalue λ * and a projection f * of the dominant eigenvector v * of the matrix K defined in equation (48) using the FRI in conjunction (the subscript on the expectation indicates that X 0 = x). Note that constant functions are in the kernel of L. The non-trivial eigenfunctions of L all correspond to negative eigenvalues. The magnitude of the greatest of these negative eigenvalues is the spectral gap and characterizes the rate of convergence of expectations such as E x f (X(t)) to their equilibrium (large t) values.
In this subsection we consider estimation of the largest negative eigenvalue of L for b = −DV with The function V is the potential energy for a periodic system of , 2D-particles, each subject to both an external force as well as a nonlinear spring coupling the particles together. Equation (51) is a model of the dynamics of that system of particles (in a high friction limit).
The equation Lg * = λ * g * is first projected onto a Fourier spectral basis, i.e., we assume that 1 ), and the symbol Z 2 N is used to indicate that both α (j) 1 and α (j) 1 are integers with magnitude less than N . Suppose thatL is the corresponding spectral projection of L (which, in this case, is real). The matrixL can be decomposed into a sum of diagonal (corresponding to the second order term in L) and non-diagonal term (corresponding to first order term in L), i.e., L = A + D.
In this problem we are trying to find not the eigenvalue ofL of largest magnitude but the largest (they are real) eigenvalue and a transformation is required. As we mentioned in Section 3.2 this can be accomplished by applying power iteration to a matrix obtained from a discrete-in-time approximation of the ODE d dt y =Ly i.e., by exponentiating the matrix tL for very large t. For example, for sufficiently small ε > 0, the eigenvalue, µ, of largest magnitude, of the matrix is, to within an error of oder ε 2 , 1 + ελ * where λ * is the eigenvalue ofL of largest real part (in our case the eigenvalues are real and non-positive). We will apply our iteration schemes to K. By fixing v t ( 0) = 0 we can guarantee that the approximate solutions all have vanishing integral over [−1, 1) 2 , ensuring that the iteration converges to an approximation of the desired eigenvector/value pair (instead of to v( 0) = 1, v( α) = 0 if α = 0).
Remark 13. In this problem, rather than estimating the dominant eigenvector of K, our goal is estimate the second largest (in magnitude) eigenvalue of K. Given that we know the largest eigenvalue of K is 1 with an eigenvector that has value one in the component corresponding to α = 0 and zeros in all other components, we can therefore exactly orthogonalize the iterates V m t with respect to the dominant eigenvalue at each iteration (by fixing V m t ( 0) = 0). We may view this as using FRI in conjunction with a simple case of orthogonal iteration.
We compare the FRI and TbS approaches with N = 51 for the four-and five-particle systems ( = 4, 5). The corresponding total count of real numbers needed to represent the solution in the five-particle case is 101 10 ≈ 10 20 so only the O(1) scheme is reasonable. For h we choose a value of 10 −3 . Our potential V is chosen so that the resulting matrixL (and therefore also K) is sparse and its entries are computed by hand. For a more complicated V , the entries ofL might have to be computed numerically on-the-fly or might not be computable at all. Our ability to efficiently compute the entries ofL will be strongly effected by the choice of basis. For example, if we use a finite difference approximation of L then the entries ofL can be computed easily. On the other hand, if the solution is reasonably smooth, the finite difference approximation will converge much more slowly than an approximation (like the spectral approximation) that more directly incorporates properties of the solution (regularity in this case). Figure 5 plots the trajectory averages over 10 5 iterations for Λ m t in the = 4 case generated by the FRI method (iteration (16) using Algorithm 1) along with corresponding trajectories of Λ m t as generated by the TbS approach (iteration (16) using truncation-by-size). We present results for both methods with m = 1, 2, 3, and 4 × 10 4 . Observe that the results from the FRI method appear to have converged on one another while the results generated by the TbS approach show no signs of convergence. The best (highest m) estimate of the eigenvalue generated by FRI is −2.31 and the best estimate generated by TbS is −2.49. Figure 6 plots the trajectory of Λ m t in the five particle ( = 5) case as generated by the FRI method with m = 10 6 along with its trajectory average (neglecting the first 500 iterations) of about −1.3. Note that Λ m t appears to reach its statistical equilibrium rapidly relative to the 2 × 10 3 total iterations. Again, the rapid equilibration suggests that statistical error could be removed by averaging over many shorter trajectories evolved in parallel.

A PDE steady state problem
The adjoint L * of the operator defined in (50) with respect to the standard inner product is called the Fokker-Planck operator. The operator determines the Figure 5: Trajectory averages of the approximation, Λ m t , of the largest negative eigenvalue of a backwards Kolmogorov operator for a four 2D-particle (8-dimensional) system, with 95% confidence intervals for the FRI method with m = 1, 2, 3, and 4 × 10 4 . The operator is discretized using a Fourier basis with 101 modes per dimension for a total of more than 10 16 basis elements (half that after taking advantage of the fact that the desired eigenvector is real). The step-size parameter h is set to 10 −3 . Also on this graph, trajectories of Λ m t for the TbS method for the same values of m.
evolution of the density of the process X(t) defined in (51) in the sense that if µ is that density then ∂ t µ = L * µ.
An element in the kernel of L * (a steady state solution of the Fokker-Planck equation) is a density left invariant by X(t).
For the choice of b and σ given in the previous subsection, the steady state solutions are easily seen to be constant multiples of the function . In most applications, the goal is to compute averages of observables with respect to µ * . For example, one might hope to find (up to an additive constant) the effective potential (or free-energy) experienced by particle 1, For that purpose, explicit knowledge of µ * is of little value since one cannot hope to compute integrals with respect to a function of so many variables (up to 101 8 in our tests). One instead hopes to find a more digestible expression for µ * . Notice that if a Fourier expansion µ * ( x) =   , of the largest negative eigenvalue of a backwards Kolmogorov operator for the five-particle system as computed by the FRI method with m = 10 6 over 2 × 10 3 iterations. The total dimension of the discretized system is more than 10 20 . The average value of Λ m t (ignoring the first 500 iterations) is −1.3 and is shown by a solid black line.
As in the previous section we discretize the Fokker-Planck operator in a Fourier basis resulting in a finite-dimensional linear root finding problem (K − I) v * = 0 where K is now defined just as in (52) but with A =L H + D. We choose to normalize the solution so that v( 0) = 1 which then results in a linear system K − I v * = r where r( α) = −K α 0 ,K has the row and column corresponding to the index 0 removed, andv * has the component corresponding to index 0 removed.
Remark 14. Note that the linear system (K −I)v * = r is solved for v * here using FRI in conjunction with Jacobi iteration. With the normalization v t ( 0) = 1, this is equivalent to using the power iteration to find the eigenvector corresponding to the largest eigenvalue of K (which is 1). Recalling that here K ≈ I + ε(L H + D), observe that we are (when ε is small) approximately computing lim t→∞ exp(At)v 0 with A =L H + D, which, since the largest eigenvalue of A is 0, is the desired eigenvector. We repeat that though we know that the dominant eigenvalue of K and we have a formula for µ * , the dominant eigenvector of L * , our goal is to compute projections of µ * that cannot be computed by deterministic means.
In Figure 7 we present approximations of the function F 1 generated by the O(1) scheme V m t+1 = Φ m t (KV m t ) , F m t+1 (α (1) ) = (KV m t ) (α (1) , 0, . . . , 0) F m t+1 (α (1) ) = (1 − ε t )F t (α (1) ) + ε t F m t+1 (α (1) ), for all α (1) ∈ Z 2 N with ε t = (t + 1) −1 and where the independent mappings Φ m t are generated according to Algorithm 1. The single particle free-energy 7 as generated by the FRI approach is plotted for = 2, 3, 4, and 5 with m = 10, 200, 10 4 , and 10 6 respectively. In the two, three, and four particle simulations we use 10 5 iterations. Again we choose N = 51 and h = 10 −3 . The high cost per iteration in the five particle case restricts our simulations to 2 × 10 3 iterations. In the four particle case, for which we have validated the FRI solution by simulations with higher values of m (m = 4 × 10 4 ), the free energy profile produced by the TbS approach differs from the FRI result by as much as 100%. We take slight advantage of the particle exchange symmetry and, at each iteration, replace (KV m t ) (α (1) , 0, . . . , 0) in the above equation for F m t+1 by the average of all components of the form (KV m t ) (0, . . . , α (k) , 0, . . . , 0). Note that in the expansion of µ * , we know that v( α) is unchanged when we swap the indices α (j) and α (k) corresponding to any two particles. This fact could be leveraged to greatly reduce the number of basis functions required to accurately represent the solution. We have not exploited this possibility.
Though it is not accurate, the TbS scheme is substantially more stable on this problem. We assume that the relative stability of the TbS scheme is a manifestation of the fact that TbS is not actually representing the high wave number modes that are responsible for stability constraints. Nonetheless, simulating higher dimensional systems would require modifications in our approach. In particular it might be necessary to identify a small number of components of the solution that should always be resolved (never set to zero). For example, for this problem one might choose to resolve some number of basis functions for each particle that are independent of the positions of the other particles.

Discussion
We have introduced a family of fast, randomized iteration schemes for eigenproblems, linear systems, and matrix exponentiation. Traditional iterative methods for numerical linear algebra were created in part to deal with instances where the coefficient matrix A (of size O(n 2 )) is too big to store but where the operation x → Ax can nonetheless be carried out. The O(1) iterative methods in this paper are intended for instances where even the solution vector x (of size O(n)) is too big to store but where the ultimate goal is to compute F x where F may be dense but has many fewer rows than columns. In addition to the contexts addressed explicitly above, such problems have become increasingly common in other areas as well; for example, in the analysis of large data sets and in the solution of large-scale Lyapunov and Sylvester equations [4,52,53] arising in control theory.
A completely deterministic approach to iterative problems related to the methods proposed in this article is the simple thresholding by size (TbS) in which, at each iteration, the smallest entries in the approximation are set to zero. An adaptive version of the TbS approach has recently been advocated for a wide range of applications (see [50,42,43]) including variants of those considered in this paper. Like TbS our randomized schemes also rely on the enforcement of sparsity and also tend to set small entries in the approximate solution to zero. While TbS schemes can be effective on some problems with sparse solutions its error, in general, will be strongly dependent on system size.
The core concept behind the schemes is the notion that, by randomly setting entries in vectors to zero, while maintaining a statistical consistency property, we can dramatically reduce the cost and storage of standard iterative schemes. One can view our FRI schemes as an attempt to blur the line separating Markov chain Monte Carlo (MCMC), which is effective in extremely high-dimensional Figure 7: Free energy landscape experienced by a single particle for the (clockwise from top left) two, three, four, and five 2D-particle systems. The surfaces were generated using the FRI method with m = 10, 200, 10 4 , and 10 6 respectively and h = 10 −3 . The two-, three-, and four-particle simulations were run for 10 5 iterations. The five particle simulation was substantially more expensive per iteration and was run for only 2×10 3 iterations. The number of Fourier modes used to represent the solution in all simulations is 101 per dimension for a total of more than 10 8 , 10 12 , 10 16 , and 10 20 basis functions (half that after taking advantage of the fact that the solution is real). As expected, the free energy basins deepen as the number of particles grows. In the four particle case (for which we have high confidence in the estimate produced by FRI) the error from the TbS approach (the results of which we do not plot) is roughly 100% at peaks of the free energy landscape. settings but is limited to a relatively narrow class of problems and often does not allow the user to take full advantage of known properties of the solution (e.g. smoothness or symmetry properties as in [8]), and traditional deterministic schemes, which are effective on a very general set of relatively low dimensional problems. As for MCMC approaches, if one settles for computing low dimensional projections of the full solution then not every element of the state space need be visited and effective FRI schemes with per iteration cost and storage requirements independent of system size can be derived (as for MCMC the validity of this statement depends on the particular sequence of problems considered). Also as for MCMC we expect that, when deterministic alternatives are available, they will outperform our randomized schemes. For matrices of the size considered in all of our numerical tests, deterministic alternatives are not available.
Experience with diffusion Monte Carlo in the context of quantum Monte Carlo simulations suggests that our randomized schemes will be most useful if applied after considerable effort has been expended on finding changes of variables that either make the desired solution as sparse as possible (reducing both bias and variance) or reduce bias by some other means. In many cases this will mean applying our randomized schemes only after one has obtained an estimate of the solution by some deterministic method applied to a reduced dimensional version of the target problem.