A comparison of limited-memory Krylov methods for Stieltjes functions of Hermitian matrices

Given a limited amount of memory and a target accuracy, we propose and compare several polynomial Krylov methods for the approximation of f(A)b, the action of a Stieltjes matrix function of a large Hermitian matrix on a vector. Using new error bounds and estimates, as well as existing results, we derive predictions of the practical performance of the methods, and rank them accordingly. As by-products, we derive new results on inexact Krylov iterations for matrix functions in order to allow for a fair comparison of rational Krylov methods with polynomial inner solves.


Introduction.
In recent years considerable progress has been made in the development of numerical methods for the efficient approximation of f (A)b, the action of a matrix function f (A) on a vector b. In applications, the matrix A ∈ C N ×N is typically large and sparse and the computation of the generally dense matrix f (A) is infeasible. One therefore seeks to approximate f (A)b directly by means of some iterative method. By far the most popular methods for this task are polynomial [15,38] or rational [16,24,25,45] Krylov methods.
In this work we investigate which Krylov methods are best suited when A is Hermitian and the only feasible operation involving this matrix are matrix-vector products. This might be the case, e.g., when direct solvers are inefficient due to A's sparsity structure or if A is only implicitly available through a routine that returns the result of a matrix-vector product. We further assume that memory is limited so that only a predefined number m max of vectors of size N can be stored. This situation arises in many applications, including lattice quantum chromodynamics [9,11,18] and statistical sampling [29,30,43].
The basis of polynomial Krylov methods for Hermitian matrices is the Lanczos method [34] (which corresponds to the Arnoldi method [2] in the non-Hermitian case). Applying m Lanczos iterations with A and b yields the Lanczos relation and e m denoting the mth canonical unit vector in R m . The Lanczos approximation f m ≈ f (A)b is obtained by projecting the original problem onto the Krylov space, where · denotes the Euclidean vector norm. The evaluation of (1.2) requires the storage of the full Lanczos basis V m , i.e., m vectors of length N . In the situation described above, the value m max which limits the number of vectors that can be stored therefore limits the maximum number of iterations that can be performed and thus also the attainable accuracy of the Lanczos approximation. This is different from the situation for Hermitian linear systems, where the short recurrence for the Krylov basis vectors translates into a short recurrence for the iterates, resulting in the famous conjugate gradient method [28].
There are several approaches for overcoming the memory problem, including • the two-pass Lanczos method, see, e.g., [10,23] which overcomes memory limitations but roughly doubles the computational effort, • the multi-shift CG method [18,22], which replaces f by a rational approximation and then simultaneously solves the resulting linear systems using the short recurrence conjugate gradient method, • restarted Krylov methods [1,17,20,21,30,41,44] which, similar to restarted methods for linear systems, construct a series of Krylov iterates in such a way that each "cycle" of the method only requires a fixed amount of storage. We also refer the reader to the recent survey [26] covering limited-memory polynomial methods for the f (A)b problem.
Another approach, which has not been considered in this context in the literature so far, is to use rational Krylov methods [8,24,25] combined with an iterative short recurrence solver for the associated linear systems. It appears to be generally thought that using a polynomial Krylov solver inside a rational Krylov method is not sensible, because the approximation computed is then polynomial and hence could also be computed with a polynomial Krylov method alone. While this is true in theory, an outer-inner rational-polynomial Krylov method may still be interesting in our setting: if the overall number of outer iterations is small, the number of vectors to be stored is small (as the inner iteration uses a short recurrence) and hence the method could be a viable alternative to restarting strategies. Thus, we pose the following question: Given a limited amount of memory (storage of at most m max vectors of length N ) and a target accuracy ε, what is an efficient way to extract an accurate approximation to f (A)b from a polynomial Krylov space?
Of course, "efficient" can have several meanings like, e.g., "small number of matrix-vector products and inner products" or "low overall computation time". The latter criterion is highly dependent on the specific implementation of each method and also the hardware environment (e.g., parallel/distributed computing), and hence difficult to asses given only information on f, A, b; see also [12] for a discussion of such difficulties. We take a more general, implementation-independent approach here by exploiting the considerable progress that has recently been made in the understanding of (restarted) Krylov methods for the f (A)b problem [20,21,41]. Together with the very well-understood convergence behaviour of the conjugate gradient method (see, e.g., [39]), this opens up the possibility to assess and compare the efficiency of different algorithmic variants using upper error bounds. We hope that this theoretical work will serve as a starting point for a more practice-oriented comparison of the different algorithms for a) investigating the potential for efficient parallelization and tuning, b) comparing our theoretical estimates of iteration numbers to the real numbers occurring when solving different real-world problems. Most of the available theoretical results mentioned above apply to the class of Stieltjes functions where µ(t) is monotonically increasing and nonnegative on [0, ∞) and in addition ∞ 0 (t + 1) −1 dµ(t) < ∞; see, e.g., [6,7,27]. Important examples of Stieltjes functions include f (z) = z −α with α ∈ (0, 1) and f (z) = log(1 + z)/z. Stieltjes matrix functions are closely related to shifted linear systems (see, e.g., [20]), which allows us to transfer many theoretical results well-known for linear systems, like the classical CG convergence bound. In particular, the following theorem is central to many developments in this paper. The A-norm used in the statement of the theorem is defined as x H Ax . Theorem 1.1 (see, e.g., [39]). Let A ∈ C N ×N be Hermitian positive definite and x 0 , b ∈ C N . Further, let x * denote the solution of the linear system Ax = b and let x m be the mth CG iterate with initial guess x 0 . Let λ min and λ max denote the smallest and largest eigenvalue of A, respectively, and let κ = λmax λmin denote the condition number of A. Define (where we set α m = 0 if κ = 1). Then the error in the CG method satisfies The remainder of this paper is structured as follows. In Section 2 we give a survey of the different established polynomial methods for approximating f (A)b, together with their (worst-case) convergence bounds. Sections 3 and 4 introduce new combinations of outer-inner rational-polynomial Krylov methods, namely an inexact shift-and-invert method [35,37,45] and an inexact extended Krylov method [16,33] with polynomial Krylov solvers. We provide a new convergence analysis for the shiftand-invert method and discuss ways to relax the inner iterations of the proposed methods. In Section 5 we use the obtained convergence results to estimate the total arithmetic cost of each of the considered methods, and discuss general advantages, disadvantages, and prerequisites of each of the methods. The theoretical estimates are compared to real iteration counts for some (artificial) test problems. Concluding remarks and topics for future research are given in Section 6.
2. Polynomial limited-memory Krylov methods. As the problem of approximating functions of very large, sparse matrices arises frequently in applications, several different strategies have been developed to overcome the problem of scarce memory. We briefly describe three established Krylov methods, together with theoretical results on their convergence behaviour.
2.1. Two-pass Lanczos. The two-pass Lanczos method [10,23] is a very simple approach that solves the scarce memory problem by applying the Lanczos process twice. Of course, this doubles the number of matrix-vector products and inner products that need to be computed.
In the first pass of the Lanczos method one computes the compressed matrix T m and discards the basis vectors in V m as soon as they are no longer needed to compute the next basis vector (i.e., only storing the last three basis vectors). Then, the coefficient vector y m := b f (T m )e 1 is computed. In the second pass, the Lanczos approximation is computed as which can be updated from one iteration to the next and thus also allows to discard old basis vectors. This approach produces the same iterates as the standard Lanczos process but requires twice the number of matrix-vector products. When f is a Stieltjes function, the convergence of the two-pass Lanczos method is thus characterized by the following theorem from [20]. It is a special case of a more general result for the restarted Lanczos method; cf. Theorem 2.5 below.
Theorem 2.1 (Corollary 4.4 in [20]). Let A ∈ C N ×N be Hermitian positive definite, b ∈ C N , f a Stieltjes function (1.3), and let f m be the approximation to f (A)b after m iterations of the Lanczos method. Further, let .
and let t 0 ≥ 0 be the left endpoint of the support of µ. Then We remark that the bound (2.2) is, up to the factor C, the same as the standard textbook CG convergence bound for the linear system (A+t 0 I)x = b. Also, note that if additional computational work is invested into the computation of error bounds or error estimates during the first pass of the Lanczos method, then this work can of course be avoided in the second pass, as the number of iterations required to reach the desired accuracy is already known.

2.2.
Multi-shift CG. The multi-shift CG method [18,22,23] for f (A)b is based on first approximating f by a suitable rational function r, r(A)b ≈ f (A)b, most commonly in partial fraction form where we assume that all poles ζ i lie on the negative real axis. Then The main trick for efficiently evaluating (2.5) is the shift-invariance of Krylov spaces, that is K m (A, b) = K m (A + ζI, b) for all ζ ∈ C. Therefore, when started with an initial guess x 0 (ζ) = 0 for all shifted systems, the Krylov spaces from which the conjugate gradient method extracts its approximants for the different systems all coincide. This can be exploited by simultaneously solving all shifted systems in the same iterative process, requiring the same number of matrix-vector products and inner products as the solution of a single system. The multi-shift CG implementation proposed in [22] results in the following computational and memory overhead compared to the standard Lanczos method (ignoring negligible scalar operations): (i) Two vectors of length N need to be stored for each pole of the rational approximation, i.e., 2p additional vectors overall. (ii) In each iteration, two vector additions and three vector scalings are performed (equalling to about two and a half inner products in cost) for each pole of the rational approximation, totalling to 2.5mp additional inner products. (iii) After performing the multi-shift CG method, the iterates of the individual systems need to be combined according to (2.5). This results in an additional number of p vector scalings and (p − 1) vector additions (and, in case of a rational approximation of the later form in (2.4), one additional matrix-vector product), which equals to about p additional inner products.
Remark 2.2. In [18] it is proposed to not store the iterates of the individual systems and then combine them at the end, but instead to directly combine them in each iteration to update the approximation to f (A)b. This way, only one additional vector needs to be stored per system, but the computational effort of roughly 2p vector additions and scalings is required in each iteration of the multi-shift CG method.
The (worst-case) speed of convergence of the multi-shift CG method is also described by the classical textbook CG convergence bound, with the worst-conditioned system (i.e., the one corresponding to the pole ζ i closest to zero) determining the overall necessary number of iterations. When one of the shifts is very close to zero, as it is typically the case for rational approximations of Stieltjes functions, the resulting convergence factor is thus approximately equal to that of the standard Lanczos method given in Theorem 2.1.
The overall computational cost and storage requirements of the multi-shift CG method are determined by the number p of poles of the rational approximation. These of course depend on the overall accuracy to which one wants to approximate , depends both on the accuracy of the rational approximation and the accuracy to which the shifted linear systems are solved. When aiming for an overall accuracy of ε, a (straightforward) approach is to construct a rational function r such that |f (z) − r(z)| ≤ ε/(2 b ) for z ∈ spec(A), the spectral interval of A, and then solve the shifted linear systems accurately enough for fulfilling such that overall Remark 2.3. Typically, systems associated with poles of large magnitude converge significantly faster than those corresponding to poles close to zero. Therefore one can employ a strategy for "removing" already converged systems from the iteration in order to not perform superfluous computations. Strategies for doing this without violating the condition (2.6) are discussed in [18,Section 5.3] in the context of approximating the action of the matrix sign function.

Restarted Lanczos.
The idea of the restarted Lanczos method is to first compute an approximation (1.2) obtained from m < m max iterations of the standard Lanczos process, which we now denote by f (1) m , where the superscript is used to distinguish quantities belonging to different restart cycles. The second cycle of the method then consists of an additive update f m obtained by m new Lanczos iterations. Repeatedly applying this approach yields a sequence of approximations In order to use the Lanczos method for approximating the error as m (z) and a new vector v (1) . First results in this direction were given in [17,30], characterizing the restart function e (1) m (z) as the mth order divided difference [13] of f (z) with respect to the Ritz values, i.e., the eigenvalues of T m . However, this error function representation turned out to be numerically unstable. We therefore cite a result from [20,21] which gives an integral representation for the error which is numerically stable and in addition useful for deriving theoretical results on the convergence of the restarted Lanczos method.
where v m+1 is the (m + 1)st Lanczos vector. Theorem 2.4 recursively also holds for the Lanczos approximation resulting after a restart cycle because e m (z) is itself (a scalar multiple of) a Stieltjes function; see [20,Proposition 2.2]. As a consequence, the error representation can be used to obtain a restarted Lanczos method with an arbitrary number of restart cycles. In [21,41] a version of this method was introduced which evaluates the error function e m (z) using adaptive numerical quadrature. The following theorem (a more general version of Theorem 2.1) gives an upper bound on the error of the restarted Lanczos method.
where C is as in (2.3) and 0 ≤ α m (t 0 ) < 1. In particular, the restarted Arnoldi method converges for all restart lengths m ≥ 1.

Rational Krylov methods for Stieltjes function.
In this section, we discuss two rational Krylov methods, namely the shift-and-invert Lanczos method and the extended Krylov method, for the approximation of Stietljes functions. The convergence of the shift-and-invert method is analysed in detail. We also briefly comment on iteratively solving the linear systems arising in each iteration.

Shift-and-invert Lanczos.
The shift-and-invert Lanczos method was introduced in [45] for "preconditioning" Lanczos iterations for the matrix exponential times a vector; see also [35] for related work. The main idea is to replace A by a matrix B with more favourable spectral properties, leading to faster convergence to f (A)b. In the following we give a short description of this method.
First, define B = (A − ξI) −1 , where ξ ∈ R − is an arbitrary, negative shift. How to best choose this parameter is discussed later in this subsection. Applying m steps of the Lanczos method to B with starting vector b, we obtain the relation We have f (A)b = g(B)b and define the standard shift-and-invert approximation as For reasons that will become apparent when deriving the convergence bounds later in this section, we propose to instead use the so-called "corrected" Lanczos approximation (first introduced in [38] for the approximation of ϕ-functions) When f is a Stieltjes function (1.3), then the function g defined in (3.2) clearly admits an integral representation Using (3.4), we find the representation for the error of the corrected shift-and-invert approximation (3.3), where γ m and w m are as in Theorem 2.4. In other words, where e m (t) is the error of the mth CG approximation to the linear system (I + (ξ + t)B)x (t) = b. This follows from the fact that is the residual of the mth CG approximation x m (t) for that linear system. In the following, we use the error function representation (3.6) to derive results on the speed of convergence of the shift-and-invert method and on how to choose the shift ξ. Similar results have been obtained in [35], but we provide a different proof here which yields explicit constants in the bounds that were previously unavailable.
Let λ min and λ max denote the smallest and largest eigenvalue of A, respectively, and define .
(3.8) Then the norm of the error ofĝ SI m is bounded by Proof. By using (3.5) and (3.7), we can write , where e m (t) denotes the error of the approximation x m (t) from m steps of CG for the shifted linear system (I + (ξ + t)B)x = b. This yields . We now apply Theorem 1.1 for the shifted matrices I + (ξ + t)B, which are positive definite for t ∈ [0, ∞). Note that κ ξ (t) is exactly the condition number of the shifted matrix I + (ξ + t)B. Applying the CG estimate for all t and using the fact that the initial guess is x 0 (t) = 0 for all t, we conclude that (3.11) Inserting (3.11) into (3.10), using the fact that v ≤ √ λ max − ξ v B for all v ∈ C N and splitting the integral at −ξ completes the proof.
The error estimate (3.9) shows that the asymptotic convergence factor of the corrected Lanczos approximation for g(B)b will be determined by the largest asymptotic CG convergence factor α ξ m (t) across all shifts t ∈ [0, ∞). According to (3.8), the values α ξ m (t) also depend on the shift ξ and we will therefore now determine the value of ξ for which the maximum of α ξ m (t) becomes smallest possible. For this, first note that c ξ increases monotonically as a function of κ ξ (t) and α ξ m increases monotonically as a function of c. Therefore, α ξ m (t) attains its largest value where κ ξ (t) attains its largest value. The function κ ξ (t) is monotonically decreasing on [0, −ξ] and monotonically increasing on [−ξ, ∞). Therefore, (3.12) It depends on the choice of the shift ξ which of the two values κ ξ (0) and κ ξ,∞ is larger. We have i.e., κ ξ,∞ is monotonically increasing in ξ and κ ξ (0) is monotonically decreasing in ξ. (3.14) Using the shift (3.14), we find the following error bound.
Proof. By the considerations above, we have that for Using the fact that √ λ min + t √ λ max + t ≤ √ λ min λ max + t after inserting (3.16) into (3.9) concludes the proof.

Extended Krylov.
The extended Krylov subspace method [16,32,33], like the shift-and-invert method, is a special case of a rational Krylov method. Here the approximations to f (A)b are extracted from an extended Krylov space One iteration of the method involves adding one basis vector from the "positive" and one vector from the "negative" sequence. A five-term recursion for the basis vectors was first derived in [42]. Recursion relations for more general extended Krylov sequences are treated in [31,32]. The method yields an extended Lanczos decomposition An approximation for f (A)b can then be obtained by projection onto the extended Krylov space in the usual way, i.e., Using the algorithmic approach from [42], one iteration of the extended Krylov subspace method requires one matrix-vector product with A, the solution of one linear system with A, and the computation of six inner products/vector norms in the orthonormalization process. The matrix T EK m can be cheaply computed from the orthonormalization coefficients without any further inner products [31].
The convergence of the extended Krylov methods for the approximation of Stieltjes matrix functions via (3.17) has been analyzed in [5,16,33]. Here we will use the following result.
for a constant C > 0 and with the term on the right asymptotically sharp for large κ = λ max /λ min . Remark 3.4. Theorem 3.3 states that the convergence factor of the extended Krylov method when expressed in terms of the subspace dimension dim EK m = 2m (instead of the order m), is the same as just like the shift-and-invert method discussed in Section 3.1.

Polynomial solves in the inner iteration.
Both the shift-and-invert method and the extended Krylov subspace method require the solution of a linear system in each iteration. They are therefore particularly attractive for matrices for which direct solution methods can be efficiently applied (e.g., not too large matrices with rather small bandwidth, for which it is feasible to compute a Cholesky decomposition). In particular, as the poles of the rational Krylov subspace stay the same across all iterations, it suffices to compute a Cholesky decomposition (of A or A − ξI, depending on the method) once and then use it in all subsequent iterations.
We are interested in the very large scale case, in which only a few vectors of length N can be stored at the same time, meaning particularly that computing a Cholesky decomposition of A is not an option. Therefore, the linear systems occurring in the shift-and-invert or extended Krylov method can only be solved approximately by an iterative method (the inner iteration). We deal here with the case in which the inner iteration is again a Krylov subspace method. As A is Hermitian positive definite, a natural choice is the conjugate gradient method.
In cases where a direct solver can be used, one iteration of the shift-and-invert Krylov method and one iteration of the extended Krylov method have approximately the same cost (as there is no difference in the cost of computing a Cholesky factorization of A or A − ξI). When using an iterative method, this dramatically changes, however. In the shift-and-invert method, with the optimal shift ξ = − √ λ min λ max it follows from (3.16) that κ(A − ξI) = κ(A). Thus, without preconditioning of the inner iteration, we can expect the linear systems within the extended Krylov method to be much more difficult to solve and require a significantly higher number of iterations.
An important question that arises in the context of using an inner iteration for the solution of the linear systems in a rational Krylov method is to which accuracy these systems need to be solved in order to not negatively influence the convergence of the outer iteration. We discuss this topic in detail in the next section.
4. Relaxing the tolerance in outer-inner rational Krylov methods. In this section, we only discuss the shift-and-invert method. For the extended Krylov method, very similar results can be formulated in a straight-forward manner. We have so far only considered the error of the Lanczos approximation for g(B)b. If we want to use an inexact version of B = (A − ξI) −1 by solving the involved linear systems iteratively, we need to modify (3.1) to where each column r j of R m is the residual incurred when solving for w = (A − ξI) −1 v j . Note that we have replaced the tridiagonal matrix T SI m by a generally dense upper-Hessenberg matric H m . Defining E m := −BR m V * m , a matrix of rank at most m, we can further rewrite (4.1) as Therefore, when inexact inner iterations are used, we are effectively computing an Arnoldi approximation g m : and so we need to control both terms on the right-hand side.
4.1. The first term g(B + E m )b − g m . The first term corresponds to the error of the (exact) Arnoldi approximation for g(B + E m )b and it is analysed in Appendix A. We show that for E m small enough, this Arnoldi approximation still converges at a rate very close to that of the unperturbed Krylov approximation.
By Theorem 3.2 we know that the exact shift-and-invert method with optimal shift ξ = − √ λ min λ max , as well as the standard Lanczos and extended Krylov method, converge geometrically as where C is some constant independent of m, and for standard Lanczos, for shift-and-invert Lanczos and extended Krylov.
|. This is true for the standard Lanczos method, which has been known since [14,19], but also for the extended Krylov [40] and even the shift-and-invert method. It is now easy to derive an upper bound on |γ (N ) j |, and thereby all |γ i.e., the Krylov coefficients γ (m) j used to form f m also decay geometrically at a rate α. Let us use the model that the unperturbed approximation f m has coefficients that satisfy a geometric series and therefore suggesting the estimate (4.5)

The second term
In order to analyse the second term on the right-hand side of (4.2), we first consider the simplified function g t (y) = (y −1 + ξ + t) −1 . Thanks to the Sherman-Morrison-Woodbury formula we have and therefore As a consequence, Since B = (A − ξI) −1 is Hermitian, it is easy to see that g t (B) = 1/(λ min + t) and where To estimate U (I + V g t (B)U ) −1 , we assume that R m 1 so that upon using a truncated Neumann series This gives the approximate error bound As we have g(z) = ∞ 0 g t (z) dµ(t), we obtain an approximate error bound for g(B + E m ) by integrating (4.6): To rewrite the right-hand side of (4.7), we use the same techniques as in the proof of Lemma 3.1 and Theorem 3.2. Let the functions f 1 , f 2 be defined as in (3.15), then Unfortunately, no closed form of the functions f 1 , f 2 or their derivatives is available. One can thus either evaluate the integrals numerically or use the trivial upper bound |f 1 (z)|, |f 2 (z)| ≤ f (z). Using the latter approach, we obtain the final estimate

Theoretical and practical comparison of the different methods.
We now devise recommendations which of the methods discussed in Section 2 and 3 are best suited for approximating f (A)b in a given situation (of available memory, conditioning of A, size of A, availability of spectral information) based purely on the theoretical results available for these methods. Additionally, we summarize several other features and (dis)advantages of the different methods in a concise manner and perform numerical experiments in order to gauge whether the predictions obtained from the theoretical results are trustworthy.
5.1. Advantages, disadvantages, and prerequisites. We briefly discuss general properties of the different methods, which go beyond the comparison of error bounds in Section 5.2. This comparison is compactly summarized in Table 5.1, together with the quantities determining the asymptotic speed of convergence.
Limited accuracy due to available memory. Without further countermeasures, the accuracy of some of the presented methods is still limited by the available memory. For the multi-shift CG method, the available memory dictates the maximum number of poles which can be used for the rational approximation, as one or two additional vectors of length N (depending on the specific implementation, cf. Remark 2.2) need to be stored. If the number of poles necessary for reaching the target accuracy exceeds the available memory, an alternative is to run the multi-shift CG method several times for subsets of the poles, which then increases the number of matrix-vector and inner products.
For the extended Krylov and shift-and-invert Lanczos method, the attainable accuracy is limited by the available memory because the outer iteration still requires the storage of the full orthonormal basis in order to construct the final approximation to f (A)b. If this becomes a limiting factor, restarting techniques could be employed for the outer iteration. However, the restarting of rational Krylov methods and the interaction between restarts and inexact inner solves are largely unexplored topics so far.
The two-pass Lanczos and restarted Lanczos method can reach any desired accuracy independent of the available memory (ignoring numerical effects like round-off error and assuming that the tridiagonal m×m matrix in the two-pass Lanczos method does not grow beyond memory).
Reliance on a-priori spectral information. In two of the discussed methods, spectral information on the matrix A is required. In the multi-shift CG method, when constructing a suitable rational approximation r ≈ f , one requires (bounds on) the largest and smallest eigenvalue of λ max and λ min of A. In the same way, computing the "optimal" shift ξ = − √ λ min λ max in the shift-and-invert method requires knowledge of these extremal eigenvalues. It is however possible to choose an arbitrary pole ξ independent of spectral information of A. For certain functions such alternative strategies have been shown to be successful, see, e.g. [36].
In contrast, the two-pass Lanczos, restarted Lanczos and extended Krylov method do not require any a priori spectral information.
Possibility for preconditioning. By applying a suitable preconditioner, the number of iterations necessary in the outer-inner methods, i.e., the extended Krylov and shift-and-invert Lanczos method, can potentially be greatly reduced. This can make these methods much more competitive than what one would expect from the inner convergence factors shown in Table 5.1. However, with preconditioning, the extracted approximations are no longer elements of a polynomial Krylov space K m (A, b), and hence a fair comparison is no longer possible.
Additional overhead. Besides matrix-vector products, there are also other arithmetic operations that add to the computational complexity of the considered methods. For the two-pass Lanczos method, f (T m ) needs to be evaluated and, when m gets large, this can be a challenge in its own right. While this is in principle also true for the extended Krylov and shift-and-invert method, the number of (outer) iterations in these methods will typically be significantly smaller. In the multi-shift CG method, additional vector operations for each pole beyond the first have to be performed. When the target accuracy is increased, this will also lead to an increase in the degree of the rational approximation and thus the number of poles. Depending on how the cost of a matrix-vector product compares to a vector operation, this additional work can become non-negligible; see also Experiment 5.3 below.

5.2.
Predicting the number of matrix-vector products. We now use the convergence results from above to estimate the number of matrix-vector products that are needed to achieve a certain relative error when approximating f (A)b, and thus obtain recommendations for which methods are most suitable under which circumstances. Let us stress that all the bounds presented so far are worst-case predictions that only take the extremal eigenvalues into account and therefore cannot, e.g., predict superlinear convergence effects due to spectral adaption [4]. Therefore, our predictions cannot be expected to be accurate for all matrices with spec(A) ⊆ [λ min , λ max ], but rather only for matrices whose spectra are close to a worst-case distribution. Put another way, our predictions can be expected to be good in regimes where m/N is small so that superlinear convergence has not set in. As most of the discussed methods are influenced by this in a similar manner, we still hope that the recommendations derived from this worst-case analysis are also valid in other cases. This is indeed confirmed by the numerical experiments reported in Section 5.3.
For obtaining the predictions in the non-restarted methods, we first estimate the number of (outer) iterations via the relation where α is given by (4.4) and C by (4.5). Here f m can be the mth iterate of the two-pass Lanczos, shift-and-invert Lanczos, or extended Krylov method, or the mth conjugate gradient iterate for the system (A − ζ 1 I)x 1 = b, where ζ 1 is the pole with smallest absolute value. From (5.1) we then obtain the prediction For the two-pass Lanczos method, the estimated number of matrix-vector products is 2m * , while for the multi-shift CG method it is m * (the value of m * differs here, as the conditioning of the matrix A − ζ i I is slightly better than that of A). For the outer-inner rational-polynomial methods, after computing the necessary number of outer iterations, we use the same approach for estimating the inner iterations. By summing over all inner iterations, we then obtain an estimate of the total number of matrix-vector products.
For the restarted Lanczos method, after k cycles with restart length m re we have with the convergence factor α mre (t 0 ) given in (2.1). We obtain the estimate k * = log cosh(mre log(α)) (cosh(m * log α )) , where α and m * correspond to the standard Lanczos method and where we have used the representation of the Lanczos and restarted Lanczos convergence factor in terms of the hyperbolic cosine. This is due to the fact that the estimate 1 cosh m re ln c ≈ 2c mre is rather rough for m re small. As the restart length is typically a small value, we thus obtain much better estimates using (5.2). The final estimate for the number of matrix-vector products is then given by m re · k * .

Experimental confirmation of the predictions.
We now perform numerical experiments to illustrate how reliable the predictions from the previous section are in practice. It turns out that choosing the target residual norms as suggested by (4.8) in the inner iterations of the rational methods is much stricter than necessary to reach the desired accuracy. Experimentally, we found the following strategy to yield sufficient accuracy in our experiments: when the overall target accuracy is ε, we solve the first linear system to a residual norm below and let the residual norm grow geometrically, ε j = ε1 α(0) j−1 , j = 2, 3, . . . . Apart from this deviation from our theoretical basis, all methods are executed as described before.
Experiment 5.1. In our first experiment, the matrix A ∈ C 1,000×1,000 is diagonal with Chebyshev eigenvalues in the interval [0.1, 200.1] and b is a normalized vector of all ones and we aim to approximate A −1/2 b with a relative accuracy of 10 −6 . In the multishift CG method, we use the optimal Zolotarev rational approximation [47] for the inverse square root, which requires 15 poles for the target accuracy. As one needs to store two additional vectors per pole in the multi-shift CG method we choose a restart length of m re = 30 in the restarted Lanczos method in order to compare methods with roughly the same memory consumption. For a matrix with Chebyshev eigenvalues, we expect our predictions to be rather accurate as no superlinear convergence takes place. This is confirmed by Table 5.2 which shows the predicted number of matrix-vector products according to the approach outlined in Section 5.2 as well as the actual number of matrix-vector products required by our implementations. We find that the two outer-inner rationalpolynomial methods are vastly outperformed by the polynomial Krylov methods. This is in particular true for the inexact extended Krylov method which needs by far the most matrix-vector products. This is not too surprising as already the solution of one linear system with A requires about the same number of matrix-vector products as the multi-shift CG method. Among the polynomial methods, the multi-shift CG method needs the fewest matrix-vector products (as expected), while the restarted Lanczos method needs about 13% fewer matrix-vector products than the two-pass Lanczos approach for this example.
Experiment 5.2. The matrix in Experiment 5.1 is deliberately chosen so that the actual convergence is very close to what is predicted by the worst-case bounds, informed only by the extremal eigenvalues of A. In reality, all eigenvalues of A have an influence on the convergence of (rational) Krylov methods. Thus, one might get the impression that our approach for theoretically comparing the different methods holds little value in practice. While it is true that the predicted number of matrix-vector products cannot be expected to be accurate, it actually turns out that the prediction of the ratio between the numbers of matrix-vector products of the different methods is quite accurate for very different eigenvalue distributions, in particular between the three polynomial methods. To illustrate this, we now consider a diagonal matrix with eigenvalues in [λ min , λ max ] given by with γ ∈ (0, 1) and where we choose λ min = 0.1, λ max = 200.1, and N = 1000 as before. The lower the parameter γ, the more the eigenvalues in (5.3) are clustered at one end of the spectrum, making the distribution more "favorable" for Krylov methods, as fast spectral adaptation and hence superlinear convergence can be expected. However, as the spectral interval for (5.3) stays the same, irrespective of the value of γ, the worst-case convergence bounds we used in our predictions will be less and less sharp when γ is decreased. Figure 5.1 depicts the results of applying the discussed methods to matrices with eigenvalue distributions (5.3) with γ ∈ [0.65, 0.99]. On the left-hand side, the absolute number of matrix-vector products is shown. As expected, this number decreases for decreasing γ and thus the distance to our prediction (shown by the dashed lines) grows larger and larger. On the right-hand side, we show the relative number of matrix-vector products of the methods, with the number needed by multi-shift CG as a baseline. For the polynomial methods, this results approximately in a horizontal line, revealing that the number of matrix-vector products these methods need in comparison to multi-shift CG stays almost constant for all the different eigenvalue distributions. Thus, although the prediction cannot be used to get a realistic estimate of the amount of work that is needed to solve a given problem, this experiment indicates that it gives a good idea of how different methods compare. Experiment 5.3. In terms of matrix-vector products, the multi-shift CG method always outperforms the other methods, which is to be expected. In this experiment, we use a very simple model of overall computational complexity to get a rough estimate of how multi-shift CG and restarted Lanczos compare in overall computation time for a given problem. To do so, we count vector operations in addition to matrix-vector products. We count the cost of one simple vector operation (addition or scaling) as one unit of work, written 1V. Thus, an inner product has a cost of ≈ 2V. The cost M of a matrix-vector product in these units of work depends on A. For example, for the discretization of a differential operator on a regular two-dimensional lattice, a matrix-vector product has a cost of M ≈ 10V, while for a discretization on a threedimensional lattice we have M ≈ 14V. We ignore the cost of all operations that are independent of N .
In the multi-shift CG method, the computational cost of one iteration for the "seed system" (i.e., the system with smallest shift) is 1M + 12V and each additional system requires an effort of 5V.
For restarted Lanczos, one iteration has a cost of 1M+9V and forming the iterate at the end of each restart cycle has a cost of ≈ 2m re V, where m re is the restart length (we ignore the fact that the matrix-vector product V mre y mre can typically be executed faster than 2m re − 1 individual vector operations). Using these formulas, we can estimate the overall number of work units required to compute f (A)b to a certain target accuracy by combining them with our estimates from Section 5.2. We again use the optimal Zolotarev rational approximation for the multi-shift CG method and choose the cycle length in restarted Lanczos as m re = 2p, where p is the required number of Zolotarev poles.
To obtain a realistic comparison, we do not assume that the additional work of 5V per pole is performed in all iterations for all poles, but instead use the strategy from [18] for removing already converged systems from the iteration in such a way that the overall error in the approximation for f (A)b is still guaranteed to be below the target accuracy. Figure 5.2 shows the resulting estimates for the same setting as in Experiment 5.1 for varying target accuracies. For high target accuracies, the estimate for the restarted Lanczos method is lower than that of the multi-shift CG method, as the number of poles necessary to construct an accurate enough rational approximation increases. For the 3d discretization, the break-even point comes earlier, as the cost of a matrix-vector product is higher compared to a vector operation.
6. Conclusions. Our theoretical results along with the practical comparisons in Section 5 indicate that, among the considered outer-inner polynomial Krylov methods for approximating Stieltjes matrix functions f (A)b, both the multi-shift CG and restarted Lanczos methods are the most favourable. We find that the inexact (with polynomial inner solves) versions of the shift-and-invert method as well as the extended Krylov method are generally not competitive.
The choice between multi-shift CG and restarted Lanczos should be informed by further considerations. In particular, the multi-shift CG method crucially requires inclusion intervals for the spectrum of A, and if these are not available or difficult to estimate (e.g. using a restarted Krylov method), then restarted Lanczos should be the method of choice. Note that restarted Lanczos can be implemented with deflation strategies that can further speed up the convergence. The multi-shift CG method, on the other hand, generally requires the smallest number of total matrix-vector products and might be more amendable to parallel implementation.
with the perturbation matrix E m := −BR m V * m . As E m and thus also B + E m is non-Hermitian, the results on the speed of convergence derived for Hermitian matrices are no longer applicable. We now explain why, as long as ε := E m is sufficiently small, we can still expect these results to hold in practice.
We denote by W(M ) the field of values (or numerical range) of a matrix M , by ∆(c, r) the closed disk with center c and radius r, by E(c, a, b) the horizontal, axisaligned ellipse with semi-axes a > b and center c and by S(c 1 , c 2 , r) the Bunimovich stadium with semicircle radius r and semicircle centers c 1 , c 2 ∈ R; see Figure A.1. As B = (A − ξI) −1 is Hermitian, we have W(B) = (λ max − ξ) −1 , (λ min − ξ) −1 and clearly W(E m ) ⊆ ∆(0, ε). Therefore, we find In order to derive a convergence rate for a matrix with field of values inside the Bunimovich stadium, a conformal mapping from C \ S 1 λmax−ξ , 1 λmin−ξ , ε onto C \ ∆(0, 1) is required. Unfortunately, no closed form for this mapping is known; see [46] for a treatment of this topic, where several numerical approximations for the conformal mapping of the Bunimovich stadium are proposed. We therefore embed S 1 λmax−ξ , 1 λmin−ξ , ε into an ellipse as illustrated in Figure A.1. Specifically, we use κ+1 . The conformal mapping from C \ E(a + ε, √ ε √ 2a + ε, c) onto C \ ∆(0, 1) is given by the scaled and shifted Zhukovsky map Further, let τ > 1 be such that ψ(τ ) < − 1 ξ . Note that such τ is guaranteed to exist as long as ε < 1 λmax−ξ , which we assume from here on. We can then write g(B + E m ) in terms of Faber polynomials Φ j as Using the quasi-optimality of the Arnoldi approximation, we can conclude that with the Crouzeix-Palencia constant 1 + √ 2. Bounding the Faber coefficients as |η j | ≤ 1 τ j max |z|=τ |g(ψ(z))| and noting that max |z|=τ |g(ψ(z))| = |g(ψ(τ ))|, we obtain the bound Thus, if E m < (λ max − ξ) −1 , so that we can find a suitable ellipse on which g is analytic, we can expect convergence with a rate τ −1 for the perturbed problem. Similar arguments can be made concerning the decay of the Arnoldi coefficients of the perturbed iteration. It remains to investigate how τ −1 compares to the rate of the unperturbed problem in dependence of ε. Solving ψ(τ ) = 0 yields which, after straightforward algebraic manipulations gives By noting that we can further rewrite this as which more clearly reveals how the convergence rate deteriorates with growing ε. For ε 1, the term involving √ ε dominates the perturbation given in (A.1). Thus, we can expect the convergence rate to deteriorate approximately like √ ε. In Figure A.2, we illustrate how |τ * | −1 evolves with growing ε for a diagonal matrix A with N = 1000 Chebyshev eigenvalues in the interval [0. 1, 200.1]. For this example, the smallest eigenvalue of B is (λ max − ξ) −1 ≈ 0.0049. As predicted by our arguments above, for small ε, the convergence rate is essentially the same as that of the unperturbed problem. With growing ε, the convergence rate deteriorates until it reaches the value 1 for ε = (λ max − ξ) −1 . In that case, we cannot guarantee convergence as any longer − 1 ξ ∈ E(a + ε, √ ε · √ 2a + ε, c), which is a singularity of g. Finally, we address how the tolerance of the inner iteration affects ε = E m . Fortunately, this is rather easy. From the definition of E m , we have so that we can control it via the residual norms of the inner iterations. In Figure A.2 (right), we apply the shift-and-invert method to our diagonal example matrix and also show the norm of the error matrix E m in each step. Comparing the norm ε of the error matrix in Figure A.2 (right) with the convergence rate given for these values of in ε in Figure A.2 (left), we would expect the convergence rate of the inexact method to deteriorate much earlier than it does in reality.
There are at different factors playing a role in the explanation of this effect: our results are valid for matrices with field of values in an ellipse which encloses the Bunimovich stadium, i.e., we have chosen a set containing the field of values of B +E m which is larger than necessary, as otherwise we have not been able to construct a conformal mapping. In addition, the convergence rate is only a worst-case estimate.
While in the unperturbed case we know that convergence for a matrix with Chebyshev eigenvalues will closely follow this worst-case bound, it is not clear how closely the perturbed matrix follows the worst-case bound for the perturbed case. Finally, we used the trivial inclusion W(E m ) ⊆ ∆(0, E m ), which often overestimates the actual diameter of W(E m ).