Parallelization of the rational Arnoldi algorithm

Rational Krylov methods are applicable to a wide range of scientific computing problems, and the rational Arnoldi algorithm is a commonly used procedure for computing an orthonormal basis of a rational Krylov space. Typically, the computationally most expensive component of this algorithm is the solution of a large linear system of equations at each iteration. We explore the option of solving several linear systems simultaneously, thus constructing the rational Krylov basis in parallel. If this is not done carefully, the basis being orthogonalized may become badly conditioned, leading to numerical instabilities in the orthogonalization process. We introduce the new concept of continuation pairs which gives rise to a near-optimal parallelization strategy that allows to control the growth of the condition number of this nonorthogonal basis. As a consequence we obtain a significantly more accurate and reliable parallel rational Arnoldi algorithm. The computational benefits are illustrated using several numerical examples from different application areas.

At the core of most rational Krylov applications is the rational Arnoldi algorithm, which is a Gram-Schmidt procedure for generating an orthonormal basis of a rational Krylov space.Given a matrix A ∈ C N,N , a vector b ∈ C N , and a polynomial q m of degree at most m and such that q m (A) is nonsingular, the rational Krylov space of order m is defined as Q m+1 (A, b, q m ) := q m (A) −1 span{b, Ab, . . ., A m b}.This is a linear vector space of rational functions r m (A)b = (p m /q m )(A)b all sharing the same denominator q m .The roots of q m are referred to as the poles of Q m+1 (A, b, q m ).Note that Q m+1 (A, b, q m ) reduces to a polynomial Krylov space K m+1 (A, b) := span{b, Ab, . . ., A m b} if q m ≡ 1.The smallest integer M such that K M (A, b) = K M +1 (A, b) is called the invariance index of A with respect to b.Throughout this work we assume that m < M , and hence all spaces Q m+1 (A, b, q m ) are of maximal dimension m + 1.
An interesting feature of the rational Arnoldi algorithm is that, under certain conditions, basis vectors can be computed in parallel.Take, for example, m distinct poles ξ 1 , ξ 2 , . . ., ξ m ∈ C \ Λ(A), where Λ(A) denotes the spectrum of A. Then Q m+1 (A, b) = span{b, (A − ξ 1 I) −1 b, (A − ξ 2 I) −1 b, . . ., (A − ξ m I) −1 b} is the rational Krylov space Q m+1 (A, b, q m ) with q m (z) = m j=1 (z−ξ j ), and clearly all basis vectors can be computed simultaneously from b.This is particularly attractive in the typical case where solving the linear systems (A − ξ j I)x j = b is the dominant computational cost in the rational Arnoldi algorithm.
Unfortunately, this naive parallelization approach may quickly lead to numerical instabilities.An instructive example is that of a diagonal matrix A = diag(λ i ) N i=1 , for which the basis vectors x j = (A − ξ j I) −1 b are the columns of a Cauchy-like matrix where e k denotes the kth canonical vector of appropriate dimension.Hence, X satisfies the Sylvester equation AX − XB = C with rank-1 right-hand side C = b 1 1 . . . 1 and B = diag(ξ j ) m j=1 .If the eigenvalues of A and B are well separated, e.g., by a straight line, the singular values of X decay exponentially as m increases (see [14]).Thus the matrix X will be exponentially ill-conditioned.Available rounding error analyses of the modified Gram-Schmidt procedure (with reorthogonalization) typically assume that the basis X to be orthogonalized is numerically nonsingular, i.e., g(m)εκ(X) < 1, where κ(X) is a condition number of X, ε is the machine precision, and g is a slowly growing function in m (see, e.g., [11]).Without this condition being satisfied, as in our case, there is no guarantee that the Gram-Schmidt procedure computes the exact QR factorization of a nearby matrix X + E, with E being of small norm relative to X.
The potential for exponential growth in the condition number of a rational Krylov basis seems to discourage any attempt to parallelize the rational Arnoldi method, and indeed only very few authors have considered this problem up to date.Most notably, in [28] Skoogh presented and compared, mostly from an algorithmic point of view, two (distributed memory) parallel variants.He noted that "generally the parallel rational Krylov programs get fewer converged eigenvalues than the corresponding sequential program" and that potential numerical instabilities may arise during the orthogonalization phases.Further remarks on parallel implementations in the model order reduction context are made in Grimme's thesis [15,Section 7].Some comparisons of different parallel variants, also referred to as continuation strategies, and graphical illustrations of the instabilities are contained in [17,Section 6.5].However, up to this date, a detailed theoretical analysis and practical recommendation on how to best parallelize the rational Arnoldi algorithm seems to be lacking.The main goal of our work is to fill this gap by analyzing the numerical instabilities that may occur, and to propose a new continuation strategy with superior stability properties.
The rest of the paper is organized as follows.In section 2 we review the standard (sequential) rational Arnoldi algorithm and introduce the notion of continuation pairs, which represent the free parameters to be chosen during the rational Arnoldi algorithm.In section 3 we propose and analyze a framework for constructing nearoptimal continuation pairs; near-optimal in the sense of minimizing the growth of the condition number of the nonorthogonal rational Krylov basis.Finally, sections 4-5 are devoted to parallel variants of the rational Arnoldi algorithm.Specifically, in section 4 we discuss a generic parallel variant of the rational Arnoldi algorithm, list some canonical choices for continuation pairs, and adapt the previously developed near-optimal strategy to the parallel case.In section 5 we provide a range of numerical experiments, comparing different continuation strategies, and different high-performance (parallel) implementations.Concluding remarks and possible future work are given in section 6.

7.
Set k j := ν j c j − ρ j t j and h j := µ j c j − η j t j , where t j = [t T j 0] T .
8. end for the free parameters in the sequential algorithm is central to our parallelization strategy, which is based on continuation pairs introduced and analyzed in sections 2.2-3.3.

2.1.
The rational Arnoldi algorithm and RADs.The rational Arnoldi algorithm constructs an orthonormal basis V m+1 for Q m+1 (A, b, q m ) in a Gram-Schmidt fashion as described in Algorithm 2.1.The notation C = C ∪ {∞} is used.From lines 4-7 we deduce AV j+1 (ν j c j − ρ j t j ) = V j+1 (µ j c j − η j t j ), and thus (2.1c) Concatenating (2.1d) for j = 1, . . ., m provides with the jth column of K m being [k , and analogously for H m .The matrices K m and H m are upper-Hessenberg, i.e., all the elements below the first subdiagonal are zero.Moreover, if the (η j /ρ j , t j ) are chosen correctly (see section 2.2), they also form an unreduced upper-Hessenberg pencil (H m , K m ), i.e., |h j+1,j | + |k j+1,j | = 0 for all j; see [4].We can now state the following definition [4].
form an unreduced upper-Hessenberg pencil, and the quotients {h j+1,j /k j+1,j } m j=1 , called poles of the decomposition, are outside Λ(A).In this sense we say that (2.2) is an RAD for Q m+1 (A, v 1 , q m ), where the nonzero polynomial q m ∈ P m has as formal roots the poles of (2.2).Furthermore, if V m+1 is orthonormal, we say that (2.2) is an orthonormal RAD.
It is convenient to refer to (2.2) as an RAD even if m = 0, in which case one can think of the pencil (H m , K m ) as being of size 1 × 0, and effectively we only have the matrix A and the normalized starting vector v 1 .This corresponds to the initial phase of Algorithm 2.1.
If (2.2) is an RAD for Q m+1 (A, b, q m ) and the matrix R m ∈ C m,m is uppertriangular and nonsingular, then . Such a right-multiplication of the decomposition by R m is equivalent to changing the parameters (η j /ρ j , t j ) during the rational Arnoldi algorithm.The two RADs for Q m+1 (A, b, q m ) are essentially equal ; see [4,Def. 3.1] for a precise definition.In fact, the rational implicit Q theorem [4,Thm. 3.2] asserts that the RADs related to Q m+1 (A, b, q m ) are essentially uniquely determined by A, b, and the ordering of the poles of the decomposition.Numerically, however, choosing (η j /ρ j , t j ) carefully may be rather beneficial.

Continuation pairs.
A rational Krylov space is defined by the matrix A, the starting vector b, and by the poles {µ j /ν j } m j=1 .These quantities are assumed to be given.We now discuss the roles of the "internal" parameters ρ j , η j , and t j , a problem that can be illustrated graphically as follows: To be precise, we study the influence of (η m /ρ m , t m ) ∈ C × C m for the extension of an order m − 1 RAD for Q m (A, b, q m−1 ), namely, ) continuation pair of order m.The value η m /ρ m is its continuation root, and t m its continuation vector.Further, (η m /ρ m , t m ) is called admissible for the pole µ m /ν m ∈ C\Λ(A) and the RAD (2.4), or, equivalently, for The notion of continuation vector has already been used in the literature, though not consistently.For instance, in [26] the author refers to V j t j as the continuation vector, while in [22] the term is used to denote (ρ j A − η j I)V j t j .The terminology of "continuation combinations" is adopted in [4,30,26] for the vectors t j .With the notion of continuation pair we want to stress that there are two parts, a root and a vector, both of which are equally important.
Let us now find an admissible continuation pair for (2.3).For any η/ρ = µ/ν, the RAD (2.4) can be transformed into (2.5) showing that a continuation pair, independently of the continuation root, is not ad- . This was first observed in [26] and led the author to suggest a nonzero left null vector Currently, the choices t m = e m and t m = q m appear to be dominant in the literature; see, e.g., [26,30].Note that t m = e m may (with probability zero) be not admissible, i.e., we would not be able to expand the space with the obtained w m+1 even though the space is not yet A-invariant.Such a situation is called unlucky breakdown.Nevertheless, these two choices do appear to work well in practice, but as we shall see, this is not always the case for the parallel variant.Moreover, continuation roots are frequently ignored.Typical choices are zero and infinity, without justification.An exception is [22], where a choice for (ϑ, t m ) is recommended in a way such that (ϑ, V m t m ) is a rough approximation to an eigenpair of A.
2.3.Optimal continuation pairs.We will now show that for the sequential rational Arnoldi algorithm, Algorithm 2.1, there exist continuation pairs which yield w m+1 such that w m+1 ⊥ R(V m ).We refer to such continuation pairs as optimal, as we are mainly concerned with the condition number of the basis being orthogonalised.
Definition 2.3.An admissible continuation pair Equivalently, if the two RADs appearing in (2.3) are orthonormal, the continu- The key observation is thus triggered by the rational implicit Q theorem [4, Theorem 3.2], which asserts that the new direction v m+1 we are interested in is predetermined by the parameters (A, b, q m ) defining Q m+1 (A, b, q m ).The following theorem, which we restate here for completeness, provides a useful representation of v m+1 .
It proves useful to derive a closed formula for the optimal continuation vector t m .To this end, let x m be a right generalized eigenvector of (H m , K m ) corresponding to the eigenvalue η m /ρ m ; i.e., (ρ m H m − η m K m )x m = 0. Right-multiplying the RAD of the form (2.5) with µ/ν ≡ µ m /ν m and η/ρ ≡ η m /ρ m , but of order m, by x m yields This gives the optimal continuation vector provided that γ m = (e T m x m )(ρ m h m+1,m − η m k m+1,m ) = 0, which holds true under the assumption that (2.2) is an RAD.Indeed, if γ m = 0, then V m (ν m H m − µ m K m )x m is an eigenvector of A with eigenvalue η m /ρ m , which implies the non-existence of an RAD of order m with starting vector v 1 , as Lemma 2.6 below shows.
To prove Lemma 2.6 we use the following result, which is given for the cases (α, β) being either (1, 0) or (0, 1) in [4,Sec. 2].Here we require the more general statement.
Proof.Consider auxiliary scalars α = 1 and any β ∈ C such that αh j+1,j − βk j+1,j = 0 for j = 1, 2, . . ., m. Multiplying the RAD (2.2) by α and subtracting βV m+1 K m from both sides gives (2.9) The choice of α and β is such that αH m − βK m is an unreduced upper-Hessenberg matrix, and as such of full column rank m.In particular, the right-hand side of (2.9) is of full column rank m.Thus, the left-hand side, and in particular K m , is of full column rank.This proves the statement for the case α = 0.For the case α = 0, consider α = α and β = β in (2.9).If αH m − βK m is unreduced, then it is of full column rank and the statement follows.If, however, αH m − βK m is not unreduced, then we have αh j+1,j − βk j+1,j = 0 for at least one index j ∈ {1, 2, . . ., m}.Equivalently, β/α = h j+1,j /k j+1,j ; that is, β/α equals the jth pole of (2.2) and hence αA − βI is nonsingular.Finally, since V m+1 and K m are of full columns rank, the left-hand side of (2.9) is of full column rank.It follows that αH m − βK m is of full column rank as well, and the proof is complete.Lemma 2.6.Let (2.4) be an orthonormal RAD.The space R(V m ) is A-invariant if and only if there exists an eigenpair of A of the form (ϑ, V m t).
Proof.Let us assume that AV m t = ϑV m t, with t = 0. We can extend (2.4) into . Let us, to the contrary, assume that t = K m−1 z , for some nonzero vector z ∈ C m−1 .As (ϑ, V m t) is an eigenpair of A, it follows from (2.4) that ϑK m−1 z = H m−1 z .This implies that H m−1 − ϑK m−1 is not of full column rank, which is in contradiction with Lemma 2.5.
3. Near-optimal continuation pairs.While Proposition 2.7 characterizes optimal continuation pairs precisely, it requires the last column of (H m , K m ), which is not available without computing v m+1 in the first place.Our idea in this section is to employ a rough approximation ( H m , K m ) ≈ (H m , K m ) to obtain a near-optimal continuation pair.We then quantify the approximation accuracy that is required in order to generate a well-conditioned rational Krylov basis.Some of the equations from section 2 are restated for ease of reference.
3.1.The framework.Assume we are given an RAD of order j − 1, namely, We seek a near-optimal continuation pair (η j /ρ j , t j ) for expanding (3.1) into using the pole ξ j = µ j /ν j .To this end we make use of an auxiliary continuation pair ( η j / ρ j , t j ), the only requirement on which is to be admissible.For example, it could be the one proposed by Ruhe [26].Let us consider the associated linear system The solution w could be used to expand the rational Krylov space we are constructing.However, to obtain a near-optimal continuation pair we instead suggest to approximate the solution w ≈ w j+1 .(The solution to (3.3) is labeled w , and not w j+1 , since w j+1 is reserved for (ν j A − µ j I)w j+1 = (ρ j A − η j I)V j t j .)To make the whole process computationally feasible, obtaining this approximation should be inexpensive; see Remark 3.5.The pencil ( H j , K j ) is then constructed as usual in the rational Arnoldi algorithm, pretending that w j+1 was the true solution.As a result we obtain perturbed Hessenberg matrices where Then a near-optimal continuation pair is given by Our goal in the next section is to evaluate the quality of such near-optimal continuation pairs within the rational Arnoldi algorithm.In particular, we provide an upper bound on the condition number of the basis being orthonormalized, based on the error v j+1 − v j+1 2 .

Inexact RADs. We continue by introducing the residual
By (3.1), (3.4), and (3.8) we have an inexact rational Arnoldi decomposition (IRAD) where ] is orthonormal and Multiplying (3.9) by ν j and then subtracting µ j V j+1 K j from both sides provides where Vj+1 = V j v j+1 + f j+1 , and (3.12) Eq. (3.11) holds since the last row of ν j H j − µ j K j is zero.We can also "add back" µ j Vj+1 K j to both sides of (3.11), and rescale by ν −1 j to get Finally, under the assumption that Vj+1 is of full rank, (3.13) is a non-orthonormal RAD, equivalent to the IRAD (3.9).Theorem 2.4 applied to (3.13) asserts that the eigenvalues of H j , K j are the roots of the rational function corresponding to the vector v j+1 + f j+1 .This discussion is summarized in the following theorem.Theorem 3.1.Let the orthonormal RAD (3.1) and the auxiliary continuation pair ( η j / ρ j , t j ) be given.Denote by w j+1 ∈ R(V j ) an approximate solution to (3.
in line 4 of Algorithm 2.1.
The vector w j+1 is not necessarily orthogonal to R(V j ), but if f j+1 2 is "small enough" it almost is, since the vector v j+1 is orthogonal to R(V j ).We make this more precise in the following corollary.
Corollary 3.2.Let the assumptions of Theorem 3.1 hold.If Proof.By Theorem 3.1 we have The statement (3.15) now follows from the reverse triangle inequality and the monotonicity of arctan, using the relation Note that Corollary 3.2 can be formulated even if f j+1 2 ≥ 1, but in this case would provide no useful information.Before continuing with the analysis of our nearoptimal continuation strategy, let us remark on the choice of ( η j / ρ j , t j ).
Remark 3.3 (auxiliary continuation pairs).The authors in [22] consider the rational Arnoldi algorithm with inexact solves, and suggest to use continuation pairs (η j /ρ j , t j ) such that (η j /ρ j , V j t j ) is an approximate eigenpair of A close to convergence.As inexact solves are used within our framework to get a near-optimal continuation pair, this observation also applies to the auxiliary continuation pair ( η j / ρ j , t j ).

Condition number of the Arnoldi basis. As Vj+1
Composing these bounds for all indices j we are able to provide an upper bound on the condition number κ(W m+1 ) of the basis which is constructed iteratively by Algorithm 2.1.The Gram-Schmidt orthogonalization process is mathematically equivalent to computing the thin QR factorisation where the first equality follows from (2.1).As already discussed in the introduction, numerical instabilities may occur if κ(W j+1 ) is too large.Theorem 3.4.Let the assumptions of Theorem 3.1 hold for j = 1, 2, . . ., m, and let the orthonormal RAD (2.2) be constructed with Algorithm 2.1 using near-optimal continuation pairs (η j /ρ j , t j ) given by (3.6)-(3.7).Let R 1 = I 1 , and R j+1 be as in (3.17).Assume that the scaled error f j+1 at iteration j satisfies f j+1 2 < 1.Then for all j = 1, 2, . . ., m we have (3.18) In particular, κ(W m+1 ) ≤ σ u m+1 /σ l m+1 .Proof.For any j = 1, 2, . . ., m we have with V * j v j+1 = 0, and v j+1 2 = 1.The proof goes by induction on j.For j = 1 the statement follows from (3.19), (3.16), and the fact that [R 1 e 2 ] = I 2 .
Let us assume that (3.18) holds for j = 1, 2, . . ., < m.For the induction step we consider the case j = + 1, and use the fact that, for any two conformable matrices X and Y of full rank, there holds σ max (XY ) ≤ σ max (X)σ max (Y ) and σ min (XY ) ≥ σ min (X)σ min (Y ).Hence, (3.18) for j = + 1 follows from (3.19), the bound (3.16) for [V +1 v +2 + f +2 ], the fact that the singular values of R +1 coincide with the singular values of W +1 , and the observation for three different values of f 2 .Theorem 3.4 asserts these to be upper bounds on κ(W m+1 ) provided that far all j there holds f j+1 2 ≤ f 2 .Right: We plot the function f j+1 2 → arctan , which provides a lower bound on ∠(w j+1 , V j ).3σ u j /σ l j .That is, the bound σ u j+1 /σ l j+1 on the condition number κ(W j+1 ) grows by at most a factor of 3 compared to σ u j /σ l j , which does not necessarily imply κ(W j+1 ) ≤ 3κ(W j ).It would imply that, if σ min (W j ) ≤ 1 ≤ σ max (W j ) holds true (this observation is clear from the proof of Theorem 3.4).In Figure 3.1(a) we illustrate the upper bounds given by Theorem 3.4 for some particular values of f j+1 2 .Figure 3.1(b) visualizes the lower bound, provided by Corollary 3.2, on the angle ∠(w j+1 , V j ).For example, for a rough approximation w j+1 that gives f j+1 2 = 0.5, we have ∠(w j+1 , V j ) ≥ π 4 .Remark 3.5.If the poles of the rational Krylov space are fairly well separated from the spectral region of A, a good approximate solution to (3.3) may be obtained with a few iterations of a cheap polynomial Krylov method, like unpreconditioned FOM or GMRES, or with a cycle of multigrid [27].Computational examples of this situation are given in sections 3.4, 5.1, and 5.2.When direct solvers are used within the rational Arnoldi algorithm, it may even be worth solving (3.3) to full accuracy, as the most costly computation is the analysis and factorization of each shifted linear system, which is done at most once per pole.An example of this situation is given in section 5.3.

Numerical illustration.
In Figure 3.2 we illustrate the effectiveness of our near-optimal continuation framework.The matrix A is of size N = 1000, and it is generated in MATLAB with A=-5*gallery('grcar',N,3). This is a nonnormal matrix and its eigenvalues are shown in Figure 3.2(a), together with the m = 16 poles used in this example.The poles are obtained using the RKFIT algorithm [4,5] and optimized for approximating exp(A)b, where the starting vector b has all its entries equal to 1. (A similar example is considered in [4, Sec.5.3].)Two experiments are performed with this data, and they differ in the way the approximants w j+1 , used to obtain the near-optimal continuation pairs, are computed.Since the poles are far away from Λ(A), we expect a few iterations of FOM to provide good approximants w j+1 .To obtain each w j+1 we hence use a fixed number k of FOM iterations;     this is referred to as FOM(k).In We observe that the bounds get sharper as the error f j+1 2 gets smaller, and also that the two bases W m+1 , computed using both the FOM(2) and FOM(3) near-optimal continuation strategy, get better conditioned as the approximations w j+1 get more accurate.
In both examples the nonorthogonal bases are in fact remarkably well-conditioned.
Note that computing or estimating f j+1 2 may be too costly in practice.However, the main message of Theorem 3.4 and this numerical illustration is that rather poor approximations w j+1 are sufficient to limit the growth of κ(W m+1 ) considerably.See also Figures 3.1-3.2.
4. Parallel rational Arnoldi algorithm.In this section we introduce a new parallel variant of the rational Arnoldi algorithm based on near-optimal continuation pairs.The parallelism we consider comes from generating more than one of the basis vectors concurrently.Another possibility is to parallelize the involved linear algebra operations, thought that might scale less favorably as the number of parallel processes increases.Combining both parallelization approaches is also viable, and our implementation supports this.Further comments about the implementation and numerical examples are given in section 5.

4.1.
High-level description of a parallel rational Arnoldi algorithm.The aim of the parallel rational Arnoldi algorithm, outlined in Algorithm 4.2 and the discussion below, is to construct an RAD for Q m+1 (A, b, q m ) using p > 1 parallel processes.The basis is constructed iteratively, but unlike the sequential version, at each iteration p > 1 vectors are computed simultaneously, one per parallel (with a possible exception of the last iteration if m is not a multiple of p).The poles assigned to distinct parallel processes have to be mutually distinct, as otherwise we would not obtain p linearly independent vectors to expand the rational Krylov space.
We assume a copy of the matrix A, the starting vector b, and the poles {ξ j } m j=1 to be available to each parallel process.After the orthogonal basis vectors have been constructed, they are broadcasted to the other parallel processes for use in the following parallel iterations.This means that a copy of the basis V j+1 is available to every parallel process.We now go through Algorithm 4.2 line by line.
In line 2 of Algorithm 4.2 the starting vector b is normalized (on every parallel process = 1, 2, . . ., p), providing the first basis vector v 1 ≡ v [ ] 1 .We use the superscript notation (•) [ ] to denote that the quantity (•) belongs to the parallel process .
If a quantity is sent to another parallel process ˆ = , a copy (•) [ ˆ ] = (•) [ ] becomes available to ˆ .The main part of the algorithm is the j-loop spanning lines 3-27, where the remaining m vectors of the orthonormal basis are generated by p parallel processes, which requires m p iterations.The variable s represents the order of the RAD AV s+1 K s = V s+1 H s constructed so far, and every parallel process has its own copy of it.The variable p ≤ p equals p for all iterations j, except perhaps the last one Algorithm 4.2 Parallel rational Arnoldi for distributed memory architectures.
Mark processor as inactive.
Applies to the case j = m p if p m. 8.
else if > k then 22.

Broadcast k [ ] s+k and h
[ ] s+k from parallel process k.

26.
end for (orthogonalization loop) 27.end for (main loop) where p = m − s represents the number of remaining basis vectors to be constructed.Parallel processes with labels greater than p do not perform the remaining part of the last iteration of the j-loop.
The selection of continuation pairs in line 9 is discussed in subsection 4.2.We shall only stress that the continuation pairs are of order s + 1 for all , and that we assume the choice to be such that unlucky breakdown is avoided.Once the continuation pairs have been computed, a new direction w [ ] s+ +1 is computed the same way as in the sequential rational Arnoldi algorithm; cf.line 10.The orthogonalization part, however, is more involved and consists of two parts.
The first part of the orthogonalization process corresponds to lines 11-12, where the newly computed vector w ).The second part of the orthogonalization corresponds to the loop in lines 13-26, and involves communication between the parallel processes.In this part the (partially) orthogonalized vectors w [ ] s+ +1 are gradually being orthonormalized against each other.As soon as w s+k+1 in line 15, it is broadcasted to the remaining active parallel processes in line 17.At this stage the parallel process k updates the RAD from order s + k − 1 to order s + k (lines 19-20), while the active parallel processes > k orthonormalize w [ ] s+ +1 against the just received v [ ] s+k+1 ; lines 22-23.The final part, line 25, is to broadcast the update for the reduced upper-Hessenberg pencil from parallel process k to the remaining active ones.The communication between the p parallel processes involves O(p 2 ) messages, which is not prohibitive in our case as p is typically moderate (not exceeding p = 8 in our experiments in section 5).
Alternative implementation options.Depending on possible memory constraints, one may consider distributing the basis, instead of having copies on every parallel process.In this case the p vectors s+ could be formed jointly by all the parallel processes, and once all are formed and distributed, the vectors w [ ] s+ +1 may be constructed independently.The Gram-Schmidt process can be adapted accordingly.
A shared memory implementation may follow the same guidelines of Algorithm 4.2, excluding the broadcast statements.Also, the second part of the orthogonalization may be performed jointly by assigning an (almost) equal amount of work to each thread.(The index notation adopted in Algorithm 4.2 guarantees different threads not to overwrite "each others" data.)4.2.Locally near-optimal continuation pairs.We now discuss the choice of continuation pairs for Algorithm 4.2.To this end we use the continuation matrix T m := [t 1 t 2 . . .t m ] ∈ C m,m , which collects the continuation vectors (padded with zeros, if needed) of order j = 1, 2, . . ., m as they are being used in the sequential rational Arnoldi algorithm.Consequently, T m is an upper triangular matrix.For the parallel rational Arnoldi algorithm, we order the continuation vectors t where the same vector v s+1 is used for all .These two choices are illustrated with the aid of the corresponding continuation matrices in Figures 4.1(b)-4.1(c)for the case m = 12 and two distinct choices for p.The choice (4.1) was used in [28] with infinity as the corresponding continuation root, while (4.2) has been introduced in [17, section 6.5].Another possibility for continuation vectors is to use Ruhe's strategy [26] locally on each parallel process; is a full QR factorization of s ∈ C s+1,s is upper triangular with last row being 0 T .The corresponding continuation matrix T m is shown in Figure 4.1(d) for the case when the poles on each parallel process are being used repeatedly, which generates this curious nonzero pattern.If the poles were not repeated cyclically, T m would generally be populated with nonzero elements in the allowed (shaded) region.These canonical choices for the continuation vectors may be supplemented with continuation roots being either zero or infinity, for example.
Let us now move on to discussing the admissibility and optimality conditions on continuation pairs in the parallel case.By assumption, the p active parallel processes at a given iteration j have mutually distinct poles {µ s+ /ν s+ } p =1 , where s = (j −1)p.It is easy to show that if for every = 1, 2, . . ., p the continuation pair (η , that is, if it is locally admissible for each parallel process, then no unlucky breakdown occurs during iteration j overall, assuming exact arithmetic and s + p < M .Hence, an example of admissible continuation pairs for Algorithm 4.2 is (η s+ provided by (4.3).Unfortunately, obtaining p > 1 optimal continuation pairs concurrently is almost always impossible.
Proof.The rational implicit Q theorem [4, Theorem 3.2] implies that the RADs Hence, the essentially unique basis vectors v [ ] s+ +1 which are mutually orthogonal to each other and to R(V [ ] s+1 ) are represented by rational functions p s+ q −1 s+ of type at most (s+ , s+ ).(The type of a rational function is the ordered pair of its numerator and denominator polynomial degrees.)For any (η are rational functions in A times the starting vector b of type at most (s + 1, s + 1) for all , which does not match the type (s + , s + ) when > 1.The only possibility to obtain, e.g., w [ ] for > 1 would be if, by chance, − 1 of the (formal) roots of p s+ canceled with − 1 poles of q s+ .By remarking that this may never happen, for instance, for Hermitian A with real-valued poles outside the convex hull of Λ(A), as the roots of p m+ are contained in the aforementioned convex hull (which is easy to show), we conclude the proof.
For the sequential version we have just enough degrees of freedom to be able to find an optimal continuation pair.For p > 1 there is a lack of degrees of freedom, which gets more pronounced as p increases.This can also be interpreted visually in Figure 4.1, where the shaded area decreases with increasing p.
Our proposal is thus to apply the near-optimal framework from section 3 locally on each parallel process , i.e., where ( ) with the pole µ s+ /ν s+ , and where ( ), thought nothing can be said a priori about their mutual angles.
With such an approach we expect the condition number of the basis W m+1 undergoing the Gram-Schmidt orthogonalization process to increase compared to the sequential case.However, as the growth of κ(W m+1 ) gets substantially suppressed with our near-optimal strategy (if the f j+1 2 are small enough), we hope that the growth due to the parallelization is not prohibitive.Our numerical experiments in section 5 confirm that our approach based on near-optimal continuation pairs is more robust than the canonical continuation strategies.We end this section with a few practical considerations.
Real-valued near-optimal continuation pairs.Recall that a near-optimal continuation pair is formed from an eigenpair of (H j , K j ); cf.(2.11).Even if A, b and the poles are real-valued, a near-optimal continuation pair may hence be complex, which may be undesirable.This problem can be resolved easily: in particular, if j is odd, there is at least one real-valued eigenpair of (H j , K j ), and it can be used to construct a real-valued continuation pair (η j /ρ j , t j ).Thus, for the parallel algorithm with p being even, we have that s+1 = (j −1)p+1 is odd and hence a real-valued nearoptimal continuation pair exists.In our implementation for real-valued data we hence construct near-optimal continuation pairs as in (2.11), but with (η/ρ, x ) replaced by ( (η/ρ), (x )), where (η/ρ, x ) is an eigenpair of (H j , K j ) such that (η/ρ) = 0 or otherwise (η/ρ)/ (η/ρ) → min.Therefore, for odd j or odd s + 1, we obtain ( (η/ρ), (x )) = (η/ρ, x ).One could also consider the constrained problem of finding a best real-valued (η j /ρ j , t j ), but we have not done this as the problem with complex continuation pairs disappears with common (even) values of p.
Matrix pencils and nonstandard inner products.The proposed near-optimal framework carries over to a rational Arnoldi variant working with a pencil (A, B) instead of a single matrix A. The only modification is to replace (ν j A − µ j I) −1 (ρ j A − η j I)V j t j by (ν j A − µ j B) −1 (ρ j A − η j B)V j t j , and to ensure that µ j /ν j ∈ Λ(A, B).Furthermore, the analysis from section 3 can be adjusted to allow for a nonstandard inner product, as the main results are based on the representation from Theorem 2.4, which only assumes V m+1 to be of full rank (not necessarily with columns orthonormal in the standard inner product).
Reordering poles.The ordering of the poles {µ j /ν j } m j=1 is likely to influence the condition number κ(W m+1 ) of the basis W m+1 being orthogonalized in the Gram-Schmidt process.By (approximately) maximizing the distance between any two distinct poles from {µ s+ /ν s+ } p =1 used simultaneously, one may obtain a better conditioned basis W m+1 .We have not yet analyzed the numerical influence of the pole ordering and leave this for future work.

Numerical experiments.
In this section we report on three numerical experiments from different application areas, each illustrating another aspect of our parallel algorithms.The algorithms are implemented in C++, using Intel MKL's LA-PACK and BLAS for dense linear algebra operations, SparseBLAS for sparse-matrix dense-vector multiplications, and PARDISO for the linear system solves (Intel MKL version 10.0.1).Sparse matrices are stored in the CSR format.All tests are run on an Intel Xeon CPU E56-2640, with 6 cores (12 threads), running at 2.5 GHz.We have 64 GiB of RAM at our disposal.The code is compiled with the Intel icpc compiler (version 12.0.5)using the -O3 flag.The implementation of the MPI standard is Open MPI (version 1.6).Both the sequential and the parallel rational Arnoldi algorithm are linked with either the sequential or multi-threaded version of Intel MKL, giving raise to the following four configurations: variant algorithm Intel MKL 1 × 1 Algorithm 2.1 sequential 1 × p Algorithm 2.1 multi-threaded p × 1 Algorithm 4.2 sequential p × p Algorithm 4.2 multi-threaded In addition to our (parallel) high-performance C++ implementation 1 , we provide a MATLAB implementation rat krylov of the rational Arnoldi algorithm in our Rational Krylov Toolbox on http://rktoolbox.org[3].The implementation is sequential, but it can simulate the parallel execution by using appropriate continuation vectors; cf.Given a computed rational Arnoldi decomposition AV m+1 K m = BV m+1 H m , where B may but does not have to be the identity matrix, we assess various continuation strategies using the following quantities.
orth Departure from orthonormality Here, the notation denotes the employed inner product and is application dependent.cond The condition number κ(W m+1 D) = κ 2 ( W m+1 D, W m+1 D ) of the rescaled basis W m+1 D, with respect to the inner product used.We have used MAT-LAB's fminsearch to determine a diagonal matrix D such that κ(W m+1 D) is (approximately) minimized.This is because the stability of the Gram-Schmidt procedure applied to W m+1 is unaffected by column scaling of W m+1 .space The space R(V m+1 ) is a (rational) Krylov space for B −1 A, where we assume B to be nonsingular, if and only if S = B −1 AV m+1 −V m+1 B −1 AV m+1 , V m+1 has rank at most one; see [29,Cor. 3.3].We therefore look at the ratio σ 2 /σ 1 of the second largest and the largest singular values of S. The smaller the ratio is, the least R(V m+1 ) deviates from a (rational) Krylov space.In all our experiments the relative backward error of the computed RAD was always close to the level of machine precision, and is hence not reported.When reporting the total CPU time, we provide a breakdown made up of four components.The component mv+orth measures the elapsed time for lines 2, 11-26, and the computation of (ρ s+ in line 10 of Algorithm 4.2.The component solve measures the solution phases consisting of backward and forward substitutions for the linear systems in line 10, while factorize measures the initial symbolic and numerical factorizations.Finally, continuation measures the time spent in line 9 of Algorithm 4.2.An analogous breakdown is given for Algorithm 2.1. 5.1.Exponential integration.Our first example relates to the modeling of a transient electromagnetic field in a geophysical application [6].We are given a symmetric positive semidefinite matrix A ∈ R N,N and a symmetric positive definite matrix B ∈ R N,N , and the task is to solve Be (t) + Ae(t) = 0, e(0) = b, for the electric field e(t).The time parameters of interest are t ∈ T = [10 −6 , 10 −3 ].
The approach suggested in [6] is to build a B-orthonormal rational Krylov basis , where B −1 A is never formed explicitly, and to extract Arnoldi approximants for all desired time parameters t ∈ T .Here b B = (b T Bb) We test various parallel continuation strategies for computing the basis V m+1 , namely, our near-optimal FOM(5) continuation strategy and the two canonical variants specified by (4.1) and (4.2).We also compare to the sequential approach using again the FOM(5) strategy for predicting the next basis vector.The numerical results are shown in Table 5.1 and Figure 5.1.We highlight that all these tests have been run using classical Gram-Schmidt without reorthogonalization.The reasoning behind this choice is that our FOM(5) continuation strategy tries to choose continuation pairs which lead to very well-conditioned basis vectors and hence the Gram-Schmidt procedure will work fine without reorthogonalization.This is justified by the condition number cond and the orthogonality measure orth in Table 5.1, which are both much better with the parallel FOM(5) strategy than they are with the canonical variants (4.1) and (4.2).
We observe that for variant (4.1) the space R(V m+1 ) deviates significantly from a rational Krylov space (column space).This instability in computing the rational  In Figures 5.1(c)-5.1(d)we report the CPU timings (averaged over 50 runs) for our C++ implementations.The first bar labeled [1 × 1] corresponds to the sequential algorithm run using the continuation strategy (4.2).We find that the computationally most costly parts are the four matrix factorizations (one factorization for each of the four distinct poles), and the solution phases consisting of the 4 × 9 backward and forward substitutions.Some speedup is achieved by using four threads to factorize and solve with each system one after the other; note the reduction in computation time when going from [1 × 1] to [1 × 4] in our notation.However, it is apparent that even more speedup is obtained by using a single core to factor and solve, but to do this with four matrices simultaneously (this corresponds to the 4 × 1 case), even though our near-optimal FOM(5) continuation strategy adds significant computational cost.
Further reduction in computation time is achieved by combing both levels of parallelism, i.e., factorizing and solving with all four matrices simultaneously using two threads in each case (the 4 × 2 case which we have only timed for the larger example).Let us also point out that the mv+orth portion is slightly bigger for the 4×1 case compared to the [1×4] case.This is mainly due to the added communication between the parallel MPI processes within the 4 × 1 variant.However, the difference is not large and indicates that communication costs are negligible here.

Model order reduction.
Our second example is the INLET problem from the Oberwolfach Model Reduction Benchmark Collection [1], an active control model of a supersonic engine inlet; see also [21].There are two nonsymmetric matrices {A, E} ⊂ R N ×N , a block of vectors B ∈ R N ×2 , and a row vector c T ∈ R 1×N in this problem, where N = 11730.In this test we only use the first column of B, say b, and consider the problem of approximating the transfer function H(s) = c T (sE − A) −1 b over a range of frequencies.To this end we choose a number of p poles being distributed equidistantly over the interval i[0, 40] and repeated cyclically, until we obtain a rational Krylov space of order m = 24.The gain |H(is)| of the full model as well as the gains of the reduced models are shown in Figure 5.2(a).(The deviations from the transfer function of the full system are not necessarily related to instabilities, but also due to the distribution of the poles.) In all cases we have used our FOM(5) near-optimal continuation strategy with classical Gram-Schmidt orthogonalization (without reorthogonalization).The table given in Figure 5.2(b) shows how the condition number of the basis being orthogonalized has the tendency to increase with the number of processors (albeit not monotonically, which is partly due to the fact that different poles are used for the different variants), and how the orthogonality measure deteriorates.This is a demonstration of how increasing the level of parallelism leads to operations on more ill-conditioned bases, but the level of ill-conditioning may still be acceptable for this application.

5.3.
A complex non-Hermitian eigenvalue problem.Our last example is a finite element discretization matrix of a three-dimensional waveguide (waveguide3D) from The University of Florida Sparse Matrix Collection [7].This non-Hermitian matrix A is of size N = 21036 and has complex entries.Our aim is to compute a few of the propagating wave modes associated with eigenvalues of A close to the interval [0, 6 × 10 −3 ].To this end we place p = 8 equidistant poles on this interval and repeat them cyclically for eight times, thus building a rational Krylov space of order m = 64, using modified Gram-Schmidt with reorthogonalization.From the computed rational Krylov decomposition AV m+1 K m = V m+1 H m we extract harmonic Ritz pairs (see e.g.[24]) with target σ = 3 × 10 −3 , i.e., we solve a generalized eigenvalue problem K * m (H m − σK m )y = (ϑ − σ)K * m K m y and use (ϑ, V m+1 K m y ) as approximations to some of the eigenpairs of A. See Figure 5.3(a) for a visualization of the eigenvalues, poles, and harmonic Ritz values.
As is typical for eigenvalue problems, the poles of the rational Krylov space are close to the eigenvalues of A, and hence a continuation prediction using FOM is likely to be unsuccessful.We therefore use the direct solver itself to predict the continuation vectors, which doubles the number of linear system solves but the factorizations are computed only once (per distinct pole).Figure 5.3(b) shows the harmonic Ritz residuals with various continuation strategies discussed in this paper for p = 8 processors, including our near-optimal continuation pair (4.4), which becomes optimal for p = 1 (i.e., the predicted basis vectors are already orthogonal to the previous vectors).The timings are reported in Figure 5.4.We notice that due to the rather expensive construction of near-optimal continuation pairs the [1 × p] version is faster than the p × 1 for p = 2, but already for p = 4 the situation is reversed as then p × 1 scales better.
We note that better results with all variants can be obtained by explicitly projecting the eigenvalue value problem with respect to the computed rational Krylov basis, instead of using the quantities from the RAD.Such an explicit projection is, however, undesirable because of the added computational cost.6. Summary and future work.In the first half of the paper, sections 2-3, we introduced and analyzed the new notion of continuation pairs (η j /ρ j , t j ) used within the sequential rational Arnoldi algorithm to extend the orthonormal basis V j of a rational Krylov space Q j (A, b, q j−1 ) by adding the new direction w j+1 = (ν j A − µ j I) −1 (ρ j A − η j I)V j t j orthonormalized against R(V j ) in a Gram-Schmidt fashion.This process is mathematically equivalent to computing a thin QR factorization of W m+1 := b w 2 w 3 . . .w m+1 = V m+1 R m+1 and, if W m+1 is ill-conditioned, numerical problems may occur in finite precision arithmetic.
By choosing continuation pairs carefully the growth of κ(W m+1 ) can be limited.In particular, in Proposition 2.7 we showed how to choose (η j /ρ j , t j ) so that W m+1 = V m+1 is already orthonormal.These continuation pairs are called optimal.However, the construction of such optimal pairs requires the knowledge of v j+1 , which is why we proposed in section 3 to consider near-optimal continuation pairs based on approximations v j+1 ≈ v j+1 .In Corollary 3.2 we provided a lower bound on the angle ∠(w j+1 , V j ) based on the norm f j+1 2 of the error f j+1 = v j+1 − v j+1 .Moreover, with Theorem 3.4 we provided an upper bound on κ(W m+1 ), again in terms of f j+1 2 .These results show that rather poor approximations (e.g., f j+1 2 = 10 −1 ) can already reduce κ(W m+1 ) significantly.It is worth noting that our analysis applies to the polynomial Arnoldi algorithm as well.To the best of our knowledge, this is the first time that an approach of this kind has been proposed.The second half of the paper, sections 4-5, was devoted to the parallel rational Arnoldi algorithm.In section 4.1 we described a generic parallel rational Arnoldi variant for distributed memory architectures, and in section 4.2 we considered the selection of continuation pairs.Unlike the sequential case, in the parallel case it is generally impossible to choose continuation pairs so that W m+1 is orthonormal; see Proposition 4.1.We thus suggested using the (sequential) near-optimal continuation strategy locally on each parallel process.Numerical experiments from three different application areas were reported in section 5.The results indicate that our near-optimal continuation strategy performs more robustly than other (canonical) strategies, and how this affects the quality of the approximations, e.g., to f (A)b or to some eigenpairs, extracted from these bases.The example from section 5.3 shows that near-optimal continuation pairs can be beneficial for the sequential algorithm as well; see Figure 5.3.In terms of computational efficiency, our experiments also showed that parallelizing the rational Arnoldi algorithm, as opposed to parallelizing only the involved linear algebra operations, leads to better scaling properties and ultimately to a faster rational Arnoldi algorithm.
We now list some open questions that may be considered for future work from both a theoretical and a practical point of view.The impact of the choice of auxiliary continuation pair ( η j / ρ j , t j ) may be further studied.The work presented in [22] may be relevant for making progress in this direction.The question concerning (constrained) real-valued near-optimal continuation pairs mentioned at the end of section 4.2 is also interesting, as well as the question of what is the best pole ordering.Another open question is that of formulating and solving (if possible) an optimization problem that would give near-optimal continuation pairs which actually minimize κ(W m+1 ) in the parallel case.For p > 1 we have shown that κ(W m+1 ) > 1 (strictly), but it is not clear how "close to orthonormal" can W m+1 be brought.

Fig. 3 . 1 :
Fig. 3.1: Evaluating the quality of the near-optimal continuation strategy.Left: We plot the Spectrum A and location of poles.
their indices s + , obtaining again an upper triangular matrix.In the parallel case there are, however, further restrictions on the nonzero pattern of T m , as can be observed in Figure4.1.There we display three canonical choices for t [ ] s+ .Perhaps the two most canonical choices for continuation vectors are t [ ] s+ = e max{1,s+1−p+ } , (4.1) where each parallel process applies the transformation (ν s+ A − µ s+ I) −1 (ρ [ ] s+ A − η [ ] s+ I) to either the rescaled starting vector v [ ] 1 (for s = 0) or to the vector v

Fig. 4 . 1 :
Fig. 4.1: Canonical continuation matrices T 12 for the sequential, 4.1(a), and parallel, 4.1(b)-4.1(d),rational Arnoldi algorithm.The shaded area in the upper triangles of the continuation matrices represents the allowed nonzero pattern, while the elements marked with × represent a particular choice of nonzeros.For instance, the sequential continuation strategy in 4.1(a) corresponds to t j = γ j e j = 0.Each of the three parallel variants corresponds to a canonical choice described in section 4.2, with a varying number of parallel processes p > 1.

Figure 4 . 1 .
Dedicated toolbox examples allow to reproduce the numerical results in this section (except the timings) in a straightforward manner.

Fig. 5 . 3 :Fig. 5 . 4 :
Fig. 5.3: Left: "Exact" eigenvalues of the waveguide problem and harmonic Ritz approximations with relative residual norms below 10 −8 extracted from a rational Krylov space of order m = 64 with eight cyclically repeated poles, computed using the near-optimal parallel strategy with p = 8 processors.Right: Residual norms of all m = 64 harmonic Ritz pairs computed using different (parallel) strategies to compute the rational Krylov basis.

Table 5 .
1: Numerical quantities associated with the transient electromagnetics problems from section 5.1 solved by various (parallel) rational Arnoldi variants.

Table 5 .
2: Numerical quantities for the 3D waveguide example from section 5.3.