A NEW PRECONDITIONER THAT EXPLOITS LOW-RANK APPROXIMATIONS TO FACTORIZATION ERROR

. We consider ill-conditioned linear systems Ax = b that are to be solved iteratively, and assume that a low accuracy LU factorization A ≈ (cid:98) L (cid:98) U is available for use in a preconditioner. We have observed that for ill-conditioned matrices A arising in practice, A − 1 tends to be numerically low rank, that is, it has a small number of large singular values. Importantly, the error matrix E = (cid:98) U − 1 (cid:98) L − 1 A − I tends to have the same property. To understand this phenomenon we give bounds for the distance from E to a low-rank matrix in terms of the corresponding distance for A − 1 . We then design a novel preconditioner that exploits the low-rank property of the error to accelerate the convergence of iterative methods. We apply this new preconditioner in three diﬀerent contexts ﬁtting our general framework: low ﬂoating-point precision (e.g., half precision) LU factorization, incomplete LU factorization, and block low-rank LU factorization. In numerical experiments with GMRES-based iterative reﬁnement we show that our preconditioner can achieve a signiﬁcant reduction in the number of iterations required to solve a variety of real-life problems.

1. Introduction.We consider the iterative solution of a linear system Ax = b, where A ∈ R n×n is nonsingular.A widely used approach is to compute a low accuracy LU factorization A = L U + ∆A and use the approximate inverse factors U −1 L −1 as a preconditioner.However, the rate of convergence of the iterative method strongly depends on the matrix properties, such as the distribution of its singular values.If A is ill conditioned the preconditioned iteration may converge slowly or not at all.In such a situation, we may have no other choice than to compute a more accurate LU factorization, which is likely to be too expensive for large-scale problems.
The objective of this article is to present a novel and yet general preconditioner that builds on a given approximate LU factorization and can be effective even for illconditioned systems.This preconditioner is based on the following key observation: ill-conditioned matrices that arise in practice often have a small number of small singular values.The inverse of such a matrix has a small number of large singular values and so is numerically low rank.This observation suggests that the error matrix is of interest, because we may expect E to retain the numerically low-rank property of A −1 .The main contributions of this article are to investigate theoretically and experimentally whether E is indeed numerically low rank and to describe how to exploit this property to accelerate the convergence of iterative methods by building a preconditioner based on a low-rank approximation to E.
We begin, in section 2, by describing the general framework for our analysis and providing three examples of algorithms that fit within the framework: low floatingpoint precision (for example, half precision) LU factorization, incomplete LU factorization, and block low-rank LU factorization.We also describe the experimental setting of the following sections.In section 3 we derive sufficient conditions for the error matrix to be numerically low rank.In section 4 we propose a new preconditioner based on a low-rank approximation to the error.In section 5 we analyze experimentally how this preconditioner can accelerate the solution of linear systems in the three contexts mentioned above.We use GMRES-based iterative refinement (GMRES-IR) as our iterative method.Concluding remarks are given in section 6.
Throughout this article, the unsubscripted norm • denotes the 2-norm.
2. General framework and three applications.We first describe the general framework to which the theory and algorithms developed in this article apply.We then provide three examples of widely used algorithms that fit within the framework: low floating-point precision LU factorization, incomplete LU factorization, and block low-rank LU factorization.We finally describe the experimental setting used in the following sections.
2.1.General framework description.We consider a linear system Ax = b, where A ∈ R n×n is nonsingular.We assume an approximate LU factorization (2.1)A = L U + ∆A can be computed, but we do not make any assumption on how it is computed, nor do we assume ∆A to have any special structure.We define the error matrix as We will show theoretically and experimentally that E is likely to have low numerical rank 1 when A is ill conditioned (that is, κ(A) = A A −1 1).Note that A being ill conditioned is not a strict requirement of our framework, but the algorithms we design can cope with such matrices; therefore, in this article, we mostly consider ill-conditioned A.

Low floating-point precision LU factorization.
The use of half or single precision floating-point arithmetic in mixed-precision algorithms is becoming increasingly common.In particular, half-precision arithmetic is attracting growing interest now that it has started to become available in hardware [13], [14].
A natural way to exploit a low precision LU factorization is with iterative refinement.Carson and Higham [6] investigate iterative refinement in three precisions.They show that if the working precision u is double precision, the LU factors are computed in half precision, and residuals are computed in quadruple precision then convergence of the refinement process is guaranteed if κ(A) ≤ 10 4 and backward errors and forward errors of order u will be produced.They also show that, by using GMRES preconditioned with the LU factors to solve for the correction term, the limit on κ(A) can be relaxed to 10 12 or 10 16 in order to ensure a forward error or backward error of order u, respectively.Algorithm 2.1 summarizes this GMRES-based iterative refinement (GMRES-IR), which was originally proposed in a form using two precisions [5].
While GMRES-IR requires only a small number of iterative refinement steps (outer iterations) when κ(A) satisfies the required bounds, the number of iterations in the GMRES solves (inner iterations) can be large.In this article, we will demonstrate how the new preconditioner proposed in section 4 can help to reduce the number of inner iterations and therefore widen the range of tractable problems.
Algorithm 2.1 GMRES-based iterative refinement with precisions u f , u, and u r .
1 Compute LU factorization of A at precision u f . 2 Solve Ax 1 = b at precision u f using the LU factors and store x 1 at precision u. 3 while not converged do 4 Compute r i = b − Ax i at precision u r and round r i to precision u.

5
Solve Ad i ≡ U −1 L −1 Ad i = U −1 L −1 r i at precision u using GMRES, with matrix-vector products with A computed at precision u r .

6
x i+1 = x i + d i at precision u. 7 end while 2.3.Incomplete LU factorization.The LU factors of a sparse matrix can be much less sparse than the matrix, because of fill-in, potentially making the factorization too expensive.This problem can be alleviated by using fill-reducing reorderings of the matrix, such as minimum degree, minimum fill, and nested dissection.However, the amount of fill-in can still be quite large in many practical applications.
A widely used alternative approach is to compute an incomplete LU (ILU) factorization [27, sec. 10.3], in which the sparsity of the LU factors is kept under a given threshold.For example, ILU(0) forces L U to have the same sparsity pattern as A. More generally, the sparsity of the computed factors can be controlled by a tolerance τ , where filled entries of magnitude less than τ (relative to the norm of A) are dropped.For large values of τ , ILU-based preconditioners may yield slow convergence of the iterative method.We will show how our new preconditioner, used in conjunction with an ILU preconditioner, can overcome this obstacle.

Block low-rank LU factorization.
In numerous scientific applications, such as the solution of partial differential equations, the matrices resulting from the discretization of the physical problem have been shown to possess a low-rank property [4]: suitably defined off-diagonal blocks of their Schur complements can be approximated by low-rank matrices.This property can be exploited to provide a substantial reduction of the complexity of matrix factorizations.
Several matrix representations-so-called low-rank formats-have been proposed in the literature.Most of them fit within our general framework, but we will focus on the block low-rank (BLR) format [1], [2], [3].The BLR format is based on a flat 2D blocking of the matrix that is defined by conveniently clustering the associated unknowns.A BLR representation A of a dense matrix A has the form , where each off-diagonal block A ij of size m i × n j and numerical rank k τ ij is approximated by a low-rank product The A ij approximation of each block can be computed in different ways.We have chosen to use a truncated QR factorization with column pivoting, which is a QR factorization with column pivoting that is truncated as soon as a diagonal coefficient of the R factor falls below a prescribed threshold τ , referred to as the BLR threshold.The BLR threshold τ controls the accuracy of the factorization.
In order to perform the LU factorization of a dense BLR matrix, the standard LU factorization has to be modified so that the low-rank blocks can be exploited to reduce the number of operations.Many such algorithms can be defined, depending on where the compression step (the introduction of the low-rank approximations) is performed.In this article, we will consider the CUFS variant, introduced in [2].As described in [2], the CUFS variant achieves the lowest complexity of all BLR variants by performing the compression as early as possible.
Using the BLR factorization as an approximate LU factorization has been shown to make an efficient preconditioner for iterative methods such as GMRES, outperforming both traditional iterative and direct methods for many applications of practical interest [22].We will show that our low-rank approximation to the factorization error can improve the performance of the BLR preconditioner.
2.5.Experimental setting.The numerical results have been obtained in MAT-LAB R2017b on a laptop computer equipped with 8 GB of memory and a four-core Intel i5-6300U running at 2.40GHz.
We will consider a large set of both random dense and real-life sparse matrices coming from a variety of applications.The randsvd matrices were generated with the MATLAB gallery function, using rng(1) to seed the random number generator.All the other matrices come from the SuiteSparse Matrix Collection (previously called the University of Florida Sparse Matrix Collection) [9].The full list is provided in Table 2.1.
The matrices are of relatively small size.We are mainly interested in the theoretical and numerical behavior of the algorithms; their high performance implementation is not our focus here.We will, however, explain why the proposed algorithms are expected to perform well on large-scale problems and in parallel computing environments.
Throughout the rest of this article, we will specify between parentheses after each matrix name which type of approximate LU factorization was performed: half precision LU factorization (fp16), incomplete LU factorization (ILU), or block lowrank LU factorization (BLR).For ILU, we used the MATLAB ilu function with threshold partial pivoting, with setup.type= 'ilutp' and drop tolerance τ .Table 2.1 indicates which type of factorization was tested on which matrices.The fp16 factorization was tested on the full set; the ILU and BLR factorizations were tested on a subset of matrices with values of τ varying between 10 −1 and 10 −5 for ILU and between 0.99 and 10 −5 for BLR.This leads to a total of 163 tests on 40 matrices.
3. Bounds for the numerical rank of the error matrix.In this section, we investigate theoretically and experimentally the numerical rank of the error matrix E = M −1 A − I, where M −1 is some preconditioner that approximates A −1 .In our LU-based preconditioner context, M = L U , but the analysis presented in this section does not use the fact that the preconditioner M is based on an approximate LU factorization, so we write M to keep the analysis general.
We begin with some definitions.
We call W k of rank k an optimal rank-k approximation to A if W k achieves the minimum in (3.1).The numerical rank of A at accuracy ε, denoted by k ε (A), is The matrix A is of low numerical rank if ε k (A) 1 for some k n.
Let XΣY T denote the singular value decomposition (SVD) of A, and let σ i (A) be the ith largest singular value.By the Eckart-Young-Mirsky theorem [11], [20, p. 468], [24] As stated in the introduction, we have observed that ill-conditioned matrices often have a numerically low-rank inverse.We now assume that A −1 is numerically low rank and seek conditions under which the error E = M −1 ∆A retains this low-rank property.To that end, we have to answer two questions: is To answer the first question, we consider M as an additive perturbation of A. We need the following lemma.Lemma 3.2.Let X ∈ R n×n and X + ∆X ∈ R n×n be nonsingular.Then Proof.Apply inequality (3.3.26) from [19] to X and X(I + X −1 ∆X).
We apply the lemma twice.First, recalling A = M + ∆A and taking X = M and ∆X = ∆A in (3.3) yields (for all i ≤ n) Second, taking X = A and ∆X = −∆A in (3.3) yields We can now answer the first question with the following theorem.
Theorem 3.3 states that if A −1 ∆A and M −1 ∆A are not too large then M −1 will retain the low-rank property of A −1 .The quantity β g = 1 + A −1 ∆A bounds how much the "perturbed" singular values of M can grow with respect to those of A by (3.5).Conversely, β s = 1 + M −1 ∆A bounds how much they can shrink by (3.4).The inequality (3.6) is sharp in a scenario where σ n (A) = σ 1 (A −1 ) −1 grows by a Table 3.1: Ratios and corresponding bounding quantities for a selection of matrices, where the LU factorization is computed in half precision or as an incomplete LU factorization or BLR factorization.
) βgβs factor β g and σ n−k (A) = σ k+1 (A −1 ) −1 shrinks by a factor β s .It is not clear whether this scenario is attainable.In any case the bound (3.6) is very pessimistic in practice.Numerical experiments are reported in Table 3.1 for a subset of the matrices, with the LU factors computed either in half precision or in double precision as an incomplete LU factorization or BLR factorization.For all these matrices, A −1 has low numerical rank.We see that the singular values of M are usually of the same order of magnitude as those of A, so that M −1 remains of low numerical rank.The bound (3.6) is usually weak by three orders of magnitude or more, but depending on the matrix, it can still imply a low numerical rank.A possible explanation for the bound (3.6) being weak is that β g β s is the same for all k, whereas the value is large for a large value of k, since we are only interested in small k.
We now turn to our second question: assuming M −1 is numerically low rank, when can we expect E = M −1 ∆A also to be numerically low rank?We answer this question with the following theorem.
as required.
Explaining the role of µ is less straightforward than for β g and β s .To clarify the role of µ, we compute an upper bound μ = min k μk whose mathematical meaning is easier to grasp.The following theorem states that unless ∆A is very special, µ should be of moderate size.
Theorem 3.5.The quantity µ F = M −1 ∆A / E F , which differs from µ only in that the Frobenius norm of E is taken rather than the 2-norm, is bounded by , and x j is the left singular vector corresponding to the jth largest singular value of M .
Proof.The proof draws its inspiration from [7].Let XΣY T be an SVD of M , with σ 1 ≥ • • • ≥ σ n .Then, with x j and y j denoting the jth columns of X and Y , respectively, Thus, for all k ≤ n, We can therefore bound µ F for all k ≤ n by as required.
Theorem 3.5 tells us that µ F will be small when ∆A is a "typical" matrix: one having a significant component in the subspace span(X k ) for some k such that σ n+1−k (M ) ≈ σ n (M ).Note that the proof requires us to take the Frobenius norm of E, which is in general greater than its 2-norm (and thus µ F ≤ µ).However, since E is expected to be numerically low rank, E ≈ E F should hold.Thus, we can also expect µ ≈ µ F to be small.Figure 3.1 plots the values of ∆A / P k ∆A , σ n+1−k (N )/σ n (M ), and μk as a function of k for four example matrices, where M = L U is from an fp16 LU factorization.In the first two cases (matrices lund a and steam1), ∆A = A − L U is a typical matrix and thus µ is small.However, for some matrices (e.g., rajat14 and nos1 in Figure 3.1), ∆A turns out to be special and leads to a large µ.Nevertheless, for all the matrices studied, the bound (3.7) is never sharp when µ is large (see the second column of Table 3.2).
We now build a matrix for which the µ bound (3.7) is both large and sharp, to prove that it cannot be improved without further assumptions on the matrix.Generating a matrix A for which the bound is sharp is difficult, because we do not control     the matrix ∆A of rounding errors.To build such a matrix, we adopt the approach of Higham [18] and use a direct search optimization procedure to maximize the ratio max k ε k (E)/ε k (A −1 ), where the L U factors are computed with a half-precision LU factorization.We obtained the 5 × 5 matrix   A −1 and U −1 L −1 are numerically low rank but the error E is not.
Corollary 3.6 tells us that the error E retains the low-rank property of M −1 as long as µ is not too large, that is, M −1 ∆A is not too much larger than M −1 ∆A , and β g and β s are not too large.Here again, inequalities (3.7) and thus (3.11) are pessimistic in practice: experiments reported in Table 3.2 show that E remains in most cases numerically low rank.These bounds should therefore not be used as a prediction of the numerical rank of E. But they do provide sufficient conditions for E to retain low numerical rank that in some cases (e.g., the last line in Table 3.2) are satisfied.

4.
A novel preconditioner based on the low-rank error.Now we consider the solution of the linear system Ax = b by means of an iterative method.A classical approach to accelerate the convergence of the iterative method is to use the preconditioner based on the computed LU factors (4.1) to solve the preconditioned system Π LU Ax = Π LU b.However, when the LU factors have been computed at low accuracy, and when the matrix A is ill conditioned, convergence may still be slow.To overcome this obstacle, we propose a novel preconditioner (4.2) which is based on a rank-k approximation E k to the error E = U −1 L −1 A − I.We expect the factor (I + E k ) −1 to improve the quality of the preconditioner.In the extreme case where E k = E, Π E is exactly equal to A −1 and thus yields a perfect Table 3.2:The bounds from Theorem 3.4 and Corollary 3.6 compared with the quantities they are bounding, using the same matrices and factorizations as in Table 3.1.preconditioner, but this is obviously too expensive.However, if k n then the solve with I +E k can be cheaply done with the Sherman-Morrison-Woodbury formula [16].Since we have shown that E is often numerically low rank, we may expect Π E k , with some suitable small k, to be almost as good a preconditioner as Π E .We note that the idea behind the Π E k preconditioner shares some similarities with deflation-type preconditioners [28], though there are fundamental differences between the two types.
4.1.Computing E k , a low-rank approximation to E. It would be too expensive to compute E k explicitly, so we develop a matrix-free approach to its use, in which we only need to perform matrix-vector products with E k .
Although other methods could be considered, we use the randomized sampling algorithm [15], [21] which has been shown to be efficient for computing low-rank approximations to dense matrices [23].We build E k as a truncated SVD of E. We consider two versions of the randomized SVD algorithm, described in Algorithms 4.1 and 4.2.
Both algorithms begin by sampling the columns of E with a random matrix Ω of size n × (k + p), where p is a small integer parameter that provides oversampling.A small amount of oversampling is usually enough to ensure a good accuracy of the low-rank approximation [15].We then build an orthonormal basis V of S; note that V captures the range of E: E ≈ V V T E. In Algorithm 4.1, based on this observation, we compute a rank-k approximation of V T E by means of a deterministic truncated SVD X k Σ k Y T k , which then yields the truncated SVD of the original matrix E as This however requires us to form the product V T E which, as analyzed in the next section, can be expensive.To overcome this issue, Algorithm 4.2 builds instead an interpolative decomposition (ID) [8] of V : where I denotes the identity matrix of order = k + p and V (L,:) is a subset of rows of V .Such a decomposition can be computed by means of a pivoted QR factorization V T P = QR and by defining W = R −1 1: ,1: R :, +1:n and V (L,:) = P T :,1: V [15].We then have, defining E = V V T E, Therefore, we can build the truncated SVD of E based on that of (I W ) T E (L,:) .The second approximation in (4.3) makes Algorithm 4.2 less accurate than Algorithm 4.1 by a factor up to 1 + 1 + 4k(n − k) [15, Lem.5.1].To maintain a unified presentation, we have formulated Algorithm 4.2 working on the orthonormal basis V .However, as explained in [15], for this second algorithm it is not necessary to orthonormalize the sample, i.e., we can work on S rather than V .This is what we will do in practice.

The four variants of the Π E k preconditioner.
In the rest of this article, we will analyze four distinct variants of Π E k , which differ in how E k is computed: • Π where F ∈ C n×n is the discrete Fourier transform (DFT) matrix and R ∈ R n× is a matrix that randomly selects distinct columns of F , i.e., it consists of random distinct columns of the n × n identity matrix.
The cost of the Π E k preconditioner strongly depends on which variant is used.Indeed, Π E k requires us to perform the products EΩ and V T E. Π (2) E k reduces the cost of the EΩ product by using a SRFT matrix Ω instead.On the other hand, Π (3) E k avoids forming V T E explicitly as explained in the previous section.Finally, Π (4) E k combines both cost reductions.
The four variants of the Π E k preconditioner are thus decreasingly expensive.However, they are also decreasingly accurate.Indeed, the theoretical properties of SRFT sampling are less well understood (e.g., how to choose the amount of oversampling).Perhaps of more concern, the row extraction SVD (Algorithm 4.2) leads to an error that is up to a factor 1 + 1 + 4k(n − k) larger than that of Algorithm 4.1 [15, Lem.5.1].Since we are only building a preconditioner, this might not be too problematic if it does not significantly increase the number of iterations.In the following section, we will therefore compare all four variants of Π E k .
Regardless of the variant of Π E k used, the new preconditioner might in some cases perform more flops than the original Π LU preconditioner if k is large or if the number of iterations is only reduced by a small quantity.Nevertheless, we still expect it to perform better for the following three reasons.
• The solve phase achieves in general a low execution rate because it uses BLAS 2 kernels (in the case of a single right-hand side).On the contrary, for the setup phase, the LU factorization is a BLAS 3 kernel, while computing E k may also be achieved with BLAS 3 kernels (or "BLAS 2.5" if k is very small).• The solve phase is performed at working precision, while the setup phase may be performed at lower precision.This includes the LU factorization but also the computation of E k .The influence of u E k , the precision at which E k is computed, will be analyzed in the next section.• Several applications require the solution for multiple right-hand sides.In this case, the setup overhead cost of the Π E k preconditioner is amortized by the necessity of performing more solves.

Numerical experiments with GMRES-IR.
In this section, we analyze how our new Π E k preconditioner can improve the convergence of GMRES-IR (Algorithm 2.1) [5], which uses iterative refinement with the solves for the correction carried out by preconditioned GMRES.We use three precisions, as proposed in [6].
• The LU factorization of A is computed at precision u f , which is half precision for a full factorization or double precision when ILU and BLR are used.

• The working precision is double precision for all experiments.
• The residual is computed in quadruple precision for all experiments.Computing the residuals in extended precision improves the forward error for illconditioned problems, though it has no effect on the convergence of iterative refinement [6].We set the maximum number of iterative refinement steps to 10 and the maximum number of GMRES iterations per step of iterative refinement to 100 (hence a maximum of 1000 total GMRES iterations is permitted).The GMRES stopping criterion is set to a relative tolerance of 10 −8 .
For each matrix, the rank k controls the accuracy at which E k approximates Table 5.1: Five matrices representative of typical scenarios.We used a half precision LU factorization for matrices randsvd(1e7,3) and lund a, ILU factorization with τ = 10 −1 for matrices west0167 and rajat14, and BLR factorization with τ = 10 −2 for matrix utm300.The seventh and eighth columns of the table show the number of GMRES iterations, with the number of iterative refinement steps in parentheses.E. In our experiments, rather than setting k to a fixed value, we choose a given target accuracy ε, and compute k as the numerical rank k ε of E at accuracy ε (see Definition 3.1).
To assess the effectiveness of each preconditioner we will measure both the number of iterations performed and the associated flops.Since our new Π E k preconditioner can and often does perform more flops that the traditional Π LU preconditioner, we also estimate their run time.Since a high-performance implementation and analysis is not our focus, we use a simple model, assuming BLAS 3 computations are 10 times faster than their BLAS 2 counterparts.We also assume that computations in single and half precision are twice and four times faster than in double precision, respectively.
Note that we only seek to assess the relative performance of Π E k compared to traditional LU-based preconditioners Π LU .Comparing the Π E k preconditioner to other approaches, such as different preconditioners or direct methods, is out of the scope of this article.

Analysis of typical scenarios. In this section, we focus on the first variant Π (1)
E k and refer to it simply as Π E k .In Table 5.1, we consider five matrices that are representative of five typical scenarios.We report the condition numbers of A, Π LU A, and Π E k A. For both preconditioners, we compare the total number of GMRES iterations (the number in parentheses corresponds to the number of iterative refinement steps), and the associated flops and estimated time.Note that for matrices lund a, west0167, and utm300, κ(Π LU A) is actually higher than κ(A); nevertheless, experiments (not shown here) show that even on these matrices, unpreconditioned GMRES converges much slower than its Π LU -and Π E k -preconditioned counterparts.We also indicate the low-rank threshold ε that we used to compute E k , and its corresponding rank k.In Figure 5.1, we plot the singular value distribution of A −1 , U −1 L −1 , and E for each of these five matrices.
We recall that our preconditioner targets ill-conditioned matrices that have a small number of small singular values and therefore a numerically low-rank inverse.Clearly, A being ill-conditioned is only a necessary condition for A −1 to be numerically lowrank, not a sufficient one.However, we have observed that all ill-conditioned matrices that we have tested from the SuiteSparse Matrix Collection fulfill that requirement.The only matrices in our set that do not are the mode 1 and 3 randsvd matrices, which are artificially created problems.They constitute what we will call Scenario 1.In this scenario, A −1 is not numerically low rank and therefore we can expect E to have  high numerical rank and our preconditioner not to be effective.Mode 3 randsvd is analyzed in Figure 5.1a.Interestingly, the SVD of the inverse factors shows a slightly faster singular value decay and therefore, even though A −1 has high numerical rank, some improvement is observed with Π E k over Π LU .However, the rank k is about n/2, leading to a significant flop overhead and thus only a modest time gain.
We emphasize that we selected the matrices based on their condition number only; we did not specifically select matrices for which A −1 is numerically low rank.While one can surely find an ill-conditioned matrix from the SuiteSparse collection that does not fulfill this requirement, we believe that the fact that one does not easily come upon one of them demonstrates that our preconditioner's scope is very general.
While A −1 is thus numerically low rank for nearly all matrices in the test set, the performance of our Π E k preconditioner is heavily dependent on the extent to which the error E is numerically low rank.In the following Scenarios 2, 3, and 4, E is numerically low rank and thus Π E k performs well.We distinguish three cases depending on the reason for E to have low numerical rank.In Scenario 2, the SVD of E closely follows that of A −1 (Figure 5.1b); in other cases, the SVD of E shows an even faster decay than that of A −1 , either because U −1 L −1 has lower numerical rank than A −1 (Scenario 3, Figure 5.1c), or because E has lower numerical rank than U −1 L −1 (Scenario 4, Figure 5.1d).Scenario 3 generally happens when the approximate L U factors are nearly singular, thus leading to a very ill-conditioned Π LU A matrix.By using Π E k , we can reduce the ill conditioning of Π LU A. We conjecture that Scenario 4 is due to a ∆A that possesses some kind of structure, and we have in fact observed it to be especially frequent for the test cases with BLR factorization.
Finally, Scenario 5 contains the unfortunate cases for which E loses the low-rank property of A −1 (Figure 5.1e).In our set of matrices, this is always due to U −1 L −1 not being numerically low rank (i.e., the bound from Theorem 3.3 is sharp and β g or β s is large).We recall that in section 3 we built a matrix for which E loses the low-rank property due to a special ∆A (see (3.10)), but this did not occur on any of the matrices of our set.
The numerical rank k ε of E at accuracy ε can be quite large for matrices falling into Scenarios 1 and 5.While the preconditioner Π E k is not designed with these matrices in mind, it is nevertheless desirable that, when used on such matrices, the overhead cost due to the use of Π E k remain limited.To do so, we limit k to be no larger than a given k max .For example, for rajat14, using k max = n/10, the flop overhead cost of Π E k is reduced from 280% to 170% of that of Π LU , and the time gain is increased from 95% to 92%.
This diversity of scenarios shows that the optimal choice of the preconditioner parameters will be heavily matrix dependent.However, we would like to design a "black box" version of the preconditioner that has default settings for which it performs well on a wide range of problems.This is the aim of the next section.

Finding a black box setting.
Three main parameters influence the cost and accuracy of the preconditioner Π E k : the precision u E k at which E k is computed, the low-rank threshold ε, and the amount of oversampling p.In this section, we analyze how to set these parameters to produce good performance on a wide range of problems.In order to do that we seek the best value for each parameter separately, using performance profiles [10], [17, sec. 26.4].Each performance profile corresponds to a preconditioner, a selection of three or four parameters, and a chosen performance measure for which smaller is better.Each curve on a performance profile shows, for a range of values of α ≥ 1, the proportion of problems p ∈ [0, 1] for which the performance measure for a particular parameter was within a factor α of the smallest performance measure over all the parameter values.The performance measures are the number of iterations, the number of flops, and the time predicted by our model.
Note that if the iteration fails to converge for some problems for a given parameter then the corresponding curve in the performance profile never reaches p = 1; thus the value of p at which a curve levels off is a measure of robustness.
Naturally, the parameters are interdependent: for example, a high oversampling parameter will increase the weight of the sampling operation, which is performed at precision u E k , thus increasing the importance of the latter parameter.While the approach of studying each parameter independently is thus possibly not optimal, it allows us to find a suitable setting without getting lost into the combinatorics of the parameters.
We first analyze the influence of the oversampling parameter p. From Figure 5.2, it is clear that a larger oversampling leads to a greater reduction of the number of iterations, but also to a greater flop overhead due to the larger subspace size = k +p.We must therefore find a compromise aiming at minimizing the time estimated by our model.Interestingly, the time performance profiles suggest that the value for p should be set differently depending on which Π E k variant is considered.Indeed, Π E k and Π (2) E k require us to form the product V T E and their cost is thus very sensitive to the choice of p; setting it to a small value works best.Conversely, Π E k and Π (4) E k avoid forming V T E, and we can thus afford to take much higher values of p, since building the preconditioner is much cheaper.This is visible from the time plots (right column) in Figure 5.2, where the curves corresponding to small p tend to be above those corresponding to large p for Π We now turn to the low-rank threshold parameter ε, whose effect is plotted in Figure 5.3.The trend is again clear: a smaller value of ε makes the preconditioner more robust but more costly.The role of ε is also strongly dependent on which variant of Π E k is considered, for the same reasons than the oversampling parameter.In the following, we will use ε = 10 −3 for Π Finally, in Figure 5.4 we study the role of the u E k precision parameter on the subset of tests performed with half precision LU factorization (fp16).Computing E k in half precision leads to a preconditioner that is less accurate than when E k is computed in higher precision: in particular, in about 8% of the cases the preconditioner fails when E k is built in half precision, whereas it succeeds with a higher precision u E k .On the other hand, computing E k in single or double precision makes little difference on this set of problems, and since single precision is twice as fast as double precision, the time performance profile shows that setting u E k to single is the best strategy overall, for all four variants of Π E k .
5.3.Results on the full set of problems with the black box setting.In this section we report numerical experiments on the full set of problems using the black box settings chosen in the previous section, which we summarize in Table 5.2.
We emphasize that the results were obtained without tuning the preconditioner parameters on a case-by-case basis, thereby demonstrating the generality and versatility of the preconditioner.We first compare the four variants of the Π E k preconditioner.Figure 5.5 shows the time performance profile of each variant.Note that we do not provide the iterations and flop profiles, since comparing the four variants in terms of iterations or flops is not meaningful, because they are used with different values of ε and/or p.We must compare their time performance to assess which variant finds the best cost/accuracy compromise.The preconditioner Π E k ranks first on the largest number of problems (about 50% of them); it is, however, less robust than the other variants, failing to converge in three cases where the other variants converged.We therefore choose to reject it.While Π E k achieves a good performance overall, very close to that of Π (4) E k .Interestingly, it is also the most robust variant; recalling that it is less accurate than Π E k by a factor up to 1 + 1 + 4k(n − k), this means that it compensates its lesser accuracy by being able to afford a much smaller threshold (ε = 10 −5 instead of 10 −3 ).We conclude that Π  is the choice that leads to the best performance overall on this set of problems.
We now compare Π E k with the classical Π LU preconditioner.In Figure 5.6, we plot the relative performance of Π E k with respect to Π LU .Each bar corresponds to a different test case, its color indicating which type of approximate factorization is considered (fp16, ILU, or BLR).The colors are evenly distributed, which means that the numerical behavior of Π E k is comparable for all three types of factorization.The preconditioner Π E k leads to a lower number of iterations than Π LU in about 80% of the test cases.Moreover, this reduction of the number of iterations is often significant: Π LU performs more than 50% more iterations on 30% of the test cases.Interestingly, E k successfully solves the problem (indicated by the white gap on the left side of the plots).Therefore, even though Π E k leads to a flop overhead compared with Π LU in about 90% of the cases, that overhead is often limited (less than 50% overhead in half the cases) and using our simple performance model, the estimated time results suggest significant gains can be expected.

Results on larger matrices.
In this section we complement the previous results with some experiments on larger matrices.These experiments include only the ILU and BLR preconditioners because the fp16 arithmetic we are using is too slow on matrices of size larger than 1000.An important question is whether these larger matrices still possess an inverse that is numerically low rank, and whether the numerical rank k ε of A −1 remains small or increases with n.
Table 5.3 shows results for some matrices of size around 5000 from the SuiteSparse collection.We see that the Π E k preconditioner significantly improves the number of iterations for both the ILU and BLR preconditioners.Importantly, this improvement is achieved for very small ranks (k ε remains smaller than 100 for all matrices) compared with the matrix sizes n.
In Figure 5.7 we compare k ε for different matrix sizes for two families of matrices (also from the SuiteSparse Matrix Collection) that contain cz308 and utm300 in Table 2.1.The plots show that the numerical rank remains almost constant with respect to n when the required ε is not too small.While there may of course be some other matrices for which this property is not true, the figure suggests that, at least for some problem classes, our preconditioner should perform well, or even better, on large-scale problems.

Conclusion.
We have presented a new and very general preconditioner for iterative methods for solving ill-conditioned linear systems Ax = b.
The key idea is to exploit the low numerical rank structure that is typically present in the error arising in approximate matrix factorizations.We have defined a general E k ).Each bar corresponds to one of the 163 test cases, its color indicating which type of approximate factorization is considered (fp16, ILU, or BLR).The y-axis corresponds to the normalized performance of Π E k with respect to that of Π LU : thus, Π E k performs better than Π LU when the bar is under the black line.These results were obtained with the black box settings described in Table 5.2.The white gap on the left side of the plots corresponds to the test cases for which Π LU did not converge whereas Π E k did.framework in which a low accuracy LU factorization A = L U + ∆A is computed.This allows for many different types of approximate LU factorizations, among which in our experiments we have used half precision LU, incomplete LU, and block low-rank LU.
We have used theoretical results from singular value perturbation analysis to bound the distance from E = U −1 L −1 A − I = U −1 L −1 ∆A to a numerically low-rank matrix by a multiple of the distance from A −1 to a numerically low-rank matrix.These bounds give sufficient conditions for the error matrix to be numerically low rank.In practice, the bounds are generally pessimistic and we have found E to be almost always numerically low rank in practice when A is ill conditioned.
Our novel preconditioner improves the traditional preconditioner U −1 L −1 based on the approximate LU factors by premultiplying it by a correction term (I + E k ) −1 , exploiting the numerical low rank of E. Because building E explicitly is too expensive, our algorithm uses a matrix-free approach based on randomized sampling to compute a rank-k matrix E k as a truncated SVD of E. We have compared four variants of the algorithm theoretically, by performing a computational cost analysis, and experimentally.
After experimenting with the internal parameters of the preconditioner, in order to better understand its practical behavior, we chose a set of parameters that we applied in a black box manner to a large set of real-life problems coming from a variety of applications.Our numerical results show the capacity of the new preconditioner to accelerate the solution of a wide range of ill-conditioned problems, thereby demonstrating its generality and versatility.
We conclude by mentioning some possible directions for future work.Our preconditioner could be coupled with other iterative methods than GMRES-IR, such as GMRES.The LU framework that we have described could also be naturally adapted to symmetric problems.We believe our work could even be extended to preconditioners that are not based on matrix factorizations, such as Jacobi, Gauss-Seidel, approximate inverse, or multigrid approaches.
Most importantly, while out of the scope of this article, a high-performance implementation of the proposed preconditioner will be of interest both to assess the performance gains that can be achieved and to study its numerical behavior on large- Matrix lund a (fp16).

Fig. 3 . 1 :
Fig.3.1:Quantities µ (defined in(3.8)) and μk (defined in (3.9)), and factors in upper bound (3.9), for M = L U and ∆A = A − L U .µ is small if ∆A is typical (top two matrices) but can be large for special ∆A (bottom two matrices).

Fig. 3 . 2 :
Fig. 3.2: Singular value distributions of A −1 , U −1 L −1 , and E corresponding to example matrix A defined in (3.10) for which the bound of Theorem 3.4 is almost sharp.A −1 and U −1 L −1 are numerically low rank but the error E is not.

( 1 )E
E k (top row), while the opposite is true for Π k variants (not shown in Figure5.2),lie in the middle ground.In the following, we will therefore use p = 0 for Π

Fig. 5 . 2 :
Fig. 5.2: Performance profile of the Π E k preconditioner for different oversampling parameters p.The other parameters were set to ε = 10 −5 and u E k = single.

E
k are significantly slower, Π

Fig. 5 . 3 :
Fig. 5.3: Performance profile of the Π E k preconditioner for different low-rank threshold ε parameters.The other parameters were fixed to p = 10 and u E k = single.

Fig. 5 . 6 :
Fig. 5.6: Performance comparison between the Π LU preconditioner and the best variant of the Π E k preconditioner (Π

Fig. 5 . 7 :
Fig. 5.7: Numerical rank k ε of A −1 at accuracy ε for different matrix sizes.The matrices are from the SuiteSparse Matrix Collection and the digits in the name denote the matrix size.

Table 2 .
1: The test matrices, indicating for each one which of ILU and BLR was tested and the corresponding values of the drop tolerance or the BLR threshold τ .
E k : compute E k with Algorithm 4.2 and with Ω an SRFT matrix.An SRFT matrix is a subsampled random Fourier transform matrix, defined as
Time performance profile of the four variants of the Π E k preconditioner, obtained with the black box settings described in Table 5.2.
in about 5% of the cases, Π LU fails to converge whereas Π

Table 5 .
3: Results on larger matrices.The last two columns of the table show the number of GMRES iterations, with the number of iterative refinement steps in parentheses.