Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions

. We propose a general algorithm for solving a n × n nonsingular linear system Ax = b based on iterative reﬁnement with three precisions. The working precision is combined with possibly diﬀerent precisions for solving for the correction term and for computing the residuals. Via rounding error analysis of the algorithm we derive suﬃcient conditions for convergence and bounds for the attainable normwise forward error and normwise and componentwise backward errors. Our results generalize and unify many existing rounding error analyses for iterative reﬁnement. With single precision as the working precision, we show that by using LU factorization in IEEE half precision as the solver and calculating the residuals in double precision it is possible to solve Ax = b to full single precision accuracy for condition numbers κ 2 ( A ) ≤ 10 4 , with the O ( n 3 ) part of the computations carried out entirely in half precision. We show further that by solving the correction equations by GMRES preconditioned by the LU factors the restriction on the condition number can be weakened to κ 2 ( A ) ≤ 10 8 , although in general there is no guarantee that GMRES will converge quickly. Taking for comparison a standard Ax = b solver that uses LU factorization in single precision, these results suggest that on architectures for which half precision is eﬃciently implemented it will be possible to solve certain linear systems Ax = b up to twice as fast and to greater accuracy. Analogous results are given with double precision as the working precision.


1.
Introduction.Iterative refinement is a method for improving an approximate solution y to a linear system Ax = b by computing the residual r = b − Ay, solving the correction equation Ad = r, forming the update y ← y + d, and repeating these steps as necessary.We consider a general iterative refinement algorithm that includes a variety of existing ones as special cases.The algorithm contains three precisions: • u is the precision at which the data A, b and the solution x are stored (the working precision), • u f is the precision at which the factorization of A is computed, • u r is the precision at which residuals are computed.The precisions are assumed to satisfy (1.1) u r ≤ u ≤ u f .
The algorithm also contains a fourth precision: • u s is the precision at which the correction equation is (effectively) solved, with u ≤ u s ≤ u f .Whereas u, u f , and u r are intended to be possibly different precisions supported by the computing environment (ideally the hardware), u s is essentially a parameter that describes how accurately the correction equation is solved; it will take the value u f or u in the cases of interest.The precise meaning of u s is explained below.
Algorithm 1.1.Let the nonsingular matrix A ∈ R n×n and b ∈ R n be given in precision u.This algorithm uses iterative refinement to generate a sequence of approximations x i , all stored in precision u, to the solution of Ax = b.
1 Solve Ax 0 = b in precision u f and store x 0 at precision u.
2 for i = 0: ∞ Compute r i = b − Ax i at precision u r and round r i to precision u s . 4 Solve Ad i = r i at precision u s and store d i at precision u.

5
x i+1 = x i + d i at precision u. 6 end Note that a different solver can be used on step 1 than on step 4. In practice, these solvers will be related, so although the precision u f does not appear in the loop, information computed in step 1 at precision u f will be used in step 4.
Algorithm 1.1 includes as special cases both old and more recent forms of iterative refinement, as we now explain.
In traditional iterative refinement the solver is LU factorization and residuals are computed at twice the working precision, which corresponds to u f = u s = u and u r = u 2 .This form of iterative refinement was programmed by Wilkinson in 1948 [35] and used at that time by Wilkinson and his colleagues on desk calculating machines and Hollerith punched card machines [13], [37].It was in common use up until the 1970s, owing to the fact that inner products could be cheaply accumulated at twice the working precision on many computers of that era.Early error analyses are given by Wilkinson for fixed point arithmetic [36] and Moler [26] for floating point arithmetic.
In fixed precision refinement all computations are at the same precision: u f = u s = u = u r .This form of refinement started to be considered in the 1970s and was analyzed by Jankowski and Woźniakowski [21] for a general solver and by Skeel [31] for LU factorization.Higham [16] extended Skeel's analysis to a general solver and in [17] gave a further extension to allow for residuals computed in extra precision.LAPACK [3] implements fixed precision iterative refinement.
In the 2000s iterative refinement attracted renewed interest because on modern computer architectures single precision arithmetic is usually at least twice as fast as double precision arithmetic.With the base precision u equal to double precision, u f and u s equal to single precision, u r = u, and a solver based on LU factorization, the most expensive part of the computation is done entirely in single precision.This usage is proposed and analyzed by Langou et al. [23] and has been exploited extensively by Dongarra and his coauthors; see [1, sec.9] for a recent overview and further references.
Error analyses of the methods above all require A to be safely bounded away from singularity relative to the working precision.Carson and Higham [9] give a new forward error analysis of iterative refinement that identifies a mechanism that allows accurate solutions to be obtained to systems for which A has condition number as large as the order of the reciprocal of the unit roundoff.Their analysis requires the update equation in line 4 of Algorithm 1.1 to be solved with relative error less than 1, and they achieve this by the use of GMRES preconditioned with the LU factors.
Table 1.1 summarizes the main existing floating point error analyses.In the table we categorize a forward error bound as componentwise if it employs the condition numbers cond(A) or cond(A, x) defined in (2.2) below and normwise if it only employs κ(A) defined in (2.1).A backward error analysis is classed as componentwise or Table 1.1 Summary of existing rounding error analyses for iterative refinement in floating point arithmetic indicating (a) whether the analyses apply to LU factorization only or to an arbitrary solver, (b) whether the backward or forward error analyses are componentwise ("comp") or normwise ("norm"), and (c) the assumptions on the precisions u f , us, u, ur in Algorithm 1.1 (u f = u and us = u f unless otherwise stated).

Forward
Backward Year Solver error error Precisions Moler [26] 1967 LU norm -u ≥ ur Stewart [33] 1973 LU norm -u ≥ ur Jankowski et al. [21] 1977 arb.norm norm u = ur Skeel [31] 1980 LU comp comp u ≥ ur Higham [16] 1991 arb.comp comp u = ur Higham [17], [18] 1997 normwise according as it bounds the absolute value or the norm of the residual vector (we do not regard a componentwise analysis as implying a normwise one, as simply taking norms in a componentwise bound does not necessarily yield the strongest normwise result).
Half precision (16 bit) floating point arithmetic, defined as a storage format in the 2008 revision of the IEEE standard [20], is now starting to become available in hardware, for example in the NVIDIA P100 and V100 GPUs and the AMD Radeon Instinct MI25 GPU, on which it runs twice as fast as single precision arithmetic with a proportional saving in energy consumption.It is therefore now of interest to consider iterative refinement with u f corresponding to half precision.Table 1.2 summarizes key parameters for four IEEE arithmetic precisions.Table 1.3 presents another view of iterative refinement, from the point of view of different ways in which it can be implemented in hardware-based IEEE arithmetic.We note that another context where low precision arithmetic is of interest is the hybrid iterative refinement method of Douglas, Mandel, and Miranker [11], which solves the correction equation using low-precision analog circuitry and computes residuals using higher-precision digital circuitry and effectively has u r = u and u f > u.
The goal of this work is to develop new iterative refinement algorithms based on three precisions.We show that by using LU factorization in IEEE half precision as the solver, single precision for the working precision, and double precision for the computation of the residuals, it is possible to solve Ax = b to full single precision accuracy for condition numbers κ 2 (A) ≤ 10 4 , with the O(n 3 ) part of the computations carried out entirely in half precision.We show further that by using GMRES preconditioned by the LU factors as the solver on step 4, the restriction on the condition number can be weakened to κ 2 (A) ≤ 10 8 .These results provide the ability to solve Ax = b at up to twice the speed and the same accuracy as by traditional iterative refinement with LU factorization in single precision and double precision residuals.
In order to understand the behavior of the new algorithms we provide a thorough rounding error analysis of Algorithm 1.1 in its full generality.In doing so, we • provide rigorous componentwise forward error bounds and both componentwise and normwise backward error bounds, • make minimal assumptions about the solvers used in steps 1 and 4, so that the analysis is applicable to all the situations mentioned above, as well as to others that can be envisaged, • treat the precisions u f , u s , u, and u r as independent parameters.Our results generalize and unify most existing analyses, including the recent forward error analysis of [9].We make one omission: we do not try to prove a "one step of iterative refinement in fixed precision implies componentwise backward stability" result [16], [31], which is of lesser practical importance.However, such a result can be obtained by extending our analysis, under further assumptions on the solver.
Iterative refinement is often used as a way to restore stability when a factorization has been computed in a way that preserves structure and reduces cost at the expense of potential stability, as for example with sparse matrices [7], [38] or symmetric quasidefinite matrices [14].Our analysis can be applied to these situations, since we make very general assumptions on the solver.
Our attention is focused exclusively on iterative refinement as described in Algorithm 1.1.We do not consider here recursive (also known as binary cascade, or k-fold) iterative refinement, in which each solve on step 4 of the algorithm is carried out by a recursive application of iterative refinement, possibly with increasing precision [22], [32].We also do not consider hybrid schemes, such as that in [8], that combine iterative refinement with some other iterative method in such a way that the basic structure of the algorithm is changed.

Preliminaries.
We now summarize our notation and our assumptions on the solver.We will use the standard model of floating point arithmetic [18, sec. 2.2].Given an integer k, we define A superscript on γ will denote that u carries that superscript as a subscript; thus, for example, γ r k = ku r /(1 − ku r ).
For a nonsingular matrix A and a vector x, we need the normwise condition number and the componentwise condition numbers where |A| = (|a ij |).These condition numbers measure the sensitivity of the solution of Ax = b to small normwise and componentwise perturbations, respectively [18, chap. 7].
Inequalities between vectors or matrices are interpreted componentwise.We denote by f l r (•) the evaluation of the argument of f l r in precision u r .
We assume that the solver used in step 4 of Algorithm 1.1 produces a computed solution d i to Ad i = r i satisfying three conditions: where E i , c 1 , c 2 , and G i are functions of n, A, r i , and u s .The first assumption simply says that the normwise relative error d i −d i ∞ / d i ∞ is bounded by a multiple of u s and is less than 1.The second assumption says that the normwise relative backward error η( d i ) is of order at most max(c 1 , c 2 )u s , where for an approximate solution y to Ax = b, η(y) is given by [18,Thm. 7.1], [30] η(y) := min{ : The third condition will be needed in order to analyze the componentwise relative backward error, which for an approximate solution y to Ax = b is given by [18,Thm. 7.3], [29] ω(y) := min{ : where ξ/0 is interpreted as zero if ξ = 0 and infinity otherwise.Table 2.1 summarizes the sizes of c 1 , c 2 , E i , and G i in (2.3)-(2.5)for the solvers that will be considered in sections 7 and 8.
We present the rounding error analysis of Algorithm 1.1 in the next three sections, which provide forward error bounds, normwise backward error bounds, and componentwise bounds, respectively.The importance of scaling to avoid underflow and overflow when half precision is used is explained in section 6.In section 7 we specialize the results to the case where the solver is LU factorization and explain the numerical properties of iterative refinement in three precisions: half, single, and double.In section 8 we use GMRES preconditioned by the LU factors as the solver and show that the resulting algorithm is able to solve accurately a wider range of problems than algorithms whose solver is based on LU factorization.In section 9 we compare some of our new forms of iterative refinement with the single precisiondouble precision form proposed by by Langou et al. [23].Numerical experiments are given in section 10 that test the predictions of the analysis.Conclusions are given in section 11.
3. Forward error analysis.For our forward error analysis of Algorithm 1.1 we will need the following lemma, which we state for a general p-norm (1 ≤ p ≤ ∞).
Lemma 3.1.Let w, z ∈ R n .Then w = Cz for a matrix C ∈ R n×n with C p = w p / z p .
Proof.Let g be the vector dual to z with respect to the p-norm, so that g T z = 1 and g q z p = 1, where 1/p + 1/q = 1 [18, sec.6.1].Setting C = wg T , we have Cz = w and C p = w p g q = w p / z p .
In the analysis we will need to bound the vector b − A x i = A(x − x i ) in terms of x − x i .We can write (3.1) where We also have the componentwise bound We can capture both inequalities by writing where the equalities following from Lemma 3.1 and (3.1).By using the matrices C i we can derive a general result that can be expressed in terms of either componentwise or normwise condition numbers.
Consider first the computation of r i .There are two stages.First, where p is the maximum number of nonzeros in any row of [A b] [18, sec.3.5]; thus p = n + 1 for a dense matrix A and vector b.Second, the residual is rounded to precision u s , so We rewrite the bound for ∆r i as where C i satisfies (3.2).
For step 4 of Algorithm 1.1 we have, by (2.3), Hence, using (3.4), For step 5, using the variant [18, eq.(2.5)] of the rounding error model we have Hence, using (3.3), Therefore, by (3.4) and (3.5), We summarize the analysis in a theorem.Theorem 3.2.Let Algorithm 1.1 be applied to a linear system Ax = b, where A ∈ R n×n is nonsingular, and assume the solver used on step 4 satisfies (2.3).For i ≥ 0 the computed iterate where and We note that both terms are needed in the expression min(cond(A), κ ∞ (A)µ i ) in (3.8).When A is diagonal, for example, the first term in the min is the smaller, since cond(A) = 1, whereas the second term is the smaller when cond(A) ≈ κ ∞ (A) 1 and µ i 1.We can now state a result about the convergence and attainable accuracy of Algorithm 1.1.
Corollary 3.3.Under the conditions of Theorem 3.2, as long as is sufficiently less than 1, the forward error is reduced on the ith iteration by a factor approximately φ i until an iterate x is produced for which When u s = u this result can be compared with two earlier results for general solvers.It is stronger than [17, Thms.3.1, 3.2], [18, Thms.12.1, 12.2] because of the presence of the term κ ∞ (A)µ i , whose significance we will explain in section 8.It is also stronger than [9, Thm.2.1], which does not have the cond(A) term in (3.9).
Note that φ i in (3.9) depends only on u s .This means that the rate of convergence of iterative refinement depends only on the effective precision of the solves, and is not affected by the precision at which the data is stored or the residual is computed.By contrast, the limiting precision (3.10) depends on u r and u, but not on u s .
Note also that the precision u f used for the initial solve in step 1 does not appear in Theorem 3.2 or Corollary 3.3.However, u f does affect the accuracy of x 0 and hence the number of iterations required.

4.
Normwise backward error analysis.Now we turn our attention to the behavior of the normwise backward error.
Multiplying (3.7) by A gives Hence, from (4.1), using (3.6) and two invocations of We summarize our findings in the next two results.Theorem 4.1.Let Algorithm 1.1 be applied to a linear system Ax = b with a nonsingular matrix A ∈ R n×n satisfying c 1 κ ∞ (A)u s < 1, and assume the solver used on step 4 satisfies (2.4).Then for i ≥ 0 the computed iterate where Under the conditions of the corollary, making the reasonable assumption that x i−1 ∞ ≈ x i ∞ and using u r ≤ u, we have, ultimately, In other words, η( x i ) pu, that is, x i is a backward stable solution to the working precision.
Early error analyses of iterative refinement did not consider the residual, because when the solver is LU factorization with partial pivoting the residual of the original computed solution is already small, assuming there is no large element growth in the factorization.Starting with the work of Jankowski and Woźniakowski [21], it was appreciated that iterative refinement could cure instability in the solver, even in fixed precision.Our analysis shows this clearly: instability in the solver is captured by large values of c 1 and c 2 in (2.4), but as long as (c 1 κ ∞ (A) + c 2 )u s is sufficiently less than 1, Corollary 4.2 guarantees that iterative refinement will yield a small backward error.
Note that there is little or no benefit to the componentwise backward error of computing residuals at extra precision, since α i in Theorem 4.1 is independent of u r and the limiting residual is no smaller when u r < u.

5.
Componentwise backward error analysis.We now determine conditions under which Algorithm 1.1 achieves a small componentwise relative backward error.
To interpret the theorem, note first that M 1 = I + O(u s ), so and so W i ∞ < 1 in view of (5.6).Note that we can expect G i ≥ |A| in practice, so the dominant term in this bound will be u s G i |A −1 | ∞ .We conclude that W i ∞ 1 if the solver is not too unstable and A is not too ill conditioned relative to precision u s .
In the limit the y i term dominates, but it is not a scalar multiple of |A|| x i+1 | + |b|.This is not a problem if we wish to take norms and use this analysis to obtain a bound for the normwise backward error that exploits the more descriptive bound (2.5).In order to bound ω( x i+1 ) we need a simple lemma [17, Lem.1.2].
Lemma 5.2.For A ∈ R n×n and x ∈ R n we have where ξ(x) = max j |x j |/ min j |x j |, with x j denoting the jth component of x.
Using the lemma we have Hence, ultimately, the componentwise relative backward error of x i+1 satisfies This bound will be of order u as long as the solver is not too unstable, A is not too ill conditioned, and ξ(|b| + |A|| x i+1 |) is not too large.The latter condition essentially requires the vector |b| + |A||x| to be not too badly scaled, which is a natural requirement, because when |b| + |A||x| has a zero component the problem of computing a solution with a small componentwise relative backward error is ill posed; see [5], [18, p. 241].Table 5.1 summarizes the sufficient conditions for convergence and the bounds on the limiting forward error and backward errors derived in this section and the previous two sections.6. Scaling.Our analysis makes no assumptions on the floating point arithmetic other than that the three precisions obey the standard model and satisfy (1.1).As is usual in rounding error analyses we have ignored the possibility of underflow and overflow.In general, this is reasonable, but if we take u f to be a low precision, specifically if u f is IEEE half precision, then underflow or overflow is quite likely in step 4 of Algorithm 1.1, since the range of half precision numbers is only 10 ±5 .Table 5.1 Summary of the results of sections 3-5: conditions for convergence and the limiting size of the forward error, normwise backward error, and componentwise backward error.

Error Convergence condition
Bound on limiting value Forward 2us min(cond(A), κ∞(A In this case it is important to incorporate scaling in step 4. When the solver is LU factorization we can use any scheme for avoiding overflow in solving triangular systems.One such scheme is implemented in the LAPACK subroutine xLATRS, which performs scaling based on a coarse bound on the solution size to reliably avoid overflow [4], [10].Here we propose a simple scaling that can be used with any solver, though it may not be optimal for any given solver.We replace steps 4 and 5 of Algorithm 1.1 by 4 θ = r i ∞ , r i = r i /θ.Solve Ad i = r i at precision u s and store d i at precision u.
∞ , so this scaling avoids the largest element of d i overflowing or underflowing as long as 1/ A ∞ does not underflow and A −1 ∞ does not overflow.7. Iterative refinement with LU factorization.We now explore the consequences of the results of our error analysis for the standard form of iterative refinement based on LU factorization.We recover known results and, in subsection 7.3, obtain new results for iterative refinement in three precisions.
Suppose that the solve on step 1 of Algorithm 1.1 is carried out by LU factorization with an appropriate form of pivoting and that the solves on step 4 are done with the LU factors.Throughout this section we take u s = u f .For notational simplicity we assume that any necessary row or column interchanges have been applied to A before the factorization starts, so that the factorization is A = LU .
Standard backward error analysis shows that the solution to Ay = c computed at precision u f satisfies (7.1) where L and U are the computed LU factors [18, Thm.9.4].Hence we can take , the E i term in (3.9) will dominate, and so in Corollary 3.3 we can take Using [18, Lem.9.6] it is possible to obtain a bound of the form where f is a cubic polynomial and ρ n is the growth factor for the LU factorization.In order to be sure that iterative refinement converges, and does so at a reasonable rate, we need φ 1 and this is assured if A is not too ill conditioned and the factorization is not too unstable with respect to the precision u f at which the factorization and substitutions are carried out.
We now consider three different scenarios for the three precisions u f , u, and u r .

7.1.
Traditional refinement with residuals in extra precision.Consider the case where u f = u and u r = u 2 , which corresponds to traditional iterative refinement with residuals calculated at twice the working precision.Here, the limiting accuracy is, from (3.10), (7.3) 4pu 2 cond(A, x) + u ≤ 4pu cond(A) • u + u ≤ 4 3 φ + 1 u, so as long as φ in (7.2) is sufficiently less than 1 we are assured of achieving a solution with normwise relative error of order u.We therefore recover a stronger version of the well known result first obtained by Moler [26], with cond(A) in place of κ ∞ (A) in the convergence condition.

Fixed precision refinement.
With u f = u and u r = u we have fixed precision iterative refinement, and φ is unchanged from the previous case.The difference is that the limiting accuracy is now (7.4) 4pu cond(A, x) + u ≈ 4pu cond(A, x).
Normwise and componentwise backward stability are shown by (4.2) and (5.8), under the conditions stated there.As originally shown by Skeel [31], the benefit of fixed precision iterative refinement for the forward error is that it gives a limiting accuracy of order cond(A, x)u instead of order κ ∞ (A)u for the original computed solution, and this is irrespective of any instability in the factorization as long as φ 1 continues to hold.

Mixed precision refinement with lower precision solves.
The third scenario of interest is where we compute the LU factorization and carry out the substitutions at less than the working precision.We consider four particular cases, all of which yield new results.We show the relevant choices of IEEE precisions, in the form "(u f , u, u r )." Case 1: u = u r = u 2 f (half, single, single) or (single, double, double).This form of refinement has been analyzed and exploited by Langou et al. [23], and is also used by Arioli and Duff [6] and, for symmetric systems, by Hogg and Scott [19].From (7.2), convergence is assured if and assuming this condition holds the limiting accuracy is, from (3.10) and (7.5), 4pu cond(A, x) , which is stronger than the limiting accuracy proportional to κ ∞ (A)u obtained in [23].Compared with fixed precision refinement we have a more stringent convergence requirement and the same limiting accuracy, but now the O(n 3 ) flops part of the computation is done at precision u 1/2 , which is a significant computational saving.The normwise and componentwise backward errors both reach order u under the assumptions that A is not too ill conditioned or the factorization too unstable with respect to precision u 1/2 , and also that |A||x| + |b| is not too badly scaled in the case of the componentwise backward error.
Case 2: u r = u 2 , u = u 2 f (half, single, double) or (single, double, quad).Now we have three precisions in play-a case for which there is no existing analysis.Convergence is again assured if (7.5) holds, and if it does we now achieve a normwise relative error of order 4pu 2 cond(A, x) + u ≤ u 3/2 • 4p cond(A, x)u 1/2 + u 4 3 u 3/2 + u ≈ u.Now we achieve full accuracy at precision u, albeit still only for problems with κ ∞ (A) no larger than u −1/2 .Nevertheless, this is a significant gain over case 1 in return for a few residuals computed at precision u 2 .The limiting backward errors are of order u, as in the previous case.Case 3: u = u r = u 4 f (half, double, double).In this more extreme case the factorization is done at one quarter of the working precision.The convergence condition is, from (7.2), φ = 3n |A −1 || L|| U | ∞ u 1/4 < 1, and the limiting accuracy is now, from (3.10), Again, the limiting backward errors are of order u.Case 4: u = u 4 f , u r = u 2 (half, double, quad).In this most extreme case the convergence condition is the same as in case 3, and the limiting accuracy is now 4pu 2 cond(A, x) + u u.Again, the limiting backward errors are of order u.
Take u f to be half precision.Case 2 shows that for a sufficiently well conditioned linear system Ax = b with single precision data we can obtain the solution correct to single precision accuracy by doing only O(n 2 ) operations in single or double precision with the dominant O(n 3 ) part of the work at half precision.Case 3 shows that for double precision data the convergence condition is the same but the limiting accuracy is of order cond(A, x)u, and the computational saving over working entirely at precision u is even greater.In case 4 the limiting accuracy improves to u.
The statements about cost in this subsection assume that the number of required iterations is small and independent of n, which will be the case as long as φ in Corollary 3.3 is sufficiently less than 1.
We summarize the usages described in this section in Table 7.1.[9] introduce a new form of iterative refinement that corresponds to Algorithm 1.1 with u f = u and u r = u 2 and a special way of solving the correction equation on line 4.The algorithm is intended to handle the situation where A is extremely ill conditioned, possibly even singular to working precision, so that κ ∞ (A) could exceed u −1 .It computes an LU factorization A ≈ L U then solves the equation Ad = r on step 4 by applying GMRES to the system (8.1)

Mixed precision refinement with preconditioned GMRES. Carson and Higham
with all computations done at precision u except that the matrix-vector products with A needed by GMRES are evaluated at precision u 2 .Carson and Higham give an error analysis similar to that in section 3 (but with u f = u s = u), making the key observation that µ i in (3.1) is typically much less than 1 in the early stages of iterative refinement.
We now consider a more general GMRES-based algorithm involving three precisions rather than two.This is a special case of Algorithm 1.1 and we write it out in detail for clarity.
Algorithm 8.1 (GMRES-IR).Let the nonsingular matrix A ∈ R n×n and b ∈ R n be given in precision u.This algorithm uses GMRES-based iterative refinement using LU factors as preconditioners to generate a sequence of approximations x i , all stored in precision u, to the solution of Ax = b.
1 Compute an LU factorization A = LU in precision u f . 2 Solve Ax 0 = b in precision u f using the LU factors and store x 0 at precision u.
Compute r i = b − Ax i at precision u r and round r i to precision u.

5
Solve i by GMRES at precision u, with matrix-vector products with A computed at precision u r , and store d i at precision u.
The analysis in [9, sec. 3] shows that if u r = u 2 we can take in (2.3) where f is a quadratic polynomial and these inequalities being pessimistic.
The reason for including the fourth precision u s in Algorithm 1.1 is now clear: even though the LU factors in step 1 of Algorithm 8.1 are computed at precision u f , the solve in step 5 that uses these factors achieves an error of order u s = u.That the LU factors were computed at precision u f is irrelevant to the preconditioned system, as long as the preconditioner L U remains nonsingular.All that matters is that the factors yield an A with condition number much smaller than that of A.
The convergence condition φ i 1 from the forward error analysis, where φ i is defined in (3.9), therefore holds if (8.4) 2uκ As mentioned above, and explained in detail in [9], µ i is much less than 1 in the early iterations, so this condition is in practice dominated by the second term, for which we need 2  1, and hence certainly κ ∞ (A) < u −1/2 u −1 f ; so (8.4) can hold for κ ∞ (A) greater then u −1 f .Then the limiting accuracy is, from (3.10), 4pu r cond(A, x) + u.With u f = u and u r = u 2 this reproduces the results of [9], giving a limiting accuracy of (8.5) 4pu 2 cond(A, x) + u 4pu, provided cond(A, x)u ≤ 1.

Now we set u
which, given the behavior of µ i , essentially requires κ ∞ (A) u −1 , Algorithm 8.1 will converge and achieve a limiting accuracy of (8.5).To be specific, this means that by taking u f to be IEEE half precision, u single precision, and u r double precision, we can potentially solve systems with κ ∞ (A) possibly as large as u −1 to single precision accuracy while performing the LU factorization at half precision, so that only O(n 2 ) of the flops are at single or double precision.We can go even further, by setting u f = u 1/4 and u r = u 2 .Now (8.4) implies the condition κ ∞ (A) u −3/4 and gives a limiting accuracy of (8.5) again.
In order to achieve this potential we need the number of iterative refinement steps (outer iterations) and the number of iterations in the GMRES solves (inner iterations), each of which involves a matrix-vector product and two triangular solves in precision u r , to be small.If GMRES takes O(n) iterations to converge, each solve will require O(n 3 ) operations in precision u r , and so any potential savings from using a lower precision LU factorization will be lost.In the case of normal A, the theoretical convergence rate of GMRES is completely determined by the spectrum of A. While a small κ ∞ ( A) often corresponds to fast GMRES convergence, this is not always the case.For example, a cluster of eigenvalues close to the origin can cause stagnation of the GMRES residual until the nth iteration, regardless of the condition number of the matrix [24].Since the GMRES convergence rate for normal A is well understood, this suggests potential strategies for improving the convergence rate in the event that a lower precision LU factorization causes slow GMRES convergence.We briefly discuss such strategies at the end of Section 10.
For nonnormal matrices, however, the convergence rate of GMRES is still not well understood, and the spectrum of A is irrelevant to the rate of GMRES convergence [15].Nevertheless, our numerical experiments in Section 10.2.2 show that GMRES-IR in three precisions can be efficient even for ill-conditioned nonnormal matrices in some cases.
We also need to check the behavior of the residual for Algorithm 8.1.It is shown in [9] that the preconditioned system Ad i = U −1 L −1 r i , is solved with backward error of order u, and it is easy to show that this implies that the same is true of the original correction equation Ad i = r i , so that we can take c 1 and c 2 in (2.4) to be of order 1 and G i to have norm of order A ∞ in (2.5).It follows from Corollary 4.2 that a normwise backward error of order u will be obtained if κ ∞ (A)u is sufficiently less than 1.Similarly, the analysis of section 5 shows that a componentwise backward error of order u will be obtained if cond(A)u is sufficiently less than 1, under the usual assumptions on the problem.In the case where u f = u, both these conditions are much stricter than the condition (8.4) required for our forward error resultessentially because the backward error analysis is not able to exploit the behavior of the µ i that is so favorable to the forward error analysis.
In Table 8.1 we summarize the practical usages of Algorithm 8.1 with IEEE arithmetic.The fourth line in the table corresponds to the algorithm proposed in [9].The second line in the table summarizes our finding that the LU factorization can be computed in half precision instead of single precision and the algorithm will still obtain a result correct to single precision for κ ∞ (A) up to 10 8 -in other words, we can obtain as good a result at potentially half the cost.
Finally, we note that when u f is half precision we could encounter overflow in computing x 0 on step 2, even if we scale as described in section 6.In this case we can simply set x 0 = 0.All our analysis is still valid, and when iterative refinement is rapidly converging this may have little effect on the number of iterations.As we mention in Section 10, we encounter this situation for one problem in our numerical experiments, when (u f , u, u r ) = (half, single, quad) and κ ∞ (A) > 10 12 .9. Comparison with single-double implementations.Dongarra and his co-authors have made extensive use over the last decade of iterative refinement with double precision as the working precision and LU factorization computed in single precision as the solver, motivated by the fact that single precision arithmetic runs twice as fast as double precision arithmetic on modern architectures [1, sec.9], [23].Code implementing this form of refinement is available in the PLASMA library as routine gesv [2].Table 9.1 shows how our new forms of iterative refinement compare with this approach.We see that: • By changing to computing residuals in quadruple precision we can guarantee a forward error of order u (that is, the cond(A, x) factor is removed).• By reducing the precision of the LU factorization from single to half there is no loss in forward error or backward error, but the bound on κ ∞ (A) for convergence to be guaranteed drops from 10 8 to 10 4 .• By switching to GMRES-IR and using quadruple precision residuals we can solve a larger class of problems (κ ∞ (A) bounded by 10 16 , or 10 12 for the forward error bounds to hold when u f is half precision) and are guaranteed a forward error of order u.
10. Numerical experiments.We have implemented Algorithms 1.1 and 8.1, with the scaling of section 6, in MATLAB version R2015b, using the built-in single and double precision arithmetics along with the fp16 half-precision class written by Moler [25], [27].
In all the tests in this section we use dense matrices of order n = 100 generated by the built-in MATLAB function gallery('randsvd',kappa,mode) with specified 2-norm condition number kappa.Unless otherwise specified, we use the default mode 3, which generates a random matrix with geometrically distributed singular values.The right-hand sides b are generated as MATLAB randn vectors.For reproducibility, we issue the MATLAB function call rng(1) to set the random number generator seed before generating each problem A, b.We use the MATLAB lu function to compute the LU factorization with partial pivoting.For quadruple precision, we use the Advanpix Multiprecision Computing Toolbox [28] with the setting mp.Digits (34), which is compliant with the IEEE 754-2008 standard [20].
In each figure in this section, plots on the left show the behavior of the forward error ferr (red), normwise relative backward error nbe (blue), and componentwise relative backward error cbe (green).The dotted black line shows the value of the working precision u.Corresponding plots on the right show bounds on the sizes of the quantities in φ i in the condition (3.9) for convergence of the forward error.Here we plot 2u s κ ∞ (A)µ i (cyan), 2u s cond(A) (orange), and u s E i ∞ (magenta).The quantity φ i (which is the minimum of the cyan and orange values plus the magenta value) is plotted in black.The dotted black line marks 1.The x-axes are chosen to enable easy comparison between plots for different choices of u f , u, and u r .
In describing the choice of precisions we will use notation of the form (u f , u, u r ) = (half, double, quad), which means that u f , u, and u r take the values corresponding to IEEE half, double, and quadruple precisions, respectively.
We begin with an experiment that demonstrates the potential benefits of GMRES-IR (Algorithm 8.1).The working precision is double precision.We generate the matrix A using gallery('randsvd',1e9,2).For this matrix, κ ∞ (A) = 2.0e+10, and for the linear system Ax = b with the randomly generated b vector, cond(A, x) = 5.2e+09.In Figure 10.1 we show convergence results for iterative refinement with LU factorization using (u f , u r ) = (single, double) on the first row, (single, quad) on the second row, and (double, quad) on the third row.The fourth row of plots shows results using GMRES-IR with (u f , u r ) = (single, quad).Above the plots in the last row, we show 'GMRES its', which is a list in which the ith element is the number of iterations that GMRES took to converge in refinement step i (using the convergence criterion that the relative GMRES residual 2-norm is less than or equal to 10 −4 ).From the first row, we see that this system is too ill conditioned to be solved using iterative refinement with a single precision LU factorization; neither the forward error nor the backward errors converge.From the second row, we see that as expected, computing the residuals more precisely has no effect on convergence.The only way to improve convergence using standard iterative refinement is to use higher precision in the LU factorization; we see from the third row that with a double precision LU factorization this results in fast convergence.The fourth row of plots shows the potential gains from using GMRES-IR.Here, even though the LU factors are computed in single precision, the forward errors and backward errors all reach the level of the working precision after two refinement steps, which incur only five GMRES iterations in total.
We now investigate the behavior of both iterative refinement with LU factorization and GMRES-IR (Algorithms 1.1 and 8.1, respectively) in more detail.
10.1.Iterative refinement with LU factorization.We begin by testing iterative refinement (Algorithm 1.1) with LU factorization as the solver, first with two precisions and then three precisions.For each test in this section we list κ 2 (A) (which we specify as kappa when generating the randsvd matrix), κ ∞ (A), and cond(A, x) above the corresponding plots.
10.1.1.Iterative refinement in two precisions.When u r = u and u f = u 1/2 , so that LU factorization is done at half the working precision, we expect backward errors to converge to level u and forward errors to converge to level cond(A, x)u for matrices with condition number up to 1/u f ; see Table 7.1.
Results with (u f , u, u r ) = (single, double, double) are shown in Figure 10.2 for a matrix with condition number well within the limit of 1/u f (top row) and a matrix that is extremely ill conditioned with respect to u f (bottom row).The observed behavior is consistent with the theory: the forward and backward errors all converge to the expected levels (note the effect of the cond(A, x) term in the forward error limit).In the second test (bottom row), we see that κ ∞ (A) for the generated matrix is already close to 1/u f .This causes the convergence factor φ i to be close to 1, and thus many refinement steps are required for convergence.Note from the plots on the right that φ i is dominated by the u s E i ∞ terms.10.1.2.Iterative refinement in three precisions.We now demonstrate the potential benefits of iterative refinement in three precisions.In Figure 10.3 we take (u f , u, u r ) = (single, double, quad) and use the same matrices as in Figure 10.2.Comparing Figure 10.3 with Figure 10.2 shows the benefit of computing the residuals at twice the working precision: the forward error converges to level u in both cases, without any dependence on the cond(A, x).Also note that the use of extra precision in the residual computation has no effect on the rate of convergence (compare the values of φ i in the right-hand plots).10.2.GMRES-based iterative refinement.We now test GMRES-IR (Algorithm 8.1) with the combinations of precisions described in Table 8.1.For these tests, within the GMRES method we use the convergence criterion that the relative residual in the 2-norm is no greater than 10 −2 , 10 −4 , and 10 −6 when u is half precision, single precision, and double precision, respectively.Above the plots for each test, we give κ 10.2.1.GMRES-IR in two precisions.We first test GMRES-IR when two different precisions are used: u f = u and u r = u 2 .This is the special case that was investigated in [9].Here we expect convergence of the forward and backward errors to level u for matrices with κ ∞ (A) up to 1/u; see Table 8.1.
In Figure 10.4,we use (u f , u, u r ) = (half, half, single) and in Figure 10.5 we use (u f , u, u r ) = (single, single, double).For each combination of precisions, we show results for a matrix with condition number well within the 1/u limit (top row) and a matrix that is on the edge of numerical singularity, i.e., κ ∞ (A) 1/u (bottom row).For the reasonably well-conditioned matrices, the results are as expected.For the case where κ ∞ (A) 1/u, the results are better than expected.Despite A being extremely ill conditioned, GMRES-IR succeeds in obtaining backward and forward errors on the level u, and does so requiring very few GMRES iterations in each refinement step.
Notice that with u f = u, κ ∞ ( A) can be substantially less than κ ∞ (A) even when κ ∞ (A) is of the order of 1/u; see, e.g., the bottom row in Figure 10.5.
We note also that the orange pluses in, e.g., Figure 10.5 are at or above 1.That φ i is nevertheless substantially less than 1 is thanks to the min function in (3.9) and the ameliorating effect of µ i , which was first pointed out in [9].10.2.2.GMRES-IR in three precisions.We now show the benefit of using three precisions in GMRES-IR, with u r < u < u f .According to the theory, the forward and backward errors should converge to level u for matrices with κ ∞ (A) ≤ 1/u.In other words, in GMRES-IR we can compute the LU factorization in precision u 1/2 and still attain the same bounds on the backward and forward errors as if it were computed in precision u.
Tests with (u f , u, u r ) set to (half, single, double) and (single, double, quad) are shown in Figures 10.6 and Figure 10.7, respectively.Again for each set of precisions, we show results for a matrix with condition number well within the 1/u limit (top rows) and for a matrix which is extremely ill-conditioned with respect to precision u (bottom rows).
The results here are consistent with the theory: in all cases we have convergence of the backward and forward errors to level u.We note that the use of lower precision for the LU factorization can work very well for reasonably well-conditioned matrices.For example, in the top row of Figure 10.7 where κ ∞ (A) = 1.1e+04,only four total GMRES iterations across two refinement steps are required to obtain the desired forward and backward errors.We note that standard iterative refinement with LU factorization also performs well on this problem (see Figure 10.3).
It is important to point out that the number of GMRES iterations in each refinement step increases significantly with κ ∞ (A) for this class of problems.In the bottom rows in both figures, where A is extremely ill-conditioned with respect to u, nearly n GMRES iterations are required for each solve.Since A is applied in precision u r in each GMRES iteration, this will not be efficient compared with simply computing the LU factorization more accurately.
To show that this approach can indeed still be efficient for some problems, we now run analogous experiments for problems generated using randsvd mode 2, which generates matrices having only one small singular value.The results are shown in Figures 10.8 and 10.9 for GMRES-IR with (u f , u, u r ) set to (half, single, double) and (single, double, quad), respectively.For mode 2 matrices, the number of GMRES iterations per refinement step grows more modestly with κ ∞ (A).For example, in the bottom row of Figure 10.9, convergence requires only 7 total GMRES iterations even though κ ∞ (A) > 1/u.Also note here that κ ∞ ( A) is still very large compared with κ ∞ (A), which emphasizes the fact that the GMRES convergence rate cannot be connected with the condition number of the preconditioned matrix.
Finally, we consider the more extreme case of GMRES-IR using precisions (u f , u, u r ) = (half, double, quad).The analysis summarized in Table 8.1 predicts that the forward and backward errors should converge to level u ≈ 10 −16 for matrices with κ ∞ (A) up to 10 12 .In other words, in GMRES-IR we can compute the LU factorization in a quarter of the working precision without increasing the forward or backward errors.
We show tests for randsvd mode 3 matrices in Figure 10.10 and randsvd mode 2 matrices in Figure 10.11.The story is largely the same as in the case of (u f , u, u r ) = (single, double, quad) in Figures 10.7 and 10.9.For randsvd mode 3 matrices, although the errors reach the levels predicted by the theory, each solve may require too many GMRES iterations to be practical unless A is well conditioned (see Figure 10.10).However, Figure 10.11 shows that for randsvd mode 2 matrices the number of GM-RES iterations is much more favorable for ill conditioned A. Note that in the bottom row in Figure 10.10, we encounter overflow in computing the initial solution x 0 and thus take x 0 = 0. 10.3.Discussion.The experiments show that the behaviors in practice of Algorithm 1.1 and Algorithm 8.1 (GMRES-IR) match well the predictions of the analysis, and even exceed it for GMRES-IR.An important difference between the two algorithms is that GMRES-IR converges quickly in all cases (at most three iterations in , whereas Algorithm 1.1 using LU factorization can be much slower.This is related to the fact, visible in the right-hand columns of the plots, that Algorithm 1.1 with LU factorization is "on the edge" as regards the convergence criteria (φ i is close to 1), whereas GMRES-IR satisfies the criteria much more comfortably.
Our experiments confirm that the LU factorization can be computed at less than the working precision while still obtaining backward errors and forward errors at the working precision.
The overall efficiency of GMRES-IR depends on the number of GMRES iterations required.Using less than working precision in the LU factorization can in some cases diminish the effectiveness of L U as a preconditioner in GMRES-IR, resulting in an undesirably high number of GMRES iterations.This can in turn reduce or outweigh any potential computational savings from computing a lower precision LU factorization.
For ease of comparison between approaches, our experiments use a consistent GM-RES tolerance based on the working precision (10 −2 for single, 10 −4 for double, and 10 −6 for quadruple).In practice, however, the GMRES tolerance could be adjusted to minimize the total number of GMRES iterations performed across refinement steps.The analysis in [9, sec. 3] shows that the smaller κ ∞ ( A), the larger we can set the GMRES convergence tolerance while still meeting the constraint (2.3); of course, if the tolerance parameter is made too large, this can increase the number of refinement steps required for convergence.
Of course, whether changing the GMRES tolerance will result in fewer GMRES iterations depends on the convergence trajectory of GMRES, which in turn depends heavily on properties of the linear system.For nonnormal matrices, κ ∞ ( A) and even the full spectrum of A, have no direct connection to the GMRES convergence rate, so we cannot draw any theoretical conclusions.
For normal A, however, we can connect spectral properties of A with the convergence rate of GMRES.Although further investigation is out of the scope of this work, we briefly mention some potential approaches for improving GMRES-IR performance in cases where the coefficient matrix is normal and GMRES convergence for the resulting preconditioned matrix A constructed using lower precision LU factors is slow.One possibility is to add an additional preconditioner in order to eliminate eigenvalues or clusters of eigenvalues that cause difficulties for GMRES.One could also incorporate a deflation-based technique to eliminate these parts of the spectrum.Another approach (for any A) is to try a different Krylov subspace iterative method; though it may not have the same guarantees on backward stability, it may provide a faster convergence rate in practice.

Corollary 4 . 2 .
Under the conditions of Theorem 4.1, if φ = (c 1 κ ∞ (A) + c 2 )u s is sufficiently less than 1 then the residual is reduced on each iteration by a factor approximately φ until b

Table 1 . 3
Different scenarios for iterative refinement in IEEE arithmetic.The columns represent different choices for u f , u, and ur, where in the notation of Algorithm 1.1 the data is stored at precision u, the solves in steps 1 and 4 are carried out in precision u f = us, and residuals are computed at precision ur.The last column indicates whether any existing backward or forward error analysis is applicable to this situation when LU factorization is used as the solver.

Table 2 . 1
Summary of the sizes of the quantities in assumptions (2.3)-(2.5)for solution of the correction equation with LU factorization (section 7) and GMRES-IR (section 8).Note that f

Table 7 . 1
Different choices of IEEE standard precision for Algorithm 1.1 with LU factorization (assumed numerically stable) as the solver (and with us = u f ).The fourth column shows a bound on κ∞(A) that must hold for the analysis to guarantee convergence (under suitable conditions described in the text) with limiting backward or forward errors of the orders shown in the final three columns.

Table 8 . 1
Different choices of IEEE precision for Algorithm 8.1 The middle column shows a bound on κ∞(A) that must hold for the analysis to guarantee convergence with the limiting backward or forward errors shown in the final three columns.

Table 9 . 1
Comparison of results for iterative refinement using the solvers given in the first column.The first row corresponds to the usage of Dongarra and his co-authors.