Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Srikara

The standard iterative refinement procedure for improving an approximate solution to the least squares problem minx ‖b − Ax‖2, where A ∈ Rm×n with m ≥ n has full rank, is based on solving the (m + n) × (m + n) augmented system with the aid of a QR factorization. In order to exploit multiprecision arithmetic, iterative refinement can be formulated to use three precisions, but the resulting algorithm converges only for a limited range of problems. We build an iterative refinement algorithm called GMRES-LSIR, analogous to the GMRES-IR algorithm developed for linear systems [SIAM J. Sci. Comput., 40 (2019), pp. A817-A847], that solves the augmented system using GMRES preconditioned by a matrix based on the computed QR factors. We explore two left preconditioners; the first has full off-diagonal blocks and the second is block diagonal and can be applied in either left-sided or split form. We prove that for a wide range of problems the first preconditioner yields backward and forward errors for the augmented system of order the working precision under suitable assumptions on the precisions and the problem conditioning. Our proof does not extend to the block diagonal preconditioner, but our numerical experiments show that with this preconditioner the algorithm performs about as well in practice.

1. Introduction. We consider the linear least squares problem min x b − Ax 2 , where A ∈ R m×n (m ≥ n) has full rank. A common method of solution uses the QR factorization where Q = [Q 1 , Q 2 ] ∈ R m×m is an orthogonal matrix with Q 1 ∈ R m×n and Q 2 ∈ R m×(m−n) , and R ∈ R n×n is upper triangular. The unique least squares solution is Least squares problems may be ill conditioned in practice, and so rounding errors may result in an insufficiently accurate solution. In this case, iterative refinement may be used to improve accuracy, and it also improves stability.
Two different approaches can be used for iterative refinement of least squares problems. When the overdetermined system is nearly compatible (i.e., there exists an x such that Ax = b exactly or nearly exactly), an approach analogous to iterative refinement for square linear systems can be employed. After computing the initial approximate solution x 0 , the solution is refined via the process whose (i + 1)st step is 1. Compute r i = b − Ax i . 2. Solve the least squares problem min di Ad i − r i 2 .
3. Update x i+1 = x i + d i . If a QR factorization was used to compute the initial approximate solution x 0 then the QR factors can be reused to solve for the correction d i in each step 2. This approach was used by Golub [10] and analyzed by Golub and Wilkinson [11].
A generalization of this approach that works even when Ax = b is inconsistent was suggested by Björck [2]. Refinement is performed on the augmented system which is equivalent to the normal equations. Given x 0 and r 0 = b − Ax 0 , the (i + 1)st refinement step is as follows.
1. Compute the residual vector for the augmented system: 3. Update the solution to the augmented system: In this way, the solution x i and residual r i for the least squares problem are simultaneously refined. Björck [2] shows that the linear system (1.3) can be solved by reusing the QR factors of A via the process h = R −T g i , (1.5) Existing analyses of the convergence and accuracy of this approach in finite precision assume that at most two precisions are used; the working precision u is used to compute the QR factorization, solve the system (1.3), and compute the update (1.4). A second precision u r ≤ u is used to compute the residuals in (1.2). Typically u r = u 2 , in which case it can be shown that as long as the condition number of the augmented system matrix is smaller than u −1 , the refinement process will converge with a limiting forward error on the order of u; see [3] and [15, sect. 20.5] and the references therein.
Motivated by the emergence of multi-precision capabilities in hardware, Carson and Higham [5] have recently analyzed iterative refinement for (square) linear systems in the case where three different precisions are used: u f for the matrix factorization, u for the working precision, and u r for the computation of residuals, where u f ≥ u ≥ u r . The analysis additionally uses a fourth "precision", denoted u s , which is essentially a parameter that describes how accurately the correction equation is solved (and u s takes the value u or u f in the cases of interest).
As the factorization of the system matrix is often the most expensive part of the computation, it is desirable from a performance standpoint that low precision be used in the factorization, ideally without affecting convergence, numerical stability, or accuracy. The results in [5] show that this is possible under certain constraints on the condition number of the matrix, which depend on the particular method of solving the correction equations. For example, with single precision as the working precision, the matrix factorization computed in half precision, residuals computed in double precision, and the correction equation solved by GMRES preconditioned with the computed LU factors, it is possible to solve the square linear system Ax = b to full single precision accuracy for condition numbers κ ∞ (A) = A −1 ∞ A ∞ ≤ 10 7 , with the O(n 3 ) part of the computations carried out entirely in half precision. Therefore significant speedups can be obtained on hardware that supports half precision. In [12], [13], using the tensor cores on an NVIDIA V100 GPU, Haidar et al. obtain a speedup of 4 over the state of the art double precision solver. In this work we wish to extend the results of [5] to least squares problems. In [17] a mixed-precision least square solver based on the normal equations was proposed for the case where A is well conditioned. The present work differs in that it is not based on the normal equations and it is applicable to a wider range of problems. This work also differs from that in [9], which focuses on the use of higher precision arithmetic and does not use a lower precision than the working precision. Define Three-precision iterative refinement based on the augmented system is written in Algorithm 1.1 in an analogous way as for linear systems. Compute s i = b − Ay i at precision u r and round s i to precision u s .

4:
Solve Aδy i = s i at precision u s and store δy i at precision u.

5:
y i+1 = y i + δy i at precision u. 6: end for The theorems developed in [5] regarding the forward error and normwise and componentwise backward error for iterative refinement of linear systems are thus applicable. The only thing that will change is the analysis of the method for solving the correction equation in line 4, since we now have a QR factorization of A, which can be used in various ways, including the procedure (which we will refer to as the "standard" method) outlined in (1.5)-(1.8). For a given solver, in order to apply the analysis from [5] we need to show that where hats denote quantities computed in finite precision and E i , c 1 , c 2 , and G i are functions of m + n, A, s i , and u s , and have nonnegative entries. Here, and throughout, inequalities between vectors and matrices hold componentwise. Given these bounds for a particular solution method, the convergence conditions and bounds on limiting values of various errors for three-precision iterative refinement for linear systems proved in [5] yield the convergence conditions and bounds for Algorithm 1.1 shown in Table 1.1, which is based on [5, where c is a small constant independent of m and n. A superscript on γ will denote that u carries that superscript as a subscript; thus, for example, γ r k = ku r /(1 − ku r ). The quantity µ i ≡ A(y − y i ) ∞ /( A ∞ y − y i ∞ ) that appears in the forward error convergence criterion in Table 1.1 describes the tightness of the inequality A(y − y i ) ∞ ≤ A ∞ y − y i ∞ . As explained in [4, sect. 2.1], we expect µ i to be close to its lower bound of κ ∞ ( A) −1 near the beginning of the iterations and to only grow close to its upper bound of 1 once the forward error becomes small. We note that depending on the particular solver that is used for the correction equation, the term with µ i or the term with E i ∞ can dominate.
Our contributions in this work are as follows.
• We show that (1.10)-(1.12) hold for Algorithm 1.1 and determine bounds for E i ∞ , c 1 , c 2 , and G i ∞ . • We extend the GMRES-based refinement scheme of [4] to the least squares case and show that one can construct a left preconditioner using the existing QR factors of A such that GMRES provably converges to a backward stable solution of the preconditioned augmented system. • We show that an existing preconditioner developed for saddle point systems can also work well in the GMRES-based approach in practice, even though our error analysis is not applicable. In section 2, we analyze the normwise relative error, normwise relative backward error, and componentwise relative backward error for the QR-based correction solve for the augmented system and write these bounds in the forms given by (1.10)-(1.12). In section 3, we discuss a technique that uses the already computed QR factors in a preconditioned Krylov subspace method for solving the correction equations, which extends the range of matrix condition numbers for which lower precision factorizations can be used. This is analogous to the GMRES-IR technique presented in [4]. In section 4 we present numerical experiments, and we give conclusions in section 5.
We first state a result showing that if the system (1.3) is solved via equations (1.5)- (1.8) in precision u f ≥ u using a QR factorization computed in precision u f then the effective solve precision is u s = u f .
Proof. The proof follows closely the analysis in [2] and [14] and is omitted.
This theorem essentially says that the backward error in the correction solve using the standard approach is limited by the precision in which the QR factorization is computed. Following the argument in [15, sect. 20.5] and using the assumption that u f ≥ u, we can obtain an expression in which only the matrix is perturbed, giving where c m,n is a constant that depends on m and n. Changing notation as described in (1.9), we can write the error in the correction solve in each step of least squares iterative refinement as Thus Thus for the forward error, the u s E i ∞ term will dominate the criterion for convergence (see the convergence condition in Table 1.1), and we can say that as long as κ ∞ ( A) u −1 f , we expect the method to converge to achieve the limiting relative forward error where p is the maximum number of nonzeros per row of [ A, y]. It is also clear that we , so again, we expect the normwise and componentwise backward errors for the augmented system to be of order u as long as κ ∞ ( A) u −1 f . (The backward error for the least squares problem is not bounded by our analysis; see [15, secs 20.5, 20.8] for details of this backward error.) It is important to comment on the condition number of the augmented system A and how it is related to the condition number of A. Björck [2] shows that the matrix that results from scaling by a parameter α, has condition number bounded by where min α κ 2 ( A α ) is attained for the choice This scaling is equivalent to b − Ax ← (b − Ax)/α, which does not change the solution to the problem, so we can assume in the analysis that κ 2 ( A) is the same order of magnitude as κ 2 (A). We can further make the simplifying assumption that α is a power of the machine base and thus does not affect the rounding errors. With this scaling, the correction equation to solve in each step of iterative refinement becomes In the remainder of the paper we will let A denote the scaled system (i.e., A ← A α ), with α as in (2.6), to simplify the notation. We summarize the implications of the analysis in Table 2.1, which gives the limiting forward error, normwise, and componentwise backward error, and the maximum κ ∞ (A) (with which we are approximating κ ∞ ( A), for the reasons explained in the previous paragraph) for which these bounds apply for various combinations of precisions (u f , u, u r ). For example, we expect that "standard" iterative refinement for least squares problems using half precision for the QR factorization, single precision as the working precision, and double precision for the residual computation, will obtain forward and backward errors to single precision accuracy for matrices A with condition number less than u −1 f . The value of the unit roundoff parameter for various precisions can be found in Table 2.2. 3. GMRES-based least squares iterative refinement. The analysis in [4] (and in the subsequent work [5]) shows that iterative refinement for linear systems can converge even in cases where the condition number of the matrix exceeds u −1 f , as long as the corrections are computed with some degree of relative accuracy. The standard approach based on LU factorization, in which the computed LU factors are reused to solve the correction equation, will not provide any relative accuracy for matrix condition numbers larger than u −1 f . Thus for linear systems that are very ill conditioned relative to the factorization precision, a different solver is needed.
In [4], [5] it was shown that the GMRES method with the computed LU factors (potentially computed in lower precision) used as left preconditioners can provide the required relative accuracy in the computed solution of the correction equation. This idea was motivated by the observation that even for ill-conditioned matrices the computed LU factors contain useful information and thus can still be effective preconditioners in terms of reducing the condition number.
One option is to apply this GMRES-based iterative refinement-referred to as GMRES-IR-to the augmented system. However, computing the LU factorization of A incurs O((m + n) 3 ) operations as opposed to O(mn 2 ) operations for the QR factorization of A, with potentially also greater interprocessor communication requirements. At scale, this is an unattractive approach, especially if we already have an existing QR factorization of A.
In the next section we show that it is possible to instead construct an effective left preconditioner M for A from the already computed QR factors of A. As a sideeffect, our analysis demonstrates how the analysis of GMRES-IR in [4] and [5] can be extended to allow for a general choice of preconditioner (i.e., one not necessarily based on an LU factorization).
3.1. A preconditioner from computed QR factors. We wish to construct a preconditioner for which the resulting preconditioned matrix has a sufficiently small condition number. This will allow us to prove that preconditioned GMRES will find a sufficiently accurate solution. We also require, for analysis purposes, that left preconditioning is used. This is because right-preconditioned (and thus split-preconditioned) GMRES is not backward stable, although we note that split preconditioning may nevertheless work well in practice. We elaborate on this in section 3.2.
For a given preconditioner M , we write E = A − M . We have and therefore We now give results for the particular preconditioner which will meet our stated requirements. In order to simplify the analysis, we have used Q 1 rather than Q 1 in the definition above; i.e., we assume that Q 1 has orthonormal columns. This is equivalent to ignoring terms in u 2 f and will not affect the conclusions of the analysis. We have where M −1 E(i, j) denotes the (i, j) block of M −1 E as in (3.5). Bounding the individual blocks of M −1 E, using |α| ≈ A + −1 2 , and again ignoring terms of order u 2 f , we have Therefore, using (3.6), and hence from (3.1) we have where we note that this inequality is pessimistic. Thus even in cases where κ ∞ (A) u −1 f , we still expect the preconditioner M composed from the QR factors computed in precision u f to reduce the condition number of A. By the result proved in [4, sect. 3], if u r = u 2 then GMRES run on A with left preconditioner M will result in where f (m + n) is a quadratic polynomial. Thus for the GMRES-based solver, we achieve an error of order u s = u in (1.10). Combining the above with (3.7), we expect to have u s E i ∞ < 1 (and thus convergence of the forward error in GMRES-based iterative refinement) when κ ∞ (A) < u −1/2 u −1 f . The case where u r = u is treated in [16, sect. 3.1]. In this case, the condition for convergence of the backward error becomes κ ∞ (A) < u −1/2 and the condition for the convergence of the forward error becomes κ ∞ (A) < u −1/3 u −2/3 f . We note, as is stated in [16], that these results are pessimistic. These results are summarized for various combinations of IEEE standard precisions in Table 3.1; note that the table exploits the fact that κ ∞ ( A) can be replaced by κ ∞ (A) (up to a constant that includes the change of norm) in view of (2.5). Note also that this table is very similar to that for iterative refinement of square linear systems in [16,   In Figure 3.1 we plot the infinity norm condition numbers of A and M −1 A, for M defined in (3.2) with the QR factorization performed in half precision and α set to the optimal value 2 −1/2 σ min (A). We also plot an approximation of the bound (3.7) with the dimensional constants ignored, as well as the value u −1 for single precision. By (3.8), when κ ∞ (M −1 A) is approximately below this level, we expect convergence of the forward error. The computations (and all computations in the remainder of the paper) were performed in MATLAB R2019a. To compute a QR factorization of A in half precision we use a MATLAB implementation of Householder QR factorization that calls the chop function of Higham and Pranesh [18]. The preconditioned system M −1 A was computed in quadruple precision arithmetic using the Advanpix Multiprecision Computing Toolbox [23] and then rounded to the working precision. The matrices were generated as gallery('randsvd',[100,10],kappa(i),3), which constructs a random 100 × 10 matrix with specified 2-norm condition number kappa(i) and geometrically distributed singular values. We tested kappa values 10 j , j = 0 : 10.
From Figure 3.1, we can see that indeed, the condition number of M −1 A grows as expected with the bound (3.7), and that the preconditioned matrix has a smaller condition number than A until around κ ∞ (A) ≈ 10 7 . The horizontal line plotting the quantity u −1 indicates the maximum condition number allowable such that u s E i ∞ in (3.8) is less than 1, meaning that the iterative refinement process is guaranteed to converge. As mentioned, the bound (3.7), which intersects the horizontal line at u −1/2 u −1 f , is a slight overestimate; the value of u s E i ∞ can actually remain below 1 for even higher values of κ ∞ (A).
Again, for the backward errors the results in [4] show that when the matrixvector products with the preconditioned matrix are computed at precision u r = u 2 , the original correction equation Aδy i = s i is solved with backward error of order u, and thus c 1 and c 2 in (1.11) can be taken to be of order 1 and G i in (1.12) can be taken to have norm of order A ∞ . Table 2.2 displays the properties of some floating-point formats currently available in hardware, where IEEE formats are denoted "fp" [20]. Owing to the narrow range of the half precision fp16 format, underflow and overflow can readily arise in practice. To address this issue in the context of solving linear systems, a two-sided diagonal scaling-based algorithm was proposed in [19]. In our context of QR factorization we propose one- sided diagonal scaling on the right, which is summarized in Algorithm 3.1, in which fl h denotes rounding to fp16. Note that the scaled matrix AS has columns of unit 2-norm. The purpose of the scaling by µ is to make full use of the dynamic range while allowing some headroom for subsequent computations. Note that this column scaling strategy is not required for bfloat16 because of its much larger range. We are not concerned here with the implementation details of QR factorization in fp16; see [25] for an implementation targeted at GPU tensor cores. Algorithm 3.1 (fp16 QR factorization). This algorithm rounds A ∈ R m×n given in precision u to the fp16 matrix A (h) , scaling all elements to avoid overflow, and then performs the QR factorization in fp16. θ max ∈ (0, 1] is a parameter.

Practical considerations and other preconditioners.
There is a wealth of work towards developing preconditioners for saddle-point systems. Common preconditioners that can be constructed using the QR factors of A include block diagonal preconditioners, triangular preconditioners, and constraint-based preconditioners. For example, we could construct the block diagonal preconditioner This block diagonal preconditioner may have an advantage over the preconditioner (3.2) in terms of the communication cost of applying it to a vector; this will depend on the relative sizes of m and n and the particular parallel distribution of the QR factors. It follows from the results in [22], [24] that if the QR factorization is computed exactly then the left-preconditioned system matrix will be nonsymmetric and diagonalizable with three distinct nonzero eigenvalues, {1, 1 2 (1± √ 5)}. However, as the resulting left-preconditioned matrix is nonsymmetric, this tells us nothing about the singular values. We have observed experimentally that the condition number of M −1 2 M −1 1 A can be very large when A is ill-conditioned (often larger than the condition number of A itself), making this preconditioner unsuitable for our purpose of proving that GMRES provides a backward stable solution to the left-preconditioned system.
The preconditioner in (3.9) can also be used as a two-sided preconditioner. Accordingly we obtain In this case, the preconditioned system is symmetric, so the absolute values of the eigenvalues (which, again, are {1, 1 2 (1± √ 5)}) coincide with the singular values, and in contrast to the left-preconditioned case we are guaranteed a well-conditioned system, at least in exact arithmetic. We confirm this behavior numerically in Figure 3 . It is clear from the plot that the preconditioned matrix has a much lower condition number for split preconditioning than for left preconditioning. In fact, the condition number of the left preconditioned matrix quickly becomes much larger than κ ∞ ( A)! Thus the left preconditioned system is too ill conditioned to ensure that u s E i ∞ < 1 in (1.10). Despite the improved condition number of the split preconditioned system, we cannot use it for the theoretical purpose of proving the backward stability of GMRES; in contrast to the left-preconditioned case (see [4]), right-preconditioned (and thus split-preconditioned) GMRES is not backward stable [1, pp. 2035]. We note that our desire for a well-conditioned preconditioned system comes from meeting the constraint u s E i ∞ < 1 in (1.10); the condition number alone does not dictate the convergence behavior of Krylov subspace methods even in the case of symmetric positive definite linear systems; see, e.g., [6].
We emphasize that the GMRES method and the preconditioner (3.2) were chosen so that we could prove that GMRES-based iterative refinement converges. In practical applications one might use any preconditioning technique or any other Krylov subspace method. As a demonstration, in section 4, we show convergence of GMRESbased iterative refinement with the use of both the left preconditioner defined in (3.2) and the split block diagonal preconditioner (3.9). We note that since the matrix A is symmetric, it makes sense to use a Krylov subspace method designed for symmetric linear systems (such as MINRES) in practice. Higham and Pranesh [17] show that such other techniques can work for symmetric positive definite linear systems, although we cannot say that Krylov subspace methods for symmetric linear systems such as conjugate gradient and MINRES are backward stable in the traditional sense.
Based on these practical considerations GMRES-based least squares iterative refinement method is summarized in Algorithm 3.2.

Algorithm 3.2 (GMRES-LSIR) Let
A ∈ R m×n (m ≥ n) and b ∈ R m be given in precision u. This algorithm solves the least squares problem min x b − Ax 2 using GMRES-based iterative refinement with precisions u f , u, and u r . The quantities A, b, s i , and y i are defined in (1.9).
1: if u f is fp16 then 2: Compute the QR factors, diagonal matrix S and scalar µ using Algorithm 3.1, such that A ≈ (1/µ) Q 1 RS −1 . Compute s i = b − Ay i at precision u r and round r i to precision u.

9:
Solve M −1 Ad i = M −1 r i by GMRES at precision u, with matrix-vector products with M A computed at precision u r , exploiting the block structure, and store d i at precision u. M is given by either (3.3) or (3.9). 10:

11:
if converged then 12: return y i+1 , quit 13: end if 14: end for 4. Numerical experiments. We present numerical experiments to illustrate the results of our analysis for three-precision least squares iterative refinement using the combinations of precisions, (u f , u, u r ) = (half, single, double), (half, double, quad), and (single, double, quad). Only u f = half is considered. All the experiments are performed in MATLAB 2019a, on a Lenovo ThinkStation with Intel Xeon Table 4.1 Two norm condition number and number of iterative refinement steps for (u f , u, ur) = (half, single, double). Numbers in parentheses denote the total number of GMRES iterations in Algorithm 3.2. Failure to converge is denoted by "-".
LSIR GMRES-LSIR GMRES-LSIR-BD  Our codes are available at https://github.com/SrikaraPranesh/LSIR3. We test both the standard least squares iterative refinement described in Section 2 as well as our GMRES-based approach described in Algorithm 3.2. For u f = half, θ max = 0.1 is used. In the GMRES-based approach, we present results for the left and split preconditioners (3.2), and (3.9) respectively. All the condition numbers reported are computed in high precision using mp.Digits(64) option of Advanpix toolbox. As before, in all tests we generate A via gallery('randsvd',[m,n], kappa,3) with m = 100 and n = 10 with various specified 2-norm condition numbers kappa. The same right-hand side b is used in all tests, generated as a MATLAB randn vector with unit 2-norm. For reproducibility we seed the random number generator by calling the MATLAB function rng (1). For the GMRES-based approach, to minimize the condition number we scale the (1,1) block of the augmented matrix (1.9) by α in (2.6), which is estimated at the working precision using the R factor of the QR factorization computed in precision u f using the svd command.
For the iterative refinement process, we use the convergence condition that the relative error in both the solution x i and the residual r i is less than u. We set the GMRES convergence tolerance to 10 −6 for u = single and 10 −12 for u = double.
First we consider (u f , u, u r ) = (half, single, double). In Table 4.1 we display the number of refinement steps for standard least squares iterative refinement in Algorithm 1.1 (LSIR), for GMRES-based iterative refinement with left preconditioner M in Algorithm 3.2 (GMRES-LSIR), and GMRES-based iterative refinement with split block diagonal preconditioners M 1 and M 2 (GMRES-LSIR-BD). For GMRES-LSIR and GMRES-LSIR-BD the numbers in parentheses denotes the total number of GM-RES iterations. Various infinity norm condition numbers are displayed in Table 4   single precision if κ ∞ ( A) u −1 f , and fails to converge for κ ∞ ( A) u −1 f , as predicted by the analysis. From the analysis in section 3, we expect the GMRES-based approach to work for higher κ ∞ ( A) than the standard approach, for κ ∞ ( A) u −1/2 u −1 f . When u is single precision and u f is half precision, this condition becomes roughly κ ∞ ( A) 10 7 . Indeed, GMRES-based iterative refinement (with both preconditioners) converges for condition numbers κ ∞ ( A) up to the order of 10 7 . Note that since α in (2.6) is estimated using the u f precision R factor, κ ∞ ( A) is somewhat larger than might be expected in view of (2.5). The GMRES-based algorithm converges even for κ ∞ ( A) > 10 9 , but the left preconditioner requires more GMRES iterations than the split preconditioner. Also from columns 3 and 4 of Table 4.2 we can observe that for ill-conditioned matrices, split preconditioning gives a lower condition number. Comparing GMRES-LSIR and GMRES-LSIR BD in Table 4.1, we see that GMRES-LSIR BD takes slightly fewer GMRES iterations to converge than GMRES-LSIR.
In Tables 4.3 and 4.4 we display the results for LSIR, GMRES-LSIR, and GMRES-LSIR-BD for (u f , u, u r ) = (half, double, quad) and (single, double, quad), respectively. In the Tables 4.5 and 4.6 relevant infinity norm condition numbers are presented. The results are as expected by the theory; LSIR converges for κ ∞ ( A) up to around u −1 , although as κ ∞ ( A) approaches u −1 , a larger number of refinement steps are needed. The GMRES-LSIR approach is able to converge for κ ∞ ( A) up to u −1/2 u −1 f as predicted by the theory, where u −1/2 u −1 f ≈ 10 16 for (single, double, quad) and u −1/2 u −1 f ≈ 10 12 for (half, double, quad). However, the number of GMRES iterations required does grow large as κ ∞ ( A) approaches u −1/2 u −1 f . The split preconditioner tends to performs better than the left preconditioner on the more ill-conditioned problems. From these experiments we can conclude that even though the convergence of iterative refinement with two-sided preconditioning cannot be theoretically guaranteed, this approach works well in practice.
For the final experiment we consider matrices from the SuiteSparse matrix collection [8]. We consider all rectangular full rank matrices with 20 ≤ m ≤ 2000, n ≤ 400, and n < m. Properties of the matrices are displayed in Table 4.7, and we observe that entries of matrix 5 and 6 will overflow upon conversion to fp16. A random righthand side vector is generated using randn, seeded for reproducibility, and both (half, single, double) and (half, double, quad) precision combinations are used. Again left (GMRES-LSIR) and two-sided (GMRES-LSIR-BD) preconditioners are considered. In Table 4.8 we display the results for (half, single, double), and as predicted by the analysis, LSIR fails to converge for κ ∞ ( A) 10 4 . GMRES-LSIR and GMRES-LSIR-BD converge for all the matrices except 6 and 11, for which A and the preconditioned matrices are extremely ill conditioned as can be observed from Table 4.9. Results for (half, double, quad) are displayed in Tables 4.10 and 4.11. For both preconditioners, Algorithm 3.2 converges for all matrices, but GMRES-LSIR preforms slightly better than GMRES-LSIR-BD. However, for the ill-conditioned matrices 6 and 11, many GMRES iterations are required.

Conclusion.
We have extended iterative refinement in three precisions for square linear systems, as studied in [5], to least squares problems. Our approach is to work with the augmented system and a QR factorization of A computed at the lowest of the three precisions. Our GMRES-IR algorithm uses GMRES to solve the correction equation with a choice of two different preconditioners, (3.2) and (3.9).
Our rounding error analysis, combined with the results in [4], [5], shows that GMRES-IR with (3.2) yields a forward error, and a backward error for the augmented system, of order the working precision under reasonable assumptions. Our error analysis is not applicable to the preconditioner (3.9). Numerical experiments show Table 4.7 Rectangular test matrices chosen from the SuiteSparse Matrix Collection. m, n denote the number rows and columns respectively, max i,j |a ij | and the last column denotes the absolute value of the maximum entry and minimum nonzero entry, respectively.

Index
Matrix (m, n) κ 2 (A) max i,j |a ij | min i,j { |a ij | : a ij = 0 }   Table 4.7, number of iterative refinement steps in standard least squares iterative refinementand Algorithm 3.2, for (u f , u, ur) = (half, single, double). Numbers in parentheses denote the total number of GMRES iterations in Algorithm 3.2, for left (GMRES-LSIR), and twosided (GMRES-LSIR-BD) preconditioning. Failure to converge is denoted by "-". that GMRES-IR behaves as predicted by the theory and that it can solve a much wider range of problems than three-precision iterative refinement without the use of GMRES; it also works about as well with the preconditioner (3.9). Future aspects to explore include the extension of the results to the minimum 2norm solution of underdetermined systems, refining the implementation details such as the choice of GMRES convergence tolerance, and implementation and performance studies on available low-precision hardware.  Table 4.10 For matrices in Table 4.7, number of iterative refinement steps in standard least squares iterative refinementand Algorithm 3.2, for (u f , u, ur) = (half, double, quad). Numbers in parentheses denote the total number of GMRES iterations in Algorithm 3.2, for left (GMRES-LSIR), and twosided (GMRES-LSIR-BD) preconditioning. Failure to converge is denoted by "-".