Exploiting Lower Precision Arithmetic in Solving Symmetric Positive Deﬁnite Linear Systems and Least Squares Problems

. What is the fastest way to solve a linear system Ax = b in arithmetic of a given precision when A is symmetric positive deﬁnite and otherwise unstructured? The usual answer is by Cholesky factorization, assuming that A can be factorized. We develop an algorithm that can be faster, given an arithmetic of precision lower than the working precision as well as (optionally) one of higher precision. The arithmetics might, for example, be of precisions half, single, and double; half and double, possibly with quadruple; or single and double, possibly with quadruple. We compute a Cholesky factorization at the lower precision and use the factors as preconditioners in GMRES-based iterative reﬁnement. To avoid breakdown of the factorization we shift the matrix by a small multiple of its diagonal. We explain why this is preferable to the common approach of shifting by a multiple of the identity matrix, We also incorporate scaling in order to avoid overﬂow and reduce the chance of underﬂow when working in IEEE half precision arithmetic. We extend the algorithm to solve a linear least squares problem with a well conditioned coeﬃcient matrix by forming and solving the normal equations. In both algorithms most of the work is done at low precision provided that iterative reﬁnement and the inner iterative solver converge quickly. We explain why replacing GMRES by the conjugate gradient method causes convergence guarantees to be lost, but show that this change has little eﬀect on convergence in practice. Our numerical experiments conﬁrm the potential of the new algorithms to provide faster solutions in environments that support multiple precisions of arithmetic.

1. Introduction.The standard way to solve a linear system Ax = b whose coefficient matrix is symmetric positive definite but otherwise not specially structured is via a Cholesky factorization, assuming that one can be computed and stored.Normally, all computations are carried out at the working precision, which is typically double precision (64 bits).If multiple precisions of floating-point arithmetic are available can we solve the problem more quickly?This question is timely because of the increasing availability of half precision (16-bit) arithmetic, to add to the single precision (32-bit) arithmetic that has been available for many years.
The 2008 revision of the IEEE-754 standard for floating-point arithmetic [23] proposed a half precision format, which we denote by fp16.As shown in Table 1. 1, it has an 11-bit significand and a 5-bit exponent.Fp16 arithmetic is increasingly supported by accelerators, including the NVIDIA P100 (2016) and V100 (2017) GPUs and the AMD Radeon Instinct family of accelerators. 1 On these devices half precision is up to twice as fast as single precision, or up to 8 times faster than single precision on the NVIDIA V100 when its tensor cores are exploited.Of the 500 machines on the June 2019 TOP500 list, 133 systems employ accelerators, among which more than Table 1.1 Parameters for four floating-point arithmetics, given to three significant figures: number of bits in significand (including implicit most significant bit) and exponent (signif., exp.), unit roundoff u, smallest positive (subnormal) number x s min , smallest normalized positive number x min , and largest finite number xmax.In Intel's bfloat16 specification subnormal numbers are not supported [24] 73 percent support fp16. 2 The A64FX Arm processor that will be used in Japan's Fugaku exascale machine will support fp16 [11], as will the Frontier exascale machine at Oak Ridge National Laboratory, which will use custom-built AMD Radeon Instinct GPUs with mixed precision capabilities [30].
Another half precision format is bfloat16, originally proposed by Google and formalized by Intel [24].Bfloat16 has an 8-bit significand, and an 8-bit exponent.The exponent width is the same as that of single precision and hence bfloat16 has a range similar to single precision.Currently Google's Tensor Processing Unit3 (TPU) is the only hardware that supports bfloat16, but Intel's Nervana Neural Network processor [26] and Cooper Lake processor [12] and the Armv8-A architecture [28] will also support it.
It has already been demonstrated that the fp16 arithmetic and the tensor cores of an NVIDIA V100 can be exploited to solve a general dense linear system Ax = b up to four times faster than by an optimized double precision solver, with a reduction in energy consumption of up to 80 percent [15], [16].The key idea, proposed in [4], [5], is to factorize A in half precision arithmetic and use GMRES-based iterative refinement (GMRES-IR), with the computed LU factors as preconditioners, to compute a numerically stable solution at the working precision of single or double.Furthermore, the same GMRES-IR algorithm has demonstrated a performance up to three times higher than that of an optimized double precision solver at scale on the Summit machine that heads the June 2019 TOP500 list [3].
In principle, adapting GMRES-IR to exploit positive definite matrices might seem straightforward: replace LU factorization by Cholesky factorization and GMRES by the conjugate gradient (CG) method.However, Cholesky factorization can fail because the matrix can lose definiteness when it is rounded to lower precision.Furthermore, when CG is used existing error analysis can only guarantee iterative refinement to converge for very well conditioned matrices.
In this work we develop a scaling and shifting algorithm to perform lower precision Cholesky factorization of a matrix given in a higher precision.We use the computed Cholesky factorization to develop a GMRES-IR algorithm for symmetric positive definite systems.We then extend the algorithm to solve a linear least squares problem min x b − Ax 2 via the normal equations, where A is assumed to be well conditioned.
To ensure that Cholesky factorization succeeds in lower precision we need to increase the diagonal, either before or after rounding to the lower precision.A natural way to do so is with the shift A ← A + µI, for some µ > 0. We will show that it is better to transform A ← A + µ D 2 , where D = diag(a 1/2 ii ).Perturbing A by a multiple of D 2 produces a smaller perturbation than perturbing by a multiple of I and it results in better performance of GMRES-IR.
In fact, we will work with the unit diagonal matrix H = D −1 AD −1 and will round a shifted and scaled version of this matrix.The scalings are necessary for fp16, to allow a wider dynamic range of data to be supported, but they will usually not be required for bfloat16 or single precision.
Our algorithms contain three precisions: the working precision, with unit roundoff u; a lower precision, with unit roundoff u > u, at which matrix factorization is done; and a potentially higher precision, with unit roundoff u r ≤ u, which we will use in iterative refinement.We define While we are particularly interested in the case where u represents half precision, our algorithms and analysis are quite general.
In the next section we explore how to perturb a positive definite matrix to ensure that Cholesky factorization succeeds in floating-point arithmetic.In section 3 we develop a scaling and shifting algorithm to perform low precision Cholesky factorization.In section 4 we use the low precision Cholesky factors as preconditioners in GMRES-IR and explain why the underlying convergence analysis does not apply when CG is used in place of GMRES.In section 5 we develop a normal equations-based solver built on GMRES-IR for well-conditioned least squares problems.In section 6 we perform numerical experiments with real-life matrices to investigate the numerical behavior of GMRES-IR and the GMRES-IR-based least squares solver.Conclusions are given in Section 7.

Shifting for Cholesky factorization.
A key task in this work is to round a positive definite matrix stored in single precision or double precision to a lower precision and then compute a Cholesky factorization.The difficulty is that Cholesky factorization may fail because the rounded matrix is indefinite or not sufficiently positive definite.Our solution is to increase the diagonal before factorizing, but what type of perturbation should we apply and how large should it be?In this section we address this question in a general form, working with just one precision u; we consider the effect of rounding to a lower precision before factorizing in the next section.
We take A to be a symmetric positive definite matrix whose smallest eigenvalue λ min (A) is possibly as small as uλ max (A), where λ max (A) is the largest eigenvalue of A. Thus for floating-point computation, A is on the border between being positive definite and indefinite.In fact, our algorithms are equally valid when λ min (A) < 0 with λ min (A) = O(uλ max (A)), since backward error considerations show that the algorithms cannot distinguish this case from the positive definite case.In either case, we wish to increase the diagonal so that Cholesky factorization is guaranteed to succeed.
One approach is based on Wilkinson's result that Cholesky factorization of a positive definite matrix A ∈ R n×n succeeds if d n κ 2 (A)u ≤ 1, where d n = 20n 3/2 [33] and κ 2 (A) = λ max (A)/λ min (A).Let A( ) = A + I, where > 0. Applying Wilkinson's result to A( ), we have that Cholesky factorization on A( ) will succeed if We note that this is consistent with the analysis in [13], [37,Thm. 1].
Theorem 2.1.Let A = DHD ∈ R n×n be symmetric positive definite, where then Cholesky factorization applied to A succeeds (barring underflow and overflow).
If Cholesky factorization succeeds it produces a nonsingular R satisfying where ii .This result shows that to guarantee the success of the factorization we can shift H to satisfy (2.3) instead of shifting A subject to (2.1).The matrix Hence by Theorem 2.1, Cholesky factorization on H( ) Solving for gives and D 2 = diag(a ii ).Dropping the λ min terms in (2.1) and (2.5), we have two different perturbations to A that allow Cholesky factorization to succeed: How do these perturbations compare?Ignoring the constants d n and n(n + 1), we have where the lower bound is attained for A = I and the upper bound can be approached arbitrarily closely, since it is an equality for the matrix of ones (which is positive semidefinite).Hence, modulo the constants, ∆A 2 generally has the smaller norm.However, we will not know λ max (A) in practice so will have to estimate it; we will replace it by the lower bound max i a ii .We now have two perturbations, ∆A 1 = (c n max i a ii )I and ∆A 2 = c n diag(a ii ), where c n and c n are suitable constants.We note that while ∆A 2 makes the same relative perturbation to each diagonal element a ii , ∆A 1 makes a large relative perturbation to any a ii with a ii max i a ii .This is important because Theorem 2.1 shows that the backward error matrix for Cholesky factorization is bounded relative to the diagonal.Consider the example [9] where the elements marked × can have any values that yield a positive definite matrix.
For double precision (dropping the constant c n ), ∆A 1 ≈ 10 40 uI ≈ 10 24 I and the (3, 3) element of A + ∆A 1 is fl(1 + 10 24 ) = 10 24 , meaning that a 33 has been lost.This is a serious loss of information, because ∆A 1 subjects a 33 to a relative perturbation of order 1 yet by Theorem 2.1 the backward error matrix for Cholesky factorization causes only a relative perturbation of order nu in a 33 .
Our conclusion is that in a situation where one wishes to increase the diagonal of a positive definite matrix A in order that Cholesky factorization succeeds, adding a multiple of diag(a ii ) is better than addding a multiple of I if one wishes to make the smallest relative change to the elements and to minimize the loss of information.In all previous work that we are aware of where a diagonal perturbation is made before beginning Cholesky factorization, a multiple of I has been added-for example [13], [37, Thm.1].

Low precision Cholesky factorization.
We now assume that we are given a symmetric positive definite matrix A ∈ R n×n in floating-point arithmetic of precision u and wish to compute a Cholesky factorization in floating-point arithmetic with precision u > u.The most practically important cases are where (u , u) = (half, single), (half, double), or (single, double).Naively, we might first form A ( ) = fl (A), where fl denotes the operation of rounding to precision u , and then compute the Cholesky factorization of A ( ) in precision u .For half precision this procedure can fail for two reasons.First, if fp16 is used then the limited range might cause overflow during the rounding.Second, for both bfloat16 and fp16, A ( ) might not be (sufficiently) positive definite, because a matrix whose smallest eigenvalue is safely bounded away from zero with respect to single precision or double precision may become numerically indefinite under the perturbation induced by rounding to half precision.The second issue can also be encountered when a double precision matrix is rounded to single precision.To overcome these problems we will use scaling and shifting.

Scaling. The first step is to scale the matrix
ii ), as in (2.2).The matrix D will be kept at precision u.Because Cholesky factorization is essentially numerically invariant under twosided diagonal scaling (as can be seen from the recurrence relations for the Cholesky factors), the sole reason for scaling is to reduce the dynamic range in order to avoid overflow and reduce the chance of underflow, for fp16.For bfloat16 or single precision it will not usually be necessary to scale, and we can work with A throughout.For the rest of the presentation we will always scale, for simplicity of notation.
Two-sided diagonal scaling before rounding to low precision was used in [21] in the context of LU factorization.The scaling used there equilibrates rows and columns; our scaling with D is the natural analogue of that for symmetric positive definite matrices.For Cholesky factorization we need an extra ingredient to ensure a successful factorization, which we consider next.

Shifting.
We now convert H to the lower precision u , incorporating a shift to ensure that the lower precision matrix is sufficiently positive definite for Cholesky factorization to succeed, as discussed in section 2. We will shift H by c n u I, where c n is a positive integer constant, to obtain G = H + c n u I. Since the diagonal of H is I, this shift incurs no rounding error and it produces the same result whether we shift in precision u then round or round then shift in precision u .
For fp16, in view of the narrow range we will also multiply the shifted matrix by a scalar to bring it close to the overflow level x max , in order to minimize the chance of underflow and of subnormal numbers being produced.So our final precision-u matrix is constructed as where θ ∈ (0, 1) is a parameter.Note that β = max ij |g ij |, so the largest absolute value of any element of A ( ) is θx max .Note also that since the growth factor for Cholesky factorization is 1 (see, e.g., [17,Prob. 10.4]), there is no danger of overflow during Cholesky factorization of A ( ) .

Analysis.
We now give some analysis to guide the choice of c n .For simplicity we will ignore the rounding errors in forming H at precision u.
We consider Cholesky factorization of the matrix A ( ) = H + I, where H = fl (H) and = c n u (since, as noted above, it does not matter whether we shift or round first, as the shift causes no rounding errors).
We now apply the analysis of section 2 to A ( ) .From (2.5) with u replaced by u , we know that Cholesky factorization on A ( ) in precision u will succeed if How should we choose the constant c n ?Since λ min ( H) −u can be expected, we will take = c n u with a suitable integer constant c n .The inequality (3.2) suggests the choice c n = n(n + 1), but this choice is not practical, as then c n u > 1 for fp16 when n ≥ 45, so we would be making a perturbation of order 1 to a matrix with elements bounded by 1, and we could not expect to obtain a useful factorization.Clearly, the rounding error analysis that leads to the sufficient condition (2.3) for the success of Cholesky factorization is pessimistic, and based on the arguments in [19,Thm. 3.8] we can reasonably expect c n = n to be a more realistic constant.However, even this value is uncomfortably large for half precision.We therefore take the pragmatic approach of taking c n to be a small constant c.If Cholesky factorization fails we will increase c and try again.We will determine experimentally how large c should be for a range of problems of interest.We note that since A ( ) has elements bounded by 1+O(u ) and its smallest eigenvalue will be of order max(λ min (H), u ) we expect κ 2 (A ( ) ) n/ max(λ min (H), u ).In practice, we have found this upper bound to be often within two orders of magnitude of κ 2 (A ( ) ).
Based on these arguments we propose Algorithm 3.1.For bfloat16 and single precision we do not need to scale by the matrix D and so we can simplify Algorithm 3.1: lines 1-4 can be deleted and line 5 can be replaced by A ( ) = fl (A + cu diag(a ii )).
An alternative to Algorithm 3.1 is (ignoring scaling) to compute a modified Cholesky factorization of A ( ) , or to do so if the initial Cholesky factorization fails.ii ).The scalar θ ∈ (0, 1] and the positive integer c are parameters. Given a symmetric and possibly indefinite A, modified Cholesky factorizations compute a Cholesky or block LDL T factorization of A + E for a perturbation E whose size is related to how close A is to being positive definite [10].We will not pursue modified Cholesky factorization for three reasons.First, our specific situation allows a detailed analysis of how to shift and does not require the general machinery of modified Cholesky factorization.Second, Cholesky factorization is faster, because modified Cholesky factorization algorithms either require pivoting (for block LDL T factorization) or are inherently level-1 BLAS-based and so will suffer a performance penalty.Third, modified Cholesky factorization is much less widely available in program libraries than Cholesky factorization, and is not, to our knowledge, available in libraries for accelerators, such as the MAGMA library4 [8], [29].
4. GMRES-based iterative refinement for Ax = b.We now use the lower precision Cholesky factorization computed by Algorithm 3.1 to solve a symmetric positive definite linear system Ax = b by iterative refinement.Algorithm 4.1 is an adaptation of GMRES-IR [4], [5] from general linear systems to symmetric positive definite systems, using the approximate Cholesky factors from Algorithm 3.1 as a preconditioner.This algorithm makes use of a third precision u r ≤ u, which in practice will be either u r = u or u r = u 2 .For bfloat16 or single precision, with the simplifications to Algorithm 3.1 mentioned above, we can remove the D −1 factors in Algorithm 4.1 and set µ = 1.
The convergence properties of Algorithm 4.1 are essentially the same as those of standard GMRES-IR, as analyzed in [4], [5].The perturbation on line 2 of Algorithm 3.1 is no larger than the backward error for Cholesky factorization at precision u , so it does not affect the analysis.Table 4.1 gives the largest values of κ ∞ (A) for which the limiting backward errors and forward errors stated in the table are guaranteed to be achieved, for several different combinations of precisions; the lines with u > u follow from [4], [5] while the lines with u = u are from [18] and, for the backward error column, are pessimistic.
We note that we are using GMRES-IR instead of standard iterative refinement, for which the update equation is solved by substitution with the Cholesky factors, because it can solve a much wider range of problems.In the table for standard iterative refinement corresponding to Table 4.1 (see [5,Table 7 Bounds on κ∞(A) such that Algorithm 4.1 using IEEE arithmetic precisions given in the first three columns is guaranteed to converge with the indicated limiting backward or forward errors, where "half ", "single", and "double" denote quantities of the order of the unit roundoffs for ha;f precision, single precision, and double precision, respectively.Here, cond(A, x) = |A −1 ||A||x| ∞/ x ∞.In the u column, half can be replaced by bfloat16, in which case the bound on κ∞(A) must be reduced by a factor 8. If the refinement process converges quickly and GMRES converges quickly within each refinement step then for large n the dominant cost in Algorithm 4.1 is the cost of computing the Cholesky factors at precision u : this is the only O(n 3 ) flops part of the algorithm.Hence we can expect significant speedups over a solution with Cholesky factorization computed at precision u.Compute r i = b − Ax i at precision u r and round r i to precision u.

6:
Solve M Ad i = M r i by GMRES at precision u, where M = µD −1 R −1 R −T D −1 and matrix-vector products with M A are computed at precision u r , and store d i at precision u.

7:
x i+1 = x i + d i at precision u.We now analyze the spectrum of the preconditioner in Algorithm 4.1, under the simplifying assumption that there are no rounding errors, so that the only source of error is the perturbation cu I in forming G = H + cu I in Algorithm 3.1.Then we have where F has eigenvalues λ i /(λ i + cu ) and the λ i are the eigenvalues of H. Hence the eigenvalues of M A satisfy From our assumption that |λ min (A)| uλ max (A), it can be shown that |λ min (H)| u and so for u ≥ u we are assured that |λ i (M A) − 1| 1 and moreover that |λ i (M A) − 1| ≈ cu for λ i of order 1.However, since M A is nonnormal no conclusions can be drawn about the speed of convergence of GMRES.
GMRES is a method for general linear systems, so the refinement phase of Algorithm 4.1 does not exploit the symmetry or definiteness of A. The CG method is a natural alternative to GMRES.However, the convergence theory in [4] exploits the backward stability of GMRES.Preconditioned CG is not guaranteed to be backward stable unless the matrix or the preconditioner is well conditioned [14, Eq. ( 34)], so the analysis no longer applies if we use CG in place of GMRES.The same is true for preconditioned MINRES, since the analysis of [14] is applicable to any iterative method that is based on three-term recurrence relations for the solution and residual.The essential problem is that preconditioning must be two-sided if it is to preserve symmetry and this does not allow a favorable error analysis, whereas in GMRES-IR a left preconditioner is used.Nevertheless, in section 6 we will investigate empirically the behavior of Algorithm 4.1 with the CG method replaced by GMRES on line 6.

5.
Cholesky-based GMRES-IR for the least squares problem.The ideas of the previous sections can be adapted to the linear least squares problem min x Ax− b 2 , where A ∈ R m×n with m ≥ n has full rank.The normal equations method solves the normal equations A T Ax = A T b using Cholesky factorization of A T A. In general, this method is deprecated by numerical analysts because it has a backward error bound of order κ 2 (A)u [17, sect.20.4] and the Cholesky factorization can break down for κ 2 (A) > u −1/2 , but it is used by statisticians with some justification [22].Here, we assume that A is (sufficiently) well conditioned.We propose the GMRES-IR-based least squares solver given in Algorithm 5.1.We make some comments on the algorithm.Line 1 produces a matrix B with columns of unit 2-norm.The computation C = B (h) T B (h) on line 4 produces a symmetric positive definite matrix with constant diagonal elements µ = θx max , so overflow cannot occur for θ < 1.The shift on line 5 is analogous to that in Algorithm 3.1, but here the matrix C is already well scaled and in precision u so there is no need to scale C to have unit diagonal.
There are two reasons why explicitly forming C = B (h) T B (h) in Algorithm 5.1 is reasonable from the numerical stability point of view.First, C is used to form a Compute r i = A T (b − Ax i ) at precision u r and round r i to precision u.

13:
Solve M A T Ad i = M r i by GMRES at precision u, where M = µSR −1 R −T S and matrix-vector products with A T A are computed at precision u r , and store d i at precision u.

14:
x i+1 = x i + d i at precision u. end if 18: end for preconditioner, so the usual problems with forming a cross product matrix (loss of significance and condition squaring) are less of a concern.Second, if we are working in fp16 on an NVIDIA V100 we can exploit the tensor cores when forming C to accumulate block fused multiply-add operations in single precision; this leads to a more accurate C, as shown by the error analysis of Blanchard et al. [2].
For the computed R we have so we are preconditioning with an approximation to the inverse of A T A. We apply the preconditioned operator M A T A to vectors at precision u r .Computing y = A T Ax costs 4mn flops and SR −1 R −T y costs another 2n 2 + n flops, making 4mn + 2n 2 + n flops in total.For m n and large n, computing y = A T Ax costs a factor n/4 fewer flops than the mn 2 flops needed to form A T A, while for m ≈ n the difference is a factor n/6.For large n, even allowing for the fact that the flops we are comparing are at different precisions, as long as GMRES converges quickly the cost of the refinement stage should be negligible compared with the cost of forming A T A and computing the Cholesky factorization.
What can be said about the convergence of Algorithm 5.1?It differs from Algorithm 4.1 mainly in that it works with A T A rather than A, so at worst we can expect convergence for κ ∞ (A) bounded by the square root of the condition number bounds given in Table 4.1 and with limiting backward error and forward errors a factor κ ∞ (A) larger than those in the table, since in translating a backward error from A T Ax = A T b to Ax − b we gain a factor as much as κ ∞ (A) [22].However, this is a pessimistic viewpoint because it is known that iterative refinement for the normal equations (and the closely related seminormal equations) works better than a simple condition squaring viewpoint suggests [1, sects.2.9.3, 6.6.5].Moreover, if u r < u we expect to benefit from applying the operator M A T A in precision u r .Since we are in any case targetting well conditioned matrices we will not attempt a detailed error analysis.
Related to our work is the Cholesky-QR algorithm for computing a QR factorization A = QR.It forms the cross-product matrix A T A, computes the Cholesky factorization A T A = R T R, then obtains the orthogonal factor Q as Q = AR −1 , and this process can be iterated for better numerical stability; see, for example, [13], [34], [35], [36].Our work differs in that we do not compute Q, we carry out the Cholesky factorization in lower precision than the working precision, and we solve a least squares problem using preconditioned iterative refinement.
Finally, we note that unless A is extremely badly scaled, for bfloat16 or single precision arithmetic we can set S = I and µ = 1 in Algorithm 4.1, since there will usually be no danger of underflow or overflow.
6. Numerical experiments.In this section we perform numerical experiments to investigate how to choose c in Algorithms 3.1 and 5.1, to study the behavior of Algorithms 4.1 and Algorithm 5.1, and to test variants of the algorithms that use the CG method in place of GMRES.
The chop function of Higham and Pranesh [20] is used to simulate low precision computation and the AdvanPix Multiprecision Computing Toolbox [25] with mp.Digits (34) is used for quadruple precision computation.For half precision we consider only fp16 in our experiments.All the experiments are performed in MAT-LAB 2019a on a Lenovo ThinkStation with Intel Xeon W-2123 CPU and 32 Gb RAM.The test codes are available at https://github.com/SrikaraPranesh/fp16_Cholesky.For the precisions (u , u, u r ) we take (half, single, double), (half, double, double), (half, double, quad), and (single, double, double).
where x is the computed solution.We tried both GMRES and CG as the iterative solver on line 6 of Algorithm 4.1, and for CG we used the standard symmetrized form of the preconditioned iteration [27, Alg.9.2]; both the iterations are terminated based on a backward error criterion for the preconditioned system with tolerance 10 −2 and 10 −4 for working precisions of single and double, respectively.The right-hand side vectors are generated using randn(n,1), and the random number generator is seeded for reproducibility.In the scaling, we take θ = 0.1.We consider all the symmetric positive definite matrices of dimension 300 to 500 from the SuiteSparse Matrix Collection [6]; their properties are displayed in Table 6.1.Note that several of these matrices are badly scaled.The SuiteSparse matrices are given in double precision.First we investigate the choice of c in Algorithm 3.1 and compare Algorithm 3.1 with an alternative that shifts A: in place of lines 1-5 it carries out the operations In Table 6.2 we display the minimum positive integer values of c, denoted by c H for Algorithm 3.1 (since it shifts H) and c A for the modified algorithm (since it shifts A) such that the Cholesky factorization in fp16 arithmetic succeeds.From the table we see that c A and c H are identical for all but two matrices and are always at most 2. The same value c = 2 was also sufficient to ensure the success of Cholesky factorization in single precision arithmetic.Based on these results we use c = 2 in the rest of the experiments in this section.
In Table 6.3 we display the iteration counts for the GMRES-based Algorithm 4.1 and its CG-based variant.Several points can be noted.
1.The iteration counts for the GMRES and CG variants are broadly similar, except that the CG variant fails to converge for matrix 4.
2. Both variants work extremely well for (half, single, double), requiring at most one step of refinement and very few iterations of the iterative solver, except for matrix 4. For (half, double, quad) the iteration counts are higher.Note that rounding to single precision and the shift both have a regularizing effect, so the matrix for working precision single is in general better conditioned that that for working precision double.
3. The results for GMRES-IR are consistent with Table 4.1-indeed convergence is achieved for somewhat larger values of κ ∞ (A) than the table predicts.
4. "0" iterations denotes that the initial solution computed using the fp16 Cholesky factors satisfied the normwise backward error criterion and no iterative refinement steps were necessary, which is because A ∞ x ∞ is so large that (6.1) is satisfied by the initial solution itself; analogous behavior was observed in [21, sect. 4].
We look at some information that gives further insight into these results.Table 6.4 displays κ 2 (H) for the diagonally scaled matrix H in Algorithm 3.1.The values of κ 2 (H) are smaller than those of κ 2 (A) and a factor 10 8 smaller for matrix 5.After Since quadruple precision is not supported by hardware, u r = u was used in GMRES-IR in [15], [16].We show results for (half, double, double) and (single, double, double) in Table 6.6.We see that (half, double, double) in Table 6.6 generally results in more iterative refinement steps and GMRES or CG iterations than (half, double, quad) in Table 6.3, which is because the preconditioned operator is being applied at a lower precision in Table 6.6.However, the (single, double, double) column shows convergence in significantly fewer iterative refinement steps and GMRES or CG iterations than (half, double, quad) in Table 6.3; here, the higher quality preconditioner outweighs the lesser precision at which it is applied; the last column of Table 6.4 confirms the quality of the preconditioner.
From these experiments we can see that CG performs as well as GMRES in practice within Algorithm 4.1, but we recall that the existing error analysis guarantees convergence of the iterative refinement process only for GMRES, not for CG.

Least squares problems.
To study the behavior of Algorithm 5.1, we consider all full rank rectangular matrices A ∈ R m×n from the SuiteSparse Matrix Collection [6], with m > n, 20 ≤ m ≤ 2000, n ≤ 400, and κ 2 (A) ≤ 10 5 .The matrix names and properties are listed in Table 6.7.Iterative refinement is terminated when the Frobenius norm backward error (6.2) is less than nu, where ξ is a parameter that we set to 1.This backward error is computed from the formula [17, sect.20.7], [32] η where r = b − A x and The right-hand side vector is generated using randn(n,1) and θ = 0.1 is used.In Table 6.8 we show the minimum values of c on line 5 of Algorithm 5.1 such that Cholesky factorization succeeds.On the basis of these results, we select c = 12 when the low precision is half precision, but c = 2 is sufficient when the low precision is single precision.
We consider the solution of a least squares problem using Algorithm 5.1 and a variant that on line 13 uses CG instead of GMRES.Table 6.9 displays the numbers of  square root of the condition number bound).The refinement nevertheless converges.

Conclusions.
As computer hardware increasingly supports multiple precisions of floating-point arithmetic, it is important to investigate whether scientific computing tasks can take advantage of these capabilities.Our focus here has been on exploiting a Cholesky factorization computed at low precision to deliver a solution of a linear system or least squares problem having accuracy and stability commensurate with a higher precision.An immediate obstacle is that rounding a symmetric positive definite matrix to lower precision can produce a matrix for which Cholesky factorization breaks down because the rounded matrix is indefinite or not sufficiently positive definite.We have shown that increasing the diagonal by a small relative amount proportional to the unit roundoff for the lower precision allows Cholesky factorization to succeed and yields computed factors good enough to act as preconditioners for GMRES-based iterative refinement.We have also shown that the standard approach of shifting by a multiple of the identity matrix is not as effective.
The proposed Cholesky-based GMRES-IR algorithm, Algorithm 3.1, is supported by rounding error analysis and can produce backward stable solutions for problems with condition numbers up to around the reciprocal of the unit roundoff of the working precision.The overall performance depends on the speed of convergence of GMRES, and is hard to predict, but in our tests on real-life matrices GMRES often converges quickly, especially for the (half, single, double) and (single, double, double) combinations of precisions.
We found experimentally that GMRES can be replaced by CG with little change in the number of iterative refinement steps and inner iterations, despite the fact that the supporting error analysis is not applicable.
Our proposed use of low precision Cholesky factorization for solving well conditioned least squares problem via the normal equations, in Algorithm 4.1, also works well in our tests.
Both algorithms can benefit from the ability of the tensor cores in NVIDIA's Volta and Turing architectures to compute a fused multiply-add with 4 × 4 fp16 matrix arguments in one clock cycle while accumulating the result in single precision.Table 6.9Total number of GMRES and CG iterations in Algorithm 5.1 and its variant, for single and double working precisions.Numbers in parentheses denote the number of iterative refinement steps.

Algorithm 3 . 1 (
Cholesky factorization in precision u ).Given a symmetric positive definite A ∈ R n×n in precision u this algorithm computes an approximate Cholesky factorization R T R ≈ µD −1 AD −1 at precision u , where D = diag(a 1/2

Algorithm 4 . 1 ( 1 :
Cholesky-based GMRES-IR for a positive definite system) Let a symmetric positive definite A ∈ R n×n and b ∈ R n be given in precision u.This algorithm solves Ax = b by GMRES-based iterative refinement, using the approximate Cholesky factorization computed by Algorithm 3.1.The scalar θ ∈ (0, 1] and the positive integer c are parameters.Run Algorithm 3.1, obtaining the Cholesky factorization A ( ) = R T R in precision u .2: b ( ) = fl (D −1 b) 3: Solve A ( ) y 0 = b ( ) in precision u using the Cholesky factors and form x 0 = µD −1 y 0 at precision u. 4: for i = 0 : i max − 1 do 5:

Algorithm 5 . 1 ( 1 : 5 :
Cholesky-based GMRES-IR for the least squares problem) Let a full rank A ∈ R m×n , where m ≥ n, and b ∈ R m be given in precision u.This algorithm solves the least squares problem min x b − Ax 2 using Cholesky-based GMRES-IR.The scalar θ ∈ (0, 1] and the positive integer c are parameters.Compute B = AS, where S = diag(1/ a j 2 ), with a j the jth column of A.2: µ = θx max 3:B (h) = fl (µ 1/2 B) 4: Compute C = B (h) T B (h) in precision u .Compute the Cholesky factorization C + cu diag(c ii ) = R T R in precision u .6:if Cholesky factorization failed then 7: c ← 2c, goto line 5 8: end if 9: Form b ( ) = fl (SA T b).10: Solve R T Ry 0 = b ( ) in precision u and form x 0 = µSy 0 at precision u. 11: for i = 0 : i max − 1 do 12:

6. 1 .
Linear systems.We first consider linear systems Ax = b with symmetric positive definite A. Iterative refinement in Algorithm 4.1 is terminated when the normwise backward error satisfies .

Table 6 . 1
Symmetric positive definite test matrices chosen from the SuiteSparse Matrix Collection.|a ij | min i,j { |a ij | : a ij = 0 }

Table 6 . 2
Minimum positive integer constants c H and c A in Algorithm 3.1 (which shifts H) and its modified version (which shifts A) to ensure the success of Cholesky factorization in fp16 arithmetic.conversion to fp16, and before shifting, λ min (H) was negative for matrices 4 and 11, which is consistent with the κ 2 (H) values.The table also displays the values of the condition numbers of M H A and M A A, the products (explicitly computed in double precision) of the preconditioners with A, where M H is obtained from Algorithm 3.1 and M A from the modified algorithm.The matrices in Table6.3 with the larger GMRES and CG iteration counts are those with the larger values of κ 2 (M H A).In Table6.5 we display results for the same experiment but now using the modified version of Algorithm 3.1 in Algorithm 4.1, which shifts A instead of H. Comparing with Table6.3,weseeslightly larger numbers of iterative refinement steps but many more steps for GMRES and CG.This is consistent with Table6.4, in which the values of κ 2 (M A A) are mostly larger than the values of κ 2 (M H A). Clearly, the Cholesky factors obtained by shifting A are not nearly as effective preconditioners as those obtained by shifting H.

Table 6 . 3
Total number of GMRES and CG iterations in Algorithm 4.1 and its CG variant, for single and double working precisions.Numbers in parentheses denote the number of iterative refinement steps.Failure to converge is denoted by "-".

Table 6 . 4
For n × n matrices A in Table6.1,quantities of interest in Algorithm 3.1.Recall that H = D −1 AD −1 , where D = diag(a Here, M H is the preconditioner from Algorithm 3.1 and M A is the preconditioner from the modified version of Algorithm 3.1 that shifts A instead of H. ii ).

Table 6 . 7
Matrices A ∈ R m×n chosen from the SuiteSparse Matrix Collection.(A) max i,j |a ij | min i,j { |a ij | : a ij = 0 }

Table 6 . 8
Minimum positive integer constant c in line 5 of Algorithm 5.1 to ensure the success of Cholesky factorization in fp16 arithmetic.