Squeezing a Matrix into Half Precision, with an Application to Solving Linear Systems

. Motivated by the demand in machine learning, modern computer hardware is increasingly supporting reduced precision ﬂoating-point arithmetic, which provides advantages in speed, energy, and memory usage over single and double precision. Given the availability of such hardware, mixed precision algorithms that work in single or double precision but carry out part of a computation in half precision are now of great interest for general scientiﬁc computing tasks. Because of the limited range of half precision arithmetic, in which positive numbers lie between 6 × 10 − 8 and 7 × 10 4 , a straightforward rounding of single or double precision data into half precision can lead to overﬂow, underﬂow, or subnormal numbers being generated, all of which are undesirable. We develop an algorithm for converting a matrix from single or double precision to half precision. It ﬁrst applies two-sided diagonal scaling in order to equilibrate the matrix (that is, to ensure that every row and column has ∞ -norm 1 ), then multiplies by a scalar to bring the largest element within a factor θ ≤ 1 of the overﬂow level, and ﬁnally rounds to half precision. The second step ensures that full use is made of the limited range of half precision arithmetic, and θ must be chosen to allow suﬃcient headroom for subsequent computations. We apply the new algorithm to GMRES-based iterative re-ﬁnement (GMRES-IR), which solves a linear system Ax = b with single or double precision data by LU factorizing A in half precision and carrying out iterative reﬁnement with the correction equations solved by GMRES preconditioned with the low precision LU factors. Previous implementations of this algorithm have used a crude conversion to half precision that our experiments show can cause slow convergence of GMRES-IR for badly scaled matrices or failure to converge at all. The new conversion algorithm computes ∞ -norms of rows and columns of the matrix and its cost is negligible in the context of LU factorization. We show that it leads to faster convergence of GMRES-IR for badly scaled matrices and thereby allows a much wider class of problems to be solved.

1. Introduction.The landscape of scientific computing is changing, because of the growing availability and usage of low precision floating-point arithmetic.The 2008 revision of IEEE standard 754 introduced a 16-bit floating point format, known as half precision (fp16) [18].Although defined only as a storage format, it has been widely adopted for computing, and is supported by the NVIDIA P100 and V100 GPUs and the AMD Radeon Instinct MI25 GPU.On such hardware, half precision operations run at least twice as fast as single precision ones, and up to 8 times faster on the NVIDIA V100 because of its tensor cores.The Summit machine at Oak Ridge National Laboratory, which heads the June 2018 and November 2018 TOP 500 lists (www.top500.org),has 4608 nodes, with 6 NVIDIA V100 GPUs per node [26], [30].To obtain the best performance from such hardware, it is clearly crucial to develop algorithms that can exploit half precision arithmetic.Indeed Summit has already achieved exaflop performance through careful use of the tensor cores [10].
As regards future machines, Fujitsu has announced that the A64FX Arm proces- sor that will power Japan's first exascale computer supports half precision arithmetic running at twice the speed of single precision [9].Another form of half precision arithmetic is the bfloat16 format1 used by Google in its tensor processing units.The bfloat16 format allocates more bits to the exponent than fp16 and fewer to the significand, so it has a larger range but less precision.Intel give a detailed specification of the bfloat16 format, which will be supported in its forthcoming Nervana Neural Network Processor [28], in a recent white paper [19].
While machine learning is one of the main drivers for the development of half precision in hardware [12], [31], half precision is also being used in other applications such as weather and climate modelling [6], [27].
Here, we are concerned with the fundamentally important problem of solving a general linear system of equations Ax = b.We suppose that A ∈ R n×n and b ∈ R n are stored in a working precision of double precision (fp64) or single precision (fp32), and that we want to solve Ax = b.The standard approach is to compute an LU factorization in the working precision and solve two triangular systems, obtaining a computed solution x w say.Our aim is to solve the problem substantially faster by exploiting half precision arithmetic, yet obtain a computed solution with the same level of backward and forward errors as x w , with minimal restrictions on A.
Algorithms satisfying most of these requirements have already been developed, based on iterative refinement [3], [4], [13], [14], [15].Such algorithms round the entries of A to half precision, compute the LU factors in half precision, compute a solution using the low accuracy LU factors, then use iterative refinement to generate a solution of working precision quality.Haidar et al. [13], [14] show that on NVIDIA GPUs, using the tensor core features, this approach leads to a speedup of up to 4 over highly optimized double precision solvers, with a reduction in power consumption of up to 80%.In this context, standard iterative refinement converges only for very well conditioned matrices (κ(A) 10 4 , where κ(A) = A A −1 ), but the range of solvable problems is greatly enlarged by using GMRES [29] preconditioned with the low precision LU factors to solve the update equations [3], [4].
However, this usage of half precision has the drawback that the elements of the matrix A may overflow or underflow when rounded to half precision.To see why, consider Table 1.1, which shows key characteristics of half, single, and double precision arithmetic, as well as bfloat16.When rounded to fp16, any double precision number with magnitude on the interval [6.6 × 10 4 , 1.8 × 10 308 ] will overflow and any double precision number with magnitude on the interval [2.2 × 10 −308 , 5.9 × 10 −8 ] will underflow.Many matrices of practical interest have entries in these ranges.
Overflow is unrecoverable, because LU factorization cannot produce useful results for a matrix with infinities amongst its entries.Underflow during the rounding could cause a serious loss of information; moreover the rounded matrix could have a zero row or column and hence be structurally singular.It is also desirable to avoid producing subnormal numbers, which lie between the underflow threshold and the smallest normalized floating-point number, as they have less precision than normalized numbers and they incur a performance penalty if handled in software2 .The limited range of fp16 may therefore create a fundamental problem in using it in this context-and indeed more generally.
In the work on iterative refinement in [13], [14], [15] elements that overflow during conversion to fp16 are mapped to the nearest finite number, ±x max .As we will show, for badly scaled real-life matrices this approach can lead to slow convergence or nonconvergence, so a more sophisticated strategy is needed.
In this work we investigate how to convert a matrix A from single or double precision to half precision without overflow and with a reduced chance of underflow and of subnormal numbers being generated.Essentially, we squeeze the matrix into half precision by a two stage process: first we apply a two-sided diagonal scaling to ensure that for every row and column the largest absolute value of the elements is 1, then we multiply by a scalar in order to expand the range of the matrix entries to occupy most of the fp16 range.
We focus on fp16 arithmetic in this work, but our algorithms and analysis apply equally well to bfloat16.The latter format is less prone to overflow and underflow, and because of its very low precision it is of less interest than fp16 for general purpose scientific computing.
In the next section we present algorithms for handling the conversion to fp16 without reference to any specific problem, as the approaches presented are widely applicable.In section 3 we focus on mixed precision iterative refinement for the Ax = b application.Then in section 4 we perform numerical experiments to test the effectiveness of the proposed scaling algorithms.In section 5 we discuss an alternative scaling algorithm that employs a rank-1 update and explain its pros and cons.We close with some concluding remarks in section 6.
2. Algorithms for converting a matrix to half precision.One response to overflow in converting a matrix to fp16 is simply to map any elements too large for fp16 to ±x max .Algorithm 2.1 is a little more general: it maps any number outside the interval [−θx max , θx max ] to the nearest point on that interval, where θ ∈ (0, 1] is a parameter.We will explain the role of θ in section 2.1.Here, fl h denotes the operator that rounds to fp16 and sign is the function that maps positive real numbers to 1, negative real numbers to −1, and 0 to 0. Algorithm 2.1, with θ = 1, is the approach used in [13], [14], [15].Algorithm 2.1 (round then replace infinities).This algorithm rounds A ∈ R n×n to the fp16 matrix A (h) , mapping any elements of modulus larger than θx max to ±θx max .
For every i and j such that |a Another approach is to scale the matrix before rounding, as in Algorithm 2.2, which again ensures that the largest entry in magnitude is θx max .Algorithm 2.2 (scale by scalar then round).This algorithm rounds A ∈ R n×n to the fp16 matrix A (h) , scaling all elements to avoid overflow.θ ∈ (0, 1] is a parameter.
Algorithms 2.1 and 2.2 have the following drawbacks.Algorithm 2.1 makes a potentially large perturbation for every element that overflows, so it can make a large change to the matrix.When max i,j |a ij | > x max , Algorithm 2.2 reduces every element in magnitude, so it increases the risk of underflow.We illustrate the algorithms with the matrix, adapted from [11, p. 45], (2.1) and we take θ = 1 in both algorithms.For α > x max , Algorithm 2.1 produces 1.1) has first and second columns consisting entirely of subnormal numbers if and has zero first and second columns (hence is singular) if α ≥ 2 × 10 12 .
A second weakness of Algorithms 2.1 and 2.2 is that they do nothing to avoid underflow of elements of A. Consider the matrix (2.2) and take θ = 1 in both algorithms.If δ ≤ 10 −8 , Algorithm 2.1 produces A (h) = A(0), which has zero second and third columns, while Algorithm 2.2 produces the same result if δ ≤ 9 × 10 −13 .
To address these issues we now consider a more sophisticated class of algorithms that carry out two-sided diagonal scaling prior to converting to fp16: they replace A by RAS, where Such scaling algorithms have been developed in the context of linear systems and linear programming problems.Despite the large literature on scaling such problems, no clear conclusions are available on when or how one should scale; see [8] for a recent experimental study.In any case, our use of these scalings is different from that in previous studies, where the aim of scaling has been to reduce a condition number or to speed up the convergence of an iterative method applied to the scaled matrix.We scale in order to help squeeze a single or double precision matrix into half precision, with a particular application to using the resulting half precision LU factors for iterative refinement.
Our usage of two-sided diagonal scaling is given in Algorithm 2.3.
Algorithm 2.3 (two-sided diagonal scaling then round).This algorithm rounds A ∈ R n×n to the fp16 matrix A (h) , scaling all elements to avoid overflow.θ ∈ (0, 1] is a parameter.
1: Apply any two-sided diagonal scaling algorithm to A, to obtain diagonal matrices R, S. 2: Let β be the maximum magnitude of any entry of RAS.
We now consider two different algorithms for determining R and S; both algorithms are carried out at the working precision.We first consider row and column equilibration, implemented in Algorithm 2.4.This scaling ensures that every row and column has maximum element in modulus equal to 1-that is, each row and column is equilibrated.The LAPACK routines xyyEQU carry out this form of scaling [2].
Algorithm 2.4 (row and column equilibration).Given A ∈ R n×n , which is assumed to have no zero row or column, this algorithm computes nonsingular diagonal matrices R and S such that B = RAS has the property that max k |b ik | = max k |b ki | = 1 for all i.
1: for i = 1 : n do 2: 3: end for 4: R = diag(r) 5: A = RA % A is row equilibrated.6: for j = 1 : n do 7: We note that Algorithm 2.4 applies the row scaling before the column scaling.If the column scaling is applied first a different result may be obtained, and while the result has the same characteristic scaling property the conditioning may be very different.For the matrix (2.1), Algorithm 2.4 yields whereas scaling columns then rows gives However, in general there is no reason to prefer to scale by rows or columns first.
If the input matrix is symmetric then Algorithm 2.4 will generally destroy symmetry.A symmetry-preserving two-sided scaling is proposed by Knight, Ruiz, and Uçar [20].The algorithm, given in Algorithm 2.5, is iterative and scales simultaneously on both sides rather than sequentially on one side then the other as in Algorithm 2.4.It has the properties that 1. if A = A T then R = S; 2. the algorithm is permutation invariant: if it produces the scaling RAS for A then it produces the scaling P 1 RP T 1 (P 1 AP 2 )P T 2 SP 2 for P 1 AP 2 ; 3. it is linearly convergent, with asymptotic error constant 1/2.Algorithm 2.5 (symmetry-preserving row and column equilibration).Given A ∈ R n×n , which is assumed to have no zero row or column, this algorithm computes nonsingular diagonal matrices R and S such that B = RAS has the property that tol is a convergence tolerance.
1: R = I, S = I 2: repeat 3: for i = 1 : n do 4: 5: end for Applied to A in (2.1), Algorithm 2.5 gives (converging after one iteration) which in terms of conditioning and the size of the smallest nonzero entry is intermediate between the matrices from Algorithm 2.4 and the variant that scales the columns first.
We note that β in line 2 of Algorithm 2.3 is equal to 1 for Algorithms 2.4 and 2.5.This is immediate for Algorithm 2.4 and is true for Algorithm 2.5 as long as at least one iteration is performed (of course, row and column equilibration in Algorithm 2.5 is achieved only in the limit).
Other two-sided diagonal scaling algorithms exist, including Hungarian scaling [17] and other algorithms discussed by Larsson [21].We have tried several of them and not found other algorithms to have any advantage over Algorithms 2.4 or 2.5 in our linear system application, so we do not discuss them here.
2.1.Discussion.How do Algorithm 2.1, Algorithm 2.2, and Algorithm 2.3 with Algorithm 2.4 or Algorithm 2.5, compare for converting a matrix to fp16?Depending on the usage of the fp16 matrix A (h) , several possible criteria may be of interest, of which we mention three.
• As few elements as possible should underflow or become nonzero but unnormalized.
Table 3.1 Bounds on κ∞(A) such that GMRES-IR with precisions given in the first three columns is guaranteed to converge with the limiting backward or forward errors shown in the final three columns, where "single" and "double" denote quantities of the order of the unit roundoffs for single precision and double precision, respectively.• Key properties of the original matrix, such as singular values or condition number, should be preserved as much as possible.• The inverses of the computed LU factors of A (h) should form an effective preconditioner for A. In section 4 we give numerical experiments that focus on the first and last of these criteria.We look at the percentage of nonzero elements of a matrix that underflow after scaling and rounding to fp16 as well as the performance of GMRES-based iterative refinement.Now we discuss the parameter θ in Algorithms 2.1, 2.2, and 2.3.The best choice will be problem-dependent.The aim is to take θ close to 1 in order to maximize the use of the fp16 range and thereby to reduce the chance of underflow (which in the worst case could make the matrix singular) and of producing subnormal numbers.However, θ needs to be sufficiently less than 1 to allow headroom for subsequent computations not to overflow.It is unusual in scientific computing to work close to the overflow level, but given the constant relative spacing of the floating-point number system there is no reason not to do so.This reasoning is analogous to the "expose to the right" approach in digital photography3 , whereby at the capture stage one maximizes the exposure without overexposing, thereby making full use of the dynamic range representable in a digital image.

Backward error
3. Application to Ax = b.Now we employ the algorithms of the previous section within the solution of Ax = b using GMRES-based iterative refinement (GMRES-IR) in three precisions [3], [4].GMRES-IR carries out iterative refinement at the working precision u using LU factors computed at a precision u h ≥ u, and with residuals computed at a precision u r ≤ u.It solves the update equation in iterative refinement, which has the residual as right-hand side, using GMRES preconditioned with the LU factors.Table 3.1 shows the limiting backward and forward errors that are guaranteed by the analysis of [3] and [4] for κ ∞ (A) satisfying the specified bounds.
We convert the matrix to fp16 by one of Algorithms 2.1-2.3 before it is factorized.The overall algorithm is Algorithm 3.1, and the choices of precisions of interest here are (u h , u, u r ) = (half, double, quad) and (u h , u, u r ) = (half, single, double).The preconditioned matrix M A is not formed explicitly, but rather its action on a vector is obtained by two triangular solves and a multiplication with A, both at precision u r .
In Algorithm 3.1, the refinement and the GMRES solves work with the original system, so backward errors and residuals are measured on the original data.It would be possible to work with the diagonally scaled system, but since norms are not in-Algorithm 3.1 (GMRES-IR) Let A ∈ R n×n and b ∈ R n be given in precision u.This algorithm solves Ax = b using GMRES-based iterative refinement with the conversion to fp16 done by one of Algorithms 2.1, 2.2, and 2.3.θ ∈ (0, 1] is a parameter.Compute r i = b − Ax i at precision u r and round r i to precision u.

7:
Solve M Ad i = M r i by GMRES at precision u, where M = µS U −1 L −1 R and matrix-vector products with M A are computed at precision u r , and store d i at precision u.

8:
x i+1 = x i + d i at precision u.Now we consider the effect of the scaling in Algorithm 3.1.A complication in analyzing LU factors of diagonally scaled matrices is that scaling might change the pivot sequence.Therefore to simplify the analysis we assume that the pivot sequence is unaffected by R and S. We consider a matrix whose entries are within the normalized fp16 range and assume that µ and the diagonal elements of R and S are powers of 2. Then fl h (µRAS) = µR fl h (A)S = µRA (h) S.
Since an LU factorization is unique, L = L 1 and U = U 1 .Hence the matrix M in Algorithm 3.1 satisfies (in exact arithmetic) The latter matrix corresponds to the unscaled problem, so the matrix M A to which GMRES-IR is applied is the same with or without scaling, and so the scaled and unscaled algorithms are equivalent.There is even a numerical equivalence, stemming from the fact that the rounding errors in the LU factorizations of RAS and A scale in the same way for a fixed pivot sequence, given our assumptions on R and S [16, sec.9.8].In practice, the pivot sequences for A and RAS may be different, but in this case it is hard to compare the matrices M obtained with and without scaling.Nevertheless, the purpose of scaling is to fit A into fp16, and this argument shows that if it accelerates iterative refinement it will do so indirectly, through allowing better LU factors to be computed.
The initial linear system that we solve for x 0 in Algorithm 3.1, which reduces to triangular systems with L and U , may need scaling in order to avoid overflow and unnecessary underflow.How to scale triangular systems is well understood [1], [7], [22] so we do not discuss it here.In our numerical experiments we did not scale the triangular systems.We also note that any scaling necessary for b is best done at the triangular solve stage.
Finally, we consider how to choose the parameter θ in Algorithms 2.1, 2.2, and 2.3.We need to ensure that the elements of L and U do not overflow.Assume that we are using LU factorization with partial pivoting, P A = LU .The lower triangular matrix L has elements bounded in modulus by 1 and |u ij | ≤ ρ n max i,j |a ij | for all i and j, where ρ n is the growth factor [16, secs.9.3, 9.4].Therefore we need θ ≤ ρ −1 n .In practice, ρ n is not large, so to avoid overflow in the LU factors we might take θ = 0.1 (say), for which the final fp16 matrix satisfies max i,j |a ij | = 0.1x max .
As well as avoiding overflow we also wish to avoid zero pivots.A pivot u kk underflows to zero if For half precision the lower bound is 1.09 × 10 13 with θ = 0.1, and in practice we will usually have κ ∞ (A) 10 13 after applying diagonal scaling.Therefore we can conclude that it is unlikely that a pivot will underflow.

Numerical experiments.
In this section we perform numerical experiments to evaluate the different approaches to dealing with the conversion from double or single precision to half precision within GMRES-IR.The convergence test that we use in line 9 of Algorithm 3.
in which the quantity on the left is the normwise backward error [16,Thm 7.1].We first demonstrate the ineffectiveness of Algorithms 2.1 and 2.2 and then we study the performance of Algorithm 2.3.Throughout this section performance is measured in terms of the total number of GMRES iterations required in Algorithm 3.1.We use the MATLAB codes for GMRES-based iterative refinement from https://github.com/eccarson/ir3, with minor modifications.The inner GMRES iterations are terminated based on a backward error criterion for the preconditioned system with tolerance 10 −2 and 10 −4 for working precisions of single and double respectively, and a maximum of 10 iterative refinement steps are performed.Finally we take tol = 10 −4 in Algorithm 2.5.
The numerical experiments were performed on a Mac laptop with Intel Core i5, and 8 Gb of RAM.We use 13 matrices from the SuiteSparse Matrix Collection 4 [5], Table 4.1 Selected matrices from the SuiteSparse Matrix Collection.' * ' denotes that the matrix is symmetric.The categories that the matrices come from are 1-2: chemical process simulation, 3-6: CFD, 7: materials problem, 8: optimal control, 9-11: structural problems, and 12-13: 2D/3D problem sequence.

Index
Matrix which are selected as representative of all 36 n × n matrices in the collection with n ≤ 400 and maximum absolute value of a matrix entry greater than x max for fp16.We restrict to n ≤ 400 because the time required to LU factorize in half precision is very high for larger n, because of the overheads of the fp16 class.Table 4.1 lists the matrices along with their properties and the underlying application.The right-hand side of each system is generated as randn(n,1), and the random number generator is seeded for reproducibility.Our test codes are available at https://github.com/SrikaraPranesh/fp16Scaling.We first consider the use of Algorithms 2.1 and 2.2 in GMRES-IR.The results are displayed in Table 4.2.We see that Algorithm 2.1 often requires a large number of GMRES iterations for iterative refinement to converge.Algorithm 2.2 requires fewer GMRES iterations than Algorithm 2.1, but iterative refinement fails to converge for matrices 12 and 13.Failure to converge was in each case because underflow causes the matrices to be singular in fp16.From these numerical experiments we conclude that Algorithms 2.1 and 2.2 are not reliable ways to convert a matrix to fp16.
Next we consider the performance of Algorithm 3.1 using Algorithms 2.4 and 2.5 for the scaling.The results, in Table 4.3, show that both diagonal scalings perform very well.We make two observations.First, in the (half, single, double) case, no iterations are required for many of the matrices.This is because one or both of A ∞ and x ∞ are so large that (4.1) is satisfied for the initial solution.Second, we note that convergence is achieved in every case and with generally fewer iterative refinement steps and GMRES iterations than for Algorithms 2.1 and 2.2.
To explore these results further, we note that the analysis in [3], [4] actually shows that convergence of iterative refinement will be achieved provided that κ ∞ (M A)u is sufficiently less than 1, where M A = µS U −1 L −1 R is the preconditioned matrix in Algorithm 3.1.(The bounds in Table 3.1 are a weakening of this condition.)Table 4.4 shows the values of κ ∞ (M A) for a working precision of double; for single precision the quantities are broadly similar.We see that κ ∞ (M A)u 1 for matrices 1-11.For matrices 12 and 13 the preconditioned matrix M A is extremely ill conditioned, like A, but Algorithms 2.4 and 2.5 nevertheless converge, even though κ ∞ (M A) 1 in the LAPACK xyyEQUB routines, for example.We have not rounded the scalings in the results reported here.We did try such rounding, but it had no effect on the number of GMRES iterations, which we attribute to rounding errors in applying the scaling (which is done at the working precision) being negligible compared with those in the rounding to fp16.
Based on these numerical experiments with GMRES-IR we conclude that Algorithms 2.4 and 2.5 are both very effective as scaling algorithms within Algorithm 2.3 and are significantly better than Algorithms 2.1 and 2.2.If a symmetry-preserving LU-type factorization is to be computed then Algorithm 2.5 should be preferred, as it preserves symmetry.
At the end of section 2 we explained that the purpose of θ in Algorithm 2.3 is to increase the matrix entries in order to avoid subnormal numbers and underflow.To check the effect of θ we compared the results for θ = 0.1, for which largest entry in absolute value of the scaled matrix is 0.1x max , and θ = 1/x max , for which the entries lie on the interval [−1, 1].We found that setting θ = 1/x max does not lead to any significant change in the number of GMRES iterations.However, as shown in Table 4.5, for some matrices the percentage of nonzero elements that underflow drops significantly for θ = 0.1.Note that for matrices 7, 12, and 13, θ = 0.1 results in a significant reduction in the percentage of nonzero entries that underflow when compared with θ = 1/x max .The percentage of entries that become subnormal is very small (always less than 1 percent), but again θ = 0.1 leads to a smaller percentage than θ = 1/x max .
5. Another scaling strategy.One can consider other transformations, in addition to Algorithm 2.3, in an attempt to fully exploit the fp16 range.Let A denote the matrix after the initial diagonal scaling using either Algorithm 2.4 or Algorithm 2.5 and define In this section we explore the idea of carrying out a linear transformation of the elements of A to an interval [z 1 , z 2 ] that exceeds [a min , a max ] but fits within the fp16 The transformation (5.1) preserves information in the term and greatly improves the condition number of the matrix.However, a drawback is that it fills in any zero elements, so destroys sparsity.We also note that when z 1 = −z 2 and a min = −a max , β = 0, and so C = αA and the condition number does not change.
Another possibility is [z 1 , z 2 ] = [µx min , θx max ], where µ ∈ {0, θ}, with θ ≥ 1.This transformation maps the a ij onto the nonnegative interval [µx min , θx max ].We could alternatively use this choice of [z 1 , z 2 ] and modify a min and a max so that Here, we can set µ = 0 to preserve sparsity.Now the transformation maps the nonnegative a ij onto [µx min , θx max ], but the nonpositive elements are not constrained and can overflow.
In the context of a system of linear equations Ax = b, the transformation (5.1) yields (C − βee T )x = αb.
This equation can be solved using the Sherman-Morrison formula to give which requires the solution of two linear systems with C, which can be done by GMRES-IR with the need for just one LU factorization.
Our experience in the context of GMRES-IR is that this transformation generally does not provide any benefits over Algorithm 2.3 on its own, so we will not pursue it here.In other contexts the transformation could prove useful.For example, it can potentially reduce the condition number, as demonstrated in (5.2), and if the matrix has positive entries then using (5.1) with [z 1 , z 2 ] = [ θx min , θx max ] prevents entries from overflowing, underflowing, or becoming subnormal in the conversion to fp16.

Conclusions.
Converting a floating-point matrix to lower precision is not a trivial task when the lower precision format has a much narrower range than the original one, especially when the target is the fp16 arithmetic that is increasingly available in hardware.Overflow and underflow in the conversion are undesirable, as is the generation of subnormal numbers.
We have derived a new conversion algorithm that employs two-sided diagonal scaling along with a further scalar multiplication that moves the elements of largest magnitude close to the overflow threshold.Our particular interest is in GMRES-IR (Algorithm 3.1), which solves a linear system Ax = b, where A and b are given in double precision, using an LU factorization of an fp16 representation of A. Previous work has shown significant benefits in speed and energy usage over solving entirely in double precision [13], [14], [15], but in that work the conversion to fp16 was done by rounding and then mapping any infinities back to the nearest floating-point number.Our new conversion algorithm produces more reliable and faster convergence of GMRES-IR on badly scaled matrices, requiring fewer total GMRES iterations and usually fewer iterative refinement steps.It thereby allows a wider class of problems to be solved and so, given its negligible cost, is recommended for use with GMRES-IR.
In future work we will look at other uses of the new conversion algorithm in mixed precision algorithms that exploit fp16.

1 : 3 :
Obtain A (h) from one of (a) Algorithm 2.1.Set µ = 1, R = I, S = I.(b) Algorithm 2.2.Set R = I, S = I.(c) Algorithm 2.3 together with Algorithm 2.4 or Algorithm 2.5.2: b (h) = fl h (Rb) Compute an LU factorization A (h) ≈ L U in precision u h .4: Solve A (h) y 0 = b (h) in precision u h using the LU factors and form x 0 = µSy 0 at precision u. 5: for i = 0 : i max − 1 do 6: end for variant under diagonal scaling this would change the convergence test and hence the attained backward and forward errors, so it should only be done with knowledge of the problem.This is why LAPACK, the MATLAB backslash, and other standard solvers, do not automatically scale linear systems.For further discussion of the role of scaling in solving Ax = b see[16, sec 9.8].
[4]s bound is from the forward error analysis in[4]; the backward error analysis requires only κ∞(A) ≤ 10 16 .

Table 4 . 2
Total number of GMRES iterations required by GMRES-IR (Algorithm 3.1) using Algorithm 2.1 and Algorithm 2.2, for single and double as working precision.Failure to converge is denoted by "-".Numbers in parentheses indicates the number of iterative refinement steps.

Table 4 . 4
Condition number of M A in Algorithm 3.1 for a working precision of double, where M = µS U −1 L −1 R and L and U are the LU factors of A (h) computed in fp16.