Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

. Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientiﬁc computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative reﬁnement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which diﬀer in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU conﬁrm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.


Introduction.
A new development in high performance computing is the emergence of hardware supporting low precision floating-point formats such as the 16bit IEEE half precision format (fp16) and the 16-bit bfloat16 format 1 [23]. Examples of such hardware include the NVIDIA P100 and V100 GPUs, the AMD Radeon Instinct MI25 GPU, Google's Tensor Processing Units (TPUs), the ARM NEON architecture [3], and the Fujitsu A64FX ARM processor [10]. Expected to join them in the near future are IBM's next generation AI chips [11] (supporting an 8-bit floatingpoint format in addition to fp16), and Intel's upcoming Xeon Cooper Lake [31] and Nervana Neural Network processors [28].
These new computing units execute low precision arithmetic faster than single precision (fp32), typically by a factor 2. But in the NVIDIA V100 GPU, thanks to special computing units called tensor cores, fp16 arithmetic executes up to 8 times faster than fp32 arithmetic.
This faster low precision arithmetic can be exploited in numerical algorithms. In [12], [13], [14], [15] it is shown how on an NVIDIA V100, fp16 arithmetic can be used with mixed precision iterative refinement to solve a linear system Ax = b up to 4 times faster and with 80 percent less energy usage than by an optimized double precision solver, with no loss of accuracy or stability. Similar improvements over a single precision solver for complex systems are obtained in [1]. Moreover, the same iterative refinement approach running on Summit [27], [32], the machine with 4608 nodes with 6 NVIDIA V100 GPUs per node that leads the November 2019 TOP 500 list, 2 has achieved a performance of 550 petaflops [13]. The tensor cores in the NVIDIA Volta and Turing architectures are able to carry out the operation D = C + AB, where all matrices are 4 × 4, in only one clock cycle and in precision fp32 [2]. Moreover, while they require the matrices A and B to be in the fp16 format, C and the result can be in fp16 or fp32. Pictorially, we have Tensor cores can therefore be seen as a generalization of a fused multiply-add (FMA) unit to 4 × 4 matrices, and they are an instance of what we call a "block FMA." Multiprecision computing units called matrix units (MXU) are available on Google TPUs [34]. They use bfloat16 rather than fp16 as the low precision format and they operate on square matrices of dimension 128 or 256. However, Google TPUs are not commercially available, and details of computation in MXUs has not been made publicly available. Other vendors, including Intel and Arm, have announced hardware that will use block FMA units; see Table 1.1.
Block FMAs are inherently mixed precision units. Existing rounding error analyses will be pessimistic when applied to computations with block FMAs, as they do not reflect the mixed precision nature of the computation.
In this work we define a mixed precision block FMA that includes the forms provided by the hardware in Table 1.1 as special cases and should be general enough to include future units. We present algorithms for matrix multiplication and LU factorization with a block FMA and give detailed rounding error analyses of them. Our analysis provides more realistic error bounds than standard analysis and can be used to determine the optimal tradeoff between performance and accuracy. In particular, in the case of NVIDIA tensor cores our analysis and experiments show that storing C and D in the block FMA in fp32 rather than fp16 can significantly improve the accuracy and stability of the algorithms, while not hindering too much their performance.
We define a block FMA in section 2. In section 3 we show how to exploit it in matrix multiplication and give a rounding error analysis. We then test accuracy and performance on an NVIDIA V100 for several matrix multiplication variants. In section 4 we present an algorithm for LU factorization based on a block FMA and give a rounding error analysis for the factorization and the solution of Ax = b. We show that the analysis gives new insights into GMRES-based iterative refinement. Then we give numerical experiments on an NVIDIA V100 to illustrate the error analysis and test the performance of four LU factorization variants. Concluding remarks are given in section 5.
We will denote by fl 16 and fl 32 the operations of rounding to the fp16 and fp32 formats, and note that u 16 = 2 −11 and u 32 = 2 −24 are the respective unit roundoffs. With a standard abuse of notation we will write fl p with an argument that is a matrix expression to denote that the expression is evaluated at some specified precision (possibly different from p) and then rounded to precision p. The absolute value of a matrix is defined componentwise: |A| = (|a ij |).
2.1. General framework for mixed-precision block FMA. A scalar FMA has the form d = c + ab and is computed as the correctly rounded exact result, that is, with one rounding error rather than two [17, sect. 2.6]. Hardware implementing an FMA at the same speed as a single addition or multiplication first became available over twenty years ago.
We define a mixed precision block FMA to take as input matrices A ∈ R b1×b , B ∈ R b×b2 , and C ∈ R b1×b2 , where A and B are provided in a given precision u low and C is either in precision u low or in a higher precision u high . The block FMA computes D u low or u high = C u low or u high and returns D in precision u low or u high . A key point is that the output D can be taken in precision u high and used as the input C to a subsequent FMA. By chaining FMAs together in this way, larger matrix products can be computed, as we will show in the next section.   In (2.1) every element of D is computed with just a single rounding error. In (2.2), the expression C + AB is computed at a precision u equal to u low or u high and there are multiple rounding errors per element of D. The evaluation (2.1) is ideal from an accuracy perspective, but in general it will be expensive. The NVIDIA Volta and Turing architectures use (2.2) with u = u 16 or u = u 32 [2].
The evaluation (2.1) with all inputs and the output at precision u low corresponds to the "long accumulator" proposed by Kulisch and Miranker [25], which has found use in interval arithmetic (see, for example, [24], [29]). Evaluating according to (2.1) is expensive, and although proposals have been made for implementing it in hardware (e.g., [5], [33]), it is not, to our knowledge, supported in commercial processors because of the hardware costs. However, manufacturers could implement something between (2.1) and (2.2) by using a little extra precision, perhaps the extended precisions defined in the IEEE standard [22] or the 80-bit registers on Intel processors. Indeed we note that NVIDIA's CUDA C++ Programming Guide [9] states that when C is in fp32 the computation is performed in at least single precision, allowing for the possibility of using extra precision in future implementations.
In order to capture the range of possible block FMA implementations in our analysis, we will use a model for the evaluation that includes a range of possibilities that has (2.1) and (2.2) as the extreme cases. Denote the precision of the output of the FMA by u FMA , which is either u low or u high . We assume that C +AB is computed at precision u ≤ u FMA and then rounded to precision u FMA ; if u = u FMA then the latter rounding is not needed. We We will need the quantity Throughout the paper we assume that ku < 1 for the relevant integers k and precisions u. An accent on γ denotes that u carries that accent and a superscript on γ denotes that u also carries that superscript as a subscript; thus below we will use γ k = ku/(1 − ku) and γ FMA k = k u FMA /(1 − k u FMA ). By standard analysis for matrix multiplication [17, sect. 3.5], the matrix D computed in precision u satisfies | D − D| ≤ γ b+1 (|C| + |A||B|).
Using (2.4), the final rounding to precision u FMA gives D satisfying Our model is therefore Setting u = 0 corresponds to (2.1) (to within O(u 2 FMA ) because of the presence of D instead of D on the right-hand side), while u = u high corresponds to (2.2). While the model (2.6) is useful, it is not so convenient when chaining FMAs, so in the next section we will directly analyze a chain of FMAs as it arises in Algorithm 3.1 for matrix multiplication.
For b = b 1 = b 2 we will refer to the block FMA as a b × b FMA.

GPU tensor cores.
The tensor cores in the NVIDIA Volta and Turing architectures perform a b × b FMA for b = 4. As noted in section 1, while they require A and B to be fp16 matrices (that is, u low = u 16 ), C and D can be either fp16 or fp32 (that is, u high = u 32 ). We will consider two cases: where C (16) and C (32) denote an fp16 matrix and an fp32 matrix, respectively. We note that the NVIDIA CUDA C++ Programming Guide [9] states that Element-wise multiplication of matrix A and B is performed with at least single precision. When .ctype or .dtype is .f32, accumulation of the intermediate values is performed with at least single precision. When both .ctype and .dtype are specified as .f16, the accumulation is performed with at least half precision. We therefore have u = u 16 for TC16 and u = u 32 for TC32. (For TC16, the multiplications are actually performed in precision u 32 , but taking this into account makes the analysis more complicated and barely improves the resulting bounds). The computed D satisfies (2.6) with u FMA = u 16 for TC16 and u FMA = u 32 for TC32.
3. Matrix multiplication with block FMA. In this section we describe an algorithm to exploit a block FMA in matrix multiplication. We perform the rounding error analysis of this algorithm and compare our error bounds with the results of numerical experiments using tensor cores on an NVIDIA V100.
3.1. Description of the algorithm. Consider matrices A ∈ R m×n and B ∈ R n×t partitioned into b 1 × b and b × b 2 blocks, respectively, where for simplicity we assume that p = m/b 1 , q = n/b, and r = t/b 2 are integers. We describe in Algorithm 3.1 how to multiply A and B in a way that can exploit a block FMA.
respectively, where p = m/b 1 , q = n/b, and r = t/b 2 are assumed to be integers. This algorithm performs the matrix multiplication C = AB using a block FMA. The output C is in precision u FMA , which is either u high or u low . Compute C ij = C ij + A i B j using a block FMA with output at precision u FMA . If u > u FMA then round C ij to precision u. Note that the algorithm does not assume that A and B are given in precision u low ; in the context of mixed precision LU factorization [13], [14], for example, they may be given in u high . We use the term "convert" on line 1 rather than "round" because in practice it might be necessary to do some kind of scaling to avoid overflow or underflow; see [21] for such considerations. However, in our analysis we will assume that underflow and overflow do not occur.

Rounding error analysis.
We now give a rounding error analysis for Algorithm 3.1. We recall from standard error analysis [17, chap. 3] that the computed value of an expression w = p 0 + p 1 q 1 + · · · + p n q n evaluated in precision u from left to right can be expressed as Let x denote a row of A and y a column of B and write s n = x T y as Algorithm 3.1 evaluates s n by setting s 0 = 0, forming where the right-hand side is evaluated in the block FMA at precision u and we assume that the evaluation is from left to right, then rounding the result back to precision u FMA , or to precision u if u > u FMA . Since the block FMA produces output only at precision u low or u high , two roundings will be needed if u > u FMA > u: first to precision u FMA and then to precision u. We will include only a single rounding in our analysis, for simplicity; this has a negligible affect on the final error bound (see [30] for analysis of the effects of double rounding). The computed s i satisfies Here, the first condition is not mutually exclusive with the second and third, so the equality should be read as taking the first choice (u) if the first condition is satisfied. For i = 1, since s 0 = 0, (3.2) holds with b + 1 replaced by b in the θ term multiplying x 1 y 1 . Overall, we have where, by [17, Lems. 3.1, 3.3] and using qb = n, |α i | ≤ γ FMA q and |β i | ≤ γ n . Hence We note that if (3.1) is evaluated from right to left-which amounts to blocked summation [4]-then s 1 (for example) participates in q rather than n additions and so γ n can be replaced by γ q+b−1 in (3.4). It is not hard to see that (3.4) is valid for all orders of evaluation of (3.1).
If the input matrices A and B are given in precision u low then we can directly apply the above analysis to obtain the following bound for Algorithm 3.1.  (3.5) in several particular cases. Let us compare the constants in the table with the constant nu low in the bound for a standard evaluation of C at precision u low . The case u FMA = u = u low has the same constant, so the block FMA offers no accuracy benefits in this case. But when u FMA = u = u high the constant is a factor u high /u low smaller than for the standard evaluation, while for u FMA = u low and u = u high the constant is a factor b smaller.
If A and B are not given in precision u low then we must account for the initial conversion to precision u low . Assuming rounding with no underflow or overflow, we have Bounding E yields the following result.
Theorem 3.2. Let the product C = AB of A ∈ R m×n and B ∈ R n×t (not necessarily given in precision u low ) be evaluated by Algorithm 3.1, where q = n/b. The computed C satisfies  (3.6). Here, q = n/b, u ≤ u FMA is assumed, and A and B are not assumed to be given in precision u low .

Evaluation method Bound
Standard in precision u low (n + 2)u low The bound (3.6) should be compared with [17, eq. (3.13)] for a standard matrix product that does not use a block FMA. We gather in Table 3.2 the dominant terms in the error bounds (3.6) for five block FMA implementations, with A and B given in precision u high , and compare them with the bounds for standard multiplication in precision u high or u low (error bound (3.7)). For the sake of readability, in Table 3.2 we have assumed that u high u low . Let us now compare these five different variants, using Table 3.2. Compared with the standard multiplication in precision u low , the mixed-precision block FMA multiplication achieves smaller error bounds in all cases except for u FMA = u = u low . Let us first consider u = u high . With u FMA = u low , the bound is reduced by a factor approximately b, while with u FMA = u high , the factor of improvement is even larger and equal to min(n/2, u low /u high ). Indeed, the bound (3.6) does not grow with n to first order (that is, as long as n ≤ 2u low /u high ) and, for larger n, the bound becomes equivalent to the bound for standard multiplication in precision u high . With u = 0, the bounds are even smaller: for u FMA = u low the improvement is negligible since it amounts to removal of the nu high term, while for u FMA = u high and nu high 2u low the bound is reduced by a factor approximately b.
The bounds above are worst-case and so may be pessimistic, especially for large dimensions (see the numerical results in the next subsection). To derive more realistic bounds, we can use a probabilistic model of the rounding errors [18]. A probabilistic analogue of Theorem 3.2 directly follows from [18, Thm. 3.1]. It contains a modified version of (3.6) with γ FMA q and γ n replaced by relaxed constants γ FMA q (λ) and γ n (λ) proportional to λq 1/2 u FMA and λn 1/2 u, respectively. This relaxed bound holds with a probability at least a given quantity that is very close to 1 for λ of order 10 or so. In particular, this means that the block FMA bound with u FMA = u high = u may only start growing with n for much larger n than the worst-case bound suggests (perhaps for n larger than 4(u low /u high ) 2 ).
We now apply these bounds to NVIDIA tensor cores. This amounts to taking b = 4, u high = u 32 , and u low = u 16 in Theorem 3.1 and Theorem 3.2 (or Table 3.2). We gather the resulting bounds in Table 3.3, where we consider standard half and  Table 3.2) for NVIDIA tensor cores, for which b = 4, u high = u 32 , and u low = u 16 , and where u FMA = u = u 16 (TC16) or u FMA = u = u 32 (TC32).

Precision
Standard Tensor core, TC16 Tensor core, TC32 Standard of A and B fp16 u FMA = u = u 16 u FMA = u = u 32 fp32 u 16 nu 16 nu 16 nu 32 nu 32 u 32 (n + 2)u 16 (n + 2)u 16 2u 16 + nu 32 nu 32 single precision multiplication (fp16 and fp32), and tensor core multiplication in either TC16 (u FMA = u = u 16 ) or TC32 (u FMA = u = u 32 ) mode. We see clearly that the TC32 mode achieves a much smaller bound than the TC16 mode, by removing the constant n multiplying the u 16 term and relegating it to the u 32 term.

3.3.
Numerical experiments with tensor core matrix multiplication. We now present numerical experiments to investigate whether the error bounds correctly predict the relative accuracy of the different methods and to assess the performance of the methods. We use the implementation of the four matrix multiplication variants provided in the cuBLAS library v10.1. The matrices A and B are random, with entries sampled uniformly from [0, 10 −3 ] or [−1, 1]. We set the upper bound to 10 −3 in the former case to avoid overflow in the multiplication for large sizes. Matrices are generated in single precision in order to compare the accuracy of the fp16, TC16, and TC32 computations with those for fp32. As a result, for the first three methods matrices A and B will be converted to half precision prior to multiplication. To assess the sharpness of the bounds we consider A ∈ R m×n and B ∈ R n×t with n varying from 1024 to approximately 2 × 10 6 , and in order that the matrices fit on a single GPU we take m = t = 8. Figures 3.1 and 3.2 plot the componentwise error max i,j | C − C| ij /(|A||B|) ij for increasing values of n, where C is an approximation of the exact result obtained with a standard double precision multiplication computed with cuBLAS. This error measure is not the true relative forward error but is what the analysis bounds. Bounds associated with each variant are represented in dashed lines. Once a bound reaches the value 1, we set it to 1 as it provides no useful information.
As expected almost all the errors are relatively far from their worst-case bounds. Moreover, the errors in Figure 3.2 are generally much smaller than their counterparts in Figure 3.1. For matrices with positive entries (Figure 3.1) the errors for fp16 and TC16 range from 10 −3 to 1 and exceed 0.1 for n ≈ 10 6 . Interestingly, TC16 achieves a noticeably smaller error than fp16, even though they both have the same error bound. We suspect this might be because the matrix multiplication algorithm of cuBLAS implements blocked summation which, as mentioned previously, achieves a bound a factor b = 4 smaller. More importantly, the errors for TC32 and fp32 are much smaller than those for fp16 and TC16, by a factor between 10 −5 and 10 −2 for TC32 and between 10 −7 and 10 −5 for fp32. We observe that the error bound for TC32 is rather loose (a constant two orders of magnitude gap) but still insightful, as it captures the growth of the error with n. Finally, we mention that the fact that the error decreases with n in Figure 3.2 is related to the entries of the matrices being distributed uniformly in [−1, 1] and thus having zero mean, as explained in the recent probabilistic analysis for random data [19].
In order to give a baseline for the performance of each matrix multiplication method we run additional simulations for square matrices (with uniform entries). Figure 3.3 shows the maximum flop rate out of five runs (in TFlops) for each multiplication method and each matrix size. We see that although the TC16 variant performs slightly better than the TC32 variant for matrix sizes up to n = 8000, the two variants have similar asymptotic performance for n > 8000. In that range of sizes the flop rates associated with tensor core-enabled multiplication are about 3.5× larger than fp16 multiplication (3.3 to 3.6) and about 7× larger than single precision multiplication (6.8 to 7.3), both executed on CUDA cores. The flop rate of TC16 multiplication reaches a maximum of 101.2 TFlops (about 90% of the theoretical performance, namely 112.7 TFlops) for n = 8000. Our performance results are in good agreement with other existing benchmarks, e.g., [26].
4. Solution of linear systems with block FMA. Now we consider the solution of linear systems Ax = b by LU factorization, where A is a dense n × n matrix. Since LU factorization can be formulated to exploit matrix multiplication it can benefit from using a block FMA. Algorithm 4.1 computes an LU factorization using a block FMA. The algorithm employs three precisions: the working precision u and the precisions u low and u high used by the block FMA employed within the call to Algorithm 3.1 on line 9. We assume that A is given in precision u and that precision u is used on lines 2 and 4. Other versions of the algorithm can be defined by varying these precisions.

Rounding error analysis.
We now perform a rounding error analysis of Algorithm 4.1. and its use to solve linear systems Ax = b. We begin with a simple application of Theorem 3.1.  Factorize L kk U kk = A kk .

3:
for i = k + 1 : q do 4: Solve L ik U kk = A ik and L kk U ki = A ki for L ik and U ki . for j = k + 1 : q do 8: L ik ← fl low (L ik ) and U ki ← fl low (U ki ).

9:
A ij ← A ij − L ik U kj using Algorithm 3.1.
. . Y T q ] T , and apply Theorem 3.1, where the +1 subscript on γ comes from the initial subtraction with A.
We also need the following lemma for the next theorem.   Moreover, the computed solution X to the multiple right-hand side system T X = B, where T ∈ R b×b is nonsingular and triangular, and B ∈ R b×c , satisfies Proof. The (i, k) block of the L factor is computed by solving where L and U denote the computed factors that have been converted to precision u low (line 8 of Algorithm 4.1) and satisfy L ij = L ij + E ij and U jk = U jk + F jk , where |E ij | ≤ u low | L ij | and |F jk | ≤ u low | U jk |. By Corollary 4.1, the computed R ik satisfies, since k ≤ q, Combining these two inequalities gives Replacing L ij by L ij + E ij and U jk by U jk + F jk , we obtain We conclude that for i > k, For i = k, L kk is determined with U kk on line 2 of Algorithm 4.1, and by (4.2) we have | L kk U kk − R kk | ≤ γ b | L kk || U kk |. Therefore (4.5) holds for i = k, too, and hence so does (4.6). In a similar way, the inequality (4.6) can be shown to hold for i < k.
Proof. The result is obtained by combining Theorem 4.3 with the error analysis for the solution of triangular systems [17,Thm. 8.5] and is analogous to the proof of [17,Thm. 9.4]. We gather in Table 4.1 the dominant terms in the error bound for LU factorization and the solution of linear systems using NVIDIA tensor cores, for which b = 4, u = u high = u 32 , and u low = u 16 . We distinguish the same four variants of matrix multiplication as in Table 3.3. In the fp16 and fp32 cases, we naturally take the working precision to be u = u 16 and u = u 32 , respectively, and the bounds are the standard ones [17,Thms. 9.3,9.4]. In the TC16 case, both u = u 16 and u = u 32 are possible, but since the FMA uses u FMA = u = u 16 , we might as well take the working precision to be u = u 16 . Finally, in the TC32 case, in order to preserve the accuracy benefit of using an FMA and avoid the error growing with n to first order, we must take u = u FMA = u = u 32 . We have u FMA = 0 in (3.3) in both cases.
Overall, these bounds lead to the same conclusions as in the matrix multiplication case: the TC16 bound is almost identical to the fp16 one and exhbits linear growth with n, while the TC32 variant leads to a much smaller bound, which only starts growing with n when n 2u 16 /u 32 = 16384 (LU factorization) or when n 2 3 u 16 /u 32 ≈ 5461 (linear system), at which point it is almost equivalent to the fp32 bound.
Our analysis is applicable to the work in Haidar et al. [14], in which an implementation of Algorithm 4.1 on an NVIDIA V100 was used with single precision as the working precision and fp16 or TC32 for the matrix multiplications. The resulting LU factorization was used as a preconditioner in GMRES-based iterative refinement [6], [7]. In the experiments reported in [14], the total number of GMRES iterations (a good measure of the cost of refinement) for TC32 was at most half that for fp16 (with a significant increase in performance too). This is what would be expected from Table 4.1, where the LU factorization error constant for TC32 is a factor ranging from approximately 900 to 5500 smaller than that for fp16 for the matrix sizes n ∈ [2000, 34000] used in those experiments.

4.2.
Numerical experiments with tensor core LU factorization. We now present experiments testing the accuracy and performance of the LU factorization computed by Algorithm 4.1 for solving Ax = b on an NVIDIA V100 GPU. Our implementation does not use pivoting in the LU factorization and it performs all the operations (factor, solve, and update) solely on the GPU. We use our own CUDA kernels for the factor and solve operations and use the cublasGemmEx routine from the cuBLAS library, which is the same as the routine tested in Section 3 for the matrix multiplication, for performing the update operation. In the following experiments, we use fp32 as working precision and compare the four variants fp16, TC16, TC32, and fp32 listed in  10 −c(i−1)/(n−1) . The resulting matrix has singular values lying between 1 and 10 −c and thus a condition number equal to 10 c . In our experiments we set c = 3.
In Figure 4.1 we show the performance for the fp16, TC16, TC32, and fp32 variants for square matrices with n ranging between 1000 and 45000. Note that we do not include the times for the forward and backward substitution in these results because the cost of the factorization largely dominates the total cost for solving the linear system. In this figure we see that the fp32 variant asymptotically reaches 10 TFlops and that, as expected, the fp16 variant achieves twice the performance of that variant with around 20 TFlops. Note that the cuSOLVER library, as part of the CUDA toolkit, provides a single-precision LU factorization routine called cusolverDnSgetrf corresponding to our fp32 variant. For the sake of clarity, we do not include experimental results for the cusolverDnSgetrf routine but we observed that our implementation achieves similar performance to this routine. The TC16 and TC32 variants achieve much higher asymptotic performance, respectively around 36 and 32 TFlops, due to the use of the tensor cores. Although the TC16 and TC32 variants show similar performance behavior, the TC16 variant is slightly less efficient on the smaller matrices. On the largest matrix, though, the TC16 variants offer slightly better performance than TC32, which is consistent with the performance results obtained with the matrix multiply operation shown in Figure 3 3), the TC16 variant gives a smaller backward error than the fp16 one, even though their bounds are identical: this might again be explained by the use of blocked summation within the cuBLAS implementation of matrix multiplication. The TC32 variant gives a backward error between one and two orders of magnitude smaller than the fp16 and TC16 variants, as could be expected from the bounds shown in Table 4.1. The fp32 variant gives the smallest backward error but it is up to 3 times slower than the TC32 variant, as shown by Figure 4.1.
We conclude from these results that the TC32 variant offers the best performance versus accuracy tradeoff, as it exploits the performance capabilities of the tensor cores and has similar performance to TC16 variant, while giving much smaller backward errors than the fp16 and TC16 variants and backward errors only 1 to 1.5 orders of magnitude larger than for fp32.

Conclusion.
We have considered a general mixed precision block FMA unit that carries out a mixed-precision fused multiply-add operation D ← C + AB on b × b matrices. This block FMA generalizes the usual scalar FMA in two ways. First, it works on matrices (for b > 1) instead of scalars. Second, it takes A and B stored in precision u low and C stored in precision u low or u high (where u high < u low ), and returns D in precision u FMA (equal to u high or u low ), carrying out the computation at precision u.
We have proposed matrix multiplication and LU factorization algorithms that exploit such units and given detailed rounding error analyses of the algorithms, distinguishing several variants depending on the choice of each precision parameter.
If u FMA = u low and u = u high , then a b × b block FMA leads to error bounds a factor b smaller than those for conventional algorithms in precision u low . More significantly, by storing C and D in precision u high (that is, u FMA = u high ), the error bounds are reduced from O(nu low ) to cu low + O(nu high ), where c is independent of the problem size n. Assuming u high u low , we obtain bounds with a much weaker dependence on n, which suggests we can obtain more accurate results than for algorithms with only one precision, u low .
We applied our analysis to the tensor core units available in the NVIDIA Volta and Turing GPUs, which are specific block FMA units with b = 4 and with fp16 and fp32 precisions. We compared two variants, TC16 (u FMA = u = u 16 ) and TC32 (u FMA = u = u 32 ), which differ in the precision used for the output of the tensor cores. Our analysis predicts the TC32 variant to be much more accurate than the TC16 one. Our numerical experiments confirm this prediction and show that the accuracy boost is achieved with almost no performance loss. Our analysis can be applied to other matrix factorizations that are able to exploit a block FMA, such as mixed precision QR factorization (used in [8]) and mixed precision Cholesky factorization (used in [20]). The analysis is sufficiently general that it should be applicable to future block FMAs, including those in Table 1.1.