A CLASS OF FAST AND ACCURATE SUMMATION ALGORITHMS

. The need to sum ﬂoating-point numbers is ubiquitous in scientiﬁc computing. Standard recursive summation of n summands, often implemented in a blocked form, has a backward error bound proportional to nu , where u is the unit roundoﬀ. With the growing interest in low precision ﬂoating-point arithmetic and ever increasing n in applications, computed sums are more likely to have insuﬃcient accuracy. We propose a class of summation algorithms called FABsum (for “fast and accurate block summation”) that applies a fast summation algorithm (such as recursive summation) blockwise and then sums the partial sums using an accurate summation algorithm (such as compensated summation, or recursive summation in higher precision). We give a rounding error analysis to show that FABsum with a ﬁxed block size b has a backward error bound ( b +1) u + O ( u 2 ), which is independent of n to ﬁrst order. Our computational experiments show that with a suitable choice of b (independent of n ) FABsum can deliver substantially more accurate results than blocked recursive summation, with only a modest drop in performance. FABsum is especially attractive for low precisions, where it can provide correct digits for much larger n than recursive summation.

1. Introduction. Summation is a key computational task at the heart of many numerical algorithms, most notably numerical linear algebra kernels involving inner products, such as matrix-vector or matrix-matrix multiplications, matrix factorizations, and the solution of linear systems. In floating-point arithmetic, summation incurs rounding errors. Backward error bounds for the basic summation algorithms are proportional to nu, where n is the number of summands and u the unit roundoff, and thus they grow linearly with n.
While compensated summation algorithms achieving backward error bounds that do not grow (or grow slowy) with n are known [10, sec. 4.3], they are computationally expensive. Indeed running a basic summation algorithm in higher precision (e.g., using double precision rather than single precision) may provide a better performance/accuracy tradeoff. Compensated summation algorithms are thus only used when the highest precision available is not enough for the application at hand, and they are generally not available in optimized libraries such as the Intel Math Kernel Library or the NAG Library.
The rise of large-scale, low-precision computations presents new challenges. On the one hand, modern supercomputing allows larger and larger problems to be solved, and we must routinely evaluate sums of 10 8 or more terms. On the other hand, the use of low floating-point precisions (such as IEEE half precision, for which u ≈ 4.88×10 −4 ) is increasingly common [2], [8], [12]. For such large sizes and low precisions, error bounds of order nu exceed 1, and so do not provide any meaningful information. This creates a dilemma, as we need to choose between benefiting from the speed, lower energy requirements, and reduced data movement of low precision arithmetic and being able to accurately solve large problems.
In this work we tackle this dilemma by introducing a new class of summation algorithms that excels in both performance and accuracy. These algorithms achieve backward error bounds that do not grow with n (that is, the bounds are of the form cu + O(u 2 ), where c is a moderate constant independent of n) and, at the same time, they can deliver a similar performance to optimized, standard algorithms by performing an arbitrarily low number of extra flops (e.g., less than a 1% overhead) and by allowing efficient implementation on modern computers.
We first review existing summation algorithms in section 2. Then, in section 3, we present the new class of algorithms, called FABsum, perform a rounding error analysis for algorithms in this class, and analyze their cost in a general framework. In section 4 we apply FABsum to several key numerical linear algebra algorithms. In section 5, we assess the practical performance and accuracy of FABsum through numerical experiments. We provide our concluding remarks in section 6.
2. Existing summation algorithms. We begin by briefly reviewing existing algorithms to compute the sum s = n i=1 x i . We denote by s the computed value of s in floating-point arithmetic. We use the standard model of floating-point arithmetic [10, sec. 2.2] Note that this model does not account for the possibility of underflow (or overflow). However, an addition or subtraction incurring underflow is necessarily exact [9], [10,Prob. 2.19] and therefore our results for summation hold even in presence of underflow. This does not extend, though, to the more general algorithms based on summation in section 4. We note that sharper error bounds can be obtained in some cases by making the stronger assumption that the floating-point arithmetic uses round to nearest [16], [19], [24]. However, we prefer to restrict to the model (2.1) in order that our results have the widest possible applicability, for example to directed or stochastic rounding.
We mention that summation algorithms have been proposed that achieve reproducibility (see, e.g., [4], [5]). Reproducibility and accuracy are two distinct objectives, since even very accurate algorithms can yield different results on different runs. In this article, we target accuracy rather than reproducibility.
2.1. Recursive summation. The most basic algorithm to compute s is recursive summation, which starts with s = x 1 and computes The computed sum satisfies [10, sec. 4.2] where γ n = nu/(1 − nu) for nu < 1. This backward error bound grows linearly with n and is almost attainable [10, Prob. 4.2].

Blocked summation.
In order to allow a sum to be computed in parallel, high-performance libraries divide the summands into n/b blocks of size b. From now on, n is assumed to be a multiple of b for simplicity. First, the local sums are computed with recursive summation, and then s = n/b i=1 s i is also obtained with recursive summation.
By applying (2.2) to s i and s we find that the computed s from blocked summation satisfies and so the backward error bound depends on the block size b. The constant b+n/b−2 in the bound is minimized for b = √ n, for which it equals 2 √ n − 2. However, for performance reasons b is often chosen to be a moderate constant (such as 128 or 256), in which case the error bound for blocked summation still grows linearly with n.

Pairwise summation.
Pairwise summation (also known as tree summation or binary cascade summation) generalizes blocked summation by recursively computing the local sums (2.3) with blocked summation, continuing until sums of two terms are obtained. There are thus at most log 2 n levels of recursion, where · denotes the ceiling function, which leads to the backward error result [10, sec. 4.2] in which the bound grows with n at a much slower rate than for blocked summation. However, pairwise summation is not well suited for an efficient implementation on parallel computers, and it usually delivers poor performance compared with blocked summation [3]. In other words, the error in a floating-point addition is itself a floating-point number. Compensated summation algorithms approximate the error term e at each step of recursive summation in order to correct the computed sum. Many such algorithms have been designed and analyzed over the years. Here we focus on Algorithm 2.1 [17], [22], and we note that the augmented arithmetic operations in the 2019 revision of the IEEE 754 standard for floating-point arithmetic [15] can be exploited in the algorithm. The backward error result for this algorithm is [7,Thm. 8]  Algorithm 2.1 requires 3 extra flops per loop iteration, which is typically more expensive than simply switching to a higher precision. For this reason, compensated summation is usually worth considering only when computations are already taking place at the highest precision available on the target hardware.
3. Fast and accurate blocked summation. Assume that we have at our disposal two summation algorithms: one that is fast, referred to as the FastSum algorithm, and one that is accurate, referred to as the AccurateSum algorithm. In Algorithm 3.1 we describe FABsum, a new blocked summation algorithm that exploits FastSum and AccurateSum. The key idea is to use FastSum to compute the local sums s i of b numbers and AccurateSum to sum the local sums. This idea is motivated by the observations that the bulk of the computation is in computing the local sums if the block size is chosen appropriately and that rounding error growth can be attenuated by accumulating the local sums more accurately. We can therefore expect that the new algorithm will be almost as fast as FastSum and almost as accurate as AccurateSum. The rest of this section is devoted to proving this expectation by analyzing the rounding errors and the cost of Algorithm 3.1 and illustrating these analyses with some practical choices of FastSum and AccurateSum.
We note the special cases b = 1, which is entirely AccurateSum, and b = n, which is entirely FastSum. We are interested in values of b between these two extremes: b should be large enough to give speed benefits but small enough to give error bounds independent of n to first order. In the next two sections we give error and cost analyses that guide the choice of b.
Our general framework includes some previously proposed algorithms. For example, taking AccurateSum to be pairwise summation leads to the superblock algorithm from [3]. Similarly, [1] presents a "selective compensation" algorithm that amounts to using compensated summation as AccurateSum. Our analysis below generalizes the analyses of these papers, considers other possible AccurateSum choices, and extends the analysis to numerical linear algebra kernels in section 4.

Rounding error analysis.
To carry out a rounding error analysis of Algorithm 3.1 we need to make some assumptions on the error bounds associated with algorithms FastSum and AccurateSum. For any sum of the form s = n i=1 x i , we assume that the computed s from FastSum satisfies and the computed s from AccurateSum satisfies where ε f (n) and ε a (n) are O(u) and depend on n and u. With these assumptions, we have the following backward error result.
Proof. Each of the local sums s i computed at line 2 of Algorithm 3.1 satisfies i=1 s i be the sum of the computed local sums. Then the computed t, which is the overall computed sum s, satisfies

Combining (3.3) and (3.4) and defining µ
, the result follows. We can interpret Theorem 3.1 as follows. If ε a is of order u 2 (such as when AccurateSum uses doubled precision, that is, with a unit roundoff u 2 corresponding to a significand twice as large), then ε(n, b) = ε f (b) + O(u 2 ) is independent of n to first order and thus does not grow with n. In fact, even if ε a is of order u, as long as it does not grow with n to first order (such as when AccurateSum uses compensation), ε(n, b) does not grow with n to first order either.
To explore this point more precisely we provide an expression for ε(n, b) in the following cases. Assume FastSum corresponds to recursive summation, so that • If AccurateSum uses recursive summation in doubled precision with a final rounding back to the working precision, then we have ε a (n/b) = u + O(u 2 ) and an overall error bound • If AccurateSum uses compensated summation, we have ε a (n/b) = 2u + O(u 2 ) by (2.7) and an overall bound • If AccurateSum uses pairwise summation, we have ε a (n/b) = log 2 (n/b) u + O(u 2 ) by (2.5) and an overall bound In all cases we thus obtain an overall error bound that does not grow (or, in the case of pairwise summation, grows very slowly) with n.
3.1.1. Accuracy versus stability. Accurate summation means achieving a small forward error, but our error analysis bounds the backward error. We will show that these concepts are closely related for summation. The backward error of an A formula for the backward error can be obtained as a special case of the Oettli-Prager theorem [10, Thm. 7.3], [23]: The numerator is the absolute forward error in the computed sum. The relative forward error is a factor x i | larger than η( s)-a factor that depends on the data but not the algorithm. This factor is precisely the condition number 3.1.2. On second order contributions. Up to now we have considered expansions of the error bounds up to first order in u. However, when low precision arithmetic is being used it is likely that the O(u 2 ) terms will become significant even for moderate n.
If the arithmetic uses round to nearest we can drop these second order terms for recursive summation, as proved in [16], [19], [24]. However, as noted in section 2, we wish to keep the analysis more general. Using the general model (2.1) we now obtain the second order terms explicitly in order to assess their potential influence on the error bounds of FABsum.
Let us start with the error bounds given in section 2. In the case of recursive summation, if (n − 1)u < 1 a second order bound is found by using the expansion of γ n−1 at u = 0 in (2.2): Here, the second order term becomes significant when (n − 1)u approaches 1, that is, when the bound becomes invalid and hence of no use. Similarly, for recursive summation in doubled precision (with unit roundoff u 2 ) with a final rounding to working precision u, we have Finally, an explicit second order backward error bound for compensated summation can be found in [7, Thm. 8]: By Theorem 3.1, the second order terms of FABsum can be obtained by adding the second order terms of ε f (b) and ε a (n/b) to the product of their first order terms. We assume that FastSum uses recursive summation at the working precision, so that • If AccurateSum uses recursive summation in doubled precision, we have and an overall error bound • If AccurateSum uses compensated summation, we have ε a (n/b) = 2u + 2(2n/b + 1)u 2 + O(u 3 ) and an overall bound We will not derive a higher-order bound for pairwise summation as the bound (3.7) depends on n to first order already. For large n, n/b is in practice much larger than b and so we have, roughly, for doubled precision and for compensated summation. The order u 2 terms become significant when n exceeds the critical values b 2 /u for doubled precision and b 2 /(4u) for compensated summation. For instance, in IEEE half precision arithmetic (u = 2 −11 ≈ 4.88 × 10 −4 ) and for a block size b = 128, the second order terms in the bounds are significant for n larger than 3.3 × 10 7 with doubled precision and 8.3 × 10 6 with compensation.

Cost analysis.
We now analyze the cost C(n, b) in flops of Algorithm 3.1. Let C f and C a denote the costs of algorithms FastSum and AccurateSum, respectively. Then In particular, if the costs C f and C a are linear functions of the number of summands, as is often the case, C(n, b) simplifies to Therefore the cost of Algorithm 3.1 can be made arbitrarily close to that of FastSum by increasing the block size b. For example, assume FastSum is recursive summation, whose cost is C f (n) = n−1, and AccurateSum is compensated summation, whose cost is C a (n) = 4n − 2. Then C(n, b) is equal to n − 1 + (4n − 2)/b, that is, it requires only (4n − 2)/b extra flops compared with recursive summation, which is a very small overhead for typical choices of block sizes. For instance, for b = 400, this represents an overhead of only 1%. Similar results hold when AccurateSum uses recursive summation with doubled precision (the use of doubled precision increases the cost of recursive summation by a constant factor, usually 2) or pairwise summation (which does not change the number of required flops, but may decrease the speed due to poor efficiency on parallel computers).
As long as the parameter b is both reasonably large (to keep the overhead cost limited) and constant (to avoid error growth), we therefore have the freedom to tune it in order to achieve the highest possible performance on a given target architecture, taking into account the cache size, for example.
4. Application to numerical linear algebra. Now we apply our new FABsum algorithm (Algorithm 3.1) within some important summation-based kernels in numerical linear algebra, namely inner products, matrix-vector and matrix-matrix products, matrix factorizations, and the solution of linear systems. The core computation is an inner product, and to be precise we specify in Algorithm 4.1 how FABsum is used.

Algorithm 4.1 Inner product via FABsum.
This algorithm takes as input vectors x, y ∈ R n and a block size b and returns the inner product x T y. We now derive rounding error bounds for the resulting algorithms. Recall that ε f (n) and ε a (n) are the error constants in (3.1) and (3.2) for FastSum and AccurateSum, respectively.
Proof. The initial products z i = x i y i each incur one rounding error. The rest of the proof follows from the application of Theorem 3.1 to the computation of n i=1 z i . In the next result we consider matrix-vector and matrix-matrix products computed by the usual inner product formulas. Here, |A| denotes the matrix (|a ij |) and inequalities involving matrices or vectors are to be understood componentwise.
Theorem 4.2 (matrix-vector and matrix-matrix products). Let A ∈ R m×n , B ∈ R n×p , and x ∈ R n . If y = Ax is computed via inner products using Algorithm 4.1 then If C = AB is computed via inner products using Algorithm 4.1 then Proof. The proof follows directly from Theorem 4.1.
In order to analyze the solution of linear systems we need the next lemma. For convenience in stating the bounds we will assume that ε f (n) ≥ u and ε a (n) ≥ u, which is true for all the methods under consideration here.
Proof. By Theorem 3.1 and (2.1) we have We can now derive results for the solution of linear systems by LU factorization. The next result is immediate from Lemma 4.3.
For the next two results we assume that the Doolittle form of LU factorization is used (see, e.g., [10, Alg. 9.2]) and that Algorithm 4.1 is used in the inner products therein.
Theorem 4.5. If LU factorization applied to A ∈ R n×n runs to completion then the computed LU factors L and U satisfy Proof. The result is obtained by applying Lemma 4.3 to the equations that determine L and U , and is analogous to the proof of [10, Thm. 9.3].
Proof. The result is obtained by combining Theorems 4.4 and 4.5, and is analogous to the proof of [10,Thm. 9.4].
When recursive summation is used (which corresponds to b = n in Algorithm 3.1, with FastSum equal to recursive summation), we have ε a = 0 and ε f (b) = ε f (n) = (n − 1)u + O(u 2 ), so the error constants in Theorems 4.1-4.2 and 4.4-4.6 have first order terms proportional to n. But as shown in section 3.1, if FABsum is used with FastSum equal to recursive summation and AccurateSum equal to either compensated summation or recursive summation in doubled precision then the first order terms in Theorems 4.1-4.2 and 4.4-4.6 are of order bu and hence independent of n. FABsum therefore offers error constants roughly n/b times smaller than those for recursive summation.

Numerical experiments.
We have carried out an extensive set of numerical experiments to assess the performance and accuracy of our new FABsum algorithm (Algorithm 3.1) with different AccurateSum choices and to compare it with existing algorithms.
In section 5.1 we experiment with summation using MATLAB R2018b. In section 5.2 we present performance and accuracy results with PLASMA [6], a state-ofthe-art numerical linear algebra library that we have modified by integrating FABsum.
In all the experiments, the working precision is single (u = 2 −24 ≈ 5.96 × 10 −8 ) and the "exact" quantities appearing in the error bound formulas for inner products and matrix-matrix products are computed in double precision.

Summation.
We first experiment with summation with MATLAB. We compute s = n i=1 x i , where x is a randomly generated vector. While the behavior of the error as a function of n does not strongly depend on the type of distribution (such as random uniform or normal), it does depend on the mean of the entries x i [13]. For this reason, we compare the two distributions • random uniform [0, 1]: x = rand(m,n), • random uniform [−1, 1]: x = (rand(m,n)-0.5)*2. In order to make the experiments reproducible, we seed the random number generator at the beginning of the scripts generating each figure of this section. We have made these scripts available online 1 . For each problem size n, we run the same experiment N test = 10 times and plot the value of the maximum backward error (3.8).
Results using single precision and a block size b = 128 are displayed in Figure 5.1, where we compare blocked summation (one of the fastest summation algorithms), compensated summation (Algorithm 2.1, one of the most accurate summation algorithms), and FABsum (Algorithm 3.1), which in this case uses recursive summation for FastSum and compensated summation for AccurateSum. Clearly, FABsum is much more stable than blocked summation, by up to three orders of magnitude for large n. It yields a backward error that does not grow with n and is almost ideally small for [0, 1] vectors, matching the compensated summation error. For [−1, 1] vectors, compensated summation remains more accurate, but only by a constant factor always less than 10.
The backward error bound for FABsum is (b + 1)u + O(u 2 ) (see (3.6)) versus 2u + O(u 2 ) for compensated summation, but we see a ratio in the actual backward errors roughly √ b/2 rather than b/2. This is not surprising because the probabilistic error analysis in [11] shows that if the rounding errors are independent random variables of zero mean then an error bound of order the square root of the worst-case bound holds with high probability.
The next experiment uses IEEE half precision (fp16, u = 2 −11 ≈ 4.88 × 10 −4 ),   bfloat16 (u = 2 −8 ≈ 3.91 × 10 −3 ), and quarter precision (u = 2 −4 ≈ 6.25 × 10 −2 )-the latter is not standard, but this choice of precision for 8-bit words is suggested in [21]. These lower precisions are simulated with the rounding function chop.m from [14]. Since our objective is to assess the effect of rounding errors, in order to avoid overflow interfering with the results we only simulate the significand of these low precision formats; for the exponent, we keep the same number of bits as double precision. Figure 5.2 compares blocked summation with FABsum with recursive summation for FastSum and two choices for AccurateSum: compensated summation and recursive summation in extended precision u e = 2 −24 (that is, single precision). As can be seen in the figure, for such low precisions blocked summation leads to a backward error that quickly reaches 1, at which point the computed sum is meaningless from a backward error point of view. Since in this experiment x is a nonnegative vector, one important difficulty is that the sum s keeps increasing as it is computed. For large n, the sum may become so large that the following increments do not increase its value in floating-point arithmetic. This phenomenon, which we call stagnation, leads to a dramatic increase of the error. The reason is that the rounding errors incurred during stagnation are all negative, which makes the worst-case error bound close to being sharp [11,Sec. 4.2.1] and also invalidates the model upon which the probabilistic analysis in [11] is based. Since the entries of x are bounded by 1, we know that s (in line 4 of Algorithm 3.1) must necessarily stagnate after its value exceeds b/u; note that stagnation can, however, occur for smaller values of s when the increments x i are small. For random uniform summands in [0, 1], the value of s should be about n/2. This explains the dramatic increase of the error for blocked summation observed in Figure 5.2 when n becomes larger than 2b/u, that is, 131072, 16384, and 1024 in fp16, bfloat16, and fp8 arithmetic, respectively.
We now discuss the behavior of FABsum with respect to stagnation for different choices of AccurateSum. For fp8 arithmetic, stagnation occurs for FABsum with compensated summation, eventually leading to a sharp increase of the error. However, when extended (single) precision is used, s does not stagnate because it never exceeds    2b/u e ≈ 1.1 × 10 9 . For larger n, one could use an even higher extended precision, such as double precision. In fact, provided that stagnation does not occur within local sums, FABsum with extended precision summation is able to avoid stagnation as long as standard blocked summation in precision u e , at a fraction of the cost. For the fp16 and bfloat16 precisions, even though the range of sizes n are too small to show the full effect of stagnation, the variant with extended precision is again noticeably and consistently more accurate than the variant with compensated summation. In contrast to the fp8 case, it is not clear, though, whether this is due to some mild beginning of stagnation or to the effect of the second-order terms analyzed in section 3.1.2.
We conclude with a classical example affected by stagnation: the evaluation of the harmonic series s = n i=1 1/i. In exact arithmetic, s diverges as n → ∞, whereas in floating-point arithmetic, s is known to "converge" due to stagnation [14], [20]. In Figure 5.3, we compare the value to which s converges for different precisions and    summation algorithms. These experiments demonstrate again that FABsum with extended precision summation (in this case, double precision) is able to delay stagnation much longer than the other algorithms.

Matrix multiplication.
We now experiment with the state-of-the-art parallel numerical linear algebra library PLASMA [6]. PLASMA includes much of the functionality of the BLAS and LAPACK, but partitions matrices into t × t blocks called tiles. Independent operations on different tiles are performed concurrently using OpenMP task-based parallelism in order to achieve high performance on sharedmemory multicore machines. The tile size is tuned for performance: it must be large enough to allow for computations of high granularity but small enough to expose enough parallelism to feed all the available cores.
By design, PLASMA therefore lends itself naturally to using blocked summation. Note that it is not easy to determine what specific summation algorithm PLASMA uses, because it relies on third-party libraries (in our experiments, Intel MKL) to compute the local operations on tiles. Nevertheless, we expect the default implemen-   tation of the library to suffer from growth of the error with the matrix size. In order to assess how FABsum can overcome this issue and at what performance cost, we have implemented FABsum in some of the PLASMA routines. 2 We experiment with single precison matrix-matrix multiplications C = AB, where A ∈ R m×n and B ∈ R n×p are randomly generated matrices with uniform entries in [0, 1]. We use two 14-core Intel Broadwell processors on the Kebnekaise supercomputer (Umeå, Sweden). We perform N test = 10 consecutive runs and take the maximum performance and the maximum error, measured as where the norm is the Frobenius norm and the "exact" C is computed by a double precision multiplication. We wish to study the behavior of the error for large n. Since the matrices would not fit in memory if taken to be square, we fix m = p = 4096 and vary n from 10 3 to 2 × 10 5 . In Figure 5.4, we compare the performance and accuracy of PLASMA using FABsum with the default implementation of the PLASMA library. In this first experiment, we have taken the summation block size b to be equal to the default tile size t = 256 because this makes for the most convenient implementation of FABsum. Figure 5.4a shows that for the largest matrices, the error is reduced by more than an order of magnitude. In theory, the performance loss should follow the flop overhead, which is about 1% since b = 256. However, Figure 5.4b reports a much higher performance loss: while the default implementation achieves a performance peak of over 1500 Gflops for the largest matrices, FABsum only reaches about 1200 Gflops, which   represents a 20% loss. This is probably explained by the fact that the extra work required by FABsum cannot achieve a high speed, being of BLAS-2 type rather than BLAS-3. Therefore, even though it is convenient to take the summation block size b and the tile size t to be the same, doing so does not allow for an entirely satisfying performance/accuracy tradeoff. While dissociating b and t slightly complicates the implementation, it allows us to find the best possible compromise between performance and accuracy, as shown in Figure 5.5. Figure 5.5a shows that increasing b from t = 256 to 16t = 4096 barely impacts the error, even though the bound is 16 times larger (this is partly explained by the probabilistic arguments discussed in the previous section, which suggest the error should increase by only a factor around √ 16-but even this prediction turns out to be pessimistic for large n). In turn, taking a larger b significantly improves the performance of FABsum: Figure 5.5b shows that with b = 16t the performance loss reduces significantly to about 3%.
6. Conclusion. As problem sizes n in scientific computing continue to grow, rounding error bounds proportional to nu, where u is the unit roundoff, become less satisfactory-especially in the context of the low precision arithmetics that are increasingly available in hardware. Error bounds for (blocked recursive) summation have the form nu + O(u 2 ) and so can readily exceed 1 in practice. We have proposed a class of blocked summation algorithms, FABsum, that uses a fast algorithm to sum blocks of b numbers and an accurate algorithm to sum the resulting local sums. When the fast algorithm is recursive summation and the accurate algorithm is either compensated summation or recursive summation in extended precision, FABsum has a backward error bound (b + 1)u + O(u 2 ); this bound is about a factor n/b smaller than that for standard summation and, for constant b, does not grow with n to first order.
When FABsum is used in the core linear algebra operations of matrix multiplication and solution of linear systems by LU factorization, error bounds with constant first order term are again obtained, as shown in section 4. Our numerical experiments show that FABsum does indeed produce substantial reductions to the actual backward errors in practice, and can do so with only a small reduction in performance in the context of the high performance state-of-the-art linear algebra library PLASMA.
FABsum provides an attractive way to obtain more accurate sums and more accurate linear algebra kernel evaluations, especially for low precisions.