Stochastic Rounding and its Probabilistic Backward Error Analysis

. Stochastic rounding rounds a real number to the next larger or smaller ﬂoating-point number with probabilities 1 minus the relative distances to those numbers. It is gaining attention in deep learning because it can improve the accuracy of the computations. We compare basic properties of stochastic rounding with those for round to nearest, ﬁnding properties in common as well as signiﬁcant diﬀerences. We prove that for stochastic rounding the rounding errors are mean independent random variables with zero mean. We derive a new version of our probabilistic error analysis theorem from [ SIAM J. Sci. Comput. , 41 (2019), pp. A2815–A2835], weakening the assumption of independence of the random variables to mean independence. These results imply that for a wide range of linear algebra computations the backward error for stochastic rounding is unconditionally bounded by a multiple of √ nu to ﬁrst order, with a certain probability, where n is the problem size and u is the unit roundoﬀ. This is the ﬁrst scenario where the rule of thumb that one can replace nu by √ nu in a rounding error bound has been shown to hold without any additional assumptions on the rounding errors. We also explain how stochastic rounding avoids the phenomenon of stagnation in sums, whereby small addends are obliterated by round to nearest when they are too small relative to the sum.

1. Introduction.The results of most elementary floating-point operations can not themselves be represented as floating-point numbers.This simple fact leads to one of the defining features of floating-point arithmetic: rounding error.To define a floating-point arithmetic we must prescribe how to round the result of an operation to a nearby floating-point number.The IEEE standard 754 for binary floating-point arithmetic [19] defines four rounding modes.
• Round to nearest.The default, where we round towards even (least significant bit 0) to break ties.• Round towards 0.
Here, we consider two stochastic rounding modes.Let F ⊆ R denote the floatingpoint number system.In the first mode, we round x ∈ R with x / ∈ F up or down with equal probability to the respective nearest floating-point number.In the second mode, we round with a probability that is 1 minus the relative distance of x to each of the nearest floating-point numbers.For x ∈ R, define x = max{ y ∈ F : y ≤ x }, x = min{ y ∈ F : y ≥ x }, so that x ≤ x ≤ x , with equality throughout if x ∈ F .For x / ∈ F , x and x are adjacent floating-point numbers.For x ∈ R with x / ∈ F the two stochastic rounding modes are mode 1: fl(x) = x with probability 1/2, x with probability 1/2, (1.1) mode 2: fl(x) = x with probability p = (x − x )/( x − x ), x with probability 1 − p. (1.2) Stochastic rounding is an old idea, proposed in the 1950s and 1960s by Forysthe [7], [8] and Hull and Swenson [18].It is attracting renewed interest in deep learning because it can improve the accuracy of the underlying computations.
It is shown in [12], in the context of neural network training, that using a 16bit fixed-point representation with mode 2 stochastic rounding can be as effective as using 32-bit floating-point numbers with round to nearest.The issue that stochastic rounding solves is the obliteration of small parameter updates in the neural network, which is an instance of what we call stagnation.If a parameter φ is updated by a quantity h that is less than half the spacing of the floating-point numbers (or fixedpoint numbers) around φ then fl(φ + h) = φ with round to nearest, so the information in h is lost.Stochastic rounding helps to preserve this information.Much recent work applies stochastic rounding in neural network training and inference; see, for example, [4], [6], [24], [27], [28], [35], [38], [39], [43], and the references therein.
Another application where mode 2 stochastic rounding has been shown to improve accuracy with fixed-point arithmetic is the numerical solution of neural ODEs [17].
Much work on stochastic rounding with floating-point arithmetic has focused on using it to validate numerical methods through an empirical approach.The CESTAC method [3], [36] and its implementation CADNA [23], [32] use mode 1 stochastic rounding, termed "stochastic arithmetic", to detect instabilities in numerical routines and to provide estimates of the accuracy of the computed results.Further references on this topic include [10], [11], [37].
Parker's Monte Carlo arithmetic [29], [30] is more general than stochastic rounding, not least because as well as randomly rounding it can randomly perturb the input and output to a floating-point operation.
We are not aware of any analysis of stochastic rounding or any work on rounding error analysis for stochastic rounding.The purpose of this paper is to fill this gap in the literature.We make the following contributions.
• We analyze the properties of stochastic rounding in floating-point arithmetic vis-à-vis the properties of round to nearest, finding both common properties and significant differences.• We show that the recent probabilistic backward error analysis of Higham and Mary [14], which assumes that rounding errors are independent random variables with zero mean, holds with the weaker assumption of mean independence.We also show that stochastic rounding produces rounding errors that are mean independent random variables with zero mean.We conclude that the long-standing rule of thumb that one can replace a worst-case error bound nu by a more realistic (probabilistic) error bound √ nu [40, p. 318], [41, p. 26] holds unconditionally for stochastic rounding.
• We show that the expected value of a computed result from stochastic rounding is the true value for summation, inner products, matrix-vector and matrixmatrix products, and the solution of triangular systems, and we explain why this property does not extend to matrix factorizations.• We prove that stochastic rounding avoids stagnation in summation and thereby can lead to more accurate results than round to nearest.We make reference throughout to various floating-point systems, the parameters of which are shown in Table 1.1.All those beginning with "fp" are from the IEEE standard.Bfloat16 is a half precision format originally proposed by Google and supported by the Google Tensor Processing Unit1 (TPU); Intel [20] and others have announced support for it in future hardware.
We begin, in section 2, by recalling some basic properties of floating-point arithmetic.In section 3 we investigate properties of stochastic rounding and compare them with key properties of round to nearest.Section 4 contains the main contribution of this article: we prove that the recent probabilistic backward error bounds from [14] hold unconditionally for stochastic rounding.We illustrate this key result in section 5 with some numerical experiments for inner products.Finally, we give concluding remarks in section 6.
2. Floating-point arithmetic.We recall some basic properties of floatingpoint arithmetic.For more details, see [9], [13, Chap.2], [26].A number y in the floating-point number system F has the form (2.1) where • β is the base, which is 2 throughout this paper, • t is the precision, and • e is the exponent, which satisfies e min ≤ e ≤ e max .The significand m is an integer satisfying 0 ≤ m ≤ β t − 1. Normalized numbers are those for which m ≥ β t−1 .The machine epsilon ε is the distance from 1 to the next larger floating-point number and is given by ε = β 1−t .The spacing of floating-point numbers increases by β at each power of β.For β = 2 the spacing in the interval 3. Properties of stochastic rounding.In this section, stochastic rounding refers to either mode 1 or mode 2, defined by (1.1) and (1.2), and all the results are valid for both.
3.1.Properties that continue to hold.We begin by identifying properties of round to nearest that continue to hold under stochastic rounding.First, we note that fl(fl(x)) = fl(x) with stochastic rounding, that is, rounding a floating-point number leaves it unchanged.
Sterbenz's lemma [13, Thm.2.5], [34] is a property of floating-point numbers that is independent of the rounding mode, so it certainly holds for stochastic rounding.Under round to nearest we have (in base 2, but not for all bases [13, Probs.2.7, 2.8]) that for floating-point numbers x and y with x ≤ y, x ≤ fl((x + y)/2) ≤ y.
These inequalities are an immediate consequence of the monotonicity of round to nearest.We show that they remain true for stochastic rounding, even though it is not monotonic (as shown in the next section).Since division by 2 is exact in base 2 arithmetic, we need to show that 2x ≤ fl(x + y) ≤ 2y.For the case x = y, the inequalities trivially hold.We thus consider x < y.Let y = x + δ, where δ > 0. Then x + y = 2y − δ < 2y, so fl(x + y) ≤ 2y.Furthermore, x + y = 2y − δ ≥ 2y − 2δ = 2(y − δ) = 2x, so fl(x + y) ≥ 2x.

3.2.
Properties that no longer hold.Some properties that are trivial under round to nearest do not hold under stochastic rounding.Since rounding is probabilistic, two different evaluations of fl(x) can give different results.Similarly, in general we have but in each case the two possible values of the left-hand side are equal to the two possible values of the right-hand side (in the third case this follows from 2 p x = 2 p x and 2 p x = 2 p x ).
Monotonicity of rounding is the property that for x ∈ R and y ∈ R, the inequality x ≤ y implies fl(x) ≤ fl(y).Round to nearest is monotonic but stochastic rounding is not: if we have two adjacent floating-point numbers a < b, then for a < x ≤ y < b, fl(x) > fl(y) is possible under stochastic rounding.
In [5], [13, Prob.2.12] it is shown that for a floating-point number x satisfying 1 ≤ x < 2, fl(x * (1/x)) is either 1 or 1 − ε/2 with round to nearest, where ε is the machine epsilon.Under stochastic rounding we have two more possibilities for the result.
ε} under stochastic rounding, and each of these values is achievable for some x.
Proof.The spacing of the floating-point numbers in the interval (1/2, 1] is ε/2.This means that under stochastic rounding we have Each of these values is achievable, as is easily verified by numerical experiment. Consider the computation of fl(n * fl(m/n)), where m and n are integers.If m/n is a floating-point number then fl(n * fl(m/n)) = fl(n * (m/n)) = fl(m) = m for any rounding scheme, as no rounding takes place.For round to nearest, Kahan proved that the same identity holds for many other choices of m and n [9,Thm. 7].Recall that a floating-point number has precision t and that we are assuming base 2.
Theorem 3.3 (Kahan).Let m and n be integers such that |m| < 2 t−1 and n = 2 i + 2 j for some i and j.Then fl(n * fl(m/n)) = m with round to nearest.
The sequence of allowable n begins 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 16, 17, 18, 20, so Kahan's theorem covers many common cases.As an example of where the result is useful, if we partition [0, 1] into n intervals of length h = 1/n, we may want, for consistency in a computation, that fl(nh) = 1.Kahan's result shows that n does not need to be a power of 2 for this condition to hold.Theorem 3.3 does not hold for stochastic rounding because there are three possibilities for the computed result, as the next result shows.
Theorem 3.4.Let m be an integer such that |m| < 2 t−1 and let n = 2 i + 2 j for some i and j.Under stochastic rounding, fl(n * fl(m/n)) is either m, the next smaller floating-point number, or the next larger floating-point number.
Proof.The proof is a modification of the proof of [9,Thm. 7].Without loss of generality we can assume that m > 0. It is harmless to scale n and m by powers of 2, since it changes only the exponents.Scale n so that 2 t−1 ≤ n < 2 t and scale m so that 1/2 ≤ q = m/n < 1.We then have 2 t−2 ≤ m < 2 t .Since the original m has been reduced by at most a factor 2, m now has at most 1 bit to the right of the binary point.We will show that q = fl(m/n) satisfies Since m has at most 1 bit to the right of the binary point, if (3.3) is satisfied then under stochastic rounding fl(nq) will equal either m or one of the two adjacent floating-point numbers.(It would, in fact, be enough to prove (3.3) with 1/2 on the right-hand side.) We now seek to bound |nq − m|.Write q = .q 1 q 2 . . .and let q = .q 1 q 2 . . .q t 1.From the proof of [9, Thm.7] we have | q − q| ≥ 1/(n × 2 t+1−r ), where n must have the form n = 2 t−1 + 2 r and r ≤ t − 2. Assume q < q.The proof for q > q is similar.We now have two cases.Write q d = fl(m/n) = fl(q).Case 1: For q < q, with round to nearest we would necessarily round down and so q d = q − 2 −t−1 .This is one possibility with stochastic rounding.In this case we have nq d < nq = m and so .
Case 2: With stochastic rounding we have another possibility.As q d < q, the other value we can compute must be q u = q d + 2 −t .We then have q u = q + 2 −t−1 .Following a similar procedure as before we can show |m − nq u | ≤ 1/4, concluding the proof.
It is readily verified that there are m and n for which all three results given in the theorem can occur.
With round to nearest (and specifically for base 2), we have that fl( √ x 2 ) = |x| for x a floating-point number [13,Prob. 2.20].We show that this identity can fail under stochastic rounding, and fl( √ x 2 ) can be one of three values.
Theorem 3.5 implies that the inequality fl(x/ x 2 + y 2 ) ≤ 1 (which always holds under round to nearest [13, Prob.2.21]) can fail under stochastic rounding.This means that the formula acos(x/ x 2 + y 2 ) for one of the angles in a right-angled triangle with sides of length x and y can fail.Indeed, take y to be zero, or so small that fl(x 2 + y 2 ) = fl(x 2 ) holds with high probability.For x > 0, Theorem 3.5 shows that fl( and it follows that under stochastic rounding the result can exceed 1. Stochastic rounding has two drawbacks in common with a fused multiply-add operation [13, sect.2.6].First, if we compute the modulus squared of a complex number from the formula (x + iy) * (x + iy) = x 2 + y 2 + i(xy − yx), the result may be non-real, since fl(xy) = fl(yx) is possible.Second, in evaluating a discriminant b 2 − ac, even if b 2 ≥ ac the discriminant can evaluate as negative because of the non-monotonicity of stochastic rounding, which is problematic if √ b 2 − ac must be computed.
Under round to nearest (in base 2) we have for floating-point numbers x and y that err(x, y) = x + y − fl(x + y) satisfies |err(x, y)| ≤ min(|x|, |y|) [13, Prob.4.6], [33].We show this to be false under stochastic rounding by counterexample.For x = 4 and y = ε we have fl(x+y) ∈ {4, 4+4ε} as the spacing of the floating-point numbers in the interval [4,8]  Vital to compensated summation algorithms is the fact that for floating-point numbers a and b, if s = fl(a + b) with round to nearest then t = a + b − s is a floating-point number, which can be computed by the following algorithm.
Under stochastic rounding, the computed t from Algorithm 3.1 is not exact, but we can bound the error.From [10, Prop.4.3], we have if each arithmetic operation is performed with a directed rounding mode and hence also for stochastic rounding.Based on this argument, error bounds are provided in [10], [11] for compensated summation algorithms under directed rounding schemes, and these bounds therefore hold under stochastic rounding.We note that while the computation of t is no longer exact, compensated summation algorithms still prove accurate under stochastic rounding.While the collection of properties analyzed above is by no means exhaustive, it demonstrates that it would be dangerous to simply replace round to nearest by stochastic rounding in a given computation.One should carefully consider whether the computation is dependent on properties of round to nearest beyond the model (2.3) and, if they are, check whether they remain true for stochastic rounding..

4.
Probabilistic backward error analysis.We now consider how the properties of stochastic rounding can be exploited in backward error analysis.Standard backward error analysis based on the model (2.3) remains valid with u ← 2u by (2.4b), but we wish to exploit the statistical properties of stochastic rounding.
It is standard practice to express backward error results in terms of the constant γ n = nu/(1 − nu).This constant arises when rounding error terms 1 + δ i with |δ i | ≤ u are collected in a product and the distance of the product from 1 is bounded using the following lemma [13, Lem 3.1].
The inequality |θ n | ≤ γ n is a worst-case bound that is often pessimistic in practice and so it can fail to provide a good indication of the size of the error of a typical computation.This weakness is especially relevant in the context of large scale and/or low precision computations, since for large values of n or u, γ n can exceed 1, in which case the worst-case bound is not able to guarantee even a single correct digit. 2 For example, with the half precision arithmetics fp16 and bfloat16, nu > 1 for n > 2048 and n > 256, respectively.
These observations have generated a renewed interest in analyzing rounding errors from a probabilistic point of view.In particular, a systematic backward error analysis based on a probabilistic model that assumes rounding errors to be independent random variables of mean zero has recently been developed by Higham and Mary [14].
We state the following result, which is a minor rewriting of [14, Thm.2.4] with the change of variable λ ← λ/(1 − u).Define (4.2) Lemma 4.2.Let δ 1 , δ 2 , . . ., δ n be independent random variables of mean zero such that |δ i | ≤ u for all i, and let ρ i = ±1, i = 1 : n.Then for any constant λ > 0, holds with probability at least The significance of the lemma is that it shows that if the rounding errors are assumed to be independent random variables of mean zero then γ n = nu + O(u 2 ) can be replaced by the relaxed constant γ n (λ) = λ √ nu + O(u 2 ) with a probability that is high even for modest λ.It justifies the long-standing rule of thumb that one can take the square root of an error constant because of statistical effects in rounding error propagation.
As an example of what can be proved using Lemma 4.2 we state the following result for inner products from [14,Thm. 3.1].We define Theorem 4.3 (inner products).Let y = a T b, where a, b ∈ R n , be evaluated in floating-point arithmetic.If the rounding errors are independent random variables of mean zero then no matter what the order of evaluation the computed y satisfies with probability at least Q(λ, n).Lemma 4.2 and Theorem 4.3 rely, however, on the two key assumptions that rounding errors are independent and have zero mean.With deterministic rounding modes these assumptions do not always hold and indeed examples where the probabilistic bound is violated are provided in [14] and are used in our experiments in section 5. We now show that stochastic rounding enforces some weaker assumptions that are sufficient for the probabilistic bounds to hold unconditionally.
4.1.Stochastic rounding enforces mean independent, zero mean rounding errors.We begin by proving that stochastic rounding forces the rounding errors to be random variables with zero mean.From this point on, stochastic rounding means mode 2 stochastic rounding.x with probability p = (x − x )/( x − x ), x with probability 1 − p.
Clearly, fl(x) and δ = (fl(x) − x)/x are random variables.We have Since we also have We note that mode 1 stochastic rounding does not produce a zero mean: (1.1) implies that E(fl(x)) = ( x + x )/2, which is in general not equal to x.
Bounding the error in a sequence of operations is achieved in [14] by using Hoeffding's concentration inequality, which requires independence of rounding errors.The question therefore arises: does stochastic rounding enforce independence of rounding errors?The answer is negative.Indeed, successive rounding errors are still dependent on each other since they affect the computed values.Consider for example the computation of (a + b) + c.We have Clearly, δ 2 depends on the addends (a + b)(1 + δ 1 ) and c and hence on δ 1 .This simple example shows that independence of rounding errors is not enforced by stochastic rounding and so Hoeffding's inequality is not applicable.
We now prove that stochastic rounding enforces a weaker property called mean independence.A random variable X is said to be mean independent of another random variable Y if its conditional expectation given Y is equal to its unconditional expectation, that is, E(X | Y ) = E(X).In our context we can prove the following property of rounding errors.Moreover, ( c − c)/c and ( c − c)/c are themselves random variables that are entirely determined by δ 1 , . . ., δ k−1 and so the conditional expectation of each given δ 1 , . . ., δ k−1 is itself.Therefore we obtain Lemma 4.5 is of crucial importance because it has been shown in [15], [21] that the assumption of independence of rounding errors in probabilistic error analyses for inner product-based computations can be relaxed to mean independence.It is possible to do so for general linear algebra operations, by extending Lemma 4.2, which we do in the next section.
4.2.Generalizing backward error bounds to mean independence.We now derive a version of Lemma 4.2 that requires only mean independence and hence is unconditionally applicable to stochastic rounding.We will need the concept of a martingale.
We also need the following inequality [25,Thm. 13.4], which generalizes Hoeffding's inequality to possibly dependent variables.Lemma 4.7 (Azuma-Hoeffding inequality).Let E 0 , . . ., E n be a martingale such that We are ready for the main result, which is a version of Lemma 4.2 with the independence assumption replaced by the weaker assumption of mean independence.Theorem 4.8.Let δ 1 , δ 2 , . . ., δ n be random variables of mean zero with with probability at least 1 − 2 exp(−λ 2 /2).By a Taylor expansion it can be shown that [14, (2.3)] Hence, for ρ i = ±1, Summing gives which by (4.7) can be weakened to We slightly weaken this bound further by dividing the λ √ nu terms by 1 − u on each side, and then we exponentiate to obtain From the definition of θ n , we therefore have Since we have proven in Lemmas 4.4 and 4.5 that the assumptions required by Theorem 4.8 are enforced by stochastic rounding, we conclude that if the δ i are generated by stochastic rounding then the probabilistic bound (4.6) holds rigorously (with u ← 2u in view of (2.4)), without exception.Hence for stochastic rounding the rule of thumb that one can replace nu in a worst-case error bound by √ nu to obtain a more realistic (probabilistic) error bound is unconditionally true.
Theorem 4.8 can now be used to derive analogues of the probabilistic error results from [14] for inner products, matrix-vector and matrix-matrix products, LU factorization, Cholesky factorization, solution of triangular systems, and solution of linear systems by LU factorization or Cholesky factorization.In all cases the assumption that the rounding errors are independent random variables can be weakened to an assumption of mean independence.These analogues therefore hold unconditionally for stochastic rounding.
As an example, we write down the results for inner products, matrix-matrix products, and solution of linear systems.Recall that Q(λ, n) is defined in (4.3).We define which is γ (λ) in (4.2) with u replaced by 2u.Proof.The proof is almost identical to that of [14,Thm. 3.1], the difference being that we invoke Theorem 4.8, which is applicable by Lemma 4.5, instead of Lemma 4.2.
4.3.The mean of the error.In the previous section we derived probabilistic error bounds.We now ask what is the expected value of the computed result for stochastic rounding.For a single floating-point operation we know that the expected value is the true value by Lemma 4.4, because E(1 + δ) = 1 for a single rounding error δ.In the next result we show that a product of rounding error terms also has expected value 1.The key property needed is mean independence.Lemma 4.12.Let δ 1 , δ 2 , . . ., δ n be random variables of mean zero such that E(δ k+1 | δ 1 , . . ., δ k ) = E(δ k+1 ) = 0 for k = 1 : n − 1.Then Proof.Define P n = n i=1 (1 + δ i ).We prove E(P n ) = 1 by induction.The result clearly holds for P 1 since E(1 + δ 1 ) = 1.Assume it holds for P n−1 .Using the law of total expectation (or tower property) we have and the result follows by induction.
We note that Theorem 4.12 does not generalize to the product n i=1 (1 + δ i ) ρi with ρ i = ±1.We apply the theorem to inner products.Proof.Standard backward error analysis [13, sect. 3.1] shows that y can be written as where the δ ki satisfy (2.4b).(Some of the δ ki will be zero, depending on the order in which the inner product is evaluated).Taking the mean and using Lemma 4.12, along with the fact that the rounding errors from stochastic rounding are mean independent with zero mean by Lemma 4.5, we obtain E( y) = n i=1 a i b i = y.As a special case of Theorem 4.13 we have that the expected value of a sum is the exact sum under stochastic rounding.Since matrix-vector products and matrixmatrix products are built on inner products, it follows from Theorem 4.13 that for them, also, the mean of the computed product is the exact product.These results do not, of course, hold for round to nearest, because it is deterministic.
This argument extends to the solution of triangular systems, as we now show.We need an extension of Lemma 4.12.
Lemma 4.14.Let δ −m , . . ., δ 0 , δ 1 , δ 2 , . . ., δ n be random variables of mean zero such that We prove by induction that p n = 1.We have Assume that p n−1 = 1.Using the general form of the law of total expectation, so the result follows by induction.
Theorem 4.15.Let the triangular system T x = b, where T ∈ R n×n is nonsingular, be solved by substitution with stochastic rounding.The computed solution x satisfies E( x) = x.  to nearest leads to rounding errors violating the assumptions required for the probabilistic bound (4.6) to hold, stochastic rounding still enforces these assumptions.Stochastic rounding can therefore produce significantly more accurate results than round to nearest by reducing the error from nu to √ nu.In particular, this explains the improvements from using stochastic rounding reported in deep learning applications.
The two examples above are bad cases for round to nearest. Figure 5.3 shows the results of an experiment with inner products of vectors x and y with elements from the uniform distribution on [−1, 1].In this case the errors for stochastic rounding and round to nearest do not grow with n and so are both much less than the probabilistic error bound.The reason the errors do not grow is that the elements of x and y have mean zero [15,Thm. 3.2].Overall, round to nearest provides slightly more accurate results than stochastic rounding in this example, as might be expected in view of (2.3) and (2.4b).

Conclusions.
Stochastic rounding is an old idea that is drawing renewed interest, notably in the context of deep learning.We have presented rounding error analyses applicable to a wide range of numerical linear algebra algorithms using floating-point arithmetic with stochastic rounding, and we expect our conclusions to extend to fixed-point arithmetic.
Stochastic rounding satisfies the basic model of floating-point arithmetic (2.3), provided that the unit roundoff u is replaced by 2u; see (2.4b).However, we have identified several properties of round to nearest that no longer hold with stochastic rounding.Before replacing round to nearest by stochastic rounding in a computation one should therefore check whether these properties are needed.
Stochastic rounding has some attractive features compared with round to nearest, especially for large problems and low precisions.We have shown that stochastic rounding has the property that the rounding errors it produces are mean independent.We have also generalized the probabilistic error analysis result of [14] (Lemma 4.2 here) by weakening the independence assumption to mean independence (Theorem 4.8).An important consequence of these results is that for stochastic rounding a worst-case error bound nu can be replaced by the more realistic probabilistic error bound √ nuthat is, the long-standing rule of thumb is actually a rule for stochastic rounding.
Stochastic rounding can yield significantly more accurate results than round to nearest in the situations where the latter violates the probabilistic bounds, notably in certain sums and inner products.In particular, we have proved that stochastic rounding avoids stagnation and that the computed result has expected value equal to the exact sum.These findings are particularly important for deep learning applications, where stagnation can hamper parameter updates in neural networks.

Algorithm 3 . 1 (
FastTwoSum) Given floating-point numbers a, b such that |a| ≥ |b|, compute (with round to nearest) s and t such that s = fl(a + b) and s + t = a + b exactly.

Theorem 4 . 11 (
Linear system).Let A ∈ R n×n and suppose that with stochastic rounding LU factorization and substitution produce computed factors L and U and a computed solution x to Ax = b.Then

Theorem 4 .
13 (inner products).Let y = a T b, where a, b ∈ R n , be evaluated in floating-point arithmetic.Under stochastic rounding, no matter what the order of evaluation the computed y satisfies E( y) = y.

Fig. 5 . 3 :
Fig. 5.3: Computed backward errors of inner products for data sampled uniformly from [−1, 1] in fp32 (a) and fp16 (b).For each value of n we perform the computation 10 times and plot the maximum backward error for each rounding mode.

Table 1 . 1 :
Parameters of floating-point systems.(sig., exp.) denotes number of bits in significand (including implicit most significant bit) and exponent, u is the unit roundoff, x min is the smallest normalized positive number, and xmax is the largest finite number.