Abstract

The conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than 60 years old, it is still a serious candidate for extreme-scale computations on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at the application level. Consequently, future extreme-scale computing facilities will be more prone to errors of any kind, including bit-flips, during their calculations. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditioner application. We further propose numerical criteria to detect the occurrence of such soft errors and assess their robustness through extensive numerical experiments.

Keywords

  1. conjugate gradient
  2. soft errors
  3. bit-flip
  4. numerical detection

MSC codes

  1. 15A06
  2. 65F10
  3. 65N12

Get full access to this article

View all available purchase options and get full access to this article.

References

1.
E. Agullo, S. Cools, E. Fatih-Yetkin, L. Giraud, N. Schenkel, and W. Vanroose, On Soft Errors in the Conjugate Gradient Method: Sensitivity and Robust Numerical Detection---Revised, Research Report RR-9330, Inria Bordeaux Sud-Ouest, 2020.
2.
E. Agullo, S. Cools, E. Fatih-Yetkin, L. Giraud, N. Schenkels, and W. Vanroose, A Complementary Note on Soft Errors in the Conjugate Gradient Method: The Persistent Error Case, Research Report RR-9360, Inria Bordeaux Sud-Ouest, August 2020.
3.
E. Agullo, L. Giraud, and L. Poirel, Robust preconditioners via generalized eigenproblems for hybrid sparse linear solvers, SIAM J. Matrix Anal. Appl., 40 (2019), pp. 417--439.
4.
W. Bland, A. Bouteiller, T. Hérault, G. Bosilca, and J. J. Dongarra, Post-failure recovery of MPI communication capability: Design and rationale, IJHPCA, 27 (2013), pp. 244--254.
5.
A. Bouras and V. Frayssé, Inexact matrix-vector products in Krylov methods for solving linear systems: A relaxation strategy, SIAM J. Matrix Anal. Appl., 26 (2005), pp. 660--678.
6.
A. Bouras, V. Frayssé, and L. Giraud, A Relaxation Strategy for Inner-Outer Linear Solvers in Domain Decomposition Methods, Technical Report TR/PA/00/17, CERFACS, Toulouse, France, 2000.
7.
W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial, SIAM, Philadelphia, 2000.
8.
J. Elliott, F. Mueller, M. Stoyanov, and C. Webster, Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic, Technical Report 2013-2, 2013.
9.
J. Elliott, M. Hoemmen, and F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, in proceedings of the 28th International, Parallel and Distributed Processing Symposium, IEEE, 2014, pp. 1193--1202.
10.
J. Elliott, M. Hoemmen, and F. Mueller, Exploiting data representation for fault tolerance, J. Comput. Sci., 14 (2016), pp. 51--60.
11.
MPI: A Message Passing Interface Standard Version 3.1, The MPI Forum, 2015.
12.
M. L. Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, Mixed-precision in-memory computing, Nature Electronics, 1 (2018), pp. 246--253.
13.
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderam, PVM: Parallel Virtual Machine: A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994.
14.
P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned conjugate gradient algorithm, Parallel Comput., 40 (2014), pp. 224--238.
15.
A. Greenbaum, Estimating the attainable accuracy of recursively computed residual methods, SIAM J. Matrix Anal. Appl., 18 (1997), pp. 535--551.
16.
M. H. Gutknecht and Z. Strakoš, Accuracy of two three-term and three two-term recurrences for Krylov space solvers, SIAM J. Matrix Anal. Appl., 22 (2000), pp. 213--229.
17.
N. Halko, P. G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217--288.
18.
T. Herault and Y. Robert, eds., Fault-Tolerance Techniques for High-Performance Computing, Springer, New York, 2015.
19.
M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. National Bureau of Standards, 46 (1952), pp. 409--436.
20.
N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, Philadelphia, 2002.
21.
M. Hoemmen and M. A. Heroux, Fault-tolerant iterative methods via selective reliability, in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis, 2011.
22.
K.-H. Huang and J. a Abraham, Algorithm-based fault tolerance for matnx operations, IEEE Trans. Computers, C-33 (1984), pp. 518--528.
23.
J. Liesen and Z. Strakoš, Krylov Subspace Methods, Numer. Math. Sci. Comput., Oxford University Press, New York, 2013.
24.
G. Meurant and Z. Strakoš, The Lanczos and conjugate gradient algorithms in finite precision arithmetic, Acta Numer., 15 (2006), pp. 471--542.
25.
B. O. Mutlu, G. Kestor, J. Manzano, O. Unsal, S. Chatterjee, and S. Krishnamoorthy, Characterization of the impact of soft errors on iterative methods, in Proceedings of the 25th International Conference on High Performance Computing, IEEE, 2018, pp. 203--214.
26.
L. N. Olson and J. B. Schroder, PyAMG: Algebraic Multigrid Solvers in Python v4.0, Release 4.0, 2018.
27.
B. Parhami, Defect, fault, error,..., or failure?, IEEE Trans. Reliability, 46 (1997), pp. 450--451.
28.
Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, 2003.
29.
P. Sao and R. Vuduc, Self-stabilizing iterative solvers, in Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2013, pp. 1--8.
30.
M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Characterizing the impact of soft errors on iterative methods in scientific computing, in Proceedings of the International Conference on Supercomputing, 2011, p. 152.
31.
V. Simoncini and D. B. Szyld, Theory of inexact Krylov subspace methods and applications to scientific computing, SIAM J. Sci. Comput., 25 (2003), pp. 454--477.
32.
Z. Strakoš and P. Tichy, Error estimation in preconditioned conjugate gradients, BIT, 45 (2005), pp. 789--817.
33.
H. A. Van der Vorst, Iterative Krylov Methods for Large Linear Systems. Cambridge Monogr. Appl. Comput. Math. 13, Cambridge University Press, Cambridge, UK, 2003.
34.
H. A. van der Vorst and Q. Ye, Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals, SIAM J. Sci. Comput., 22 (2000), pp. 835--852.
35.
R. Velazco, P. Fouillat, and R. Reis, Radiation Effects on Embedded Systems. Springer, New York, 2007.

Information & Authors

Information

Published In

cover image SIAM Journal on Scientific Computing
SIAM Journal on Scientific Computing
Pages: C335 - C358
ISSN (online): 1095-7197

History

Submitted: 26 November 2018
Accepted: 28 September 2020
Published online: 18 November 2020

Keywords

  1. conjugate gradient
  2. soft errors
  3. bit-flip
  4. numerical detection

MSC codes

  1. 15A06
  2. 65F10
  3. 65N12

Authors

Affiliations

Funding Information

Fonds Wetenschappelijk Onderzoek https://doi.org/10.13039/501100003130 : 12H4617N
Institut national de recherche en informatique et en automatique https://doi.org/10.13039/100012950

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

View Options

View options

PDF

View PDF

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media