Abstract

CUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain their scientific results. We present the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types. Since the respective compute back end can be selected at runtime, one can seamlessly switch between different hardware types without the need for error-prone and time-consuming recompilation steps. We present new benchmark results for sparse linear algebra operations in ViennaCL, complementing results for the dense linear algebra operations in ViennaCL reported in earlier work. Comparisons with vendor libraries show that ViennaCL provides better overall performance for sparse matrix-vector and sparse matrix-matrix products. Additional benchmark results for pipelined iterative solvers with kernel fusion and preconditioners identify the respective sweet spots for CPUs, Xeon Phis, and GPUs.

Keywords

  1. ViennaCL
  2. iterative solvers
  3. CUDA
  4. OpenCL
  5. OpenMP
  6. CPU
  7. GPU
  8. Xeon Phi

MSC codes

  1. 65F10
  2. 65F50
  3. 65Y05
  4. 65Y10

Get full access to this article

View all available purchase options and get full access to this article.

References

1.
E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, J. Phys. Conf. Ser., 180 (2009), 012037.
2.
J. I. Aliaga, J. Pérez, E. S. Quintana-Ortí, and H. Anzt, Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs, in Proceedings of the Internatioal Conference on Parallel Processing, IEEE, Piscataway, NJ, 2013, pp. 320--329.
3.
J. I. Aliaga, J. Pérez, and E. S. Quintana-Ortí, Systematic fusion of CUDA kernels for iterative sparse linear system solvers, in Euro-Par 2015: Parallel Processing, Lecture Notes in Comput. Sci. 9233, Springer, Heidelberg, 2015, pp. 675--686.
4.
H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra, Optimizing Krylov subspace solvers on graphics processing units, in IEEE International Parallel and Distributed Processing Symposium Workshops, IEEE, Piscataway, NJ, 2014, pp. 941--949.
5.
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, Fast sparse matrix-vector multiplication on GPUs for graph applications, in International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, Piscataway, NJ, 2014, pp. 781--792.
6.
W. Bangerth and T. Heister, What makes computational open source software libraries successful?, Comput. Sci. Discov., 6 (2013), 015010/1.
7.
W. Bangerth and T. Heister, Quo Vadis, Scientific Software?, SIAM News, 47 (2014), 1.
8.
M. M. Baskaran and R. Bordawekar, Optimizing sparse matrix-vector multiplication on GPUs, Technical report RC24704, IBM 2008.
9.
N. Bell, S. Dalton, and L. N. Olson, Exposing fine-grained parallelism in algebraic multigrid methods, SIAM J. Sci. Comput., 34 (2012), pp. C123--C152.
10.
N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, 2009, 18.
11.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, Cilk: An efficient multithreaded runtime system, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, 1995, pp. 207--216.
12.
J. Brown, M. G. Knepley, and B. F. Smith, Run-time extensibility and librarization of simulation software, Comput. Sci. Eng., 17 (2015), pp. 38--45.
13.
J. Brown, HPGMG: Benchmarking computers using multigrid, Copper Mountain Multigrid Conference 2015, https://jedbrown.org/files/20150324-HPGMG.pdf (2015).
14.
A. Buluç and J. R. Gilbert, Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments, SIAM J. Sci. Comput., 34 (2012), pp. C170--C191.
15.
E. Chow, H. Anzt, and J. Dongarra, Asynchronous iterative algorithm for computing incomplete factorizations on GPUs, in High Performance Computing, Lecture Notes in Comput. Sci. 9137, Springer, Cham, 2015, pp. 1--16.
16.
E. Chow and A. Patel, Fine-grained parallel incomplete LU factorization, SIAM J. Sci. Comput., 37 (2015), pp. C169--C193.
17.
L. Dagum and R. Menon, OpenMP: An industry standard API for shared-memory programming, IEEE Comput. Sci. Eng., 5 (1998), pp. 46--55.
18.
S. Dalton, N. Bell, L. Olson, and M. Garland, Cusp: Generic parallel algorithms for sparse matrix and graph computations, Version 0.5.1, http://cusplibrary.github.io/ (2014).
19.
T. A. Davis and Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Software, 38 (2011), pp. 1:1--1:25.
20.
D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling, Programming CUDA and OpenCL: A case study using modern C++ libraries, SIAM J. Sci. Comput., 35 (2013), pp. C453--C472.
21.
Message Passing Forum, MPI: A Message-Passing Interface Standard, Technical report, University of Tennessee, Knoxville, TN, 1994.
22.
R. Gandham, K. Esler, and Y. Zhang, A GPU accelerated aggregation algebraic multigrid method, Comp. Math. Appl., 68 (2014), pp. 1151--1160.
23.
P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose, Hiding global communication latency in the GMRES algorithm on massively parallel machines, SIAM J. Sci. Comput., 35 (2013), pp. C48--C71.
24.
P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned conjugate gradient algorithm, Parallel Comput., 40 (2014), pp. 224--238.
25.
J. R. Gilbert, V. B. Shah, and S. Reinhardt, A unified framework for numerical and combinatorial computing, Comput. Sci. Eng., 10 (2008), pp. 20--25.
26.
P. Gottschling and C. Steinhardt, Meta-Tuning in MTL$4$, International Conference of Numerical Analysis and Applied Mathematics, AIP Conf. Proc. 1281 (2010), pp. 778--782.
27.
J. L. Greathouse and M. Daga, Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, in International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, Piscataway, NJ, 2014, pp. 769--780.
28.
F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann, GPU-accelerated sparse matrix-matrix multiplication by iterative row merging, SIAM J. Sci. Comput., 37 (2015), pp. C54--C71.
30.
V. Heuveline, D. Lukarski, and J.-Ph. Weiss, Enhanced parallel ILU(p)-based preconditioners for multi-core CPUs and GPUs -- The power(q)-pattern method, Preprint Series of the Engineering Mathematics and Computing Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2011.
31.
J. Hoberock and N. Bell, Thrust: A Parallel Template Library, http://eigen.tuxfamily.org (2010).
32.
T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, and R. Thakur, MPI + MPI: A new hybrid approach to parallel programming with MPI plus shared memory, Computing, 95 (2013), pp. 1121--1136.
33.
J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, Addison-Wesley, Upper Saddle River, NJ, 2010.
34.
K. Iglberger, G. Hager, J. Treibig, and U. Rüde, Expression templates revisited: A performance analysis of current methodologies, SIAM J. Sci. Comput., 34 (2012), pp. C42--C69.
35.
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units, SIAM J. Sci. Comput., 36 (2014), pp. C401--C423.
36.
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, Efficient sparse matrix-vector multiplication on x86-based many-core processors, in International Conference on Supercomputing, ACM, New York, 2013, pp. 273--282.
37.
R. Li and Y. Saad, GPU-accelerated preconditioned iterative linear solvers, J. Supercomput., 63 (2013), pp. 443--466.
38.
J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, Computer Society Technical Committee on Computer Architecture Newsletter, (1995), pp. 19--25.
39.
J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with CUDA, Queue, 6 (2008), pp. 40--53.
40.
G. Penn, Efficient transitive closure of sparse matrices over closed semirings, Theoret. Comput. Sci., 354 (2006), pp. 72--81.
41.
J. Reinders, Intel Threading Building Blocks, O'Reilly, Beijing, 2007.
42.
K. Rupp, F. Rudolf, J. Weinbub, A. Morhammer, T. Grasser, and A. Jüngel, Optimized Sparse Matrix-Matrix Multiplication for Multi-Core CPUs, GPUs, and Xeon Phi, manuscript.
43.
K. Rupp, F. Rudolf, and J. Weinbub, ViennaCL - A high level linear algebra library for GPUs and multi-core CPUs, in International Workshop on GPUs and Scientific Applications, 2010, pp. 51--56.
44.
K. Rupp, Ph. Tillet, F. Rudolf, J. Weinbub, T. Grasser, and A. Jüngel, Performance portability study of linear algebra kernels in OpenCL, in Proceedings of the International Workshop on OpenCL 2013-2014, IWOCL '14, ACM, New York, 2014, p. 8.
45.
K. Rupp, J. Weinbub, T. Grasser, and A. Jüngel, Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Software, 43 (2016), 11.
46.
K. Rupp, J. Weinbub, F. Rudolf, A. Morhammer, T. Grasser, and A. Jüngel, A Performance Comparison of Algebraic Multigrid Preconditioners on CPUs, GPUs, and Xeon Phis, manuscript.
47.
Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, 2003.
48.
C. Sanderson, Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments, Technical report, NICTA, 2010.
49.
E. Saule, K. Kaya, and Ü. Catalyürek, Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi, in Parallel Processing and Applied Mathematics, Lecture Notes in Comput. Sci. 8384, Springer, Heidelberg, 2014, pp. 559--570.
50.
B. Schäling, The Boost C++ Libraries, XML Press, Laguna Hills, CA, 2011.
51.
J. Schöberl, NETGEN an advancing front 2D/3D-mesh generator based on abstract rules, Comput. Vis. Sci., 1 (1997), pp. 41--52.
52.
J. E. Stone, D. Gohara, and G. Shi, OpenCL: A parallel programming standard for heterogeneous computing systems, IEEE Des. Test, 12 (2010), pp. 66--73.
53.
B.-Y. Su and K. Keutzer, clSpMV: A cross-platform OpenCL SpMV framework on GPUs, in Proceedings of the ACM International Conference on Supercomputing, ICS '12, ACM, New York, 2012, pp. 353--364.
54.
Ph. Tillet, K. Rupp, S. Selberherr, and C.-T. Lin, Towards performance-portable, scalable, and convenient linear algebra, in 5th USENIX Workshop on Hot Topics in Parallelism (HotPar'13), USENIX, Berkeley, CA 2013.
55.
N. Trost, J. Jiménez, D. Lukarski, and V. Sanchez, Accelerating COBAYA3 on multi-core CPU and GPU systems using PARALUTION, Ann. Nucl. Energy, 82 (2015), pp. 252--259.
56.
U. Trottenberg, C. W. Oosterlee, and Anton Schüller, Multigrid, Academic, San Diego, CA, 2001.
57.
D. Vandevoorde and N. M. Josuttis, C++ Templates, Addison-Wesley, Berlin, 2002.
58.
S. Van Dongen, Graph clustering via a discrete uncoupling process, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 121--141.
59.
T. Veldhuizen, Expression Templates, C++ Rep., 7 (1995), pp. 26--31.
60.
S. Yan, Ch. Li, Y. Zhang, and H. Zhou, yaSpMV: Yet another SpMV framework on GPUs, in Proc. ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, PPoPP '14, ACM, New York, 2014, pp. 107--118.

Information & Authors

Information

Published In

cover image SIAM Journal on Scientific Computing
SIAM Journal on Scientific Computing
Pages: S412 - S439
ISSN (online): 1095-7197

History

Submitted: 16 June 2015
Accepted: 1 July 2016
Published online: 27 October 2016

Keywords

  1. ViennaCL
  2. iterative solvers
  3. CUDA
  4. OpenCL
  5. OpenMP
  6. CPU
  7. GPU
  8. Xeon Phi

MSC codes

  1. 65F10
  2. 65F50
  3. 65Y05
  4. 65Y10

Authors

Affiliations

Funding Information

Austrian Science Fund http://dx.doi.org/10.13039/501100002428 : P23296, P23598
European Research Council http://dx.doi.org/10.13039/501100000781 : 247056

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

View Options

View options

PDF

View PDF

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media