Software and High-Performance Computing

Avoiding Communication in Primal and Dual Block Coordinate Descent Methods

Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. However, existing implementations communicate at every iteration, which, on modern data center and supercomputing architectures, often dominates the cost of floating-point computation. Recent results on communication-avoiding Krylov subspace methods suggest that large speedups are possible by re-organizing iterative algorithms to avoid communication. We show how applying similar algorithmic transformations can lead to primal and dual block coordinate descent methods that only communicate every $s$ iterations---where $s$ is a tuning parameter---instead of every iteration for the regularized least-squares problem. We show that the communication-avoiding variants reduce the number of synchronizations by a factor of $s$ on distributed-memory parallel machines without altering the convergence rate and attain strong scaling speedups of up to $6.1\times$ over the “standard algorithm" on a Cray XC30 supercomputer.

  • 1.  G. Ballard, Avoiding Communication in Dense Linear Algebra, Ph.D. thesis, EECS Department, University of California, Berkeley, Berkeley, CA, 2013. Google Scholar

  • 2.  G. Ballard E. Carson J. Demmel M. Hoemmen N. Knight and  O. Schwartz , Communication lower bounds and optimal algorithms for numerical linear algebra , Acta Numer. , 23 ( 2014 ), pp. 1 -- 155 , https://doi.org/10.1017/S0962492914000038. CrossrefISIGoogle Scholar

  • 3.  \AA. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996, https://doi.org/10.1137/1.9781611971484. Google Scholar

  • 4.  L. Bottou , Large-scale machine learning with stochastic gradient descent, in Proceedings of Computation Statistics, Springer , New York , 2010 , pp. 177 -- 186 , https://doi.org/10.1007/978-3-7908-2604-3_16. Google Scholar

  • 5.  E. Carson, Communication-Avoiding Krylov Subspace Methods in Theory and Practice, Ph.D. thesis, EECS Department, University of California, Berkeley, Berkeley, CA, 2015. Google Scholar

  • 6.  E. Carson and  J. Demmel , A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods , SIAM J. Matrix Anal. Appl. , 35 ( 2014 ), pp. 22 -- 43 , https://doi.org/10.1137/120893057. LinkISIGoogle Scholar

  • 7.  E. Carson and  J. W. Demmel , Accuracy of the s-step Lanczos method for the symmetric eigenproblem in finite precision , SIAM J. Matrix Anal. Appl. , 36 ( 2015 ), pp. 793 -- 819 , https://doi.org/10.1137/140990735. LinkISIGoogle Scholar

  • 8.  E. Carson N. Knight and  J. Demmel , Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods , SIAM J. Sci. Comput. , 35 ( 2013 ), pp. S42 -- S61 , https://doi.org/10.1137/120881191. LinkISIGoogle Scholar

  • 9.  E. Carson N. Knight and  J. Demmel , An efficient deflation technique for the communication-avoiding conjugate gradient method , Electron. Trans. Numer. Anal. , 43 ( 2014 ), pp. 125 -- 141 . ISIGoogle Scholar

  • 10.  C.-C. Chang and  C.-J. Lin : A library for support vector machines , ACM Trans. Intelligent Syst. Tech. , 2 ( 2011 ), pp. 1 -- 27 , https://doi.org/10.1145/1961189.1961199. CrossrefISIGoogle Scholar

  • 11.  A. Chronopoulos and  C. Gear , On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy , Parallel Comput. , 11 ( 1989 ), pp. 37 -- 53 , https://doi.org/10.1016/0167-8191(89)90062-8. CrossrefISIGoogle Scholar

  • 12.  A. Chronopoulos and  C. Gear , s-step iterative methods for symmetric linear systems , J. Comput. Appl. Math. , 25 ( 1989 ), pp. 153 -- 168 , https://doi.org/10.1016/0377-0427(89)90045-9. CrossrefISIGoogle Scholar

  • 13.  T. A. Davis S. Rajamanickam and  W. . Sid-Lakhdar, A survey of direct methods for sparse linear systems , Acta Numer. , 25 ( 2016 ), pp. 383 -- 566 , https://doi.org/10.1017/S0962492916000076. CrossrefISIGoogle Scholar

  • 14.  J. Demmel L. Grigori M. Hoemmen and  J. Langou , Communication-optimal parallel and sequential QR and LU factorizations , SIAM J. Sci. Comput. , 34 ( 2012 ), pp. A206 -- A239 , https://doi.org/10.1137/080731992. LinkISIGoogle Scholar

  • 15.  J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick, Avoiding Communication in Computing Krylov Subspaces, Tech. report UCB/EECS-2007-123, EECS Department, University of California, Berkeley, Berkeley, CA, 2007. Google Scholar

  • 16.  M. P. I. Forum, MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, TN, 1994. Google Scholar

  • 17.  G. H. Gonnet , Expected length of the longest probe sequence in hash code searching , J. ACM , 28 ( 1981 ), pp. 289 -- 304 , https://doi.org/10.1145/322248.322254. CrossrefISIGoogle Scholar

  • 18.  M. Hoemmen, Communication-Avoiding Krylov Subspace Methods, Ph.D. thesis, University of California, Berkeley, Berkeley, CA, 2010. Google Scholar

  • 19.  M. Jaggi V. Smith M. Takáč J. Terhorst S. Krishnan T. Hofmann and  M. I. Jordan , Communication-efficient distributed dual coordinate ascent, in Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal , Canada , 2014 , pp. 3068 -- 3076 . Google Scholar

  • 20.  S. Kim and  A. Chronopoulos , An efficient nonsymmetric Lanczos method on parallel vector computers , J. Comput. Appl. Math. , 42 ( 1992 ), pp. 357 -- 374 , https://doi.org/10.1016/0377-0427(92)90085-C. CrossrefISIGoogle Scholar

  • 21.  K. Lang, NewsWeeder: Learning to filter netnews, in Proceedings of the 12th International Machine Learning Conference, Tahoe City, CA, 1995, https://doi.org/10.1016/B978-1-55860-377-6.50048-7. Google Scholar

  • 22.  M. Lichman, UCI Machine Learning Repository, 2013, http://archive.ics.uci.edu/ml. Google Scholar

  • 23.  A. McCallum, SRAA: Simulated/Real/Aviation/Auto UseNet Data, https://people.cs.umass.edu/~mccallum/data.html. Google Scholar

  • 24.  M. D. Mitzenmacher, The Power of Two Choices in Randomized Load Balancing, Ph.D. thesis, EECS Department, University of California, Berkeley, Berkeley, CA, 1996. Google Scholar

  • 25.  M. Mohiyuddin, Tuning Hardware and Software for Multiprocessors, Ph.D. thesis, EECS Department, University of California, Berkeley, Berkeley, CA, 2012. Google Scholar

  • 26.  M. Mohiyuddin M. Hoemmen J. Demmel and  K. Yelick , Minimizing communication in sparse matrix solvers, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM , New York , 2009 , 36 , https://doi.org/ 10 .1145/1654059.1654096. Google Scholar

  • 27.  NERSC, NERSC Edison Configuration, http://www.nersc.gov/users/computational-systems/edison/configuration/. Google Scholar

  • 28.  Y. Nesterov , Efficiency of coordinate descent methods on huge-scale optimization problems , SIAM J. Optim. , 22 ( 2012 ), pp. 341 -- 362 , https://doi.org/10.1137/100802001. LinkISIGoogle Scholar

  • 29.  Z. Qu P. Richtárik M. Takáč and  O. Fercoq , SDNA: Stochastic dual Newton ascent for empirical risk minimization, in Proceedings of the 33rd International Conference on Machine Learning, Volume 48, New York , NY , 2016 , pp. 1823 -- 1832 , http://dl.acm.org/citation.cfm?id=3045390.3045583. Google Scholar

  • 30.  M. Raab and  A. Steger , `` Balls into Bins " ---A Simple and Tight Analysis, in Randomization and Approximation Techniques in Computer Science, Springer, New York , 1998 , pp. 159 -- 170 , https://doi.org/10.1007/3-540-49543-6_13. Google Scholar

  • 31.  B. Recht C. Ré S. Wright and  F. Niu , Hogwild: A lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems, Granada , Spain , 2011 , pp. 693 -- 701 . Google Scholar

  • 32.  P. Richtárik and  M. Takáč , Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , Math. Program. , 144 ( 2014 ), pp. 1 -- 38 , https://doi.org/10.1007/s10107-012-0614-z. CrossrefISIGoogle Scholar

  • 33.  Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, 2003, https://doi.org/10.1137/1.9780898718003. Google Scholar

  • 34.  S. Shalev-Shwartz and  T. Zhang , Stochastic dual coordinate ascent methods for regularized loss , J. Mach. Learn. Res. , 14 ( 2013 ), pp. 567 -- 599 . ISIGoogle Scholar

  • 35.  E. Solomonik, Provably Efficient Algorithms for Numerical Tensor Algebra, Ph.D. thesis, EECS Department, University of California, Berkeley, Berkeley, CA, 2014. Google Scholar

  • 36.  M. Takáč, P. Richtárik, and N. Srebro, Distributed Mini-Batch SDCA, preprint, https://arxiv.org/abs/1507.08322, 2015. Google Scholar

  • 37.  The MathWorks, Box Plots, https://www.mathworks.com/help/stats/box-plots.html. Google Scholar

  • 38.  J. Van Rosendale, Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration, IEEE Computer Society, Los Alamitos, CA, 1983. Google Scholar

  • 39.  H. F. Walker , Implementation of the GMRES method using Householder transformations , SIAM J. Sci. Statist. Comput. , 9 ( 1988 ), pp. 152 -- 163 , https://doi.org/10.1137/0909010. LinkISIGoogle Scholar

  • 40.  S. Williams M. Lijewski A. Almgren B. Van Straalen E. Carson N. Knight and  J. Demmel , s-step Krylov subspace methods as bottom solvers for geometric multigrid, in Proceedings of the International Parallel and Distributed Processing Symposium, Phoenix , AZ , 2014 , pp. 1149 -- 1158 , https://doi.org/10.1109/IPDPS.2014.119. Google Scholar

  • 41.  S. J. Wright , Coordinate descent algorithms , Math. Program. , 151 ( 2015 ), pp. 3 -- 34 , https://doi.org/10.1007/s10107-015-0892-3. CrossrefISIGoogle Scholar