Majorization-minimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely applicable and has been very popular in various scientific fields, especially in signal processing and statistics. We propose an incremental majorization-minimization scheme for minimizing a large sum of continuous functions, a problem of utmost importance in machine learning. We present convergence guarantees for nonconvex and convex optimization when the upper bounds approximate the objective up to a smooth error; we call such upper bounds “first-order surrogate functions.” More precisely, we study asymptotic stationary point guarantees for nonconvex problems, and for convex ones, we provide convergence rates for the expected objective function value. We apply our scheme to composite optimization and obtain a new incremental proximal gradient algorithm with linear convergence rate for strongly convex functions. Our experiments show that our method is competitive with the state of the art for solving machine learning problems such as logistic regression when the number of training samples is large enough, and we demonstrate its usefulness for sparse estimation with nonconvex penalties.


  1. nonconvex optimization
  2. convex optimization
  3. majorization-minimization

MSC codes

  1. 90C06
  2. 90C26
  3. 90C25

Get full access to this article

View all available purchase options and get full access to this article.


S. Ahn, J. A. Fessler, D. Blatt, and A. O. Hero, Convergent incremental optimization transfer algorithms: Application to tomography, IEEE Trans. Med. Imaging, 25 (2006), pp. 283--296.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing penalties, Found. Trends Mach. Learn., 4 (2012), pp. 1--106.
A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183--202.
A. Beck and L. Tetruashvili, On the convergence of block coordinate descent type methods, SIAM J. Optim., 23 (2013), pp. 2037--2060.
D. P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientific, Belmont, MA, 1999.
D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29--51.
D. Böhning and B. G. Lindsay, Monotonicity of quadratic-approximation algorithms, Ann. Inst. Statist. Math., 40 (1988), pp. 641--663.
J. M. Borwein and A. S. Lewis, Convex Analysis and Nonlinear Optimization: Theory and Examples, Springer, New York, 2006.
L. Bottou, Online algorithms and stochastic approximations, in Online Learning and Neural Networks, D. Saad, ed., Cambridge University Press, Cambridge, UK, 1998.
S. P. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, 2004.
E. J. Candès, M. Wakin, and S. P. Boyd, Enhancing sparsity by reweighted $\ell_1$ minimization, J. Fourier Anal. Appl., 14 (2008), pp. 877--905.
A. Choromanska and T. Jebara, Stochastic Bound Majorization, arXiv:1309.5605, 2013.
M. Collins, R. E. Schapire, and Y. Singer, Logistic regression, AdaBoost and Bregman distances, Mach. Learn., 48 (2002), pp. 253--285.
P. L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal processing, in Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, New York, 2010.
P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., 4 (2005), pp. 1168--1200.
I. Daubechies, M. Defrise, and C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), pp. 1413--1457.
A. J. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Proceedings of Advances in Neural Information Processing Systems, 2014.
A. J. Defazio, T. S. Caetano, and J. Domke, Finito: A faster, permutable incremental gradient method for big data problems, in Proceedings of ICML, 2014.
S. Della Pietra, V. Della Pietra, and J. Lafferty, Duality and Auxiliary Functions for Bregman Distances, Tech. report, CMU-CS-01-109, Carnegie Mellon University, Pittsburgh, 2001.
A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B, 39 (1977), pp. 1--38.
J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), pp. 2121--2159.
J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, J. Mach. Learn. Res., 10 (2009), pp. 2899--2934.
H. Erdogan and J. A. Fessler, Ordered subsets algorithms for transmission tomography, Phys. Med. Biol., 44 (1999), pp. 2835--2851.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, LIBLINEAR: A library for large linear classification, J. Mach. Learn. Res., 9 (2008), pp. 1871--1874.
M. Fashing and C. Tomasi, Mean shift is a bound optimization, IEEE Trans. Pattern Anal., 27 (2005), pp. 471--474.
G. Gasso, A. Rakotomamonjy, and S. Canu, Recovering sparse signals with non-convex penalties and DC programming, IEEE Trans. Signal Process., 57 (2009), pp. 4686--4698.
S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework, SIAM J. Optim., 22 (2012), pp. 1469--1492.
E. T. Hale, W. Yin, and Y. Zhang, Fixed-point continuation for $\ell_1$-minimization: Methodology and convergence, SIAM J. Optim., 19 (2008), pp. 1107--1130.
E. Hazan and S. Kale, Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization, in Proceedings of COLT, 2011.
R. Horst and N. V. Thoai, DC programming: Overview, J. Optim. Theory Appl., 103 (1999), pp. 1--43.
T. Jebara and A. Choromanska, Majorization for CRFs and latent likelihoods, in Proceedings of Advances in Neural Information Processing Systems, 2012.
A. Juditsky and A. Nemirovski, First order methods for nonsmooth convex large-scale optimization, in Optimization for Machine Learning, MIT Press, Cambridge, MA, 2011.
E. Khan, B. Marlin, G. Bouchard, and K. Murphy, Variational bounds for mixed-data factor analysis, in Proceedings of Advances in Neural Information Processing Systems, 2010.
G. Lan, An optimal method for stochastic composite optimization, Math. Program., 133 (2012), pp. 365--397.
K. Lange, D. R. Hunter, and I. Yang, Optimization transfer using surrogate objective functions, J. Comput. Graph. Statist., 9 (2000), pp. 1--20.
N. Le Roux, M. Schmidt, and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Proceedings of Advances in Neural Information Processing Systems, 2012.
D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in Proceedings of Advances in Neural Information Processing Systems, 2001.
J. Mairal, Optimization with first-order surrogate functions, in Proceedings of ICML, 2013.
J. Mairal, Stochastic majorization-minimization algorithms for large-scale optimization, in Proceedings of Advances in Neural Information Processing Systems, 2013.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res., 11 (2010), pp. 19--60.
J. J. Moreau, Fonctions convexes duales et points proximaux dans un espace hilbertien, C. R. Acad. Sci. Paris Sér. A Math., 255 (1962), pp. 2897--2899.
R. M. Neal and G. E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models, Kluwer, Dordrecht, the Netherlands, 1998, pp. 355--368.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574--1609.
Y. Nesterov, Introductory Lectures on Convex Optimization, Kluwer, Dordrecht, the Netherlands, 2004.
Y. Nesterov, Gradient methods for minimizing composite objective functions, Math. Program., 140 (2012), pp. 125--161.
J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed., Springer, New York, 2006.
M. Razaviyayn, M. Hong, and Z.-Q. Luo, A unified convergence analysis of block successive minimization methods for nonsmooth optimization, SIAM J. Optim., 23 (2013), pp. 1126--1153.
M. Razaviyayn, M. Sanjabi, and Z.-Q. Luo, A Stochastic Successive Minimization Method for Nonsmooth Nonconvex Optimization, arXiv:1307.4457v2, 2013.
M. Schmidt, N. Le Roux, and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, in Proceedings of Advances in Neural Information Processing Systems, 2011.
M. Schmidt, N. Le Roux, and F. Bach, Minimizing Finite Sums with the Stochastic Average Gradient, arXiv:1309.2388, 2013.
S. Shalev-Schwartz and T. Zhang, Proximal Stochastic Dual Coordinate Ascent, arXiv:1211.2717, 2012.
B. A. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics, 47 (2005), pp. 349--363.
M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference, Found. Trends Mach. Learn., 1 (2008), pp. 1--305.
S. J. Wright, R. D. Nowak, and M. A. Figueiredo, Sparse reconstruction by separable approximation, IEEE Trans. Signal Process., 57 (2009), pp. 2479--2493.
L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, J. Mach. Learn. Res., 11 (2010), pp. 2543--2596.
M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables., J. Roy. Statist. Soc. Ser. B, 68 (2006), pp. 49--67.
L. W. Zhong and J. T. Kwok, Fast stochastic alternating direction method of multipliers, in Proceedings of ICML, 2014.

Information & Authors


Published In

cover image SIAM Journal on Optimization
SIAM Journal on Optimization
Pages: 829 - 855
ISSN (online): 1095-7189


Submitted: 18 February 2014
Accepted: 27 January 2015
Published online: 14 April 2015


  1. nonconvex optimization
  2. convex optimization
  3. majorization-minimization

MSC codes

  1. 90C06
  2. 90C26
  3. 90C25



Metrics & Citations



If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

View Options

View options


View PDF







Copy the content Link

Share with email

Email a colleague

Share on social media

The SIAM Publications Library now uses SIAM Single Sign-On for individuals. If you do not have existing SIAM credentials, create your SIAM account https://my.siam.org.