Abstract

This paper develops a unified framework, based on iterated random operator theory, to analyze the convergence of constant stepsize recursive stochastic algorithms (RSAs). RSAs use randomization to efficiently compute expectations, and so their iterates form a stochastic process. The key idea of our analysis is to lift the RSA into an appropriate higher-dimensional space and then express it as an equivalent Markov chain. Instead of determining the convergence of this Markov chain (which may not converge under constant stepsize), we study the convergence of the distribution of this Markov chain. To study this, we define a new notion of Wasserstein divergence. We show that if the distribution of the iterates in the Markov chain satisfy a contraction property with respect to the Wasserstein divergence, then the Markov chain admits an invariant distribution. We show that convergence of a large family of constant stepsize RSAs can be understood using this framework, and we provide several detailed examples.

Keywords

  1. iterative random maps
  2. Wasserstein divergence
  3. stochastic gradient descent

MSC codes

  1. 93E35
  2. 60J20
  3. 68Q32

Get full access to this article

View all available purchase options and get full access to this article.

Supplementary Material


PLEASE NOTE: These supplementary files have not been peer-reviewed.


Index of Supplementary Materials

Title of paper: Convergence of Recursive Stochastic Algorithms using Wasserstein Divergence

Authors: A. Gupta and W. Haskell

File: Supplement.pdf

Type: PDF

Contents: In this supplementary material, we derive the contraction coefficient for some algorithms that are covered in the paper.

References

1.
C. Aliprantis and K. Border, Infinite Dimensional Analysis: A Hitchhiker's Guide, Springer-Verlag, Berlin, Heidelberg, 2006.
2.
L. Ambrosio, N. Gigli, and G. Savaré, Gradient Flows: In Metric Spaces and in the Space of Probability Measures, Birkhäuser, Basel, 2008.
3.
P. Amortila, D. Precup, P. Panangaden, and M. G. Bellemare, A distributional analysis of sampling-based reinforcement learning algorithms, in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) PMLG Vol. 108, 2020, Palermo, Italy, 2020.
4.
F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate $o (1/n)$, in Advances in Neural Information Processing Systems, 2013, pp. 773--781.
5.
M. F. Barnsley, J. H. Elton, and D. P. Hardin, Recurrent iterated function systems, Constr. Approx., 5 (1989), pp. 3--31.
6.
C. L. Beck and R. Srikant, Error bounds for constant step-size Q-learning, Systems Control Lett., 61 (2012), pp. 1203--1208.
7.
M. Benaïm, Recursive algorithms, urn processes and chaining number of chain recurrent sets, Ergodic Theory Dynam. Systems, 18 (1998), pp. 53--87.
8.
A. Benveniste, M. Métivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Appl. Math. (N. Y.) 22, Springer-Verlag, Berlin, 2012.
9.
J. Bhandari, D. Russo, and R. Singal, A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation, preprint, https://arxiv.org/abs/1806.02450, 2018.
10.
R. N. Bhattacharya and E. C. Waymire, A Basic Course in Probability Theory, Springer, New York, 2007.
11.
P. Billingsley, Convergence of Probability Measures, John Wiley & Sons, New York, 2013.
12.
V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press, Cambridge, 2008.
13.
V. S. Borkar and S. P. Meyn, The O.D.E. method for convergence of stochastic approximation and reinforcement learning, SIAM J. Control Optim., 38 (2000), pp. 447--469, https://doi.org/10.1137/S0363012997331639.
14.
L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), pp. 223--311, https://doi.org/10.1137/16M1080173.
15.
L. Breiman, Probability, Classics Appl. Math. 7, SIAM, Philadelphia, 1992, https://doi.org/10.1137/1.9781611971286.
16.
B. Can, M. Gurbuzbalaban, and L. Zhu, Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances, preprint, https://arxiv.org/abs/1901.07445, 2019.
17.
A. Defazio, F. Bach, and S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, 2014, pp. 1646--1654.
18.
P. Diaconis and D. Freedman, Iterated random functions, SIAM Rev., 41 (1999), pp. 45--76, https://doi.org/10.1137/S0036144598338446.
19.
A. Dieuleveut, A. Durmus, and F. Bach, Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, preprint, https://arxiv.org/abs/1707.06386, 2017.
20.
L. E. Dubins and D. A. Freedman, Invariant probabilities for certain Markov processes, Ann. Math. Statist., 37 (1966), pp. 837--848.
21.
M. Duflo, Random Iterative Models, Appl. Math. (N. Y.) 34, Springer-Verlag, Berlin, 2013.
22.
S. B. Gelfand and S. K. Mitter, Recursive stochastic algorithms for global optimization in $\mathbb{R}^d$, SIAM J. Control Optim., 29 (1991), pp. 999--1018, https://doi.org/10.1137/0329055.
23.
A. Gupta, Dynamic Sequential Decision Problems with Asymmetric Information: Some Existence Results, Ph.D. thesis, University of Illinois at Urbana-Champaign, IL, 2014.
24.
A. Gupta, R. Jain, and P. Glynn, Probabilistic Contraction Analysis of Iterated Random Operators, preprint, https://arxiv.org/abs/1804.01195, 2018; Ann. Appl. Probab., submitted.
25.
M. Hairer and J. C. Mattingly, Yet another look at Harris' ergodic theorem for Markov chains, in Seminar on Stochastic Analysis, Random Fields and Applications VI, Progr. Probab. 63, Birkhäuser/Springer Basel AG, Basel, 2011, pp. 109--117.
26.
W. B. Haskell, R. Jain, and D. Kalathil, Empirical dynamic programming, Math. Oper. Res., 41 (2016), pp. 402--429.
27.
O. Hernández-Lerma and J. B. Lasserre, Markov Chains and Invariant Probabilities, Progr. Math. 211, Birkhäuser Verlag, Basel, 2012.
28.
B. Hu, S. Wright, and L. Lessard, Dissipativity Theory for Accelerating Stochastic Variance Reduction: A Unified Analysis of SVRG and Katyusha Using Semidefinite Programs, preprint, https://arxiv.org/abs/1806.03677, 2018.
29.
R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, 2013, pp. 315--323.
30.
J. Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regression function, Ann. Math. Statistics, 23 (1952), pp. 462--466.
31.
B. Kumar, V. Borkar, and A. Shetty, Non-asymptotic Error Bounds For Constant Stepsize Stochastic Approximation For Tracking Mobile Agents, preprint, https://arxiv.org/abs/1802.07759, 2018.
32.
H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Appl. Math. (N. Y.) 35, Springer-Verlag, New York, 2003.
33.
C. Lakshminarayanan and C. Szepesvari, Linear stochastic approximation: How far does constant step-size and iterate averaging go?, in Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, 2018, pp. 1347--1355.
34.
H. Lin, J. Mairal, and Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, 2015, pp. 3384--3392.
35.
H. Lin, J. Mairal, and Z. Harchaoui, Catalyst acceleration for first-order convex optimization: From theory to practice, J. Mach. Learn. Res., 18 (2017), 212.
36.
L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Automatic Control, 22 (1977), pp. 551--575.
37.
R. Munos and C. Szepesvári, Finite-time bounds for fitted value iteration, J. Mach. Learn. Res., 9 (2008), pp. 815--857.
38.
Y. Ollivier, Ricci curvature of metric spaces, C. R. Math. Acad. Sci. Paris, 345 (2007), pp. 643--646.
39.
B. Palaniappan and F. Bach, Stochastic variance reduction methods for saddle-point problems, in Advances in Neural Information Processing Systems, 2016, pp. 1416--1424.
40.
S. T. Rachev and L. Rüschendorf, Mass Transportation Problems: Volume I: Theory, Springer-Verlag, New York, 1998.
41.
S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. J. Smola, On variance reduction in stochastic gradient descent and its asynchronous variants, in Advances in Neural Information Processing Systems, 2015, pp. 2647--2655.
42.
H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statistics, 22 (1951), pp. 400--407.
43.
G. Roth and W. H. Sandholm, Stochastic approximations with constant step size and differential inclusions, SIAM J. Control Optim., 51 (2013), pp. 525--555, https://doi.org/10.1137/110844192.
44.
F. Shang, L. Jiao, K. Zhou, J. Cheng, Y. Ren, and Y. Jin, ASVRG: Accelerated Proximal SVRG, preprint, https://arxiv.org/abs/1810.03105, 2018.
45.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), pp. 484--489.
46.
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of Go without human knowledge, Nature, 550 (2017), pp. 354--359.
47.
R. Srikant and L. Ying, Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning, preprint, https://arxiv.org/abs/1902.00923, 2019.
48.
Ö. Stenflo, A survey of average contractive iterated function systems, J. Difference Equ. Appl., 18 (2012), pp. 1355--1380.
49.
V. Strassen, The existence of probability measures with given marginals, Ann. Math. Statist., 36 (1965), pp. 423--439.
50.
C. Villani, Optimal Transport: Old and New, Grundlehren Math. Wiss. 338, Springer-Verlag, Berlin, 2008.
51.
J. Wolfowitz, On the stochastic approximation method of Robbins and Monro, Ann. Math. Statist., 23 (1952), pp. 457--461.
52.
H. Yu, Weak convergence properties of constrained emphatic temporal-difference learning with constant and slowly diminishing stepsize, J. Mach. Learn. Res., 17 (2016), pp. 7745--7802.

Information & Authors

Information

Published In

cover image SIAM Journal on Mathematics of Data Science
SIAM Journal on Mathematics of Data Science
Pages: 1141 - 1167
ISSN (online): 2577-0187

History

Submitted: 4 January 2021
Accepted: 1 July 2021
Published online: 21 October 2021

Keywords

  1. iterative random maps
  2. Wasserstein divergence
  3. stochastic gradient descent

MSC codes

  1. 93E35
  2. 60J20
  3. 68Q32

Authors

Affiliations

Funding Information

Advanced Research Projects Agency - Energy https://doi.org/10.13039/100006133

Funding Information

National Science Foundation https://doi.org/10.13039/100000001 : ECCS 1610615

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

There are no citations for this item

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media