Abstract.

Kakade’s natural policy gradient method has been studied extensively in recent years, showing linear convergence with and without regularization. We study another natural gradient method based on the Fisher information matrix of the state-action distributions which has received little attention from the theoretical side. Here, the state-action distributions follow the Fisher–Rao gradient flow inside the state-action polytope with respect to a linear potential. Therefore, we study Fisher–Rao gradient flows of linear programs more generally and show linear convergence with a rate that depends on the geometry of the linear program. Equivalently, this yields an estimate on the error induced by entropic regularization of the linear program which improves existing results. We extend these results and show sublinear convergence for perturbed Fisher–Rao gradient flows and natural gradient flows up to an approximation error. In particular, these general results cover the case of state-action natural policy gradients.

Keywords

  1. Fisher–Rao metric
  2. linear program
  3. entropic regularization
  4. multiplayer game
  5. Markov decision process
  6. natural policy gradient

MSC codes

  1. 65K05
  2. 90C05
  3. 90C08
  4. 90C40
  5. 90C53

Get full access to this article

View all available purchase options and get full access to this article.

Acknowledgments.

The project originated when the first author was a Ph.D. student at the Max Planck Institute for Mathematics in the Sciences (MPI MiS) with additional support from the Evangelisches Studienwerk Villigst e.V.

References

1.
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, and S. Altman, GPT-4 Technical Report, preprint, arXiv:2303.08774, 2023.
2.
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, On the theory of policy gradient methods: Optimality, approximation, and distribution shift, J. Mach. Learn. Res., 22 (2021), pp. 4431–4506.
3.
C. Alfano and P. Rebeschini, Linear Convergence for Natural Policy Gradient with Log-Linear Policy Parametrization, preprint, arXiv:2209.15382, 2022.
4.
C. Alfano, R. Yuan, and P. Rebeschini, A novel framework for policy mirror descent with general parameterization and linear convergence, in 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023, pp. 1–45.
5.
F. Alvarez, J. Bolte, and O. Brahic, Hessian Riemannian gradient flows in convex programming, SIAM J. Control Optim., 43 (2004), pp. 477–501, https://doi.org/10.1137/S0363012902419977.
6.
S.-i. Amari, Natural gradient works efficiently in learning, Neural Comput., 10 (1998), pp. 251–276, https://doi.org/10.1162/089976698300017746.
7.
S.-i. Amari, Information Geometry and Its Applications, Applied Mathematical Sciences 194, Springer-Verlag, Berlin, 2016.
8.
N. Ay, J. Jost, H. Vân Lê, and L. Schwachhöfer, Information Geometry, Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge/A Series of Modern Surveys in Mathematics (MATHE3) 64, Springer-Verlag, Berlin, 2017.
9.
J. A. Bagnell and J. Schneider, Covariant policy search, in Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03), Morgan Kaufmann Publishers, Burlington, MA, 2003, pp. 1019–1024.
10.
H. H. Bauschke, J. Bolte, J. Chen, M. Teboulle, and X. Wang, On linear convergence of non-Euclidean gradient methods without strong convexity and Lipschitz gradient continuity, J. Optim. Theory Appl., 182 (2019), pp. 1068–1087, https://doi.org/10.1007/s10957-019-01516-9.
11.
C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. d.O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang, Dota 2 with Large Scale Deep Reinforcement Learning, preprint, arXiv:1912.06680, 2019.
12.
J. Bhandari and D. Russo, On the linear convergence of policy gradient methods for finite MDPs, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 2386–2394.
13.
J. Bhandari and D. Russo, Global optimality guarantees for policy gradient methods, Oper. Res., 72 (2024), pp. 1906–1927, https://doi.org/10.1287/opre.2021.0014.
14.
B. Boll, J. Cassel, P. Albers, S. Petra, and C. Schnörr, A Geometric Embedding Approach to Multiple Games and Multiple Populations, preprint, arXiv:2401.05918, 2024.
15.
S. P. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004.
16.
S. Cayci, N. He, and R. Srikant, Convergence of entropy-regularized natural policy gradient with linear function approximation, SIAM J. Optim., 34 (2024), pp. 2729–2755, https://doi.org/10.1137/22M1540156.
17.
S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi, Fast global convergence of natural policy gradient methods with entropy regularization, Oper. Res., 70 (2022), pp. 2563–2578, https://doi.org/10.1287/opre.2021.2151.
18.
N. N. Čencov, Algebraic foundation of mathematical statistics, Ser. Stat., 9 (1978), pp. 267–276, https://doi.org/10.1080/02331887808801428.
19.
R. Cominetti and J. S. Martín, Asymptotic analysis of the exponential penalty trajectory in linear programming, Math. Program., 67 (1994), pp. 169–187, https://doi.org/10.1007/BF01582220.
20.
C. Derman, Finite State Markovian Decision Processes, Academic Press, New York, 1970.
21.
T. Dick, A. Gyorgy, and C. Szepesvari, Online learning in Markov decision processes with changing cost sequences, in International Conference on Machine Learning, PMLR, 2014, pp. 512–520.
22.
D. Ding, K. Zhang, T. Basar, and M. Jovanovic, Natural policy gradient primal-dual method for constrained Markov decision processes, in Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), 2020, pp. 8378–8390.
23.
D. Ding, K. Zhang, J. Duan, T. Başar, and M. R. Jovanović, Convergence and Sample Complexity of Natural Policy Gradient Primal-Dual Methods for Constrained MDPs, preprint, arXiv:2206.02346, 2022.
24.
O. Hernández-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria, Stochastic Modelling and Applied Probability 30, Springer-Verlag, Berlin, 2012, https://doi.org/10.1007/978-1-4612-0729-0.
25.
Y. Hu, Z. Ji, and M. Telgarsky, Actor-critic is implicitly biased towards high entropy optimal policies, in International Conference on Learning Representations, 2022, https://openreview.net/forum?id=vEZyTBRPP6o.
26.
S. M. Kakade, A natural policy gradient, in Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS’01), 2001, pp. 1531–1538.
27.
L. C. Kallenberg, Survey of linear programming for standard and nonstandard Markovian control problems. Part I: Theory, Z. Oper. Res., 40 (1994), pp. 1–42, https://doi.org/10.1007/BF01414028.
28.
B. Kerimkulov, J.-M. Leahy, D. Siska, L. Szpruch, and Y. Zhang, A Fisher-Rao Gradient Flow for Entropy-Regularised Markov Decision Processes in Polish Spaces, preprint, arXiv:2310.02951, 2023.
29.
S. Khodadadian, P. R. Jhunjhunwala, S. M. Varma, and S. T. Maguluri, On linear and super-linear convergence of natural policy gradient algorithm, Syst. Control Lett., 164 (2022), 105214, https://doi.org/10.1016/j.sysconle.2022.105214.
30.
G. Lan, Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes, Math. Program., 198 (2023), pp. 1059–1106, https://doi.org/10.1007/s10107-022-01816-5.
31.
R. Laroche and R. T. Des Combes, On the occupancy measure of non-Markovian policies in continuous MDPs, in International Conference on Machine Learning, PMLR, 2023, pp. 18548–18562.
32.
H. Li, S. Gupta, H. Yu, L. Ying, and I. Dhillon, Approximate Newton policy gradient algorithms, SIAM J. Sci. Comput., 45 (2023), pp. A2585–A2609, https://doi.org/10.1137/22M1492088.
33.
H. Lu, R. M. Freund, and Y. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM J. Optim., 28 (2018), pp. 333–354, https://doi.org/10.1137/16M1099546.
34.
J. Mei, C. Xiao, B. Dai, L. Li, C. Szepesvári, and D. Schuurmans, Escaping the gravitational pull of softmax, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020, pp. 21130–21140.
35.
J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, On the global convergence rates of softmax policy gradient methods, in International Conference on Machine Learning, PMLR, 2020, pp. 6820–6829.
36.
G. Montúfar, J. Rauh, and N. Ay, On the Fisher metric of conditional probability polytopes, Entropy, 16 (2014), pp. 3207–3233, https://doi.org/10.3390/e16063207.
37.
T. Morimura, E. Uchibe, J. Yoshimoto, and K. Doya, A new natural policy gradient by stationary distribution metric, in Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science 5212, Springer-Verlag, Berlin, 2008, pp. 82–97.
38.
T. Morimura, E. Uchibe, J. Yoshimoto, and K. Doya, A generalized natural actor-critic algorithm, in Advances in Neural Information Processing Systems 22 (NIPS 2009), 2009, pp. 1–9.
39.
J. Müller, Geometry of Optimization in Markov Decision Processes and Neural Network Based PDE Solvers, Ph.D. thesis, University of Leipzig, 2023.
40.
J. Müller and S. Cayci, Essentially Sharp Estimates on the Entropy Regularization Error in Discrete Discounted Markov Decision Processes, preprint, arXiv:2406.04163, 2024.
41.
J. Müller and G. Montúfar, The geometry of memoryless stochastic policy pptimization in infinite-horizon POMDPs, in International Conference on Learning Representations, 2022, https://openreview.net/pdf?id=A05I5IvrdL-.
42.
J. Müller and G. Montúfar, Geometry and convergence of natural policy gradient methods, Inform. Geom., 7 (2024), pp. 485–523, https://doi.org/10.1007/s41884-023-00106-z.
43.
J. Müller and M. Zeinhofer, Achieving high accuracy with PINNs via energy natural gradient descent, in International Conference on Machine Learning, PMLR, 2023, pp. 25471–25485.
44.
G. Neu, A. Jonsson, and V. Gómez, A Unified View of Entropy-Regularized Markov Decision Processes, preprint, arXiv:1705.07798, 2017.
45.
J. Peters, K. Mulling, and Y. Altun, Relative entropy policy search, in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 24, 2010, pp. 1607–1612.
46.
J. Peters and S. Schaal, Natural actor-critic, Neurocomputing, 71 (2008), pp. 1180–1190, https://doi.org/10.1016/j.neucom.2007.11.026.
47.
G. Peyré and M. Cuturi, Computational optimal transport: With applications to data science, Found. Trends Mach. Learn., 11 (2019), pp. 355–607, https://doi.org/10.1561/2200000073.
48.
C. R. Rao, Information and accuracy attainable in the estimation of statistical parameters, Bull. Calcutta Math. Soc., 37 (1945), pp. 81–91, https://doi.org/10.1007/978-1-4612-0919-5_16.
49.
C. R. Rao, Differential metrics in probability spaces, in Differential Geometry in Statistical Inference, IMS Lecture Notes Monograph Series 10, Institute of Mathematical Statistics, Waite Hill, OH, 1987, pp. 217–241.
50.
J. Rauh, Finding the Maximizers of the Information Divergence from an Exponential Family, Ph.D. thesis, University of Leipzig, 2011, https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa-77355.
51.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, Trust region policy optimization, in International Conference on Machine Learning, PMLR, 2015, pp. 1889–1897.
52.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal Policy Optimization Algorithms, preprint, arXiv:1707.06347, 2017.
53.
B. Sturmfels, S. Telen, F.-X. Vialard, and M. von Renesse, Toric geometry of entropic regularization, J. Symbolic Comput., 120 (2024), 102221, https://doi.org/10.1016/j.jsc.2023.102221.
54.
F. Suárez Colmenares, Perspectives on Geometry and Optimization: From Measures to Neural Networks, Ph.D. thesis, Massachusetts Institute of Technology, 2023.
55.
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, in Advances in Neural Information Processing Systems 12 (NIPS 1999), 1999, pp. 1057–1063.
56.
G. Teschl, Ordinary Differential Equations and Dynamical Systems, Graduate Studies in Mathematics 140, American Mathematical Society, Providence, RI, 2024, https://doi.org/10.1090/gsm/140.
57.
J. van Oostrum, J. Müller, and N. Ay, Invariance properties of the natural gradient in overparametrised systems, Inform. Geom., 6 (2023), pp. 51–67, https://doi.org/10.1007/s41884-022-00067-9.
58.
L. Wang and M. Yan, Hessian informed mirror descent, J. Sci. Comput., 92 (2022), 90, https://doi.org/10.1007/s10915-022-01933-5.
59.
L. Weaver and N. Tao, The optimal reward baseline for gradient-based reinforcement learning, in Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI’01), Morgan Kaufmann Publishers, Burlington, MA, 2001, pp. 538–545.
60.
J. Weed, An explicit analysis of the entropic penalty in linear programming, in Conference on Learning Theory, PMLR, 2018, pp. 1841–1855.
61.
L. Xiao, On the convergence rates of policy gradient methods, J. Mach. Learn. Res., 23 (2022), pp. 1–36, http://jmlr.org/papers/v23/22-0056.html.
62.
R. Yuan, S. S. Du, R. M. Gower, A. Lazaric, and L. Xiao, Linear convergence of natural policy gradient methods with log-linear policies, in 11th International Conference on Learning Representations, 2023, https://openreview.net/forum?id=-z9hdsyUwVQ.
63.
W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence, SIAM J. Optim., 33 (2023), pp. 1061–1091, https://doi.org/10.1137/21M1456789.
64.
G. M. Ziegler, Lectures on Polytopes, Graduate Texts in Mathematics 152, Springer-Verlag, Berlin, 2012.
65.
G. M. Ziegler, Discrete Geometry I, Lecture notes, Freie Universität Berlin, 2013, https://wikis.fu-berlin.de/download/attachments/528515139/Ziegler+DiscGeo+Lec+Notes+Parts+1-10.pdf.
66.
A. Zimin and G. Neu, Online learning in episodic Markovian decision processes by relative entropy policy search, in Advances in Neural Information Processing Systems 26 (NIPS2013), 2013, pp. 1–9.

Information & Authors

Information

Published In

cover image SIAM Journal on Optimization
SIAM Journal on Optimization
Pages: 1060 - 1088
ISSN (online): 1095-7189

History

Submitted: 10 April 2024
Accepted: 22 January 2025
Published online: 15 May 2025

Keywords

  1. Fisher–Rao metric
  2. linear program
  3. entropic regularization
  4. multiplayer game
  5. Markov decision process
  6. natural policy gradient

MSC codes

  1. 65K05
  2. 90C05
  3. 90C08
  4. 90C40
  5. 90C53

Authors

Affiliations

Johannes Müller Contact the author
Corresponding author. Chair of Mathematics of Information Processing, RWTH Aachen University, 52070 Aachen, Germany.
Semih Çayci
Chair of Mathematics of Information Processing, RWTH Aachen University, 52070 Aachen, Germany.
Guido Montúfar
Departments of Mathematics and Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA 90095 USA, and Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany.

Funding Information

Funding: The first and second authors acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under project 442047500 through the Collaborative Research Center’s Sparsity and Singular Structures (SFB 1481). The second author acknowledges support by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the German State of North Rhine-Westphalia (MKW) under the Excellence Strategy of the Federal Government and the Länder. The third author has been supported in part by NSF CAREER 2145630, NSF 2212520, DFG SPP 2298 project 464109215, ERC 757983, and BMBF in DAAD project 57616814.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

There are no citations for this item

View Options

View options

PDF

View PDF

Full Text

View Full Text

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media