Abstract.

We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularizers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures–Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.

Keywords

  1. continuous-time linear-quadratic control
  2. policy optimization
  3. relative entropy
  4. geometry-aware gradient
  5. global linear convergence
  6. mesh-independent convergence

MSC codes

  1. 68Q25
  2. 93E20

Get full access to this article

View all available purchase options and get full access to this article.

References

1.
A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg, A theoretical and empirical comparison of gradient approximations in derivative-free optimization, Found. Comput. Math., 22 (2022), pp. 507–560.
2.
J. Bu, A. Mesbahi, and M. Mesbahi, Policy Gradient-Based Algorithms for Continuous-Time Linear Quadratic Control, preprint, https://arxiv.org/abs/2006.09178, 2020.
3.
R. Carmona, Lectures on BSDEs, Stochastic Control, and Stochastic Differential Games with Financial Applications, SIAM, Philadelphia, 2016, https://doi.org/10.1137/1.9781611974249.
4.
A. Cartea, S. Jaimungal, and J. Ricci, Algorithmic trading, stochastic control, and mutually exciting processes, SIAM Rev., 60 (2018), pp. 673–703, https://doi.org/10.1137/18M1176968.
5.
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator, in Proceedings of the International Conference on Machine Learning, PMLR, 2018, pp. 1467–1476.
6.
D. Firoozi and S. Jaimungal, Exploratory LQG mean field games with entropy regularization, Automatica, 139 (2022), 110177.
7.
M. Geist, B. Scherrer, and O. Pietquin, A theory of regularized Markov decision processes, in International Conference on Machine Learning, PMLR, 2019, pp. 2160–2169.
8.
M. Giegrich, C. Reisinger, and Y. Zhang, Convergence of Policy Gradient Methods for Finite-Horizon Stochastic Linear-Quadratic Control Problems, preprint, https://arxiv.org/abs/2211.00617, 2022.
9.
P. J. Graber, Linear quadratic mean field type control and mean field games with common noise, with application to production of an exhaustible resource, Appl. Math. Optim., 74 (2016), pp. 459–486.
10.
B. Gravell, P. M. Esfahani, and T. Summers, Learning optimal controllers for linear systems with multiplicative noise via policy gradient, IEEE Trans. Automat. Control, 66 (2020), pp. 5283–5298.
11.
B. Hambly, R. Xu, and H. Yang, Policy Gradient Methods Find the Nash Equilibrium in N-player General-Sum Linear-Quadratic Games, preprint, https://arxiv.org/abs/2107.13090, 2021.
12.
B. M. Hambly, R. Xu, and H. Yang, Policy gradient methods for the noisy linear quadratic regulator over a finite horizon, SIAM J. Control Optim. 59 (2021), pp. 3359–3391, https://doi.org/10.1137/20M1382386.
13.
A. Han, B. Mishra, P. K. Jawanpuria, and J. Gao, On Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry, Adv. Neural Inf. Process. Syst., 34 (2021), pp. 8940–8953.
14.
Y. Jia and X. Y. Zhou, Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms, J. Mach. Learn. Res., 23 (2022), pp. 12603–12652.
15.
Y. Jia and X. Y. Zhou, q-learning in continuous time, J. Mach. Learn. Res., 24 (2023), 161.
16.
Z. Jin, J. M. Schmitt, and Z. Wen, On the Analysis of Model-Free Methods for the Linear Quadratic Regulator, preprint, https://arxiv.org/abs/2007.03861, 2020.
17.
S. M. Kakade, A natural policy gradient, in Advances in Neural Information Processing Systems, Vol. 14, T. Dietterich, S. Becker, and Z. Ghahramani, eds., MIT Press, 2001, pp. 1531–1538, https://proceedings.neurips.cc/paper_files/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf.
18.
B. Kerimkulov, J.-M. Leahy, D. Šiška, and Ł. Szpruch, Convergence of Policy Gradient for Entropy Regularized MDPs with Neural Network Approximation in the Mean-Field Regime, preprint, https://arxiv.org/abs/2201.07296, 2022.
19.
V. Konda and J. Tsitsiklis, Actor-critic algorithms, in Advances in Neural Information Processing Systems, Vol. 12, S. Solla, T. Leen, and K. Müller, eds., MIT Press, 1999, pp. 1008–1014, https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
20.
J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, On the global convergence rates of softmax policy gradient methods, in Proceedings of the International Conference on Machine Learning, PMLR, 2020, pp. 6820–6829.
21.
R. Munos, Policy gradient in continuous time, J. Mach. Learn. Res., 7 (2006), pp. 771–791.
22.
S. Park, J. Kim, and G. Kim, Time discretization-invariant safe action repetition for policy gradient methods, Adv. Neural Inf. Process. Syst., 34 (2021), pp. 267–279.
23.
C. Reisinger, W. Stockinger, and Y. Zhang, Linear Convergence of a Policy Gradient Method for Finite Horizon Continuous Time Stochastic Control Problems, preprint, https://arxiv.org/abs/2203.11758, 2022.
24.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, Trust region policy optimization, in Proceedings of the International Conference on Machine Learning, PMLR, 2015, pp. 1889–1897.
25.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal Policy Optimization Algorithms, preprint, https://arxiv.org/abs/1707.06347, 2017.
26.
D. Šiška and Ł. Szpruch, Gradient Flows for Regularized Stochastic Control Problems, preprint, https://arxiv.org/abs/2006.05956, 2020.
27.
J. Sun, X. Li, and J. Yong, Open-loop and closed-loop solvabilities for stochastic linear quadratic optimal control problems, SIAM J. Control Optim., 54 (2016), pp. 2274–2308.
28.
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Adv. Neural Inf. Process. Syst., 12 (1999), pp. 1057–1063.
29.
L. Szpruch, T. Treetanthiploet, and Y. Zhang, Optimal Scheduling of Entropy Regulariser for Continuous-Time Linear-Quadratic Reinforcement Learning, preprint, https://arxiv.org/abs/2208.04466, 2022.
30.
C. Tallec, L. Blier, and Y. Ollivier, Making Deep Q-learning Methods Robust to Time Discretization, preprint, https://arxiv.org/abs/1901.09732, 2019.
31.
H. Wang, T. Zariphopoulou, and X. Y. Zhou, Reinforcement learning in continuous time and space: A stochastic control approach, J. Mach. Learn. Res., 21 (2020), 198.
32.
H. Wang and X. Y. Zhou, Continuous-time mean–variance portfolio selection: A reinforcement learning framework, Math. Finance, 30 (2020), pp. 1273–1308.
33.
W. Wang, J. Han, Z. Yang, and Z. Wang, Global convergence of policy gradient for linear-quadratic mean-field control/game in continuous time, in Proceedings of the International Conference on Machine Learning, PMLR, 2021, pp. 10772–10782.
34.
J. Yong and X. Y. Zhou, Stochastic Controls: Hamiltonian Systems and HJB Equations, Vol. 43, Springer, New York, 1999.
35.
K. Zhang, B. Hu, and T. Basar, Policy optimization for linear control with robustness guarantee: Implicit regularization and global convergence, SIAM J. Control Optim., 59 (2021), pp. 4081–4109, https://doi.org/10.1137/20M1347942.
36.
K. Zhang, X. Zhang, B. Hu, and T. Basar, Derivative-free policy optimization for linear risk-sensitive and robust control design: Implicit regularization and sample complexity, Adv. Neural Inf. Process. Syst., 34 (2021), pp. 2949–2964.
37.
X. Y. Zhou and D. Li, Continuous-time mean-variance portfolio selection: A stochastic LQ framework, Appl. Math. Optim., 42 (2000), pp. 19–33.

Information & Authors

Information

Published In

cover image SIAM Journal on Control and Optimization
SIAM Journal on Control and Optimization
Pages: 1060 - 1092
ISSN (online): 1095-7138

History

Submitted: 8 November 2022
Accepted: 5 January 2024
Published online: 22 March 2024

Keywords

  1. continuous-time linear-quadratic control
  2. policy optimization
  3. relative entropy
  4. geometry-aware gradient
  5. global linear convergence
  6. mesh-independent convergence

MSC codes

  1. 68Q25
  2. 93E20

Authors

Affiliations

Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK.
Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK.
Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

View Options

View options

PDF

View PDF

Full Text

View Full Text

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media