Abstract.

\(Q\)-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, in European Conference on Computational Learning Theory, Springer, New York, 1999, pp. 11–17] as one of the most important theoretical open problems in the RL community. Even in the basic setting where linear function approximation is used, there are well-known divergent examples. In this work, we propose a stable online variant of \(Q\)-learning with linear function approximation that uses target network and truncation and is driven by a single trajectory of Markovian samples. We present the finite-sample guarantees of the algorithm, which imply a sample complexity of \(\tilde{\mathcal{O}}(\epsilon^{-2})\) up to a function approximation error. Importantly, we establish the results under minimal assumptions and do not modify the problem parameters to achieve stability.

Keywords

  1. reinforcement learning
  2. \(Q\)-learning
  3. linear function approximation
  4. finite-sample analysis

MSC codes

  1. 60J20
  2. 93E20
  3. 90C40
  4. 62L20

Get full access to this article

View all available purchase options and get full access to this article.

Supplementary Materials

PLEASE NOTE: These supplementary files have not been peer-reviewed.
Index of Supplementary Materials
Title of paper: Target Network and Truncation Overcome the Deadly Triad in Q-Learning
Authors: Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri
File: supplement.pdf
Type: PDF
Contents: The supplement additional discussions.

References

1.
N. Agarwal, S. Chaudhuri, P. Jain, D. M. Nagaraj, and P. Netrapalli, Online target \(Q\)-learning with reverse experience replay: Efficiently finding the optimal policy for linear MDPs, in Proceedings of the International Conference on Learning Representations, 2021.
2.
L. Baird, Residual algorithms: Reinforcement learning with function approximation, in Machine Learning Proceedings 1995, Elsevier, Amsterdam, 1995, pp. 30–37, https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=DA12DC12763657BCF6093FA405C24F33?doi=10.1.1.50.7784&rep=rep1&type=pdf.
3.
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Nashua, NH, 1996.
4.
J. Bhandari, D. Russo, and R. Singal, A finite-time analysis of temporal difference learning with linear function approximation, in Proceedings of the Conference on Learning Theory, 2018, pp. 1691–1692.
5.
V. S. Borkar, A concentration bound for contractive stochastic approximation, Systems Control Lett., 153 (2021), 104947.
6.
V. S. Borkar and S. P. Meyn, The ODE method for convergence of stochastic approximation and reinforcement learning, SIAM J. Control Optim., 38 (2000), pp. 447–469.
7.
Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, Neural temporal difference and Neural temporal difference and \(Q\)-learning provably converge to global optima, Math. Oper. Res., (2023), https://doi.org/10.1287/moor.2023.1370.
8.
D. Carvalho, F. S. Melo, and P. Santos, A new convergent variant of \(Q\)-learning with linear function approximation, Adv. Neural Inf. Process. Syst., 33 (2020).
9.
Z. Chen, J. P. Clarke, and S. T. Maguluri, Target Network and Truncation Overcome the Deadly Triad in \(Q\)-Learning, preprint, arXiv:2203.02628, 2022.
10.
Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, Finite-sample analysis of off-policy TD-learning via generalized Bellman operators, Adv. Neural Inf. Process. Syst., 34 (2021), pp. 21440–21452.
11.
Z. Chen, S. Zhang, T. T. Doan, J.-P. Clarke, and S. T. Maguluri, Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning, Automatica, 146 (2022), 110623.
12.
A. M. Devraj and S. Meyn, Zap \(Q\)-learning, in Advances in Neural Information Processing Systems, 2017, pp. 2235–2244.
13.
S. S. Du, J. D. Lee, G. Mahajan, and R. Wang, Agnostic \(Q\)-learning with function approximation in deterministic systems: Near-optimal bounds on approximation error and sample complexity, in Advances in Neural Information Processing Systems, 2020.
14.
S. S. Du, Y. Luo, R. Wang, and H. Zhang, Provably efficient \(Q\)-learning with function approximation via distribution shift error checking oracle, in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 8060–8070.
15.
Y. Duan, Z. Jia, and M. Wang, Minimax-optimal off-policy evaluation with linear function approximation, in International Conference on Machine Learning, PMLR, 2020, pp. 2701–2709.
16.
D. Ernst, P. Geurts, and L. Wehenkel, Tree-based batch mode reinforcement learning, J. Mach. Learn. Res., 6 (2005), pp. 503–556.
17.
J. Fan, Z. Wang, Y. Xie, and Z. Yang, A theoretical analysis of deep \(Q\)-learning, in Learning for Dynamics and Control, PMLR, 2020, pp. 486–489.
18.
Z. Gao, Q. Ma, T. Başar, and J. R. Birge, Finite-Sample Analysis of Decentralized \(Q\)-Learning for Stochastic Games, preprint, arXiv:2112.07859, 2021.
19.
L. Györfi, M. Kohler, A. Krzyzak, H. Walk, et al., A Distribution-free Theory of Nonparametric Regression, Vol. 1, Springer, Berlin, 2002.
20.
H. Hasselt, Double \(Q\)-learning, Adv. Neural Inf. Process. Syst., 23 (2010), pp. 2613–2621.
21.
T. Jaakkola, M. I. Jordan, and S. P. Singh, Convergence of stochastic iterative dynamic programming algorithms, in Advances in Neural Information Processing Systems, 1994, pp. 703–710.
22.
C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, Is \(Q\)-learning provably efficient?, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 4868–4878.
23.
C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, Provably efficient reinforcement learning with linear function approximation, in Proceedings of the Conference on Learning Theory, PMLR, 2020, pp. 2137–2143.
24.
S. Khodadadian, Z. Chen, and S. T. Maguluri, Finite-sample analysis of off-policy natural actor-critic algorithm, in Proceedings of the International Conference on Machine Learning, PMLR, 2021, pp. 5420–5431.
25.
D. Lee and N. He, A unified switching system perspective and convergence analysis of \(Q\)-learning algorithms, Adv. Neural Inf. Process. Syst., 33 (2020), pp. 15556–15567.
26.
D. A. Levin and Y. Peres, Markov Chains and Mixing Times, AMS, Providence, RI, 2017.
27.
G. Li, Y. Chen, Y. Chi, Y. Gu, and Y. Wei, Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting, Adv. Neural Inf. Process. Syst., 34 (2021).
28.
G. Li, L. Shi, Y. Chen, Y. Gu, and Y. Chi, Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, Adv. Neural Inf. Process. Syst., 34 (2021).
29.
G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, Sample complexity of asynchronous \(Q\)-learning: Sharper analysis and variance reduction, Adv. Neural Inf. Process. Syst., 33 (2020), pp. 7031–7043.
30.
S. Ma, Z. Chen, Y. Zhou, and S. Zou, Greedy-GQ with variance reduction: Finite-time analysis and improved complexity, in Proceedings of the International Conference on Learning Representations, 2021.
31.
H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, Toward off-policy learning control with function approximation, in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 719–726.
32.
F. S. Melo, S. P. Meyn, and M. I. Ribeiro, An analysis of reinforcement learning with function approximation, in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 664–671.
33.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), pp. 529–533.
34.
R. Munos and C. Szepesvári, Finite-time bounds for fitted value iteration, J. Mach. Learn. Res., 9 (2008).
35.
G. Qu and A. Wierman, Finite-time analysis of asynchronous stochastic approximation and \(Q\)-learning, in Proceedings of the Conference on Learning Theory, PMLR, 2020, pp. 3185–3205.
36.
H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400–407.
37.
S. P. Singh and R. C. Yee, An upper bound on the loss from approximate optimal-value functions, Mach. Learn., 16 (1994), pp. 227–233.
38.
R. Srikant and L. Ying, Finite-time error bounds for linear stochastic approximation and TD-learning, in Proceedings of the Conference on Learning Theory, 2019, pp. 2803–2830.
39.
R. S. Sutton, Open theoretical questions in reinforcement learning, in European Conference on Computational Learning Theory, Springer, Berlin, 1999, pp. 11–17.
40.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 2018.
41.
C. Szepesvári and R. Munos, Finite time bounds for sampling-based fitted value iteration, in Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 880–887.
42.
J. N. Tsitsiklis, Asynchronous stochastic approximation and \(Q\)-learning, Mach. Learn., 16 (1994), pp. 185–202.
43.
J. N. Tsitsiklis and B. Van Roy, An analysis of temporal-difference learning with function approximation, IEEE Trans. Automat. Control, 42 (1997), pp. 674–690.
44.
R. Wang, D. Foster, and S. M. Kakade, What are the statistical limits of offline RL with linear function approximation?, in Proceedings of the International Conference on Learning Representations, 2020.
45.
Y. Wang and S. Zou, Finite-sample analysis of Greedy-GQ with linear function approximation under Markovian noise, in Proceedings of the Conference on Uncertainty in Artificial Intelligence, PMLR, 2020, pp. 11–20.
46.
C. J. Watkins and P. Dayan, \(Q\)-learning, Mach. Learn., 8 (1992), pp. 279–292.
47.
T. Xie and N. Jiang, Batch value-function approximation with only realizability, in Proceedings of the International Conference on Machine Learning, PMLR, 2021, pp. 11404–11413.
48.
P. Xu and Q. Gu, A finite-time analysis of \(Q\)-learning with neural network function approximation, in Proceedings of the International Conference on Machine Learning, PMLR, 2020, pp. 10555–10565.
49.
T. Xu and Y. Liang, Sample complexity bounds for two timescale value-based reinforcement learning algorithms, in Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 811–819.
50.
L. Yang and M. Wang, Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound, in Proceedings of the International Conference on Machine Learning, PMLR, 2020, pp. 10746–10756.
51.
A. Zanette, Exponential lower bounds for batch reinforcement learning: Batch RL can be exponentially harder than online RL, in Proceedings of the International Conference on Machine Learning, PMLR, 2021, pp. 12287–12297.
52.
A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill, Learning near optimal policies with low inherent Bellman error, in Proceedings of the International Conference on Machine Learning, PMLR, 2020, pp. 10978–10989.
53.
A. Zanette and M. Wainwright, Stabilizing \(Q\)-learning with linear architectures for provable efficient learning, in Proceedings of the International Conference on Machine Learning, PMLR, 2022, pp. 25920–25954.
54.
S. Zhang, H. Yao, and S. Whiteson, Breaking the deadly triad with a target network, in Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, pp. 12621–12631.
55.
S. Zou, T. Xu, and Y. Liang, Finite-sample analysis for SARSA with linear function approximation, in Advances in Neural Information Processing Systems, 2019, pp. 8668–8678.

Information & Authors

Information

Published In

cover image SIAM Journal on Mathematics of Data Science
SIAM Journal on Mathematics of Data Science
Pages: 1078 - 1101
ISSN (online): 2577-0187

History

Submitted: 3 June 2022
Accepted: 5 September 2023
Published online: 7 December 2023

Keywords

  1. reinforcement learning
  2. \(Q\)-learning
  3. linear function approximation
  4. finite-sample analysis

MSC codes

  1. 60J20
  2. 93E20
  3. 90C40
  4. 62L20

Authors

Affiliations

Computing + Mathematical Sciences, California Institute of Technology, Pasadena, CA 91106 USA.
John-Paul Clarke
Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin, TX 78712 USA.
Siva Theja Maguluri
Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA.

Funding Information

National Science Foundation: EPCN-2144316, CPS-2240982, CMMI-2112533
RTX
Funding: This work was partially supported by NSF grant EPCN-2144316, CPS-2240982, CMMI-2112533, and RTX.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

View Options

View options

PDF

View PDF

Full Text

View Full Text

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media