Abstract.

In this work we establish an algorithm and distribution independent nonasymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either “classical”—have training loss close to the noise level—or are “modern”—have a much larger number of parameters compared to the minimum needed to fit the training data exactly. We also provide a more precise asymptotic analysis when the limiting spectral distribution of the whitened features is Marchenko–Pastur. Remarkably, while the Marchenko–Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, it coincides exactly with the distribution independent bound as the level of overparameterization increases.

Keywords

  1. statistical learning theory
  2. overfitting
  3. linear regression
  4. overparametrization

MSC codes

  1. 62J05
  2. 62F12

Get full access to this article

View all available purchase options and get full access to this article.

Acknowledgments.

The authors would like to thank Amirhesam Abedsoltan for finding an error in a previous version of the proof of Theorem 2.1. Correcting the proof led to an improved lower bound which is now tight. We also thank the anonymous reviewers for insightful comments. We are grateful for support from the National Science Foundation (NSF) and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning.

References

1.
B. Adlam and J. Pennington, The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization, in Proceedings of the International Conference on Machine Learning, 2020, pp. 74–84.
2.
P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, Benign overfitting in linear regression, Proc. Natl. Acad. Sci. USA, 117 (2020), pp. 30063–30070.
3.
M. Belkin, D. Hsu, S. Ma, and S. Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc. Natl. Acad. Sci. USA, 116 (2019), pp. 15849–15854.
4.
M. Belkin, D. Hsu, and J. Xu, Two models of double descent for weak features, SIAM J. Math. Data Sci., 2 (2020), pp. 1167–1180.
5.
M. Belkin, D. J. Hsu, and P. Mitra, Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate, in Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018.
6.
M. Belkin, A. Rakhlin, and A. B. Tsybakov, Does data interpolation contradict statistical optimality?, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1611–1619.
7.
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al., On the Opportunities and Risks of Foundation Models, preprint, arXiv:2108.07258, 2021.
8.
G. Brown, M. Bun, V. Feldman, A. Smith, and K. Talwar, When is memorization of irrelevant training data necessary for high-accuracy learning?, in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 123–132.
9.
S. Bubeck and M. Sellke, A universal law of robustness via isoperimetry, in Advances in Neural Information Processing Systems 34, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., Curran Associates, Red Hook, NY, 2021, pp. 28811–28822.
10.
Y. Cao, Z. Chen, M. Belkin, and Q. Gu, Benign Overfitting in Two-Layer Convolutional Neural Networks, preprint, arXiv:2202.06526, 2022.
11.
N. S. Chatterji and P. M. Long, Foolish crowds support benign overfitting, J. Mach. Learn. Res., 23 (2022), pp. 1–12.
12.
C. Cheng, J. Duchi, and R. Kuditipudi, Memorize to Generalize: On the Necessity of Interpolation in High Dimensional Linear Regression, preprint, arXiv:2202.09889, 2022.
13.
J. Chhor, S. Sigalla, and A. B. Tsybakov, Benign Overfitting and Adaptive Nonparametric Regression, preprint, arXiv:2206.13347, 2022.
14.
L. Devroye, L. Györfi, and A. Krzyżak, The Hilbert kernel regression estimate, J. Multivariate Anal., 65 (1998), pp. 209–227.
15.
L. H. Dicker, Ridge regression and asymptotic minimax estimation over spheres of growing dimension, Bernoulli, 22 (2016), pp. 1–37.
16.
E. Dobriban and S. Wager, High-dimensional asymptotics of prediction: Ridge regression and classification, Ann. Statist., 46 (2018), pp. 247–279.
17.
V. Feldman, Does learning require memorization? A short tale about a long tail, in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 954–959.
18.
S. Frei, N. S. Chatterji, and P. L. Bartlett, Benign Overfitting Without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data, preprint, arXiv:2202.05928, 2022.
19.
S. Frei, G. Vardi, P. L. Bartlett, and N. Srebro, Benign Overfitting in Linear Classifiers and Leaky ReLU Networks From KKT Conditions for Margin Maximization, preprint, arXiv:2303.01462, 2023.
20.
G. H. Hardy, J. E. Littlewood, and G. Pólya, Inequalities, Cambridge University Press, Cambridge, UK, 1952.
21.
T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, Ann. Statist., 50 (2022), pp. 949–986.
22.
D. Holzmüller, On the Universality of the Double Descent Peak in Ridgeless Regression, preprint, arXiv:2010.01851, 2020.
23.
F. Koehler, L. Zhou, D. J. Sutherland, and N. Srebro, Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting, in Proceeds of the 35th Conference on Neural Information Processing Systems, 2021, pp. 20657–20668.
24.
Z. Li, Z.-H. Zhou, and A. Gretton, Towards an Understanding of Benign Overfitting in Neural Networks, preprint, arXiv:2106.03212, 2021.
25.
T. Liang and A. Rakhlin, Just interpolate: Kernel “ridgeless” regression can generalize, Ann. Statist., 48 (2020), pp. 1329–1347.
26.
S. Mei, T. Misiakiewicz, and A. Montanari, Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration, Appl. Comput. Harmon. Anal., 59 (2022), pp. 3–84.
27.
S. Mei and A. Montanari, The generalization error of random features regression: Precise asymptotics and the double descent curve, Comm. Pure Appl. Math., 75 (2022), pp. 667–766.
28.
V. Muthukumar, K. Vodrahalli, V. Subramanian, and A. Sahai, Harmless interpolation of noisy data in regression, IEEE J. Sel. Areas Inf. Theory, 1 (2020), pp. 67–83.
29.
A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Proceedings of the 21st Conference on Neural Information Processing Systems, 2007.
30.
I. Steinwart and A. Christmann, Support Vector Machines, Springer Science & Business Media, New York, 2008.
31.
A. Tsigler and P. L. Bartlett, Benign Overfitting in Ridge Regression, preprint, arXiv:2009.14286, 2020.
32.
A. B. Tsybakov, Introduction to Nonparametric Estimation, Springer Ser. Statist., Springer, Berlin, 2009, https://doi.org/10.1007/b13794.
33.
D. Zou, J. Wu, V. Braverman, Q. Gu, and S. Kakade, Benign overfitting of constant-stepsize SGD for linear regression, in Proceedings of the Conference on Learning Theory, 2021, pp. 4633–4635.

Information & Authors

Information

Published In

cover image SIAM Journal on Mathematics of Data Science
SIAM Journal on Mathematics of Data Science
Pages: 977 - 1004
ISSN (online): 2577-0187

History

Submitted: 9 December 2022
Accepted: 31 July 2023
Published online: 9 November 2023

Keywords

  1. statistical learning theory
  2. overfitting
  3. linear regression
  4. overparametrization

MSC codes

  1. 62J05
  2. 62F12

Authors

Affiliations

Nikhil Ghosh Contact the author
Statistics Department, University of California, Berkeley, CA 95730 USA.
Mikhail Belkin
Halıcıoğlu Data Science Institute, University of California, San Diego, CA 92093 USA.

Funding Information

National Science Foundation (NSF): DMS-2031883, 814639, 1745640, IIS-1815697, CCF-2112665
Funding: This work was funded by NSF awards DMS-2031883 and 814639, with additional support from NSF grants 1745640, IIS-1815697, and CCF-2112665.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

There are no citations for this item

View Options

View options

PDF

View PDF

Full Text

View Full Text

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media