Abstract.

In this article, we show the existence of minimizers in the loss landscape for residual artificial neural networks (ANNs) with a multidimensional input layer and one hidden layer with ReLU activation. Our work contrasts with earlier results in [D. Gallon, A. Jentzen, and F. Lindner, preprint, arXiv:2211.15641, 2022] and [P. Petersen, M. Raslan, and F. Voigtlaender, Found. Comput. Math., 21 (2021), pp. 375–444] which showed that in many situations minimizers do not exist for common smooth activation functions even in the case where the target functions are polynomials. The proof of the existence property makes use of a closure of the search space containing all functions generated by ANNs and additional discontinuous generalized responses. As we will show, the additional generalized responses in this larger space are suboptimal so that the minimum is attained in the original function class.

Keywords

  1. neural networks
  2. shallow networks
  3. best approximation
  4. ReLU activation
  5. approximatively compact

MSC codes

  1. 68T07
  2. 68T05
  3. 41A50

Get full access to this article

View all available purchase options and get full access to this article.

References

AB09.
H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), pp. 5–16, https://doi.org/10.1007/s10107-007-0133-5.
AMA05.
P.-A. Absil, R. Mahony, and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM J. Optim., 16 (2005), pp. 531–547, https://doi.org/10.1137/040605266.
BDL07.
J. Bolte, A. Daniilidis, and A. Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM J. Optim., 17 (2007), pp. 1205–1223, https://doi.org/10.1137/050644641.
CB18.
L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport, in Neural Information Processing Systems, Vol. 31, Curran Associates, Red Hook, NY, 2018, pp. 3040–3050.
CB20.
L. Chizat and F. Bach, Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, Proc. Mach. Learn. Res. (PMLR), 125 (2020), pp. 1305–1338.
CJR22.
P. Cheridito, A. Jentzen, and F. Rossmannek, Landscape analysis for shallow neural networks: Complete classification of critical points for affine target functions, J. Nonlinear Sci., 32 (2022), 64, https://doi.org/10.1007/s00332-022-09823-8.
CK23.
C. Christof and J. Kowalczyk, On the omnipresence of spurious local minima in certain neural network training problems, Constr. Approx., 60 (2024), pp. 197–224, https://doi.org/10.1007/s00365-023-09658-w.
Coo21.
Y. Cooper, Global minima of overparameterized neural networks, SIAM J. Math. Data Sci., 3 (2021), pp. 676–691, https://doi.org/10.1137/19M1308943.
DDKL20.
D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, Stochastic subgradient method converges on tame functions, Found. Comput. Math., 20 (2020), pp. 119–154, https://doi.org/10.1007/s10208-018-09409-5.
DK21.
S. Dereich and S. Kassing, Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes, J. Mach. Learn., 3 (2024), pp. 245–281.
DK22a.
S. Dereich and S. Kassing, Cooling down stochastic differential equations: Almost sure convergence, Stochastic Process. Appl., 152 (2022), pp. 289–311, https://doi.org/10.1016/j.spa.2022.06.020.
DK22b.
S. Dereich and S. Kassing, On minimal representations of shallow ReLU networks, Neural Netw., 148 (2022), pp. 121–128, https://doi.org/10.1016/j.neunet.2022.01.006.
DK23.
S. Dereich and S. Kassing, Central limit theorems for stochastic gradient descent with averaging for stable manifolds, Electron. J. Probab., 28 (2023), pp. 1–48, https://doi.org/10.1214/23-EJP947.
DLL+19.
S. S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, Gradient descent finds global minima of deep neural networks, Proc. Mach. Learn. Res. (PMLR), 97 (2019), pp. 1675–1685.
DZPS19.
S. S. Du, X. Zhai, B. Poczos, and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, in International Conference on Learning Representations, ICML, San Diego, CA, 2019.
EJRW23.
S. Eberle, A. Jentzen, A. Riekert, and G. S. Weiss, Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation, Electron. Res. Arch., 31 (2023), pp. 2519–2554, https://doi.org/10.3934/era.2023128.
EMWW20.
W. E, C. Ma, L. Wu, and S. Wojtowytsch, Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t, CSIAM Trans. Appl. Math., 1 (2020), pp. 561–615, https://doi.org/10.4208/csiam-am.SO-2020-0002.
Fou22.
S. Foucart, Mathematical Pictures at a Data Science Exhibition, Cambridge University Press, Cambridge, 2022.
GJL22.
D. Gallon, A. Jentzen, and F. Lindner, Blow up Phenomena for Gradient Descent Optimization Methods in the Training of Artificial Neural Networks, preprint, arXiv:2211.15641, 2022.
GK23.
B. Gess and S. Kassing, Convergence Rates for Momentum Stochastic Gradient Descent with Noise of Machine Learning Type, preprint, arXiv:2302.03550, 2023.
GW22.
R. Gentile and G. Welper, Approximation Results for Gradient Descent Trained Shallow Neural Networks in 1d, preprint, arXiv:2209.08399, 2022.
IJR22.
S. Ibragimov, A. Jentzen, and A. Riekert, Convergence to Good Non-Optimal Critical Points in the Training of Neural Networks: Gradient Descent Optimization with One Random Initialization Overcomes All Bad Non-Global Local Minima with High Probability, preprint, arXiv:2212.13111, 2022.
JR22.
A. Jentzen and A. Riekert, On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks, J. Mach. Learn., 1 (2022), pp. 141–246, https://doi.org/10.4208/jml.220114a.
KKV03.
P. C. Kainen, V. Kurková, and A. Vogt, Best approximation by linear combinations of characteristic functions of half-spaces, J. Approx. Theory, 122 (2003), pp. 151–159, https://doi.org/10.1016/S0021-9045(03)00072-8.
Liu21.
B. Liu, Understanding the loss landscape of one-hidden-layer ReLU networks, Knowledge-Based Syst., 220 (2021), 106923, https://doi.org/10.1016/j.knosys.2021.106923.
LMQ22.
L.-H. Lim, M. Michałek, and Y. Qi, Best k-layer neural network approximations, Constr. Approx., 55 (2022), pp. 583–604, https://doi.org/10.1007/s00365-021-09545-2.
Łoj63.
S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, in Les Équations aux Dérivées Partielles, Colloq. Int. CNRS 117, Paris, 1963, pp. 87–89.
Łoj65.
S. Łojasiewicz, Ensembles Semi-Analytiques, Lectures Notes IHES, IHES, Bures-sur-Yvette, 1965.
Łoj84.
S. Łojasiewicz, Sur les trajectoires du gradient d’une fonction analytique, in Seminari di Geometria, 1983, Università di Bologna, Bologna, 1984, pp. 115–117.
LRG24.
Q.-T. Le, E. Riccietti, and R. Gribonval, Does a sparse ReLU network training problem always admit an optimum?, in Neural Information Processing Systems, Vol. 36, Curran Associates, Red Hook, NY, 2024, 2848.
LXT+18.
H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, Visualizing the loss landscape of neural nets, in Neural Information Processing Systems, Vol. 31, Curran Associates, Red Hook, NY, 2018, pp. 6391–6401.
MHKC20.
P. Mertikopoulos, N. Hallak, A. Kavis, and V. Cevher, On the almost sure convergence of stochastic gradient descent in non-convex problems, in Neural Information Processing Systems, Vol. 33, Curran Associates, Red Hook, NY, 2020, 95.
PRV21.
P. Petersen, M. Raslan, and F. Voigtlaender, Topological properties of the set of functions generated by neural networks of fixed size, Found. Comput. Math., 21 (2021), pp. 375–444, https://doi.org/10.1007/s10208-020-09461-0.
Roc70.
R. T. Rockafellar, Convex Analysis, Vol. 28, Princeton University Press, Princeton, NJ, 1970.
SCP16.
G. Swirszcz, W. M. Czarnecki, and R. Pascanu, Local Minima in Training of Neural Networks, preprint, arXiv:1611.06310, 2016.
Sin70.
I. Singer, Best Approximation in Normed Linear Spaces by Elements of Linear Subspaces, Springer, Berlin, 1970.
SS18.
I. Safran and O. Shamir, Spurious local minima are common in two-layer ReLU neural networks, Proc. Mach. Learn. Res. (PMLR), 80 (2018), pp. 4433–4441.
Tad15.
V. B. Tadić, Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema, Stochastic Process. Appl., 125 (2015), pp. 1715–1755, https://doi.org/10.1016/j.spa.2014.11.001.
VBB19.
L. Venturi, A. S. Bandeira, and J. Bruna, Spurious valleys in one-hidden-layer neural network optimization landscapes, J. Mach. Learn. Res., 20 (2019), pp. 1–34.
Woj20.
S. Wojtowytsch, On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime, preprint, arXiv:2005.13530, 2020.
Woj23.
S. Wojtowytsch, Stochastic gradient descent with noise of machine learning type part I: Discrete time analysis, J. Nonlinear Sci., 33 (2023), 45, https://doi.org/10.1007/s00332-023-09903-3.

Information & Authors

Information

Published In

cover image SIAM Journal on Numerical Analysis
SIAM Journal on Numerical Analysis
Pages: 2640 - 2666
ISSN (online): 1095-7170

History

Submitted: 28 February 2023
Accepted: 4 June 2024
Published online: 26 November 2024

Keywords

  1. neural networks
  2. shallow networks
  3. best approximation
  4. ReLU activation
  5. approximatively compact

MSC codes

  1. 68T07
  2. 68T05
  3. 41A50

Authors

Affiliations

Institute for Mathematical Stochastics, Faculty of Mathematics and Computer Science, University of Münster, Münster, Germany.
Arnulf Jentzen
School of Data Science and Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China; Applied Mathematics: Institute for Analysis and Numerics, Faculty of Mathematics and Computer Science, University of Münster, Münster, Germany.
Faculty of Mathematics, University of Bielefeld, Bielefeld, Germany.

Funding Information

Deutsche Forschungsgemeinschaft (DFG): EXC 2044- 390685587, SFB 1283/2 2021 - 317210226
Funding: This work has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy EXC 2044-390685587, Mathematics Münster: Dynamics–Geometry–Structure. Moreover, this work been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - SFB 1283/2 2021 - 317210226. Furthermore, this work has been partially funded by the European Union (ERC, MONTECARLO, 101045811). The views and the opinions expressed in this work are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council (ERC). Neither the European Union nor the granting authority can be held responsible for them.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

There are no citations for this item

View Options

View options

PDF

View PDF

Full Text

View Full Text

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media