Optimal Approximation with Sparsely Connected Deep Neural Networks

We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in $L^2(\mathbb{R}^d)$. In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems---so-called \emph{affine systems}---can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, $\alpha$-shearlets, and more generally $\alpha$-molecules. This result elucidates a remarkable universality property of neural networks and shows that they achieve the optimum approximation properties of all affine systems combined. As a specific example, we consider the class of $1/\alpha$-cartoon-like functions, which is approximated optimally by $\alpha$-shearlets. We also explain how our results can be extended to the case of functions on low-dimensional immersed manifolds. Finally, we present numerical experiments demonstrating that the standard stochastic gradient descent algorithm generates deep neural networks providing close-to-optimal approximation rates at minimal connectivity. Moreover, these results show that stochastic gradient descent actually learns approximations that are sparse in the representation systems optimally sparsifying the function class the network is trained on.


Introduction
Neural networks arose from the seminal work by McCulloch and Pitts [34] in 1943 which, inspired by the functionality of the human brain, introduced an algorithmic approach to learning with the aim of building a theory of artificial intelligence.Roughly speaking, a neural network consists of neurons arranged in layers and connected by weighted edges; in mathematical terms this boils down to a concatenation of (potentially learned) affine linear functions and relatively simple non-linearities.
Despite significant theoretical progress in the 1990s [28,7], the area has seen practical advances only during the past decade, triggered by the drastic improvements in computing power, and, in particular, the availability of vast amounts of training data.Specifically, deep neural networks, i.e., networks with large numbers of layers are now state-of-the-art technology for a wide variety of real-world applications, such as image classification [30], speech recognition [27], or game intelligence [8], to name a few.For an in-depth overview, we refer to the survey paper by LeCun, Bengio, and Hinton [33] and the recent book [19].
A neural network effectively implements a non-linear mapping and can be used to either perform classification directly or to extract features that are then fed into a classifier, such as a support vector machine [44].In the former case, the primary goal is to approximate an unknown classification function based on a given set of corresponding input-output value pairs.This is typically accomplished by learning the network's weights through, e.g., the standard gradient descent (via backpropagation) algorithm [41].In a classification task with, say, two classes, the function to be learned would take only two values, whereas in the case of, e.g., the prediction of the temperature in a certain environment, it would be real-valued.It is therefore clear that characterizing to what extent deep neural networks are capable of approximating general functions is a question of significant practical relevance.
Deep neural networks employed in practice often consist of hundreds of layers and may depend on billions of parameters, see for example the work [26] on image classification.Training and operation of networks of this scale entail formidable computational challenges which often still present a bottleneck.As a case in point, we mention speech recognition on a smart phone such as, e.g., Apple's SIRI-system, which operates in the cloud.Android's speech recognition system has meanwhile released an offline version that is based on a neural network with sparse connectivity, meaning that the number of edges with nonzero weights is small.
The desire to reduce the complexity of network training and operation naturally leads to the question of function approximation through neural networks with sparse connectivity.In addition, the network's memory requirements in terms of the number of bits needed to store its topology and quantized weights are of concern in practice.
The purpose of this paper is to characterize the approximation-theoretic properties of deep neural networks under connectivity and memory constraints.Specifically, defining the complexity of a signal class C as the number of bits needed to describe any element in C to within a prescribed accuracy, we shall ask the following question: Given a signal class C, how does the complexity of a neural network that approximates every function in C to within a prescribed accuracy depend on the complexity of C?
Interpreting the network as an encoder in Donoho's min-max rate distortion theory [14], we establish fundamental lower bounds on connectivity and memory requirements for a network to guarantee uniform approximation rates for a given signal class C.Moreover, we demonstrate that these bounds are saturated by a broad family of signal classes, namely those that can be optimally approximated by a general class of representation systems-so-called affine systems.Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, α-shearlets, and more generally α-molecules.Our result therefore reveals an interesting universality property of neural networks; they achieve the optimum approximation properties of all affine systems combined.The technique we develop to derive this result is interesting in its own right as it constitutes a more general framework for transferring results on function approximation through representation systems to results on approximation by neural networks.

Deep Neural Networks
While various network architectures exist in the literature, we focus on the following setup.
is called a neural network.It is composed of affine linear maps W : R N −1 → R N , 1 ≤ ≤ L, and non-linear functions-often referred to as rectifiers-ρ acting component-wise.Here, d is the dimension of the input layer, L denotes the number of layers, N 1 , . . ., N L−1 stands for the dimensions of the L − 1 hidden layers, and N L is the dimension of the output layer.
The term "network" arises from the interpretation of the mapping Φ as a weighted acyclic directed graph with nodes arranged in L hierarchical layers and edges only between adjacent layers.In fact, the affine linear map W is defined by a matrix A ∈ R N ×N −1 and an affine part b ∈ R N via W (x) = A x + b .(A ) i,j is the weight associated with the edge between the j-th node in the ( − 1)-th layer and the i-th node in the -th layer, while (b ) i is the weight associated with the i-th neuron in the -th layer.This assignment is depicted in Figure 1.We refer to the nodes of the graph as neurons and note that the total number of neurons is given by N := d + L j=1 N j .The real numbers (A ) i,j and (b ) i are said to be the network's weights and the total number of non-zero edge weights, denoted by M , is the network's connectivity.If M is small relative to the number of connections possible (i.e., the number of edges in the graph that is fully connected between adjacent layers), we say that the network is sparsely connected.
Hidden layer

Input layer
Figure 1: Assignment of the weights (A ) i,j and (b ) i to the neurons and edges.The network has sparse connectivity, i.e., most of the weights (A ) i,j equal zero.
Throughout the paper, we consider the case Φ : R d → R, i.e., N L = 1, which includes situations such as the classification problem described above.Nonetheless, the results of Sections 2 and 3 as well as Theorem 4.12 can be readily generalized to N L > 1.
We denote the class of networks Φ : R d → R with no more than L layers, no more than M non-zero edge weights, and rectifier ρ by NN L,M,d,ρ .Moreover, we let Now, given a function f : R d → R, we can ask how well a neural network Φ ∈ NN L,M,d,ρ can approximate f .Clearly, this depends on the algorithm chosen to learn the network's weights.But one can also take the following vantage point: The best possible approximation of f by NN L,M,d,ρ provides a fundamental lower bound on the approximation error independently of the learning algorithm.
We shall be interested in the dependence of this lower bound on the connectivity M and on the number of bits available to encode the network topology and the quantized weights.Clearly, smaller M entails lower computational complexity in terms of evaluating (1) and a smaller number of bits translates to smaller memory requirements.

Quantifying Approximation Quality
We next briefly review a widely used approach for characterizing the approximation quality of functions given restricting conditions on the approximant, in our case the neural network with the restrictions imposed by its connectivity and memory requirements.
Fix Ω ⊂ R d .A common approach to function approximation is to consider a class of functions C ⊆ L 2 (Ω), termed signal class and a corresponding complete system D := (ϕ i ) i∈I ⊆ L 2 (Ω), termed representation system, or dictionary, with the restriction on the approximant imposed by a limit on the number of elements in D allowed to participate in the approximation.One then studies the error of best M -term approximation of f ∈ C: , and a representation system D, we define, for We call Γ D M (f ) the best M -term approximation error of f with respect to D. Every f M = i∈I M c i ϕ i attaining the infimum is referred to as a best M -term approximation of f in the system D. The supremum of γ > 0 such that there exists C > 0 with sup determines the optimal M -term approximation rate of C in the system D, and will henceforth be referred to as γ * (C, D).
For α-shearlet dictionaries D, optimal sparse approximation properties have been completely characterized for the signal class C of so-called β-cartoon-like functions E β ([0, 1] 2 ), these are piecewise C β (R 2 ) functions on the unit square with a C β discontinuity curve and β ∈ [1,2].Under weak technical conditions on the α-shearlet system, it was shown in [23,39]

Approximation by Deep Neural Networks
We now substitute the concept of M -term approximation with representation systems by approximation through neural networks with M edges.In other words, sparsity in terms of the number of participating elements of a representation system is replaced by sparsity in terms of connectivity.
More formally, we consider the following setup.
We call Γ NN M (f ) the best M -edge approximation error of f .The supremal γ > 0 such that a C > 0 with exists, determines the optimal M -edge approximation rate of C achievable by neural networks NN ∞,∞,d,ρ , and will henceforth be referred to as γ * NN (C).
We emphasize that the infimum in ( 3) is taken over all networks with no more than M edges of non-zero weight.In particular, this means that the optimum is taken over all possible edge positions and associated edge weights.Knowledge of the optimal M -edge approximation rate hence provides a bound on the approximation rate of a sparsely connected deep neural network.This bound is fundamental as it must be met by all learning algorithms.While we do not evaluate specific learning algorithms, our framework provides a means for assessing the quality of a given learning algorithm in the sense of measuring how close the rate induced by the algorithm comes to the optimal M -edge approximation rate.

Previous Work
The best-known results on approximation by neural networks are the universal approximation theorems by Hornik [28] and Cybenko [7], stating that every measurable function f can be approximated arbitrarily well by a singlehidden-layer (L = 2 in our terminology) neural network.The literature on approximation-theoretic properties of shallow networks continuing this work is abundant.Without any claim of comprehensiveness, we mention work on approximation error bounds for functions with bounded first moments [1], bounds in terms of the number of nodes and in terms of the size of the training set [2], the non-existence of localized approximations [5], a fundamental lower bound on approximation rates [12,3], lower bounds on the network's sparsity-in terms of the number of neurons [42], and on the approximation of specific function classes [36,35].
Approximation-theoretic results for networks with multiple hidden layers were obtained in [29] for general functions, in [18] for special function classes such as continuous functions, and for functions together with their derivatives in [38].
In [5] it was shown that, although localized approximation is impossible for shallow networks, it can be accomplished using deeper networks.We also highlight two very recent papers, which investigate the benefit-from an approximation-theoretic perspective-of multiple hidden layers.Specifically, in [16] it was shown that there exists a function which although expressible through a small 3-layer network, can only be accurately represented through a very large 2-layer network; here size is measured in terms of the number of nodes.In the setting of deep convolutional neural networks, first results of a nature similar to those in [16] were derived in [37].For survey articles on approximation-theoretic aspects of neural networks, we refer the interested reader to [17,40].
Relative to the above-mentioned contributions, the aim of this paper is to derive a universal optimality result concerning approximation properties of deep neural networks in terms of connectivity and memory requirements.
Most closely related to our work is that by Shaham, Cloninger, and Coifman [43], which shows that for functions that are sparse in wavelet frames the best M -edge approximation rate of a 3-layer neural network is at least as good as the best M -term approximation rate by piecewise linear wavelet frames.

Contributions
Our contributions can be grouped into four threads.
• Fundamental lower bound on connectivity.Let d ∈ N be the dimension of the input layer, Ω ⊂ R d , ρ : R → R an arbitrary rectifier, and C an arbitrary subset of L 2 (Ω).Consider a learning algorithm Learn which maps each accuracy ∈ (0, 1/2) and function where the weights of Learn( , f ) can each be encoded (i.e., represented) with O(log 2 (1/ )) bits.We demonstrate in Theorem 2.8 that for all γ < γ * (C), where γ * (C) is the optimal exponent with respect to the minimax code length of C as defined in [14,21] (see Definition 2.1 below), all C > 0, and every sequence ( n ) n∈N converging to 0 such that n / n+1 is bounded, there exists a subsequence where M(Learn( , f )) denotes the number of edges of the network Learn( , f ).This result quantifies the minimum network connectivity needed to allow approximation of all elements in C to within a prescribed error.
In particular, it says that the minimum (complexity) exponent at which the number of edges in the network has to scale is given by γ * (C).On a conceptual level, this statement establishes a universal link between the connectivity of the approximating network and the complexity of the function class that it is to approximate.
• Optimal M -edge approximation based on representation systems.We develop an abstract framework for transferring optimal M -term approximation results with representation systems to optimal M -edge approximation results for neural networks.These transfer results hold for dictionaries that are representable by neural networks in the following sense: There exists R ∈ N such that for every dictionary element and every > 0, there is a neural network with no more than R edges approximating this element with error at most .Additionally, we show that a wide class of representation systems, coined affine systems, and including as special cases wavelets, ridgelets, curvelets, shearlets, α-shearlets, and more generally, α-molecules, as well as tensor-products thereof, are representable by neural networks.This result suggests an explanation for the "unreasonable effectiveness" of neural networks: They effectively combine the optimum approximation properties of all affine systems taken together.
For concreteness and to illustrate our general results, we consider the class C of 1/α-cartoon-like functions , 1] and γ * (C) = 2 α , and, for each > 0 and f ∈ C, we construct a neural network Learn( , f ) satisfying (4).This construction is based on α-shearlets, which are known to provide optimal M -term approximation rate for C. For two classes of rectifiers, namely sigmoidal functions of arbitrary order (Definition 4.1) and smooth approximations of rectified linear units (Definition 4.2), we show that for all γ > γ * (C) there exists C > 0 such that i.e., the number of edges in the constructed approximating neural network grows at most at rate O( −γ ) as → 0.
Finally, in Section 6 we show that our techniques and results can be extended to the approximation of functions on immersed submanifolds of R d .
• Quantization and memory requirements.The lower bound (5) holds under the assumption that the weights in the network can each be encoded with O(log 2 (1/ )) bits.That this is, indeed, possible is certainly not clear a priori.
To clarify this matter we proceed as follows.We first demonstrate that, if a signal class C is effectively representable (Definition 2.2) with respect to a dictionary D and the dictionary is effectively representable by neural networks (Definition 3.1), a certain best M -term approximation rate in D implies the same M -edge approximation rate by a neural network with the weights of the network bounded by p(M ), where p is a polynomial.Then, we demonstrate that the so-bounded weights can be quantized to O(log 2 (1/ )) bits without incurring a considerable overall approximation error.This implies achievability of the bound ( 5) and demonstrates that the network at the same time has minimal memory requirements.
• Realizability of optimal approximation rates.An important practical question is how networks trained by gradient descent (via backpropagation) [41] perform relative to the fundamental bounds established in the paper.Surprisingly, our numerical experiments indicate that, given a fixed network topology with sparse connectivity, the stochastic gradient descent algorithm yields close-to-optimal approximation rates.Moreover, for the approximation of functions f ∈ E we observe that stochastic gradient descent generates neural networks which mimic the classical M -term approximation of f in a dictionary of α-molecules.

Outline of the Paper
The fundamental lower bound on connectivity will be developed in Section 2. Section 3 introduces a general framework for transferring sparse approximation results in representation systems to approximation results by neural networks.In Section 4, we apply these techniques first to the broad class of affine systems and then, in Section 5, to the class of 1/α-cartoon-like functions.Section 6 describes the extension of our main findings to the approximation of functions defined on manifolds.Numerical results assessing the performance of stochastic gradient descent (via backpropagation) relative to our lower bound on connectivity are reported in Section 7.

A Fundamental Lower Bound on Connectivity
We start by establishing a lower bound on the worst-case connectivity of neural networks approximating elements from a given signal class C ⊆ L 2 (R d ) to within a prescribed accuracy of > 0.
Let us first recall concepts and definitions from min-max rate distortion theory as introduced by Donoho in [14].Specifically, we review fundamental bounds on the length of lossy compression by encoder-decoder pairs.We then recall in Subsection 2.2 how dictionaries can be interpreted as encoders, and in Subsection 2.3 we analyze neural networks from an encoder point of view.

Min-Max Rate Distortion Theory
Min-max rate distortion theory provides a theoretical foundation for deterministic lossy data compression.We recall the following notions and concepts from [14,21].
Let d ∈ N, Ω ⊆ R d and consider the function class C ⊂ L 2 (Ω).Then, for each ∈ N, we denote by the set of binary encoders mapping elements of C to bit strings of length , and we let be the set of binary decoders mapping bit strings of length to elements of L 2 (Ω).An encoder-decoder pair This means that the worst case error incurred by applying the encoder-decoder pair (E, D) ∈ E × D to an element of C is upper-bounded by , often also expressed as the uniform error over C being bounded by .A quantity of central interest is the minimal length ∈ N for which there exists an encoder-decoder pair (E, D) ∈ E × D that achieves distortion > 0 over the function class C, and its asymptotic behavior as made precise in the following definition.
Moreover, the optimal exponent γ * (C) is defined by The optimal exponent γ * (C) describes how fast L( , C) tends to infinity as decreases.For function classes C 1 and C 2 , γ * (C 1 ) < γ * (C 2 ) says that asymptotically, i.e., for → 0, the length of the encoding bit string for C 2 is larger than that for C 1 .In other words, a smaller exponent indicates smaller description complexity.The optimal exponent γ * (C) therefore determines the minimal memory requirements for storing signals f ∈ C such that reconstruction with a uniformly bounded error is possible.
We mention that sometimes in the literature the reciprocal of γ * (C) is termed the optimal exponent.The optimal exponent is known for several function classes, such as subsets of Besov spaces B s p,q (R d ) with 1 ≤ p, q < ∞, s > 0, and q > (s + 1/2) −1 , namely all functions in B s p,q (R d ) whose norm is bounded by a constant C > 0 [6].For this class we have γ * (C) = d s .We will be particularly interested in so-called β-cartoon-like functions, for which the optimal exponent is given by 2 β , see Theorem 5.3.

Representation Systems as Encoders
For later use we recall well-known relations between the optimal M -term approximation rate γ * (C, D) for a signal class C with respect to a dictionary D = (ϕ i ) i∈I and the optimal exponent γ * (C) per Definition 2.1.In particular, we treat the question of how to build efficient encoders based on best M -term approximations in a given dictionary.This is formalized by the following definition which is a variation of the concept of "best M -term approximation subject to polynomial-depth search" introduced in [14].
C is said to be effectively representable in D, if there exist a univariate polynomial π and a constant D such that for all γ < γ * (C, D) there is a constant C > 0 so that for all M ∈ N and f ∈ C: there exist Remark 2.3.We note that the notion of effective representability depends on the ordering of the representation system D.
The condition of having the index set I M of a best M -term approximation be contained in the set of the first π(M ) indices implies the following fundamental relationship between the best M -term approximation rate in representation systems and the optimal exponent γ * (C). .
In light of this result, the following definition is natural (see also [21]).
Definition 2.5.Let Ω ⊂ R d and assume that the signal class C ⊂ L 2 (Ω) is effectively representable in the system .
Then, the representation system D is said to be optimal for the signal class C.

Neural Networks as Encoders
The present section examines to which extent efficient encoders can be constructed from neural networks with limited connectivity.Interpreting neural networks as encoders in the sense of Donoho's min-max rate distortion theory leads us to the first main result of this paper, namely a fundamental lower bound on the connectivity of a neural network approximating the elements of a given function class C to within accuracy > 0. We say that this lower bound is fundamental as it has to be met by all learning algorithms.
Remark 2.6.Throughout the paper, we are concerned with network connectivity in terms of the number of edges of non-zero weight.Looking at the number of nodes instead would not be conceptually different as nodes that are not part of any edge do not influence the underlying mapping Φ and we can bound the number of nodes by twice the number of edges.
The main result of this section is as follows.
be a map such that, for each pair ( , f ) ∈ (0, 1 2 ) × C, every weight of the neural network Learn( , f ) can be encoded with no more than −c log 2 ( ) bits while guaranteeing that Then, sup Proof.Let γ < γ * (C) and assume to the contrary that sup In the course of the proof we show that the networks Learn( , f ) lead to encoder-decoder pairs (E, D) ∈ E ( ) ×D ( ) achieving distortion over C with where C 0 is a constant.A contradiction to ( 9) is then produced through the following argument: Take δ > 0 with γ < γ + δ < γ * (C).By assumption there exists C > 0 such that This would, however, imply that L( , C) ≤ ( ) = O( −(γ+δ) ) which constitutes a contradiction to the optimality of γ * (C) and consequently yields (8).
The remainder of the proof is now devoted to establishing that the networks Learn( , f ) lead to encoder-decoder pairs (E, D) ∈ E ( ) × D ( ) with the aforementioned properties.We start by analyzing the code length of a neural network-based encoder operating on C.More precisely, for given f ∈ C and > 0, we aim at encoding the neural network Learn( , f ), more specifically its topology and weights, by a bit string of length O(M(Learn( , f ))).To this end, we will frequently invoke the observation that the number of nodes of Learn( , f ) is upper-bounded by 2M(Learn( , f )), see Remark 2.6.Also, trivially, the number of layers is upper-bounded by M(Learn( , f )).
We fix f ∈ C and enumerate the neurons and edges of Learn( , f ) by assigning unique numbers, henceforth called indices of the nodes and edges.This numbering can, for instance, simply assign increasing indices from left to right in every layer as indicated in Figure 2.For the sake of notational simplicity, we also set
Without loss of generality we assume that M is a power of 2. We recall that the number of layers of Φ is denoted by L, the number of neurons in these layers is N 1 , . . ., N L (see Definition 1.1), and d is the dimension of the input layer.By Remark 2.6 we have which bounds the total number of neurons, and which bounds the total number of layers.
We then construct a bit string representing Φ according to the following steps.
Step 1: We encode the number of edges, M , by starting the overall bit string with M 1's followed by a single 0.
Step 2: We continue by encoding the number of layers in the network.Thanks to (11) this requires no more than log 2 (M ) bits.We thus reserve the next log 2 (M ) bits for the binary representation of L. * Step 3: For each layer , we store the number N of neurons.Moreover, we store the dimension d of the input layer.As by (10) N ≤ 2M , for all , we can encode (generously) each N as well as d using log 2 (M ) + 1 bits.
In total, Step 3 requires a bit string of length In combination with Steps 1 and 2 this yields an overall bit string of length at most Step 4: We encode the topology of the graph associated with Φ. Recall that we assigned a unique index i to each neuron.These indices range from 1 to N := d + L =1 N and, by (10), can each be encoded by a bit string of length log 2 (M ) + 1.We denote the bit string corresponding to index i by b(i) ∈ {0, 1} log 2 (M )+1 and let n(i) be the number of children of neuron i.For each neuron i = 1, . . ., N , we form a bit string of length (n(i) + 2) • (log 2 (M ) + 1) by concatenating the representations b(j) of the indices of all its children, followed by an all-zeros bit string of length 2(log 2 (M ) + 1) to signal the transition to the neuron with index i + 1.Overall, this yields a bit string of length where again we used (10).
Combining (12) and (13) it follows that we have encoded the full topology of the neural network Φ using at most bits.
Step 5: We encode the weights of Φ.By assumption, each weight can be encoded by a bit string of length − c log 2 ( ) .For each node i = 1, . . ., N , we reserve exactly − c log 2 ( ) bits to encode its weight and, for each of its children a bit string of length − c log 2 ( ) to encode the weight corresponding to the edge between that child and its parent node.Concatenating the results in a bit string of length (n(i) + 1) • (− c log 2 ( ) ) for each node i, we get in total a bit string of length at most representing the weights of the graph associated with the network Φ.
Summing up (14) for the number of bits needed to encode the topology of Φ and ( 15) for the number of bits needed to encode the corresponding weights, we conclude that a bit string of length at most is needed to encode Φ.The network can be recovered uniquely from this bit string by simply reversing the steps above.As ( 16) can be upper-bounded by for a constant C 0 > 0 depending on c only, we have constructed an encoder-decoder pair ) and distortion uniformly bounded (for the class C) by .This concludes the proof.
We also offer a quantitative version of the above theorem which yields an explicit lower bound on the connectivity induced by every learning procedure exhibiting a given uniform approximation rate for a given signal class.
be a map such that, for each pair ( , f ) ∈ (0, 1 2 ) × C, every weight of the neural network Learn( , f ) can be represented with no more than −c log 2 ( ) bits while guaranteeing (7).
Let γ < γ * (C) and C > 0 and consider a zero-sequence Proof.Let γ < γ * (C) and take ( n ) n∈N to be a zero-sequence such that ( n / n+1 ) n∈N ∈ ∞ .We will prove that there does not exist C > 0 so that It is not hard to see that the non-existence of C such that (19) holds yields (18).Assume towards a contradiction, that there exists C > 0 such that (19) holds.We can now construct a second learning algorithm where n( ) is the largest n with n ≤ < n−1 .We get for a suitable constant C > 0 and all ∈ (0, 1 2 ).Additionally, we have that for every ∈ (0, 1 2 ), and every weight of Learn ( , f ) can be represented with −c log 2 ( n( ) ) ≤ −c log 2 ( ) bits for some c > 0. Applying Theorem 2.7 to the mapping Learn yields the desired contradiction.
We note that the assumption on the encoding complexity of the weights of Learn( , f ) in Theorem 2.8 is necessary in order to be able to store the weights of the network Learn( , f ) on a computer.We conclude that Theorem 2.8 exhibits a fundamental lower bound on the storage complexity of any neural network which uniformly approximates C to within a given accuracy!
In Theorem 2.8 we established that a mapping Learn as in Theorem 2.8 such that for all γ > γ * (C) there exists C > 0 and yields neural networks Learn( , f )) whose number of edges scales asymptotically optimally in terms of the approximation accuracy .We cast this property into a definition.
Definition 2.9.We call a signal class C ⊂ L 2 (Ω) optimally representable by neural networks with rectifier function ρ, if a mapping Learn as in Theorem 2.8 exists that additionally satisfies: For all γ > γ * (C) there exists C > 0 such that (22) holds.
We would like to emphasize that optimal representability as just defined is a much stronger statement than simply providing a bound for γ * NN (C) as per Definition 1.3.If a signal class C is optimally representable, Theorem 2.8 states that the memory requirements for storing a neural network that approximates any target function f ∈ C to within a prescribed accuracy are minimal.We hasten to add that these minimal memory requirements are realized by representing each of the weights in the network by the same number of bits-of order log(1/ ), a standing assumption throughout the paper.

Transitioning from Representation Systems to Neural Networks
Given the benchmark result Theorem 2.8 we now ask the following question: Which signal classes are optimally representable by (deep) neural networks?
The remainder of the paper is devoted to demonstrating that a large family of signal classes is optimally representable by (deep) neural networks.This will be accomplished by establishing an upper bound on connectivity that matches the lower bound in Theorem 2.8.
The mathematical technique we develop in the process is interesting in its own right as it constitutes a general framework for transferring results on function approximation through representation systems to results for approximation by neural networks.In particular, we prove that for a given signal class and associated representation system, under certain technical conditions, there exists a neural network with no more than O(M ) non-zero weights that achieves the same approximation error as a best M -term approximation in that representation system.This is formalized in Theorem 3.4.By imposing slightly more stringent technical conditions this result can be sharpened to additionally guarantee that the network's weights can be represented with no more than −c log 2 ( ) bits.The proof is constructive in the sense of making the quantizer used to encode the weights explicit.Specifically, we quantize by rounding to m Z ∩ [− −k , −k ], for some m, k ∈ N.These weights can clearly be represented with no more than −c log 2 ( ) bits, with a constant c > 0 depending upon m and k only.
We start by stating the technical conditions on the representation system needed for the aforementioned transference results to hold.For X ⊂ R d we say that A : X → R is polynomially bounded in x, if there exists a d-dimensional polynomial p such that |A(x)| ≤ |p(x)|, for all x ∈ X.We call a collection of functions A i , i ∈ I, uniformly polynomially bounded, if all A i are polynomially bounded with the same bounding polynomial.Definition 3.1.Let Ω ⊂ R d , ρ : R → R, and D = (ϕ i ) i∈N ⊂ L 2 (Ω) be a representation system.Then, D is said to be representable by neural networks (with rectifier ρ), if there exists R ∈ N such that for all η > 0 and every i ∈ I there is a neural network Φ i,η ∈ NN ∞,R,d,ρ with If, in addition, the neural networks Φ i,η ∈ NN ∞,C,d,ρ have weights that are uniformly polynomially bounded in (i, η −1 ), and if ρ is either globally Lipschitz continuous, or differentiable with polynomially bounded first derivative, we call the representation system (ϕ i ) i∈N effectively representable by neural networks (with rectifier ρ).Remark 3.2.We remark that, just like effective representability in a representation system (Definition 2.2), effective representability by neural networks depends on the ordering of the elements in the representation system.
The following lemma demonstrates that for a network with weights and number of non-zero weights both polynomially bounded in η −1 , it is always possible to replace the network by one whose weights can be represented with no more than −c log 2 (η) bits per weight without incurring an overall approximation error larger than η.Proof.We present the proof for Lipschitz-continuous ρ only.The other case follows similarly but is slightly more technical.Let m ∈ N, to be specified later, and, for a given network Φ, let Φ denote the network that results by replacing all weights of Φ by a closest element in η m Z ∩ [−η −k , η −k ].Set C max := η −k and denote the total number of non-zero weights in the network by C W .Note that C W ≤ 3M ≤ 3η −k , where the latter inequality is by assumption.For = 1, . . ., L, define Φ : Ω → R N as and let Set C 0 = max{1, sup{|x| : x ∈ Ω}}, denote the Lipschitz constant of ρ by C ρ , and let Then, it is not difficult to see that We now bound the quantity C for = 1, . . ., L. A simple computation, exploiting the Lipschitz continuity of ρ, yields that Since ρ(0) < ∞ there exists C > 0 such that As C W and C max are both polynomially bounded in η −1 , it follows that C is polynomially bounded in η −1 .We can therefore find n ∈ N such that C −1 C ρ C W C max ≤ η −n .Invoking (23), we conclude that there exists a constant C > 0 so that e ≤ C η m− n , for = 1, . . ., L.
Finally, choosing m large enough yields e L ≤ η, which was to be established.
The next theorem relates M -term approximation properties of representation systems to M -edge approximation properties for neural networks.
where ∈ (0, 1/2).Then, there exists a corresponding neural network In particular, for all signal classes C ⊂ L 2 (Ω) it holds that Proof.By the triangle inequality it suffices to show that there exists a neural network that approximates f M to within .By representability according to Definition 3.1, it follows that there is R ∈ N such that for each i ∈ I M and for η := /M , there exists a neural network Φ i,η ∈ NN ∞,R,d,ρ with Let then Φ(f, M ) be a neural network consisting of the networks (Φ i,η ) i∈I M operating in parallel, all with the same input, and summing their one-dimensional outputs with weights (c i ) i∈I M according to Then, Φ(f, M ) ∈ NN R,M ,d,ρ , where M = RM and, using (24), we can conclude that which completes the proof.
This result shows that we can restrict ourselves to the approximation of the individual elements of a representation system by neural networks with the only constraint being that the number of edges in the individual networks must admit a uniform upper bound.
The previous result does, however, not guarantee that the weights of the network Φ(f, M ) can be represented with −c log( ) bits for a constant c > 0. To understand if the lower bound on the connectivity of Theorem 2.8 is sharp we thus need to extend Theorem 3.4, which will, again, be accomplished through a transfer argument.Theorem 3.5.Let Ω ⊂ R d be bounded and suppose that C ⊂ L 2 (Ω) is effectively representable in the dictionary D = (ϕ i ) i∈N ⊂ L 2 (Ω) according to Definition 2.2.Suppose that D is effectively representable by neural networks according to Definition 3.1.Then, for all γ < γ * (C, D) there exist constants c, C > 0 and a map such that for every f ∈ C the following statements hold: (i) there exist m, k ∈ N such that each weight of the neural network Learn( , f holds true, and (iii) the neural network Learn( , f ) has at most C • −γ edges with non-zero weights.
Specifically, if D is effectively representable by neural networks, then optimality of D for the signal class C in the sense of Definition 2.5 implies that C is optimally representable by neural networks in the sense of Definition 2.9.
In Section 4 we provide a large class of dictionaries that are effectively representable by neural networks and yield effective representations for relevant function classes.
Proof of Theorem 3.5.Let f ∈ C, M ∈ N, and γ < γ * (C, D).Pick a subset I M ⊂ {1, . . ., π(M )} and coefficients (c i ) i∈I M according to Definition 2.2 such that (6) holds and, in the notation of ( 6), set M /2 := CM −γ .Let Φ(f, M ) be constructed as in the proof of Theorem 3.4.Note that Φ(f, M ) ∈ NN R,RM,d,ρ for some R ∈ N, independent of .Next, note that by effective representability of C in D, the weights (c i ) i∈I M in the last layer of Φ(f, M ) are given by (c i ) i∈I M and, moreover, these weights are uniformly bounded, independently of f ∈ C. By the assumption of D being effectively representable by neural networks, the weights in the approximating networks Φ i,η (x) in ( 25) are uniformly polynomially bounded in (i, (M/ M )), and since i ≤ π(M ), the weights in the approximating networks Φ i,η (x) in ( 25) are uniformly polynomially bounded in (M, −1 M ).Since M = ( M /(2C)) −γ it follows that the weights in the approximating networks Φ i,η (x) in ( 25) are uniformly polynomially bounded in −1 M .Invoking Lemma 3.3 yields the existence of m, k ∈ N such that there is Φ(f, M ) satisfying and where all weights in We set Learn( M , f ) := Φ(f, M ).For general ∈ (0, 1/2), we let and Learn( , f ) has RM ( ) ≤ R(( /(2C)) −γ + 1) edges which yields the desired result.

All Affine Representation Systems are Effectively Representable by Neural Networks
This section shows that a large class of dictionaries, namely affine systems, as defined below, are effectively representable by neural networks.Affine systems include as special cases wavelets, ridgelets, curvelets, shearlets, αshearlets, and more generally α-molecules.Combined with Theorem 3.5 the results in this section establish the central conceptual fact that any signal class that is effectively representable in an arbitrary affine system is optimally representable by neural networks in the sense of Definition 2.9.
Clearly, such strong statements are possible only under restrictions on the choice of the rectifier function for the approximating neural network.

Choice of Rectifier Function
We consider two classes of rectifier functions, namely sigmoidal functions [7] and smooth approximations of rectified linear units, both defined formally next.
The function ρ is called a strong sigmoidal function of order k, if there exist constants a, b, K > 0 such that One of the most commonly used rectifier functions is the so-called rectified linear unit (ReLU) given by x → max{0, x}.The second class of rectifiers we consider are smooth versions of the ReLU.
for some constant K > 0.Then, we call ρ an admissible smooth rectifier.
The reason for considering these two specific classes of rectifiers resides in the fact that neural networks based thereon allow effective representations of multivariate bump functions.We demonstrate in the sequel, that the effective representation of multivariate bump functions by neural networks leads to the effective representation of all affine systems by neural networks.Approximation of multivariate bump functions using sparsely connected neural networks is a classical topic in neural network theory [32].What is new here is the aspect of quantized weights.
In [5] it was shown that B-splines can be well approximated by neural networks with sigmoidal rectifiers.It is instructive to recall this result.To this end, for m ∈ N, we denote the univariate cardinal B-spline of order m ∈ N by N m , i.e., N 1 = χ [0,1] , and This result alone does, however, not establish that dictionaries based on B-splines are effectively representable by neural networks.For this to hold we need the weights in the approximating neural network Φ D, to be polynomially bounded in 1/ .We next show that this is, indeed, possible for strongly sigmoidal rectifiers.Moreover, the weights of the network Φ D, are polynomially bounded in (D, −1 ).
Proof.The neural network Φ D, in Theorem 4.3 is explicitly constructed in [5].Carefully following the steps in that construction and making explicit use of the strong sigmoidality of ρ, as opposed to plain sigmoidality as in [5], yields the desired result.
Likewise, taking ρ to be an admissible smooth rectifier, we can ensure that a neural network with only 3 layers yields a smooth bump function.The corresponding result is as follows.
Theorem 4.5.Let ρ be an admissible smooth rectifier.Then, for all d ∈ N, there exist M ∈ N and a neural network Proof.For simplicity of exposition we consider the case d = 2 only; the general case requires minor modifications.We start by constructing an auxiliary function as follows.For 0 < p 1 ≤ p 2 ≤ p 3 such that p 1 + p 2 = p 3 , define t : R → R as Then, t is smooth and compactly supported.Letting q = t L ∞ (R) , we choose g : R 2 → R according to By construction, g is smooth and compactly supported.Moreover, the two-step design of g per ( 26) and ( 27) allows the realization of g by a three layer neural network.Since g > 0, it follows that |ĝ(0)| > 0. Hence, there exists a δ > 0 such that |ĝ(ξ)| > 0 for all ξ ∈ B δ (0).We now set which can then also be constructed by a neural network Φ ρ ∈ NN 3,M,2,ρ for some M ∈ N. Since, by construction, | φ(ξ)| > 0 for all ξ ∈ B 3 (0), the function ϕ and hence Φ ρ indeed satisfy the desired assumptions.

Invariance to Affine Transformations
We next leverage the approximation results for bump functions to demonstrate that a wide class of representation systems is effectively representable by neural networks.As a first step towards establishing this general result, we show in this section that representability of a function f by neural networks in the sense of Definition 3.1 is invariant to affine transformations and finite linear combinations of shifts of f .Moreover, we will see that if f can be approximated by a neural network to within an error of and the network's weights are polynomially bounded in −1 , the same holds true for the approximating networks corresponding to affine transformations and finite linear combinations of f .Proposition 4.6.Let d ∈ N and ρ : R → R. Assume that there exist constants M, L ∈ N such that for all D, > 0 there is Let A ∈ R d×d be an expanding † matrix and b ∈ R d .Then, there exists M depending on M and d only such that for all E, η > 0 there is Ψ E,η ∈ NN L,M ,d,ρ with If the weights of Φ D, are polynomially bounded in −1 , then the weights of Ψ E,η are polynomially bounded in (|det(A)|, E, |b|, η −1 ).
Proof.The proof proceeds by noting that for every Φ ∈ NN L,M,d,ρ , we have and there exists M depending on M and d only such that det(A) Therefore, the network Next, we show that effective representability by neural networks is preserved under finite linear combinations of translates.
Proposition 4.7.Let d ∈ N and ρ : R → R. Assume that there exist constants M, L ∈ N such that for all D, > 0 there is Then, there exists M depending on M, d, and r only such that for all E, η > 0 there is If the weights of Φ D, are polynomially bounded in −1 , then the weights of Ψ E,η are polynomially bounded in ( ). Proof.We first note that for Φ ∈ NN L,M,d,ρ satisfying (30), the function is in NN L,M ,d,ρ with M depending on d, r, and M only.Moreover, by the triangle inequality, we have for all D, > 0 that exhibits the desired properties.
The operations introduced in Propositions 4.6 and 4.7 can be used to construct neural networks which approximate functions with a given number of vanishing moments.This property will turn out crucial in demonstrating representability of affine systems.
The next lemma establishes that an arbitrary number of vanishing moments in a given coordinate direction can be generated by suitable linear combinations of translates of a given function of compact support.Lemma 4.9.Let R, d ∈ N, B > 0, k ∈ {1, . . ., d}, and f ∈ C(R d ) with compact support.Then, the function Proof.For simplicity of exposition, we consider the case B = 1 only.Taking the Fourier transform of (32) we see that which implies But looking at Definition 4.8, this property precisely says that g possesses the desired vanishing moments.Statement (33) follows by inspection of (34).
Remark 4.10.Lemma 4.9 and Definition 4.8 consider vanishing moments in one coordinate direction only.It is immediate that, along the same lines, vanishing moments in multiple coordinate directions according to can be generated.

Affine Representation Systems
In this section, we introduce a family of representation systems, coined affine systems, that includes all representation systems based on affine scaling and translations of a given "mother function".Special cases of affine systems are wavelets, ridgelets, curvelets, shearlets, α-shearlets, and more generally α-molecules, as well as tensor products thereof.
The formal definition of affine systems is as follows.For a given affine system according to Definition 4.11, we can establish a natural order of its elements by concatenating the sub-systems as follows.First, we note that, as a consequence of the compact support of the functions g s , all sub-systems D s,j have their cardinality bounded by where the implicit constant depends on δ and Ω only.Each sub-system D s,j may be ordered arbitrarily.The overall representation system D is organized according to This ordering is assumed in the remainder of the paper and will sometimes be referred to as canonical ordering.
The next result establishes that all affine systems are (effectively) representable by neural networks in the sense of Definition 3.1.Theorem 4.12.Suppose that Ω ⊂ R d is bounded and D = (ϕ i ) i∈N ⊂ L 2 (Ω) is an affine system according to Definition 4.11.Suppose further that for the rectifier function ρ : R → R there exists a constant M such that for all D, > 0 there is Φ D, ∈ NN L,M,d,ρ with where f is as in Definition 4.11.Then, D is representable by neural networks with rectifier function ρ in the sense of Definition 3.1.
If, in addition, the weights of Φ D, are polynomially bounded in (D, −1 ), and if there exist a > 0 and c > 0 such that for all j ∈ N then D is effectively representable by neural networks with rectifier function ρ.To this end, we need to establish the existence of a constant C > 0 such that for all i ∈ N and all η > 0 there exist The elements of D are constructed through dilations and translations of f according to for some s ∈ {1, . . ., S}, j ∈ N, and b ∈ R d .One can now see that for all ϕ i ∈ D we have |b| = O(|det(A j )|) in (41).Thus (40) follows directly from Propositions 4.6 and 4.7.
It remains to show that a polynomial bound in D • −1 on the weights of Φ D, in (38) implies that D is effectively representable by neural networks with rectifier function ρ.By Definition 3.1 this means that the coefficients of Φ i,η grow at most polynomially in (i, η −1 ).Consider ϕ i as in (41) It remains to show that the quantities |det(A j )| are polynomially bounded in i.To see this, consider the canonical ordering (37) and observe that ϕ i according to (41) satisfies ϕ i ∈ D s,j for some s ∈ {1, . . ., S}, where D s,j was defined in (35).Together with the bound (36) it follows that, thanks to the canonical ordering, there exists a constant c > 0 such that We finally appeal to (39) to conclude that |det(A j )| is polynomially bounded in i which, together with (42), establishes the desired result.
Note that condition (39) is very weak and we are not aware of an affine representation system that would violate it.We now present the main technical result of this paper, namely that neural networks provide optimal approximations for all signal classes that are optimally approximated by any affine system!Theorem 4.13.Assume that Ω ⊂ R d is bounded and D = (ϕ i ) i∈N ⊂ L 2 (Ω) is an affine system according to Definition 4.11.Assume further that for a rectifier function ρ : R → R there exists a constant M such that for all D, > 0 there is Φ D, ∈ NN L,M,d,ρ satisfying (38).Then, for all signal classes C ⊂ L 2 (Ω), we have If, in addition, the weights of Φ D, in (38) are polynomially bounded in (D, −1 ), and if there exist a > 0 and c > 0 such that for all j ∈ N and, if C ⊂ L 2 (Ω) is effectively representable in D (according to Definition 2.2), then, for all γ < γ * (C, D) there exist c, C > 0 and a mapping Proof.The proof follows directly by combining Theorem 4.12 with Theorems 3.4 and 3.5.
Theorem 4.13 reveals a remarkable universality property of neural networks.In particular, it provides an answer to the question posed at the beginning of Section 3: As soon as a signal class can be optimally represented by an affine system, the bound in Theorem 2.8 is tight!

α-Shearlets and Cartoon-Like Functions
We next present a pair (C, D) of signal class and dictionary such that γ * NN (C) = γ * (C, D).Specifically, we consider α-shearlets as representation systems in L 2 (R 2 ) and (1/α)-cartoon-like functions as signal class.We point out that the definition of α-shearlets in this paper differs slightly from that in [22].Our definition replaces 1/α by α such that α-shearlets are α-molecules, whereas in [22] α-shearlets are a special case of 1/α-molecules.We will need scaling and shearing matrices defined as This leads us to the following definition which is a slightly modified version of the corresponding definition in [39]. where We mention that α-shearlets fall into the framework of α-molecules [22].Our interest in α-shearlets stems from the fact that they yield optimal M -term approximation rates for β-cartoon-like functions defined as follows.
This function class was originally introduced in [15] as a model class for functions governed by curvilinear discontinuities of prescribed regularity.In this sense, β-cartoon-like functions provide a convenient model for images governed by edges or for the solutions of transport equations which are often governed by singularity curves.
The optimal exponent γ * (E β (R 2 ; ν)) was found in [15,23]: ), and ν > 0, we have Proof.The proof of [15, Thm.2] demonstrates that C ⊂ L 2 (R 2 ) has optimal exponent γ * (C) = 2p 2−p if C contains a copy of p 0 .The result now follows, since by [23], the function class E β (R 2 ; ν) does, indeed, contain a copy of p 0 for p = 2 β+1 .Using Theorem 2.8 this result allows us to conclude that, independently of the learning algorithm employed, and asymptotically in , there always exists a function f ∈ E β (R 2 ; ν) such that the learned network approximating f with an accuracy of < M −β/2 must have more than O(M ) edges.It will turn out that a neural network exists which admits precisely this optimal number of edges.
The following theorem now shows that α-shearlets yield best M -term approximation rates for 1/α-cartoon-like functions.
Then, there exists a constant δ * > 0 such that, for all δ < δ * , the system SH α (f, g, δ) is optimal for the signal class E 1/α (R 2 ; ν) in the sense of Definition 2.5.
Remark 5.5.The assumptions on the smoothness and vanishing moments of f and g in Theorem 5.4 follow from [39, Equation 4.9] with s 1 = 3/2, s 0 = 0, p 0 = q 0 = 2/3, and |β| ≤ 4. While these particular choices allow the statement of the theorem to be independent of α, it is possible to weaken the assumptions, if a fixed α is considered.For example, for α = 1/2 the smoothness assumptions on f and g reduce to f ∈ C 11 , g ∈ C 28 .
We proceed to the main statement of this section.Theorem 5.6.Suppose that ρ : R → R is either a strongly sigmoidal function of order k ≥ 2 (see Definition 4.1) or an admissible smooth rectifier (see Definition 4.2).Then, for every β ∈ [1, 2), the signal class E β (R 2 ; ν) is optimally representable by a neural network with rectifier function ρ. appears in the definition of f , we have constructed a neural network Φ which approximates f on Γ.We thus observe the following.
Whenever a signal class C is invariant w.r.t.diffeomorphisms (in our construction the functions Ξ i ) and multiplication by smooth functions (in our construction the functions h i ), then approximation results on R m can be leveraged to approximation results on m-dimensional submanifolds Γ ⊂ R d .
Such invariances are, in particular, satisfied for all signal classes characterized by a particular smoothness behavior; for example, the class of cartoon-functions as studied in Section 5. Somewhat informally we thus summarize that neural networks are optimal for the approximation of piecewise smooth functions on manifolds.

Numerical Results
Our theoretical results in Section 2 show that, for every function class C ⊆ L 2 (R d ), the number of edges of a neural network stemming from a fixed learning algorithm must obey a fundamental lower bound.One of the most widely used learning algorithms is stochastic gradient descent with the gradient computed via backpropagation [41].The purpose of this section is to investigate how the stochastic gradient descent algorithm fares relative to our lower bound.Surprisingly, our numerical experiments indicate that for a fixed sparsely connected network topology inspired by the construction of bump functions in Equations ( 26) and ( 27) and with the ReLU as rectifier, the stochastic gradient descent algorithm generates neural networks that achieve high M -edge approximation rates.In particular, we identify two classes where these rates are optimal.
The topology that we prescribe is depicted in Figure 3. Let us briefly explain why we chose this network topology.As mentioned before, admissible smooth rectifiers consist of smooth functions which equal a ReLU outside a compact interval.For this class of rectifiers, the appropriate shearlet generators were constructed from a function g as specified in (27).Choosing p 1 = p 2 = 1 and p 3 = 2 in (26) yields hat functions t.This construction implies that we only require six nodes in the first layer in each subnetwork.Additionally, the derivative of the resulting functions t does not vanish on the support of t.Both properties are beneficial in terms of the learning speed of the stochastic gradient descent algorithm.In Figure 3, we see four network realizations of g in parallel.The input layer consists of four inputs.The first two inputs represent x 1 , x 2 , whereas the third and fourth inputs are constant.The first four neurons in the first layer correspond to the construction (26).The first neuron in the second layer corresponds to (27).In the output layer, all incoming signals are added to yield a sum of the subnetworks.
We now train the network using the backpropagation algorithm.Following (27) the weights associated with the edges between the first and the second layer remain fixed, and only the weights of the edges between the input and the first layer as well as the edges between the second layer and the output layer are trained.Training is performed for two different functions, where one is a function with a line singularity (Figure 4(a)), and the other one is a cartoonlike function (Figure 5(a)).We train the network by drawing samples (x 1 , x 2 ) ∈ [−1, 1] 2 from an equispaced grid in [−1, 1] 2 to which we then apply the network.The error determined by comparing to the original function is backpropagated through the network.We repeat this procedure for different network sizes, i.e., for different numbers of subnetworks.We start by discussing the results for the function with a line singularity in Figure 4(a).The approximation error corresponding to the trained neural network is shown in Figure 4(b).The super-linear decay of the error in the semilogarithmic scale indicates super-exponential decay of the error with respect to the number of edges.
Also, interestingly, the trained subnetworks yield α-molecules for α = 0 (see Figures 4(c)-(e)).These functions are constant along one direction and vary along another, hence can be considered part of a ridgelet system, which is, in fact, an optimally sparsifying system for line singularities.Moreover, the orientation of the three learned ridge functions matches that of the original function.
In the second experiment, we draw samples from the function depicted in Figure 5(a) below, which exhibits a curvilinear singularity.Figures 5(c Then, the weights in the last layer are optimized using the Lasso [45] to obtain a sparse weight vector c * .Considering networks built from a combination of the subnetworks associated with the N largest coefficients of c * yields a sequence of networks with corresponding errors according to Figure 5(b).Finally, we investigate whether the approximation delivered by this procedure is similar to what would be obtained by best N -term approximation with standard shearlet systems, where it is known [31,25] that shearlet elements on high scales tend to cluster around singularities.We analyze this behavior in Figures 5(g)-(i).Here, we depict in Figure 5(g) the weighted sum of those subnetworks that have the largest support.In Figure 5(h) we show weighted sums of subnetworks with medium-sized support, and in Figure 5(i) we sum up only the subnetworks with the smallest supports.We observe that, indeed, subnetworks of large support approximate the smooth part of the underlying function, whereas the small-sized subnetworks resolve the jump singularity.

Figure 2 :
Figure 2: Example of a natural numbering of neurons and edges.

1 ]
, for all m ≥ 1. Multivariate B-splines are simply tensor products of univariate B-splines.Specifically, we define, for d ∈ N, the d-dimensional B-spline of order m by N d m .Theorem 4.3 ([5], Thm.4.2).Let d, m, k ∈ N, and take ρ to be a sigmoidal function of order k ≥ 2. Further, let L := log(md − d)/ log(k) + 1.Then, there is M ∈ N, possibly dependent on d, m, k, such that for all D, > 0, there exists a network Φ D, ∈ NN L,M,d,ρ with

Proof.
Without loss of generality, we assume that Ω = [−D, D] d for some fixed D > 0, otherwise we pick D such that Ω ⊂ [−D, D] d .We first show that (38) implies representability of D by neural networks with rectifier function ρ.
d,ρ , such that for every f ∈ C (i) each weight of the neural network Learn( , f ) can be represented with at most −c log 2 ( ) bits, (ii) the estimate f − Learn( , f ) L 2 (Ω) ≤ holds, and (iii) the neural network Learn( , f ) has at most C • −γ edges.Specifically, optimality of D for the signal class C implies optimal representability of C by neural networks.

Figure 3 :
Figure 3: Left: Topology of the neural network trained using stochastic gradient descent.The network consists of a weighted sum of four subnetworks.The four inputs are x 1 , x 2 ∈ [−1, 1] plus two inputs set to 1. Right: A single subnetwork.

Figure 4 :
Figure 4: (a): Function with a line singularity.(b): Approximation error as a function of the number of edges.(c)-(e): Depiction of the functions obtained by restricting to the subnetworks with the largest weights in modulus in the final layer.
)-(e) show that the corresponding trained subnetworks resemble anisotropic molecules with different scales and of different orientations.Concerning the approximation error, we report that the expected decay rate of O(N −1 ) was not visible when simply training with different network sizes.However, with a slight adaptation one obtains the result of Figure 5(b), which demonstrates a decay of roughly N −1 .The specifics of this adaptation are as follows: We first train a large network with ∼ 10000 edges, again by stochastic gradient descent.

Figure 5 :
Figure 5: (a): Function with curvilinear singularity to be approximated by the network.(b): Approximation error in terms of edges.(c)-(f): Shearlet-like subnetworks.(g): Reconstruction using only the 10 subnetworks with the largest supports.(h): Reconstruction using only subnetworks with medium-sized support.(i): Reconstruction using only subnetworks with very small support.