Implicit Deep Learning

We define a new class of"implicit"deep learning prediction rules that generalize the recursive rules of feedforward neural networks. These models are based on the solution of a fixed-point equation involving a single a vector of hidden features, which is thus only implicitly defined. The new framework greatly simplifies the notation of deep learning, and opens up new possibilities, in terms of novel architectures and algorithms, robustness analysis and design, interpretability, sparsity, and network architecture optimization.


Introduction
In this paper we introduce a new class of deep learning models that are based on implicit prediction rules. Such rules are not obtained via a recursive procedure through several layers, as in current neural networks. Instead, they are based on solving a fixed-point equation in some single "state" vector x ∈ R n that contains the hidden features: for a given input vector u the predicted vector iŝ with φ is the activation function, and matrices A, B, C, D contain model parameters. Since x cannot be in general solved in closed-form, the model above provides x only implicitly.
Perhaps surprisingly, the implicit framework includes current neural network architectures as special cases. Implicit models are a much wider class, as they have a lot more capacity, as measured by the number of parameters for a given dimension of the hidden features. Implicit rules open up the possibility of using novel architectures and prediction rules for deep learning, which are not based on any notion of "network" or "layers", as is classically understood. They also enable novel algorithms for solving the training problem, notably allowing for constrained optimization. In addition, they allow one to consider rigorous approaches to challenging problems in deep learning, ranging from robustness, sparsity and interpretability, and feature selection.
Related work. Recent works have considered versions of implicit models, and demonstrated their potential in deep learning. Additionally, recent work by Kolter and and collaborators [4,13] demonstrated success of an entirely implicit framework, which they call Deep Equilibrium Models, for the task of sequence modeling. Paper [7] uses implicit methods to solve and construct a general class of models known as neural ordinary differential equations, while [8] uses implicit models to construct a differentiable physics engine that enables gradient-based learning and high sample efficiency. Furthermore, many papers explore the concept of integrating implicit models with modern deep learning methods in a variety of ways. For example, [20] show promise in integrating logical structures into deep learning by incorporating a semidefinite programming (SDP) layer into a network in order to solve a (relaxed) MAXSAT problem. In [16], the authors propose to integrate a differentiable game solver in deep network architectures and in [1] the authors propose to include a model predictive control as a differentiable policy class for deep reinforcement learning, both of which can be seen as novel implicit architectures. In [2] the authors introduced implicit layers where the activation is the solution of some quadratic programming problem; in [9], the authors incorporate stochastic optimization formulation for end-to-end learning task, in which the model is trained by differentiating the solution of a stochastic programming problem.
In implicit learning, there is usually no way to express the state variable in closed-form, which makes the task of computing gradients with respect to model parameters challenging. Thus, a natural idea in implicit learning is to keep the state vector as a variable in the training problem, resulting in a higher-dimensional (or, "lifted") expression of the training problem. The idea of lifting the dimension of the training problem in (non-implicit) deep learning by introducing "state" variables has been studied in a variety of works; a non-extensive list includes [17], [3], [10], [22], [23], [6] and [15]. Lifted models are trained using block coordinate descent methods, Alternating Direction Method of Multipliers (ADMM) or iterative, non-gradient based methods. In this work, we introduce a novel aspect of lifted models, namely the possibility of defining a prediction rule implicitly.
Contributions and paper outline. Our contributions in this paper are: • We establish rigorous and numerically tractable conditions for implicit rules to be well-posed. Such constraints are then used in the training problem, guaranteeing the well-posedness of the learned prediction rule. • We also discuss the corresponding training problem; following the work of [10] and [15], we represent activation functions using so-called Fenchel divergences, in order to relax the training problem into a more tractable form. • We outline the potential relevance of the new framework, specifically exploring robustness, sparsity and interpretability, and architecture optimization.
Our focus here is on the ReLU activation function: φ(·) = max(0, ·), applied component-wise to a vector argument. We may easily extend our model and results to other activation functions, such as sigmoids, leaky ReLUs, or tanh. We may also consider maps that do not operate in component-wise fashion, but rather on the whole vector argument, such as normalization, max-pooling, softmax, normalizations, etc. It is also possible to consider different activations for different (blocks of) features.
In this preliminary work, our focus is on theoretical and algorithmic underpinnings, and not on empirical validation. In particular, we do not aim at empirically proving the superiority of the new class, over current state-of-the-art deep learning models, as applied to real-world data sets. Our few numerical experiments are simply aimed at validating the proposed training algorithm, in terms of achieving a low training set loss, or recovering a model with a sparse model matrices.
Our paper is organized as follows. We define the implicit model in Section 2, expose the important notion of well-posedness in Section 3, and discuss the training problem in Section 4. Section 5 explores the use the implicit framework towards robustness against input uncertainty; Section 6 discusses issues of interpretability, sparsity and architecture optimization. Section 7 provides a very limited, preliminary experiment on synthetic data.
2 Implicit Models

Well-posed rules
We consider the prediction rule (1) with input point u ∈ R p and predicted output vectorŷ(u) ∈ R q . The parameters of our model are contained in the matrices A ∈ R n×n , B ∈ R n×p , C ∈ R q×n , D ∈ R q×p . We can think of the vector x ∈ R n as a "state" corresponding to n "hidden" features that are extracted from the inputs. For notational simplicity only, our rule does not contain any bias terms; we can easily account for that by considering the vector (u, 1) instead of u, and increasing the column dimension of B by one.
The equation in (1) does not necessarily have a well-defined, unique solution x. In order to guarantee this, we assume that the n × n matrix A satisfies the following property.
Well-Posedness Property: A square, n × n matrix A is said to be well-posed for φ (in short, A ∈ WP(φ)) if, for any n-vector b, the equation x = φ(Ax + b) has a unique solution.
There are many classes of matrices that satisfy the well-posedness property, such as strictly upper (or, lower) triangular (SUT) matrices. In such a case, the state vector x can be obtained via backward (or, forward) substitution. Within the class of SUT matrices, a related important example comes from feedforward neural networks, as detailed in the next section. We provide other cases in section 3.
Note that the model is, by definition, "lifted" in the sense that, in general, the x-variable cannot be easily eliminated, precluding the use of unconstrained optimization, such as gradient descent, in the training problem. As shown next, feedforward neural networks are a special case, in which this explicit elimination can be done.

A special case: feedforward neural networks
Standard feedforward neural network prediction rules are a special case of our model, with (A, B) strictly upper block diagonal, where the number of blocks is equal to that of hidden layers. For example, consider the following prediction rule, with L > 1 layers: Here W l and φ l , l = 1, . . . , L, are given weight matrices and activation functions, respectively. We can express the above rule as (1), with x = (x L , . . . , x 1 ), and and with an appropriately defined activation function φ, defined as operating on a vector x = (x L , . . . , is easily solved via backward substitution, which corresponds to a simple forward pass through the network. Imposing further structure of the weighting matrices, such as Toeplitz (constant along diagonals) or Kronecker, and with an appropriate definition of the state vector, allows one to model multidimensional convolutional layers, pooling operations, etc. Recurrent neural networks are likewise covered by the proposed framework, by adding states corresponding to each recurrent element. It appears that the implicit model covers most of the known architectures. It also contains new, truly implicit architectures, as seen in Section 3.1.

Composing implicit models
Thanks to their compact representation, implicit models can be easily composed via matrix algebra. For example, given two models with matrix parameters (A i , B i , C i , D i ) and activation functions φ i , i = 1, 2, we can consider a "cascaded" prediction rule: The above rule can be represented as (1), with φ((z 1 , z 2 )) = (φ 1 (z 1 ), φ 2 (z 2 )) and As seen in the next section, the cascaded rule is well-posed if and only if each rule is.

Well-Posedness Property
We now focus in more detail on the the Well-Posedness Property, which enables the implicit rule to be well-defined.

Tractable sufficient conditions
Our first goal is to understand how we can constrain A to have the Well-Posedness Property, in a numerically tractable way.

Condition based on contraction mapping theorem and Picard iterations.
A sufficient condition is based on the contraction mapping theorem. We observe that the ReLU is non-expansive, that is: The above generalizes to vector inputs: and · α is the l α -norm (other norms, such as diagonally weighted l α -norms, are also possible). This means that for any pair x, x and vector b: We obtain that when A α,α < 1, the map x → φ(Ax + b) is a strict contraction with respect to the l α norm. Thus, Banach's contraction mapping theorem [12,Ch.3] applies, showing that the equation has a unique solution, which can be computed via the Picard iteration . . . The above converges exponentially at a linear rate, and each iteration is a matrix-vector product, hence the complexity is almost quadratic.
For α ∈ {1, 2, ∞}, the corresponding induced norms are easy to compute: where σ max refers to the largest singular value. The norm conditions are only sufficient, as seen next.
The above results can be extended to other activation functions, provided they are non-expansive. This covers sigmoid, leaky ReLU, tanh, max-pooling, normalization, etc.
Conditions involving the structure of A. As mentioned above, a sufficient condition for A to have the Well-Posedness Property is that it is strictly upper (or, strictly lower) triangular.
In the case when A = diag(a) is diagonal, with a ∈ R n , we can show that A is well-posed if and only if a < 1, in which case with denoting component-wise product. (This shows that the norm conditions seen in the previous section are only sufficient.) We can extend the above results and combine them with a triangular structure. In fact, as seen in the next section, if A is upper-triangular, then A ∈ WP(φ) if and only if diag(A) < 1. In that case, we can compute the solution to x = φ(Ax + b) by backwards elimination. Each variable requires us to solve a scalar problem, which can be done in closed form, as evidenced by the diagonal case seen above. The backward recursion writes Summary. Moving forward, we have found two kinds of tractable sufficient conditions for the Well-Posedness Property to hold: one is based on some triangular structure of A, and the other on norm bounds on A. Both of these two kinds result in convex constraints on A.

Well-posed matrices
In this section we examine some general properties of well-posed matrices.
Invariance. The well-posedness property is invariant under permutation and diagonal scaling: if A is well-posed then for any permutation matrix P , P AP is well-posed, and so is DAD −1 is, for any diagonal positive matrix D. We can use this property to refine the above sufficient conditions, and make them less conservative. Using the largest singular value norm condition for example, we obtain that A is well posed if there exist a diagonal positive-definite matrix S such that S − A SA is positive-definite.
Well-posed rank-one matrices. If A is rank-one: A = pq , with p, q ∈ R n , then for any diagonal S 0, the condition SAS −1 ∞,∞ < 1 reads Sp ∞ S −1 q 1 < 1. After some manipulations, this can be expressed as |p| |q| < 1, which is more accurate that the initial norm condition (here |x| denotes elementwise absolute-value of a vector x).
Composition. To some degree, the well-posedness property can be "composed".
Now assume that A 11 and A 22 are well-posed. Since A 11 is well-posed, the first equation has a unique solution x * 1 ; plugging x 1 = x * 1 into the second equation, and using the well-posedness of A 22 , we see that the second equation has a unique solution in x 2 , hence A is well-posed.
To prove the converse direction, assume that A is well-posed. The first equation above must have a unique solution x * 1 , irrespective to the choice of b 1 , hence A 11 must be well-posed. To prove that A 22 must be well-posed too, set b 1 = 0, b 2 arbitrary, leading to the system Since A 11 is well-posed, we must have x 1 = 0; the second equation then reads A similar result holds with the matrix where A 12 ∈ R n1×n2 is arbitrary. This result proves the fact stated above, that an upper-triangular matrix A ∈ WP(φ) if and only if diag(A) < 1.

Setup
We are now given an input data matrix U = [u 1 , . . . , u m ] ∈ R p×m and response matrix Y = [y 1 , . . . , y m ] ∈ R q×m , and seek to fit a model of the form (1), with A satisfying the Well-Posedness Property. We note that the rule (1), when applied to a collection of inputs (u i ) 1≤i≤m , can be written in compact form, asŶ We consider a training problem of the form In the above, L is a loss function which we assume is convex in its second argument, and P is a convex penalty function, which can be used to enforce a given (linear) structure (such as, A strictly upper block triangular) on the parameters, and/or encourage their sparsity. Our training problem involve two kinds of variables: the model variables (A, B, C, D); and the "state" variable X.
Examples of loss functions. A possible loss function is the squared Euclidean loss: for Y,Ŷ ∈ R q×m , Consider now the loss is a combination of negative cross-entropy with the soft-max, which is useful for multi-class classification: for two q-vectors y,ŷ, with y ≥ 0, y 1 = 1, we define We can extend the definition to matrices, by summing the contribution to all columns, each corresponding to a data point: for Y, Z ∈ R q×m , where both the log and the exponential functions apply component-wise.
Examples of penalty functions. Via an appropriate definition of P, we can make sure that A satisfies the Well-Posedness Property, either imposing an upper triangular structure for A, or via a norm constraint. To illustrate this point, consider the following penalty: Here, κ A is a given positive parameter, with κ A < 1 so that the norm constraint on A ensures it satisfies the Well-Posedness Property, as seen in section 3. Another choice encourages sparsity in a feedforward neural network where κ B , κ C are given positive parameters. Now well-posedness is ensured via a hard-coded structure constraint.
In the sequel we assume that the penalty function includes a constraint that enforces the wellposedness of the matrix A.
Fenchel divergence formulation. The above problem can be equivalently written where f φ is the so-called Fenchel divergence adapted to φ [10].
In the case of the ReLU activation, for two given matrices X, Z of the same size, we have and the component-wise multiplication. For the tanh activation function, we have where denotes component-wise division, and | · |, log, cosh are understood component-wise.
By construction, f φ is bi-convex, that is, convex with respect to anyone of its two matrix arguments when the other is fixed. Reference [10] lists a large number of popular activation functions that can be represented via Fenchel divergence functions.
Biconvex relaxation. We may relax the training problem into an unconstrained problem: where λ > 0 is a relaxation parameter. The objective function is convex in the model parameters (A, B, C, D) for fixed X, but not vice-versa. We can introduce another "proxy" state variable to obtain a bi-convex relaxation: with µ > 0 an additional parameter. This approach is closely related to the Moreau-Yosida regularization of the divergence function f φ , which is smooth [14].
The above model involves a single "dual" variable λ associated with the activation constraint. It may make sense to ascribe one different variable for each hidden state component. This leads to the "scaled" model where λ ∈ R n ++ is a positive vector. In our (limited) experiments on feedforward neural networks written as implicit models (as described previously in 2.2), we found for instance that setting λ = [ 1 n2 ; 2 1 n3 ; . . . , L 1 n L+1 ], where n l+1 is the number of rows of W l and with 1 a hyperparameter, was useful.

Bi-convex optimization approach
We can solve the relaxed training problem (10) by in block-coordinate descent (BCD) fashion, alternatively optimizing over state variables X, P and model parameters (A, B, C, D). Due to the bi-convexity of the training problem, each update corresponds to a convex problem. Note that the problem decomposes across the features (rows of (A, B)), provided the penalty P and loss L functions do. This is the case of the penalty involving the · ∞,∞ norm discussed in Section 3.1.
Conditional gradient methods [11] apply well here. The (A, B)-step requires solving The above is decomposable across rows of (A, B), and is again amenable to conditional gradient methods.
Updating state variables X and P . The X-step involves the optimization over X with P, A, B, C, D fixed: min The above problem is strictly convex, and has a unique solution.
The P -step involves the optimization over P with X, A, B, C, D fixed: The above problem is again strictly convex, and has a unique solution. In the case of the ReLU, with f φ given by (7), we can write the P -step as In the case of an Euclidean loss, the P -step can be solved in closed form.
Convergence result. The following theorem, taken from [21], states that the BCD algorithm converges globally to a stationary point of the objective function of (10). Theorem 2. If the loss function L(Y, .) is bounded below and differentiable, the penalty P function is closed and convex, and the Fenchel divergence f φ is differentiable then the BCD algorithm will converge globally to a stationary point of The proof of this theorem can be found in [21] (Theorem 2.8). The conditions of theorem 2 are met for instance for the ReLu activation function and losses such as the squared Euclidean or cross-entropy losses, and using penalty functions P such as those given in Section 4.1.
Updates via fixed point iterations. In some cases it may be preferable to not fully optimize the different variables. To this end, we may consider taking (projected) gradient steps, instead of running the optimization to optimum. Alternatively, we may use a limited number of fixed point iterations for updating the model and state variables.
Consider for example the case of the X-update. Necessary and sufficient optimality conditions for problem (14) can be written as a fixed-point equation: It can be shown that the above fixed-point equation can be solved via Picard iterations (see Section 3.1), provided A ∞,∞ ≤ µ/λ. Therefore, if we are using a penalty of the form (5), and if µ < λκ 2 A , then the Picard iteration corresponding to the above fixed-point equation can be safely used.

Input uncertainty model
We now assume that the input matrix is uncertain, and only known to belong to a given set U ⊆ R p×m . We further assume that each data point is affected independently of the others, so that U is the product of m sets of p-vectors. Specifically, by way of example, we consider the case when inputs are only known up to intervals: Here, the p-vector σ > 0 is a measure of component-wise uncertainty affecting each data point, and U corresponds to "nominal" inputs.

LP relaxation
Our starting point is the following "LP" relaxation to the training problem (3): where, from now on, we assume D = 0 for simplicity.

Robust counterpart
We first consider the robust counterpart to the LP relaxation model (17), with no penalty for simplicity: The above can be processed using the techniques of robust optimization [5].
Let us detail what happens in the simple case of interval uncertainty (16). For a given z ∈ R n , the condition ∀ δ, |δ| ≤ σ : z ≥ Bδ is equivalent to z ≥ |B|σ, with |B| the matrix containing the absolute values of those of B. Thus, the condition: X ≥ AX + BÛ + |B|σ1 . Note that the above condition is convex in (A, B) with fixed X, as before. The robust counterpart (18) to the relaxed training problem is thus

Affine recourse
In the previous approach, we require that the same state matrix X works for all the possible configurations of the input, which may be conservative. Clearly, in the implicit rule, the state depends (in a very complicated fashion) on the input, so it would make sense to optimize not over a fixed matrix X, but over a class of maps U → X(U ).
We do this by allowing the state matrix X to be an affine function of the uncertainty, which is referred to as "affine recourse" in the robust optimization literature [5]. Precisely, we set X(U ) = X + RU , with the "recourse matrix" R ∈ R n×p now a part of the model parameters. The robust counterpart with affine recourse writes ∀ U ∈ U, X + RU ≥ 0, X + RU ≥ A(X + RU ) + BU.
We recover the previous robust counterpart upon imposing R = 0 in the above.
In the simple case of interval uncertainty (16), the condition: . We can process the other condition similarly, leading to Processing the loss function can be done in similar fashion, but may be more complicated. Consider for example the case of the cross-entropy loss function (4). Focusing on one generic data point with input u, state x, and output y, the worst-case loss is where z := Cx + Dû, H := CR = [h 1 , . . . , h q ] ∈ R q×p . The above may be hard to compute, but we can work with a bound, based on evaluating the maximum above for each of the two terms independently. We obtain max u : |u−û|≤σ L(y, z + Hu) ≤ log( q i=1 e zi+σ |hi| ) − y z + σ |H y|.
Summing over data points we obtain the expression for our bound on the worst-case loss: Note that the loss encourages the matrix CR, which encodes what the output "sees" from the recourse, to be sparse.
Our robust training problem becomes min In order to solve the problem we alternate over model parameters (A, B, C) and state parameters (X, R). Each step is convex. Once we found (A, B, C), we simply apply the implicit prediction rule (1).

Sparsity and Architecture Optimization
In this section, we examine the role of sparsity in implicit deep learning, specifically in the model parameter matrix In our discussion, we will use the fact that the prediction rule (1) is invariant under permutation of the state vector, in the sense that, for any n×n permutation matrix, the matrix diag(P, I)M diag(P T , I) represents the same prediction rule as M given above.
Various kinds of sparsity of M can be encouraged in the training problem, with appropriate penalties. For example, we can use penalties that encourage many elements in M to be zero; the advantage of such "element-wise" sparsity is, of course, computational, since sparsity in matrices A, B, C, D will allow for computational speedups at test time. Another interesting kind of sparsity is rank sparsity, which refers to the case when model matrices are low-rank.
Next, we examine the benefits of row-(or, column-) sparsity, which refers to the fact that entire rows (or, columns) of a matrix are zero. Note that column sparsity in a matrix N can be encouraged with a penalty in the training problem, of the form where α > 1. Row sparsity can be handled via P(N T ).

Deep feature selection
We may use the implicit model to select features. Any zero column in the matrix (B T , D T ) T means that the corresponding element in an input vector does not play any role in the prediction rule. We may thus use a column-norm penalty in the training problem, in order to encourage such a sparsity pattern: with α > 1.

Dimension reduction via row-and column-sparsity
Row sparsity. Assume that the matrix A is row-sparse. Without loss of generality, using permutation invariance, we can assume that M writes where A 11 is square of order n 1 < n. We can then decompose x accordingly, as x = (x 1 , x 2 ) with x 1 ∈ R n1 , and the above implies x 2 = φ(B 2 u). The prediction rule for an input u ∈ R p then writeŝ The rule only involves x 1 as a true hidden feature vector. In fact, the row sparsity of A allows for a computational speedup, as we simply need to solve a fixed-point equation for the vector with reduced dimensions, x 1 .
Further assume that (A, B) is row-sparse. Again without loss of generality we may put M in the above form, with B 2 = 0. Then the prediction rule can be written This means that the dimension of the state variable can be fully reduced, to n 1 < n. Thus, row sparsity of (A, B) allows for a reduction in the dimension of the prediction rule.
Column-sparsity. Assume that the matrix A is column-sparse. Without loss of generality, using permutation invariance, we can assume that M writes M = where A 11 is square of order n 1 < n. We can then decompose x accordingly, as x = (x 1 , x 2 ) with x 1 ∈ R n1 . The above implies that the prediction rule for an input u ∈ R p writeŝ Thus, column-sparsity allows for a computational speedup, since x 2 can be directly expressed as closed-form function of x 1 . Now assume that (A T , C T ) T is column-sparse. Again without loss of generality we may put M in the above form, with C 2 = 0. We obtain that the prediction rule does not need x 2 at all, so that the computation of the latter vector can be entirely avoided. This means that the dimension of the state variable can be fully reduced, to n 1 < n. Thus, column sparsity of (A T , C T ) T allows for a reduction in the dimension of the prediction rule.
Summary. To summarize, row or column sparsity of A allows for a computational speedup; if the corresponding rows of B (resp. columns of C) are zero, then the prediction rule involves only a vector of reduced dimensions.

Rank sparsity
Assume that the matrix A is rank k n, and that a corresponding factorization is known: A = LR T , with L, R ∈ R n×k . In this case, for any n-vector b, the implicit equation x = φ(Ax + b) can be written as x = φ(Lz + b), where z = R T x. Hence, we can obtain a prediction for a given input u via the solution of a low-dimensional fixed-point equation in z ∈ R k : Once a solution z is found, we simply set the prediction to be y = Cφ(Lz + Bu) + Du.
At test time, if we use Picard iterations to obtain our predictions, then the computational savings brought about by the low-rank representation of A can be substantial, with a per-iteration cost going from O(n 2 ), to O(kn) if we use the above.
Encouraging rank sparsity can be done using an explicit low-rank representation of A, as A = LR T . In this case, the sub-problem of updating A (as in (13) is not jointly convex in L, R, but may be addressed by alternating over the factors L, R, similar to what is done in power iteration or generalized low-rank modelling schemes [19].

Architecture optimization
In this section, we consider the problem of designing the architecture of the network from scratch. The basic idea is to allow for a very large number of parameters, simply requiring that A be well-posed, then optimize in such a way that the resulting model parameter matrices are sparse, leading to a sparse network of connections. As noted above, such sparse architectures are very relevant in practice, as they allow for speedups in the prediction rule at test time.
Note that the sufficient well-posedness condition A ∞,∞ < 1, with the latter norm defined in (2), will naturally tend to encourage sparsity in the rows of matrix A. Our framework allows for many other types of sparsity-inducing penalties or constraints.
A similar approach can be made in terms of compressing the model parameters, precisely encouraging a low rank in A. As seen above, this in turn is useful for speeding up the prediction rule at test time. Our purpose in this experiment is to compare the proposed training BCD algorithm to backpropagation, in the context of a standard neural network. We thus consider a two-layer feedforward neural network that takes an input u ∈ R p and outputs the prediction where φ is the tanh activation function. For a given simulation, we start by drawing at random sparse weights W 3 , W 2 , W 1 . To do so we draw the components of the weights independently, for where s = 0.7 (that is, approximately 70% of the weights components are zero). We then create a synthetic training data set, by drawing at random m inputs u from a zero-mean normal distribution, with a covariance matrix set to Σ = M M − λI p , where M ∼ Uniform[−0.5, 0.5] p , and λ is the minimum eigenvalue of M M , and computing the corresponding outputŷ(u). We construct m = 10 3 datapoints for our training set. We do something similar to construct a test dataset, for which we draw an independent covariance matrix Σ using the same method aforementioned. The size of the test set is taken to be 500.
From the training data set, we then learn the weights using two different methods: the well-known RMSProp algorithm, and the BCD method based on the Fenchel divergence formulation described in 4.1, via the bi-convex optimization model (4.2): min W3,W2,W1,X2,X1 where f φ is the divergence function corresponding to the tanh activation, as given in (8). In our experiment, problem sizes are as follows: p = 20, q = 5, and W 1 ∈ R 12×20 , W 2 ∈ R 6×12 and W 3 ∈ R 5×6 , so that n = 12 + 6 = 18 is the total dimension of hidden layers.
We apply BCD in the following order: we first update W 3 , then X 2 , W 2 , X 1 and finally W 1 . The W 3 -update corresponds to linear regression, for which we have a closed form solution, for the W 1and W 2 -updates, we use gradient descent. For the X 2 -and X 1 -updates, we use Newton's method. We choose as a hyperparameter ε = 10 −3 , and run 5 iterations of the BCD.
We use Matlab for our simulation and we use the Deep Learning Toolbox TM to fit the weights of the neural network using RMSProp, for which we tuned the learning rate and fix the number of iterations to a maximum of 150 iterations (which corresponds to convergence). We run 50 trials using this method, each time we draw random weights and construct a corresponding dataset. This numerical experiment is made available on GitHub [18].
From figure 1,the BCD method displays similar performances to that of RMSProp. For the training set, the average RMSE gap between RMSProp and our method is 1.1 × 10 −2 with a standard deviation of 2.7 × 10 −2 . For the test set, the average RMSE gap between RMSProp and our method is −3 × 10 −3 with a standard deviation of 2.9 × 10 −2 . Moreover we can see that after only one iteration our method falls already very close to the performance of RMSProp. Therefore, from this synthetic experiment, it appears that the BCD method is competitive with respect to classical backpropagation algorithms. More experiments validating the BCD algorithm applied to (non-implicit) lifted models are given in [10] with experiments on MNIST and CIFAR-10.

Model recovery
We now illustrate some model recovery properties of the implicit framework. We generate a synthetic data set using a (truly implicit) network with ReLU activation, involving a random n × n matrix A that satisfies A ∞,∞ ≤ 0.5, and a n × p matrix B that is column-sparse. We set n = 20, p = 100, q = 1, and the number of data points to m = 400. Next we solved the training problem with n = 10, reflecting the fact that we may not know the hidden dimension of the true model.
We have run the BCD method with an Euclidean loss, using the fixed-point iterations for the updates of matrices X and P , and used a penalty such as (19) with a penalty parameter of 5, and α = 2.
As shown in Figure 2, after training, the algorithm recovers the same column sparsity as the "generative" model, even though the hidden fature vector dimension used in the training model is far off the true value.