Abstract

We study a deep linear network endowed with the following structure: A matrix $X$ is obtained by multiplying $K$ matrices (called factors and corresponding to the action of the layers). The action of each layer (i.e., factor) is obtained by applying a fixed linear operator to a vector of parameters satisfying a constraint. The number of layers is not limited. Assuming that $X$ is given and factors have been estimated, the error between the product of the estimated factors and $X$ (i.e., the reconstruction error) is either the statistical or the empirical risk. We provide necessary and sufficient conditions on the network topology under which a stability property holds. The stability property requires that the error on the parameters defining the near-optimal factors scales linearly with the reconstruction error (i.e., the risk). Therefore, under these conditions on the network topology, any successful learning task leads to stably defined features that can be interpreted. In order to do so, we first evaluate how the Segre embedding and its inverse distort distances. Then we show that any deep structured linear network can be cast as a generic multilinear problem that uses the Segre embedding. This is the tensorial lifting. Using the tensorial lifting, we provide a necessary and sufficient condition for the identifiability of the factors up to a scale rearrangement. We finally provide a necessary and sufficient condition called the deep-Null Space Property (because of the analogy with the usual Null Space Property in the compressed sensing framework) which guarantees that the stability property holds. We illustrate the theory with a practical example where the deep structured linear network is a convolutional linear network. We obtain a condition on the scattering of the supports which is strong but not empty. A simple test on the network topology can be implemented to test whether the condition holds.

Keywords

  1. interpretable learning
  2. stable recovery
  3. matrix factorization
  4. deep linear networks
  5. convolutional networks

MSC codes

  1. 68T05
  2. 90C99
  3. 15-02

Get full access to this article

View all available purchase options and get full access to this article.

Supplementary Material


PLEASE NOTE: These supplementary files have not been peer-reviewed.


Index of Supplementary Materials

Title of paper: Multilinear compressive sensing and an application to convolutional linear networks

Authors: Francois Malgouyres and Joseph Landsberg

File: M119834_01.pdf

Type: PDF

Contents: Additional Proofs

References

1.
A. Agarwal, A. Anandkumar, and P. Netrapalli, A clustering approach to learning sparsely used overcomplete dictionaries, IEEE Trans. Inform. Theory, 63 (2016), pp. 575--592.
2.
A. Ahmed, B. Recht, and J. Romberg, Blind deconvolution using convex programming, IEEE Trans. Inform. Theory, 60 (2014), pp. 1711--1732.
3.
A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang, Learning polynomials with neural networks, in International Conference on Machine Learning, 2014, pp. 1908--1916.
4.
S. Arora, A. Bhaskara, R. Ge, and T. Ma, Provable bounds for learning some deep representations, in International Conference on Machine Learning, 2014, pp. 584--592.
5.
S. Arora, R. Ge, R. Kannan, and A. Moitra, Computing a nonnegative matrix factorization--provably, in Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, ACM, 2012, pp. 145--162.
6.
S. Arora, R. Ge, and A. Moitra, New algorithms for learning incoherent and overcomplete dictionaries, in COLT, 2014, pp. 779--806.
7.
S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, in International Conference on Machine Learning, 2018, pp. 254--263.
8.
S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation, PloS One, 10 (2015), e0130140.
9.
S. Bahmani and J. Romberg, Lifting for blind deconvolution in random mask imaging: Identifiability and convex relaxation, SIAM J. Imaging Sci., 8 (2015), pp. 2203--2238, https://doi.org/10.1137/141002165.
10.
P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks, 2 (1989), pp. 53--58.
11.
P. F. Baldi and K. Hornik, Learning in linear neural networks: A survey, IEEE Trans. Neural Networks, 6 (1995), pp. 837--858.
12.
A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), pp. 930--945.
13.
A. Bourrier, M. Davies, T. Peleg, P. Pérez, and R. Gribonval, Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems, IEEE Trans. Inform. Theory, 60 (2014), pp. 7928--7946.
14.
A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, in Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, pp. 605--614.
15.
E. Candès, J. Romberg, and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Trans. Inform. Theory, 52 (2006), pp. 489--509.
16.
E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, Phase retrieval via matrix completion, SIAM Rev., 57 (2015), pp. 225--251, https://doi.org/10.1137/151005099.
17.
E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math., 9 (2009), pp. 717--772.
18.
E. J. Candès, T. Strohmer, and V. Voroninski, Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming, Comm. Pure Appl. Math., 66 (2013), pp. 1241--1274.
19.
E. J. Candès and T. Tao, The power of convex relaxation: Near-optimal matrix completion, IEEE Trans. Inform. Theory, 56 (2010), pp. 2053--2080.
20.
O. Chabiron, F. Malgouyres, J.-Y. Tourneret, and N. Dobigeon, Toward fast transform learning, Int. J. Comput. Vis., (2014), pp. 1--22.
21.
O. Chabiron, F. Malgouyres, H. Wendt, and J.-Y. Tourneret, Optimization of a Fast Transform Structured as a Convolutional Tree, preprint hal-01258514, 2016.
22.
A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun, The loss surfaces of multilayer networks, in Artificial Intelligence and Statistics, 2015, pp. 192--204.
23.
A. Choromanska, Y. LeCun, and G. Ben Arous, Open problem: The landscape of the loss surfaces of multilayer networks, in Conference on Learning Theory, 2015, pp. 1756--1760.
24.
S. Choudhary and U. Mitra, Identifiability Scaling Laws in Bilinear Inverse Problems, preprint, https://arxiv.org/abs/1402.2637, 2014.
25.
A. Cohen, W. Dahmen, and R. DeVore, Compressed sensing and best-term approximation, J. Amer. Math. Soc., 22 (2009), pp. 211--231.
26.
N. Cohen, O. Sharir, and A. Shashua, On the expressive power of deep learning: A tensor analysis, in Conference on Learning Theory, 2016, pp. 698--728.
27.
N. Cohen and A. Shashua, Convolutional rectifier networks as generalized tensor decompositions, in International Conference on Machine Learning, 2016, pp. 955--963.
28.
C. Davis and W. M. Kahan, The rotation of eigenvectors by a perturbation. \textupIII, SIAM J. Numer. Anal., 7 (1970), pp. 1--46, https://doi.org/10.1137/0707001.
29.
D. Donoho and V. Stodden, When does non-negative matrix factorization give a correct decomposition into parts?, in Advances in Neural Information Processing Systems, 2004, pp. 1141--1148.
30.
D. L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory, 52 (2006), pp. 1289--1306.
31.
R. Eldan and O. Shamir, The power of depth for feedforward neural networks, in Conference on Learning Theory, 2016, pp. 907--940.
32.
M. Fazel, E. Candes, B. Recht, and P. Parrilo, Compressed sensing and robust recovery of low rank matrices, in 42nd Asilomar Conference on Signals, Systems and Computers, IEEE, 2008, pp. 1043--1047.
33.
Q. Geng, H. Wangy, and J. Wright, On the local correctness of $\ell_1$-minimization for dictionary learning, in International Symposium on Information Theory (ISIT), 2014.
34.
S. Goel and A. Klivans, Eigenvalue decay implies polynomial-time learnability for neural networks, in Advances in Neural Information Processing Systems, 2017, pp. 2192--2202.
35.
R. Gribonval, R. Jenatton, and F. Bach, Sample complexity of dictionary learning and other matrix factorizations, IEEE Trans. Inform. Theory, 61 (2015), pp. 3469--3486.
36.
R. Gribonval and K. Schnass, Dictionary identification---sparse matrix-factorisation via $\ell_1$-minimisation, IEEE Trans. Inform. Theory, 56 (2010), pp. 3523--3539.
37.
B. D. Haeffele and R. Vidal, Global Optimality in Tensor Factorization, Deep Learning, and Beyond, preprint, https://arxiv.org/abs/1506.07540, 2015.
38.
J. Harris, Algebraic Geometry. A First Course, Grad. Texts in Math. 133, Springer-Verlag, New York, 1995; corrected reprint of the 1992 original.
39.
J. D. Hauenstein and A. J. Sommese, Witness sets of projections, Appl. Math. Comput., 217 (2010), pp. 3349--3354.
40.
J. D. Hauenstein and A. J. Sommese, Membership tests for images of algebraic sets by linear projections, Appl. Math. Comput., 219 (2013), pp. 6809--6818.
41.
L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell, Generating visual explanations, in European Conference on Computer Vision, Springer, 2016, pp. 3--19.
42.
M. Janzamin, H. Sedghi, and A. Anandkumar, Beating the Perils of Non-convexity: Guaranteed Training of Neural Networks Using Tensor Methods, preprint, https://arxiv.org/abs/1506.08473, 2015.
43.
R. Jenatton, R. Gribonval, and F. Bach, Local Stability and Robustness of Sparse Dictionary Learning in the Presence of Noise, preprint, https://arxiv.org/abs/1210.0685, 2012.
44.
K. Kawaguchi, Deep learning without poor local minima, in Advances in Neural Information Processing Systems, 2016, pp. 586--594.
45.
V. Khrulkov, A. Novikov, and I. Oseledets, Expressive Power of Recurrent Neural Networks, preprint, https://arxiv.org/abs/1711.00811, 2017.
46.
R. Kondor, N. Teneva, and V. Garg, Multiresolution matrix factorization, in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1620--1628.
47.
J. M. Landsberg, Tensors: Geometry and Applications, Graduate Stud. Math. 128, AMS, 2012.
48.
H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K. Hansen, and S. H. Jensen, Theorems on positive data: On the uniqueness of NMF, Comput. Intell. Neurosci., 2008 (2008), 764206.
49.
V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, Speeding-Up Convolutional Neural Networks Using Fine-Tuned CP-Decomposition, preprint, https://arxiv.org/abs/1412.6553, 2014.
50.
D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, 401 (1999), pp. 788--791.
51.
X. Li, S. Ling, T. Strohmer, and K. Wei, Rapid, robust, and reliable blind deconvolution via nonconvex optimization, Appl. Comput. Harmon. Anal., 2018.
52.
X. Li and V. Voroninski, Sparse signal recovery from quadratic measurements via convex programming, SIAM J. Math. Anal., 45 (2013), pp. 3019--3033, https://doi.org/10.1137/120893707.
53.
Y. Li and Y. Yuan, Convergence analysis of two-layer neural networks with ReLU activation, in Advances in Neural Information Processing Systems, 2017, pp. 597--607.
54.
S. Ling and T. Strohmer, Self-calibration and biconvex compressive sensing, Inverse Problems, 31 (2015), 115002.
55.
R. Livni, S. Shalev-Shwartz, and O. Shamir, On the computational efficiency of training neural networks, in Advances in Neural Information Processing Systems, 2014, pp. 855--863.
56.
S. Lyu and X. Wang, On algorithms for sparse multi-factor NMF, in Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS'13, Curran Associates, 2013, pp. 602--610.
57.
L. Le Magoarou, Matrices efficientes pour le traitement du signal et l'apprentissage automatique, Ph.D. thesis, Université Bretagne Loire, 2016.
58.
L. Le Magoarou and R. Gribonval, Are there approximate fast Fourier transforms on graphs?, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4811--4815.
59.
L. Le Magoarou and R. Gribonval, Flexible multi-layer sparse approximations of matrices and applications, IEEE J. Selected Topics Signal Process., 10 (2016), pp. 688--700.
60.
F. Malgouyres and J. Landsberg, On the identifiability and stable recovery of deep/multi-layer structured matrix factorization, in 2016 IEEE Information Theory Workshop, 2016, pp. 315--319.
61.
F. Malgouyres and J. Landsberg, Multilinear Compressive Sensing and an Application to Convolutional Linear Networks, preprint hal-01494267, 2017.
62.
S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, Boston, 1998.
63.
G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller, Explaining nonlinear classification decisions with deep Taylor decomposition, Pattern Recog., 65 (2017), pp. 211--222.
64.
G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, On the number of linear regions of deep neural networks, in Advances in Neural Information Processing Systems, 2014, pp. 2924--2932.
65.
B. Neyshabur and R. Panigrahy, Sparse Matrix Factorization, preprint, https://arxiv.org/abs/1311.3315, 2013.
66.
A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, Tensorizing neural networks, in Advances in Neural Information Processing Systems, 2015, pp. 442--450.
67.
B. Recht, M. Fazel, and P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Rev., 52 (2010), pp. 471--501, https://doi.org/10.1137/070697835.
68.
R. Rubinstein, A. Bruckstein, and M. Elad, Dictionaries for sparse representation modeling, Proc. IEEE - Special issue on applications of sparse representation and compressive sensing, 98 (2010), pp. 1045--1057.
69.
I. Safran and O. Shamir, On the quality of the initial basin in overspecified neural networks, in International Conference on Machine Learning, 2016, pp. 774--782.
70.
J. Schmidt-Hieber, Nonparametric regression using deep neural networks with ReLU activation function, Ann. Statist., to appear.
71.
K. Schnass, On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD, Appl. Comput. Harmon. Anal., 37 (2014), pp. 464--491.
72.
H. Sedghi and A. Anandkumar, Provable methods for training neural networks with sparse connectivity, in Deep Learning and Representation Learning Workshop: NIPS, 2014.
73.
I. R. Shafarevich, Basic Algebraic Geometry. 1. Varieties in Projective Space, 3rd ed., Springer, Heidelberg, 2013.
74.
R. Socher, D. Chen, C. D. Manning, and A. Ng, Reasoning with neural tensor networks for knowledge base completion, in Advances in Neural Information Processing Systems, 2013, pp. 926--934.
75.
D. Spielmana, H. Wang, and J. Wright, Exact recovery of sparsely-used dictionaries, in COLT, 2012, 37--1.
76.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res., 15 (2014), pp. 1929--1958.
77.
M. Telgarsky, Benefits of depth in neural networks, in Conference on Learning Theory, 2016, pp. 1517--1539.
78.
T. Tsiligkaridis, A. O. Hero, and S. Zhou, On convergence of Kronecker graphical lasso algorithms, IEEE Trans. Signal Process., 61 (2013), pp. 1743--1755.
79.
S. A. Vavasis, On the complexity of nonnegative matrix factorization, SIAM J. Optim., 20 (2009), pp. 1364--1377, https://doi.org/10.1137/070709967.
80.
L. Venturi, A. S. Bandeira, and J. Bruna, Spurious Valleys in Two-Layer Neural Network Optimization Landscapes, preprint, https://arxiv.org/abs/1802.06384, 2018.
81.
B. Xie, Y. Liang, and L. Song, Diverse neural network learns true target functions, in Artificial Intelligence and Statistics, 2017, pp. 1216--1224.
82.
D. Yu, L. Deng, and F. Seide, The deep tensor neural network with applications to large vocabulary speech recognition, IEEE Trans. Audio, Speech, Language Process., 21 (2013), pp. 388--396.
83.
C. Yunpeng, J. Xiaojie, K. Bingyi, F. Jiashi, and Y. Shuicheng, Sharing Residual Units through Collective Tensor Factorization in Deep Neural Networks, preprint, https://arxiv.org/abs/1703.02180, 2017.
84.
K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon, Recovery guarantees for one-hidden-layer neural networks, in Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, eds., Proc. Mach. Learn. Res. 70, Sydney, Australia, PMLR, 2017, pp. 4140--4149.

Information & Authors

Information

Published In

cover image SIAM Journal on Mathematics of Data Science
SIAM Journal on Mathematics of Data Science
Pages: 446 - 475
ISSN (online): 2577-0187

History

Submitted: 5 July 2018
Accepted: 16 May 2019
Published online: 21 August 2019

Keywords

  1. interpretable learning
  2. stable recovery
  3. matrix factorization
  4. deep linear networks
  5. convolutional networks

MSC codes

  1. 68T05
  2. 90C99
  3. 15-02

Authors

Affiliations

Funding Information

National Science Foundation https://doi.org/10.13039/100000001 : DMS-1405348, AF1814254

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

View Options

View options

PDF

View PDF

Figures

Tables

Media

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media