Abstract
The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
F. Anselmi, L. Rosasco, C. Tan, T. Poggio. Deep Convolutional Networks are Hierarchical Kernel Machines, Center for Brains, Minds and Machines (CBMM) Memo No. 035, The Center for Brains, Minds and Machines, USA, 2015.
T. Poggio, L. Rosasco, A. Shashua, N. Cohen, F. Anselmi. Notes on Hierarchical Splines, DCLNs and i-theory, Center for Brains, Minds and Machines (CBMM) Memo No. 037, The Center for Brains, Minds and Machines, USA, 2015.
T. Poggio, F. Anselmi, L. Rosasco. I-theory on Depth vs Width: Hierarchical Function Composition, Center for Brains, Minds and Machines (CBMM) Memo No. 041, The Center for Brains, Minds and Machines, USA, 2015.
H. Mhaskar, Q. L. Liao, T. Poggio. Learning Real and Boolean Functions: When is Deep Better than Shallow, Center for Brains, Minds and Machines (CBMM) Memo No. 045, The Center for Brains, Minds and Machines, USA, 2016.
H. N. Mhaskar, T. Poggio. Deep Vs. Shallow Networks: An Approximation Theory Perspective, Center for Brains, Minds and Machines (CBMM) Memo No. 054, The Center for Brains, Minds and Machines, USA, 2016.
D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. Lecture–Math Challenges of Century, vol. 13, pp. 178–183, 2000.
Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015.
K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
M. Riesenhuber, T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, vol. 2, no. 11, pp. 1019–1025, 1999.
H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Advances in Computational Mathematics, vol. 1, no. 1, pp. 61–80, 1993.
C. K. Chui, X. Li, H. Mhaskar. Neural networks for localized approximation. Mathematics of Computation, vol. 63, no. 208, pp. 607–623, 1994.
C. K. Chui, X. Li, H. N. Mhaskar. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics, vol.5, no. 1, pp. 233–243, 1996.
A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, vol. 8, pp. 143–195, 1999.
T. Poggio, S. Smale. The mathematics of learning: Dealing with data. Notices of the American Mathematical Society, vol. 50, no. 5, pp. 537–544, 2003.
B. Moore, T. Poggio. Representation properties of multilayer feedforward networks. Neural Networks, vol.1, no.S1, pp. 203, 1998.
R. Livni, S. Shalev-Shwartz, O. Shamir. A provably efficient algorithm for training deep networks. CoRR, abs/1304.7045, 2013.
O. Delalleau, Y. Bengio. Shallow vs. deep sum-product networks. In Proceedings of Advances in Neural Information Processing Systems 24, NIPS, Granada, Spain, pp. 666–674, 2011.
G. F. Montufar, R. Pascanu, K. Cho, Y. Bengio. On the number of linear regions of deep neural networks. In Proceedings of Advances in Neural Information Processing Systems 27, NIPS, Denver, USA, pp. 2924–2932, 2014.
H. N. Mhaskar. Neural networks for localized approximation of real functions. In Proceedings of IEEE-SPWorkshop on Neural Networks for Processing III, pp. 190–196, IEEE, Linthicum Heights, USA, 1993.
N. Cohen, O. Sharir, A. Shashua. On the expressive power of deep learning: A tensor analysis. arXiv:1509.0500v1, 2015.
F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Unsupervised Learning of Invariant Representations With Low Sample Complexity: The Magic of Sensory Cortex or A New Framework for Machine Learning? Center for Brains, Minds and Machines (CBMM) Memo No. 001, The Center for Brains, Minds and Machines, USA, 2014.
F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Unsupervised learning of invariant representations. Theoretical Computer Science, vol. 633, pp. 112–121, 2016.
T. Poggio, L. Rosaco, A. Shashua, N. Cohen, F. Anselmi. Notes on Hierarchical Splines, DCLNs and i-theory, Center for Brains, Minds and Machines (CBMM) Memo No. 037. The Center for Brains, Minds and Machines, 2015.
Q. L. Liao, T. Poggio. Bridging the Gaps between Residual Learning, Recurrent Neural Networks and Visual Cortex, Center for Brains, Minds and Machines (CBMM) Memo No. 047, The Center for Brains, Minds and Machines, 2016.
M. Telgarsky. Representation benefits of deep feedforward networks. arXiv:1509.08101v2, 2015.
I. Safran, O. Shamir. Depth separation in ReLU networks for approximating smooth non-linear functions. arXiv:1610.09887v1, 2016.
H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, vol. 8, no. 1, pp. 164–177, 1996.
E. Corominas, F. S. Balaguer. Conditions for an infinitely differentiable function to be a polynomial. Revista Matemática Hispanoamericana vol. 14, no. 1–2, pp. 26–43, 1954. (in Spanish)
T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. L. Liao. Why and when can deep–but not shallow–networks avoid the curse of dimensionality: A review. arXiv:1611.00740v3, 2016.
R. A. DeVore, R. Howard C. A. Micchelli. Optimal nonlinear approximation. Manuscripta Mathematica, vol. 63, no. 4, pp. 469–478, 1989.
H. N. Mhaskar. On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity, vol. 20, no. 4, pp. 561–590, 2004.
F. Bach. Breaking the curse of dimensionality with convex neural networks. arXiv:1412.8690, 2014.
D. Kingma, J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. F. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Q. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Q. Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
R. Eldan, O. Shamir. The power of depth for feedforward neural networks. arXiv:1512.03965v4, 2016.
H. W. Lin, M. Tegmark. Why does deep and cheap learning work so well? arXiv:1608.08225, 2016.
J. T. Håstad. Computational Limitations for Small Depth Circuits, Cambridge, MA, USA: MIT Press, 1987.
N. Linial, Y. Mansour, N. Nisan. Constant depth circuits, Fourier transform, and learnability. Journal of the ACM, vol. 40, no. 3, pp. 607–620, 1993.
Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI. Large-Scale Kernel Machines, L. Bottou, O. Chapelle, D. DeCoste, J. Weston, Eds., Cambridge, MA, USA: MIT Press, 2007.
Y. Mansour. Learning Boolean functions via the Fourier transform. Theoretical Advances in Neural Computation and Learning, V. Roychowdhury, K. Y. Siu, A. Orlitsky, Eds., pp. 391–424, US: Springer, 1994.
M. Anthony, P. Bartlett. Neural Network Learning: Theoretical Foundations, Cambridge, UK: Cambridge University Press, 2002.
F. Anselmi, L. Rosasco, C. Tan, T. Poggio. Deep Convolutional Networks are Hierarchical Kernel Machines, Center for Brains, Minds and Machines (CBMM) Memo No. 035, The Center for Brains, Minds and Machines, USA, 2015.
B. M. Lake, R. Salakhutdinov, J. B. Tenenabum. Humanlevel concept learning through probabilistic program induction. Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
A. Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, vol. 7, no. 1, pp. 117–139, 2016.
S. Soatto. Steps towards a theory of visual information: Active perception, signal-to-symbol conversion and the interplay between sensing and control. arXiv:1110.2053, 2011.
T. A. Poggio, F. Anselmi. Visual Cortex and Deep Networks: Learning Invariant Representations, Cambridge, MA, UK: MIT Press, 2016.
L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM Journal on Matrix Analysis and Applications, no. 31, no. 4, pp. 2029–2054, 2010.
S. Shalev-Shwartz, S. Ben-David. Understanding Machine Learning: From Theory to Algorithms, Cambridge, UK: Cambridge University Press, 2014.
T. Poggio, W. Reichardt. On the representation of multiinput systems: Computational properties of polynomial algorithms. Biological Cybernetics, vol. 37, no. 3, 167–186, 1980.
M. L. Minsky, S. A. Papert. Perceptrons: An Introduction to Computational Geometry, Cambridge MA, UK: The MIT Press, 1972.
Acknowledgement
The authors thank O. Shamir for useful emails that prompted us to clarify our results in the context of lower bounds.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the Center for Brains, Minds and Machines (CBMM), NSF STC award CCF (No. 1231216), and ARO (No. W911NF-15-1-0385).
Recommended by Associate Editor Hong Qiao
Tomaso Poggio received the Ph.D. degree in theoretical physics from University of Genoa, Italy in 1971. He is the Eugene McDermott Professor at Department of Brain and Cognitive Sciences, the director of Center for Brains, Minds and Machines, the member of the Computer Science and Artificial Intelligence Laboratory at Massachusetts Institute of Technology (MIT), USA. Since 2000, he is a member of the faculty of the McGovern Institute for Brain Research. He was a Wissenschaftlicher assistant in Max Planck Institut für Biologische Kybernetik, Tüebingen, Germany from 1972 until 1981 when he became an associate professor at MIT. He is an honorary member of the Neuroscience Research Program, a member of the American Academy of Arts and Sciences and a Founding Fellow of AAAI. He received several awards such as the Otto-Hahn-Medaille Award of the Max-Planck-Society, the Max Planck Research Award (with M. Fahle), from the Alexander von Humboldt Foundation, the MIT 50K Entrepreneurship Competition Award, the Laurea Honoris Causa from the University of Pavia in 2000 (Volta Bicentennial), the 2003 Gabor Award, the 2009 Okawa prize, the American Association for the Advancement of Science (AAAS) Fellowship (2009) and the Swartz Prize for Theoretical and Computational Neuroscience in 2014. He is one of the most cited computational neuroscientists (with a h-index greater than 100-based on GoogleScholar).
Hrushikesh Mhaskar did his under-graduate studies in Institute of Science, Nagpur, and received the first M. Sc. degree in mathematics from the Indian Institute of Technology, India in 1976. He received the Ph.D. degree in mathematics and M. Sc. degree in computer science from the Ohio State University, USA in 1980. He then joined Cal. State L.A., and was promoted to full professor in 1990. After retirement in 2012, he is now a visiting associate at California Institute of Technology, Research Professor at Claremont Graduate University, and occasionally served as a consultant for Qualcomm. He has published more than 135 refereed articles in the area of approximation theory, potential theory, neural networks, wavelet analysis, and data processing. His book Weighted Polynomial Approximation was published in 1997 by World Scientific, and the book with Dr. D. V. Pai, Fundamentals of Approximation Theory was published by Narosa Publishers, CRC, and Alpha Science in 2000. He serves on the editorial boards of Journal of Approximation Theory, Applied and Computational Harmonic Analysis, and Jaen Journal of Approximation. In addition, he was a co-editor of a special issue of “Advances in Computational Mathematics on Mathematical Aspects of Neural Networks”, two volumes of Journal of Approximation Theory, dedicated to the memory of G. G. Lorentz, as well as two edited collections of research articles: Wavelet Analysis and Applications, Narosa Publishers, 2001, and Frontiers in Interpolation and Approximation, Chapman and Hall/CRC, 2006. He has held visiting positions, as well as given several invited lectures throughout North America, Europe, and Asia. He was awarded the Humboldt Fellowship for research in Germany four times. He was John von Neumann distinguished professor at Technical University of Munich in 2011. He is listed in Outstanding Young Men of America (1985) and Who’s Who in America’s Teachers (1994). His research was supported by the National Science Foundation and the U. S. Army Research Office, the Air Force Office of Scientific Research, the National Security Agency, and the Research and Development Laboratories.
Lorenzo Rosasco received the Ph.D. degree from the University of Genova, Italy in 2006, where he worked under the supervision of Alessandro Verri and Ernesto De Vito in the Statistical Learning and Image Processing Research Unit (SLIPGURU). He is an assistant professor at the University of Genova, Italy. He is also affiliated with the Massachusetts Institute of Technology (MIT), USA, where is a visiting professor, and with the Istituto Italiano di Tecnologia (IIT), Italy where he is an external collaborator. He is leading the efforts to establish the Laboratory for Computational and Statistical Learning (LCSL), born from a collaborative agreement between IIT and MIT. During his Ph.D. degree period, he has been visiting student at the Toyota Technological Institute at Chicago, USA (working with Steve Smale) and at the Center for Biological and Computational Learning (CBCL) at MIT–working with Tomaso Poggio. Between 2006 and 2009, he was a postdoctoral fellow at CBCL working with Tomaso Poggio.
His research interests include theory and algorithms for machine learning. He has developed and analyzed methods to learn from small as well as large samples of high dimensional data, using analytical and probabilistic tools, within a multidisciplinary approach drawing concepts and techniques primarily from computer science but also from statistics, engineering and applied mathematics.
Brando Miranda received the B. Sc. degree in electrical engineering and computer science (EECS) and the M. Eng. degree (supervised by Professor Tomaso Poggio) in machine learning from Massachusetts Institute of Technology (MIT), USA in 2014 and 2016, respectively.
His research interests include machine learning, statistics, neural networks, theories in deep learning and applied mathematics.
Qianli Liao is a Ph. D. degree candidate in electrical engineering and computer science (EECS) at Massachusetts Institute of Technology (MIT), USA, supervised by Professor Tomaso Poggio.
His research interests include machine learning, optimization, neural networks, theoretical deep learning, computer vision, visual object/face recognition, biologically-plausible and brain-inspired learning.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Poggio, T., Mhaskar, H., Rosasco, L. et al. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Int. J. Autom. Comput. 14, 503–519 (2017). https://doi.org/10.1007/s11633-017-1054-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-017-1054-2