Envisioning future deep learning theories: some basic concepts and characteristics

Weijie J. Su^1,2

171 Accesses
Explore all metrics

Abstract

To advance deep learning methodologies in the next decade, a theoretical framework for reasoning about modern neural networks is needed. While efforts are increasing toward demystifying why deep learning is so effective, a comprehensive picture remains lacking, suggesting that a better theory is possible. We argue that a future deep learning theory should inherit three characteristics: a hierarchically structured network architecture, parameters iteratively optimized using stochastic gradient-based methods, and information from the data that evolves compressively. As an instantiation, we integrate these characteristics into a graphical model called neurashed. This model effectively explains some common empirical patterns in deep learning. In particular, neurashed enables insights into implicit regularization, information bottleneck, and local elasticity. Finally, we discuss how neurashed can guide the development of deep learning theories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction

Deep Learning without Tears

Article 01 January 2020

A Survey on Recent Deep Learning Architectures

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Commun ACM, 2017, 60: 84–90
Article Google Scholar
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436–444
Article Google Scholar
Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529: 484–489
Article Google Scholar
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: convergence and generalization in neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 8580–8589
Google Scholar
Bartlett P L, Foster D J, Telgarsky M. Spectrally-normalized margin bounds for neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6241–6250
Google Scholar
Berner J, Grohs P, Kutyniok G, et al. The modern mathematics of deep learning. 2021. ArXiv:2105.04026
Tolstikhin I, Houlsby N, Kolesnikov A, et al. MlP-Mixer: an all-MLP architecture for vision. 2021. ArXiv:2105.01601
Zdeborová L. Understanding deep learning is also a job for physicists. Nat Phys, 2020, 16: 602–604
Article Google Scholar
Eldan R, Shamir O. The power of depth for feedforward neural networks. In: Proceedings of Conference on Learning Theory, 2016. 907–940
Google Scholar
Hinton G. How to represent part-whole hierarchies in a neural network. 2021. ArXiv:2102.12627
Bagrov A A, Iakovlev I A, Iliasov A A, et al. Multiscale structural complexity of natural patterns. Proc Natl Acad Sci USA, 2020, 117: 30241–30251
Article MathSciNet Google Scholar
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015
Google Scholar
Soudry D, Hoffer E, Nacson M S, et al. The implicit bias of gradient descent on separable data. J Mach Learn Res, 2018, 19: 2822–2878
MathSciNet Google Scholar
Blum A L, Rivest R L. Training a 3-node neural network is NP-complete. Neural Networks, 1992, 5: 117–127
Article Google Scholar
Goldt S, Mézard M, Krzakala F, et al. Modeling the influence of data structure on learning in neural networks: the hidden manifold model. Phys Rev X, 2020, 10: 041044
Google Scholar
Tishby N, Zaslavsky N. Deep learning and the information bottleneck principle. In: Proceedings of IEEE Information Theory Workshop (ITW), 2015. 1–5
Google Scholar
Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information. 2017. ArXiv:1703.00810
Feldman V. Does learning require memorization? A short tale about a long tail. In: Proceedings of Symposium on Theory of Computing, 2020. 954–959
Google Scholar
Hebb D O. The Organization of Behavior: A Neuropsychological Theory. New York: Psychology Press, 2005
Book Google Scholar
Poggio T, Banburski A, Liao Q. Theoretical issues in deep networks. Proc Natl Acad Sci USA, 2020, 117: 30039–30045
Article MathSciNet Google Scholar
Allen-Zhu Z, Li Y. Backward feature correction: how deep learning performs deep learning. 2020. ArXiv:2001.04413
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on Machine Learning, 2015. 448–456
Google Scholar
Ba J L, Kiros J R, Hinton G E. Layer normalization. 2016. ArXiv:1607.06450
Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 2014, 15: 1929–1958
MathSciNet Google Scholar
Fang C, He H, Long Q, et al. Exploring deep neural networks via layer-peeled model: minority collapse in imbalanced training. In: Proceedings of the National Academy of Sciences, 2021
Google Scholar
He H F, Su W J. The local elasticity of neural networks. In: Proceedings of International Conference on Learning Representations, 2020
Google Scholar
Rahimi A, Recht B. Random features for large-scale kernel machines. In: Proceedings of Advances in Neural Information Processing Systems, 2007. 20
Google Scholar
Yehudai G, Shamir O. On the power and limitations of random features for understanding neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 32
Google Scholar
Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 3859–3869
Google Scholar
Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning. New York: Springer, 2001
Google Scholar
Zhang C Y, Bengio S, Hardt M, et al. Understanding deep learning (still) requires rethinking generalization. Commun ACM, 2021, 64: 107–115
Article Google Scholar
Bartlett P L, Long P M, Lugosi G, et al. Benign overfitting in linear regression. Proc Natl Acad Sci USA, 2020, 117: 30063–30070
Article MathSciNet Google Scholar
Nagarajan V, Kolter J Z. Uniform convergence may be unable to explain generalization in deep learning. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Google Scholar
Razin N, Cohen N. Implicit regularization in deep learning may not be explainable by norms. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 33
Google Scholar
Zhou Z-H. Why over-parameterization of deep neural networks does not overfit? Sci China Inf Sci, 2021, 64: 116101
Article MathSciNet Google Scholar
Papyan V, Han X Y, Donoho D L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc Natl Acad Sci USA, 2020, 117: 24652–24663
Article MathSciNet Google Scholar
He H F, Su W J. A law of data separation in deep learning. Proc Natl Acad Sci USA, 2023, 120: e2221704120
Article Google Scholar
Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: generalization gap and sharp minima. 2016. ArXiv:1609.04836
Smith S, Elsen E, De S. On the generalization benefit of noise in stochastic gradient descent. In: Proceedings of International Conference on Machine Learning, 2020. 9058–9067
Google Scholar
Ilyas A, Santurkar S, Engstrom L, et al. Adversarial examples are not bugs, they are features. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 32
Google Scholar
Xiao K, Engstrom L, Ilyas A, et al. Noise or signal: the role of image backgrounds in object recognition. In: Proceedings of International Conference on Learning Representations, 2021
Google Scholar
HaoChen J Z, Wei C, Lee J D, et al. Shape matters: understanding the implicit bias of the noise covariance. 2020. ArXiv:2006.08680
Feinberg I. Schizophrenia: caused by a fault in programmed synaptic elimination during adolescence? J Psychiatric Res, 1982, 17: 319–334
Article Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. ArXiv:1409.1556
Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255
Google Scholar
Chen S, He H, Su W J. Label-aware neural tangent kernel: toward better generalization and local elasticity. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 15847–15858
Google Scholar
Deng Z, He H F, Su W J. Toward better generalization bounds with locally elastic stability. In: Proceedings of International Conference on Machine Learning, 2021
Google Scholar
Zhang J Y, Wang H, Su W J. Imitating deep learning dynamics via locally elastic stochastic differential equations. In: Proceedings of Advances in Neural Information Processing Systems, 2021
Google Scholar
Frankle J, Carbin M. The lottery ticket hypothesis: finding sparse, trainable neural networks. In: Proceedings of International Conference on Learning Representations, 2018
Google Scholar
Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Google Scholar
Wu L, Ma C, E W. How SGD selects the global minima in over-parameterized learning: a dynamical stability perspective. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 8289–8298
Google Scholar
Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA, 2018, 115: 7665–7671
Article MathSciNet Google Scholar
Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 3040–3050
Google Scholar
Belkin M, Hsu D, Ma S, et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 2019, 116: 15849–15854
Article MathSciNet Google Scholar
Lee J, Xiao L, Schoenholz S S, et al. Wide neural networks of any depth evolve as linear models under gradient descent. 2019. ArXiv:1902.06720
Xu Z Q J, Zhang Y Y, Xiao Y Y. Training behavior of deep neural network in frequency domain. In: Proceedings of International Conference on Neural Information Processing, 2019. 264–274
Chapter Google Scholar
Oymak S, Soltanolkotabi M. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE J Sel Areas Inf Theor, 2020, 1: 84–105
Article Google Scholar
Chan K H R, Yu Y D, You C, et al. ReduNet: a white-box deep network from the principle of maximizing rate reduction. 2021. ArXiv:2105.10446
E W. The dawning of a new era in applied mathematics. Not Amer Math Soc, 2021, 68: 565–571
MathSciNet Google Scholar
Li Z L, You C, Bhojanapalli S, et al. Large models are parsimonious learners: activation sparsity in trained transformers. 2022. ArXiv:2210.06313
Zhang Z, Lin Y, Liu Z, et al. MoEfication: transformer feed-forward layers are mixtures of experts. In: Proceedings of Findings of the Association for Computational Linguistics, 2022. 877–890
Google Scholar

Download references

Acknowledgements

This work was supported in part by an Alfred Sloan Research Fellowship and the Wharton Dean’s Research Fund. We would like to thank Patrick CHAO, Zhun DENG, Cong FANG, Hangfeng HE, Qingxuan JIANG, Konrad KORDING, Yi MA, and Jiayao ZHANG for their helpful discussions and comments. We are grateful to two anonymous reviewers for their constructive comments that helped improve the presentation of the paper.

Author information

Authors and Affiliations

The Wharton School, University of Pennsylvania, Philadelphia, PA, 19104, USA
Weijie J. Su
School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, 19104, USA
Weijie J. Su

Authors

Weijie J. Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weijie J. Su.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Su, W.J. Envisioning future deep learning theories: some basic concepts and characteristics. Sci. China Inf. Sci. 67, 203101 (2024). https://doi.org/10.1007/s11432-023-4129-1

Download citation

Received: 30 November 2023
Revised: 08 March 2024
Accepted: 05 August 2024
Published: 13 September 2024
DOI: https://doi.org/10.1007/s11432-023-4129-1

Envisioning future deep learning theories: some basic concepts and characteristics

Abstract

Access this article

Subscribe and save

Buy Now