Abstract
To advance deep learning methodologies in the next decade, a theoretical framework for reasoning about modern neural networks is needed. While efforts are increasing toward demystifying why deep learning is so effective, a comprehensive picture remains lacking, suggesting that a better theory is possible. We argue that a future deep learning theory should inherit three characteristics: a hierarchically structured network architecture, parameters iteratively optimized using stochastic gradient-based methods, and information from the data that evolves compressively. As an instantiation, we integrate these characteristics into a graphical model called neurashed. This model effectively explains some common empirical patterns in deep learning. In particular, neurashed enables insights into implicit regularization, information bottleneck, and local elasticity. Finally, we discuss how neurashed can guide the development of deep learning theories.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Commun ACM, 2017, 60: 84–90
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436–444
Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529: 484–489
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: convergence and generalization in neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 8580–8589
Bartlett P L, Foster D J, Telgarsky M. Spectrally-normalized margin bounds for neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6241–6250
Berner J, Grohs P, Kutyniok G, et al. The modern mathematics of deep learning. 2021. ArXiv:2105.04026
Tolstikhin I, Houlsby N, Kolesnikov A, et al. MlP-Mixer: an all-MLP architecture for vision. 2021. ArXiv:2105.01601
Zdeborová L. Understanding deep learning is also a job for physicists. Nat Phys, 2020, 16: 602–604
Eldan R, Shamir O. The power of depth for feedforward neural networks. In: Proceedings of Conference on Learning Theory, 2016. 907–940
Hinton G. How to represent part-whole hierarchies in a neural network. 2021. ArXiv:2102.12627
Bagrov A A, Iakovlev I A, Iliasov A A, et al. Multiscale structural complexity of natural patterns. Proc Natl Acad Sci USA, 2020, 117: 30241–30251
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015
Soudry D, Hoffer E, Nacson M S, et al. The implicit bias of gradient descent on separable data. J Mach Learn Res, 2018, 19: 2822–2878
Blum A L, Rivest R L. Training a 3-node neural network is NP-complete. Neural Networks, 1992, 5: 117–127
Goldt S, Mézard M, Krzakala F, et al. Modeling the influence of data structure on learning in neural networks: the hidden manifold model. Phys Rev X, 2020, 10: 041044
Tishby N, Zaslavsky N. Deep learning and the information bottleneck principle. In: Proceedings of IEEE Information Theory Workshop (ITW), 2015. 1–5
Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information. 2017. ArXiv:1703.00810
Feldman V. Does learning require memorization? A short tale about a long tail. In: Proceedings of Symposium on Theory of Computing, 2020. 954–959
Hebb D O. The Organization of Behavior: A Neuropsychological Theory. New York: Psychology Press, 2005
Poggio T, Banburski A, Liao Q. Theoretical issues in deep networks. Proc Natl Acad Sci USA, 2020, 117: 30039–30045
Allen-Zhu Z, Li Y. Backward feature correction: how deep learning performs deep learning. 2020. ArXiv:2001.04413
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on Machine Learning, 2015. 448–456
Ba J L, Kiros J R, Hinton G E. Layer normalization. 2016. ArXiv:1607.06450
Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 2014, 15: 1929–1958
Fang C, He H, Long Q, et al. Exploring deep neural networks via layer-peeled model: minority collapse in imbalanced training. In: Proceedings of the National Academy of Sciences, 2021
He H F, Su W J. The local elasticity of neural networks. In: Proceedings of International Conference on Learning Representations, 2020
Rahimi A, Recht B. Random features for large-scale kernel machines. In: Proceedings of Advances in Neural Information Processing Systems, 2007. 20
Yehudai G, Shamir O. On the power and limitations of random features for understanding neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 32
Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 3859–3869
Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning. New York: Springer, 2001
Zhang C Y, Bengio S, Hardt M, et al. Understanding deep learning (still) requires rethinking generalization. Commun ACM, 2021, 64: 107–115
Bartlett P L, Long P M, Lugosi G, et al. Benign overfitting in linear regression. Proc Natl Acad Sci USA, 2020, 117: 30063–30070
Nagarajan V, Kolter J Z. Uniform convergence may be unable to explain generalization in deep learning. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Razin N, Cohen N. Implicit regularization in deep learning may not be explainable by norms. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 33
Zhou Z-H. Why over-parameterization of deep neural networks does not overfit? Sci China Inf Sci, 2021, 64: 116101
Papyan V, Han X Y, Donoho D L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc Natl Acad Sci USA, 2020, 117: 24652–24663
He H F, Su W J. A law of data separation in deep learning. Proc Natl Acad Sci USA, 2023, 120: e2221704120
Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: generalization gap and sharp minima. 2016. ArXiv:1609.04836
Smith S, Elsen E, De S. On the generalization benefit of noise in stochastic gradient descent. In: Proceedings of International Conference on Machine Learning, 2020. 9058–9067
Ilyas A, Santurkar S, Engstrom L, et al. Adversarial examples are not bugs, they are features. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 32
Xiao K, Engstrom L, Ilyas A, et al. Noise or signal: the role of image backgrounds in object recognition. In: Proceedings of International Conference on Learning Representations, 2021
HaoChen J Z, Wei C, Lee J D, et al. Shape matters: understanding the implicit bias of the noise covariance. 2020. ArXiv:2006.08680
Feinberg I. Schizophrenia: caused by a fault in programmed synaptic elimination during adolescence? J Psychiatric Res, 1982, 17: 319–334
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. ArXiv:1409.1556
Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255
Chen S, He H, Su W J. Label-aware neural tangent kernel: toward better generalization and local elasticity. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 15847–15858
Deng Z, He H F, Su W J. Toward better generalization bounds with locally elastic stability. In: Proceedings of International Conference on Machine Learning, 2021
Zhang J Y, Wang H, Su W J. Imitating deep learning dynamics via locally elastic stochastic differential equations. In: Proceedings of Advances in Neural Information Processing Systems, 2021
Frankle J, Carbin M. The lottery ticket hypothesis: finding sparse, trainable neural networks. In: Proceedings of International Conference on Learning Representations, 2018
Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Wu L, Ma C, E W. How SGD selects the global minima in over-parameterized learning: a dynamical stability perspective. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 8289–8298
Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA, 2018, 115: 7665–7671
Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 3040–3050
Belkin M, Hsu D, Ma S, et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 2019, 116: 15849–15854
Lee J, Xiao L, Schoenholz S S, et al. Wide neural networks of any depth evolve as linear models under gradient descent. 2019. ArXiv:1902.06720
Xu Z Q J, Zhang Y Y, Xiao Y Y. Training behavior of deep neural network in frequency domain. In: Proceedings of International Conference on Neural Information Processing, 2019. 264–274
Oymak S, Soltanolkotabi M. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE J Sel Areas Inf Theor, 2020, 1: 84–105
Chan K H R, Yu Y D, You C, et al. ReduNet: a white-box deep network from the principle of maximizing rate reduction. 2021. ArXiv:2105.10446
E W. The dawning of a new era in applied mathematics. Not Amer Math Soc, 2021, 68: 565–571
Li Z L, You C, Bhojanapalli S, et al. Large models are parsimonious learners: activation sparsity in trained transformers. 2022. ArXiv:2210.06313
Zhang Z, Lin Y, Liu Z, et al. MoEfication: transformer feed-forward layers are mixtures of experts. In: Proceedings of Findings of the Association for Computational Linguistics, 2022. 877–890
Acknowledgements
This work was supported in part by an Alfred Sloan Research Fellowship and the Wharton Dean’s Research Fund. We would like to thank Patrick CHAO, Zhun DENG, Cong FANG, Hangfeng HE, Qingxuan JIANG, Konrad KORDING, Yi MA, and Jiayao ZHANG for their helpful discussions and comments. We are grateful to two anonymous reviewers for their constructive comments that helped improve the presentation of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Su, W.J. Envisioning future deep learning theories: some basic concepts and characteristics. Sci. China Inf. Sci. 67, 203101 (2024). https://doi.org/10.1007/s11432-023-4129-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-023-4129-1