research-article

Designing Universally-Approximating Deep Neural Networks: A First-Order Optimization Approach

Authors:

Zhouchen LinAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 46, Issue 9

Pages 6231 - 6246

https://doi.org/10.1109/TPAMI.2024.3380007

Published: 01 September 2024 Publication History

Abstract

Universal approximation capability, also referred to as universality, is an important property of deep neural networks, endowing them with the potency to accurately represent the underlying target function in learning tasks. In practice, the architecture of deep neural networks largely influences the performance of the models. However, most existing methodologies for designing neural architectures, such as the heuristic manual design or neural architecture search, ignore the universal approximation property, thus losing a potential safeguard about the performance. In this paper, we propose a unified framework to design the architectures of deep neural networks with a universality guarantee based on first-order optimization algorithms, where the forward pass is interpreted as the updates of an optimization algorithm. The (explicit or implicit) network is designed by replacing each gradient term in the algorithm with a learnable module similar to a two-layer network or its derivatives. Specifically, we explore the realm of width-bounded neural networks, a common practical scenario, showcasing their universality. Moreover, adding operations of normalization, downsampling, and upsampling does not hurt the universality. To the best of our knowledge, this is the first work that <italic>width-bounded</italic> networks with universal approximation guarantee can be designed in a principled way. Our framework can inspire a variety of neural architectures including some renowned structures such as ResNet and DenseNet, as well as novel innovations. The experimental results on image classification problems demonstrate that the newly inspired networks are competitive and surpass the baselines of ResNet, DenseNet, as well as the advanced ConvNeXt and ViT, testifying to the effectiveness of our framework.

References

[1]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[2]

Y. Wu et al., “Google's neural machine translation system: Bridging the gap between human and machine translation,” 2016,.

[3]

D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.

[4]

K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Netw., vol. 2, pp. 359–366, 1989.

[5]

G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Control Signals Syst., vol. 2, pp. 303–314, 1989.

[6]

Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6232–6240.

[7]

H. Lin and S. Jegelka, “ResNet with one-neuron hidden layers is a universal approximator,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 6172–6181.

[8]

C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.

[9]

R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.

[10]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[11]

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269.

[12]

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5987–5995.

[13]

B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” in Proc. Int. Conf. Learn. Representations, 2017. [Online]. Available: https://openreview.net/forum?id=S1c2cvqee

[14]

H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1eYHoC5FX

[15]

K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 399–406.

[16]

Y. Yang, J. Sun, H. Li, and Z. Xu, “ADMM-CSNet: A deep learning approach for image compressive sensing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 3, pp. 521–538, Mar. 2020.

[17]

V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18–44, Mar. 2021.

[18]

H. Li, Y. Yang, D. Chen, and Z. Lin, “Optimization algorithm inspired deep neural network structure design,” in Proc. Asian Conf. Mach. Learn., 2018, pp. 614–629.

[19]

A. Bibi, B. Ghanem, V. Koltun, and R. Ranftl, “Deep layers as stochastic solvers,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=ryxxCiRqYX

[20]

Q. Li, T. Lin, and Z. Shen, “Deep learning via dynamical systems: An approximation perspective,” J. Eur. Math. Soc., vol. 25, pp. 1671–1709, 2022.

[21]

F. Bashforth and J. C. Adams, An Attempt to Test the Theories of Capillary Action by Comparing the Theoretical and Measured Forms of Drops of Fluid. Hungerford, U.K.: LEGARE STREET Press, 1883.

[22]

Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence O(1/k²),” in Proc. Doklady Akademii Nauk USSR, vol. 269, 1983, pp. 543–547.

[23]

N. Parikh et al., “Proximal algorithms,” Found. Trends Optim., vol. 1, pp. 127–239, 2014.

Digital Library

[24]

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 630–645.

[25]

S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in Proc. Adv. Neural Inf. Process. Syst., 2019, Art. no.

[26]

L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai, “Implicit deep learning,” SIAM J. Math. Data Sci., vol. 3, pp. 930–958, 2021.

[27]

Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11966–11976.

[28]

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.

[29]

J. Johnson, “Deep, skinny neural networks are not universal approximators,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=ryGgSsAcFQ

[30]

B. Hanin and M. Sellke, “Approximating continuous functions by ReLU nets of minimal width,” 2017,.

[31]

S. Park, C. Yun, J. Lee, and J. Shin, “Minimum width for universal approximation,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=O-XJwyoIF-k

[32]

P. Kidger and T. Lyons, “Universal approximation with deep narrow networks,” in Proc. Conf. Learn. Theory, 2020, pp. 2306–2327.

[33]

J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” 2014,.

[34]

J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1322–1332, Jun. 2018.

[35]

K. H. R. Chan, Y. Yu, C. You, H. Qi, J. Wright, and Y. Ma, “Deep networks from the principle of rate reduction,” 2020,.

[36]

Z. Shen et al., “Newton design: Designing CNNs with the family of Newton's methods,” Sci. China Inf. Sci., vol. 66, 2023, Art. no.

[37]

M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Netw., vol. 6, pp. 861–867, 1993.

Digital Library

[38]

M. E. Sander, P. Ablin, M. Blondel, and G. Peyré, “Momentum residual neural networks,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 9276–9287.

[39]

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016.

Digital Library

[40]

M. Li, L. He, and Z. Lin, “Implicit Euler skip connections: Enhancing adversarial robustness via numerical stability,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 5874–5883.

[41]

X. Xie, Q. Wang, Z. Ling, X. Li, G. Liu, and Z. Lin, “Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 3604–3616, Mar. 2023.

[42]

Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 3276–3285.

[43]

L. Ruthotto and E. Haber, “Deep neural networks motivated by partial differential equations,” J. Math. Imag. Vis., vol. 62, pp. 352–364, 2020.

[44]

B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Comput. Math. Math. Phys., vol. 4, pp. 1–17, 1964.

[45]

Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 612–620.

[46]

L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integralquadratic constraints,” SIAM J. Optim., vol. 26, pp. 57–95, 2016.

[47]

E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural ODEs,” in Proc. Adv. Neural Inf. Process. Syst., 2019, Art. no.

[48]

H. Zhang, X. Gao, J. Unterman, and T. Arodz, “Approximation capabilities of neural ODEs and invertible residual networks,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 11086–11095.

[49]

T. Teshima, K. Tojo, M. Ikeda, I. Ishikawa, and K. Oono, “Universal approximation property of neural ordinary differential equations,” 2020,.

[50]

E. Weinan, “A proposal on machine learning via dynamical systems,” Commun. Math. Statist., vol. 5, pp. 1–11, 2017.

[51]

G. Dahlquist, “Convergence and stability in the numerical integration of ordinary differential equations,” Mathematica Scandinavica, vol. 4, pp. 33–53, 1956.

[52]

G. Wanner and E. Hairer, Solving Ordinary Differential Equations I. Berlin Heidelberg, New York, NY, USA: Springer, 1993.

[53]

B. Hanin, “Universal function approximation by deep neural nets with bounded width and ReLU activations,” Mathematics, vol. 7, no. 10, 2019, Art. no. [Online]. Available: https://www.mdpi.com/22277390/7/10/992

[54]

L. Zhao, J. Wang, X. Li, Z. Tu, and W. Zeng, “Deep convolutional neural networks with merge-and-run mappings,” 2016,.

[55]

R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in Proc. Conf. Learn. Theory, 2016, pp. 907–940.

[56]

M. Telgarsky, “Benefits of depth in neural networks,” in Proc. Conf. Learn. Theory, 2016, pp. 1517–1539.

[57]

M. Furst, J. B. Saxe, and M. Sipser, “Parity, circuits, and the polynomial-time hierarchy,” Math. Syst. Theory, vol. 17, pp. 13–27, 1984.

[58]

E. Malach, G. Yehudai, S. Shalev-Schwartz, and O. Shamir, “The connection between approximation, depth separation and learnability in neural networks,” in Proc. Conf. Learn. Theory, 2021, pp. 3265–3295.

[59]

A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” M.S. thesis, Univ. Tront, 2009.

[60]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.

[61]

T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” 2017,.

[62]

M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6105–6114.

[63]

I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10425–10433.

[64]

S. Bai, V. Koltun, and J. Z. Kolter, “Multiscale deep equilibrium models,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 189–205.

[65]

E. Winston and J. Z. Kolter, “Monotone operator equilibrium networks,” in Proc. Adv. Neural Inf. Process. Syst., 2020, Art. no.

[66]

M. Li, Y. Wang, X. Xie, and Z. Lin, “Optimization inspired multi-branch equilibrium models,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=nbC8iTTXIrk

[67]

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.

[68]

X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256.

[69]

A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017,.

[70]

G. Larsson, M. Maire, and G. Shakhnarovich, “FractalNet: Ultra-deep neural networks without residuals,” in Proc. Int. Conf. Learn. Representations, 2017. [Online]. Available: https://openreview.net/forum?id=S1VaB4cex

[71]

A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6000–6010.

[72]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

[73]

Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9992–10002.

[74]

T. H. Gronwall, “Note on the derivatives with respect to a parameter of the solutions of a system of differential equations,” Ann. Math., vol. 20, pp. 292–296, 1919.

[75]

G. Strang, Linear Algebra and its Applications. Belmont, CA, USA: Thomson, Brooks/Cole, 2006.

[76]

S. D. Conte and C. De Boor, Elementary Numerical Analysis: An Algorithmic Approach. Philadelphia, PA, USA: SIAM, 2017.

Digital Library

[77]

F. W. Warner, Foundations of Differentiable Manifolds and Lie Groups. Berlin, Germany: Springer Science & Business Media, 1983.

[78]

Z. Lin, H. Li, and C. Fang, Accelerated Optimization for Machine Learning - First-Order Algorithms. Berlin, Germany: Springer, 2020.

[79]

A. Beck, First-Order Methods in Optimization. Philadelphia, PA, USA: SIAM, 2017.

[80]

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 448–456.

[81]

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,.

[82]

P. Petersen and F. Voigtlaender, “Equivalence of approximation by convolutional neural networks and fully-connected networks,” Proc. Amer. Math. Soc., vol. 148, pp. 1567–1581, 2020.

[83]

K. Oono and T. Suzuki, “Approximation and non-parametric estimation of ResNet-type convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 4922–4931.

[84]

D.-X. Zhou, “Theory of deep convolutional neural networks: Downsampling,” Neural Netw., vol. 124, pp. 319–327, 2020.

Digital Library

[85]

J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with ReLU activation function,” Ann. Statist., vol. 48, pp. 1875–1897, 2020.

Index Terms

Designing Universally-Approximating Deep Neural Networks: A First-Order Optimization Approach

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep Kronecker neural networks: A general framework for neural networks with adaptive activation functions
Abstract
We propose a new type of neural networks, Kronecker neural networks (KNNs), that form a general framework for neural networks with adaptive activation functions. KNNs employ the Kronecker product, which provides an efficient way of ...
High-Order hopfield neural networks
ISNN'05: Proceedings of the Second international conference on Advances in Neural Networks - Volume Part I

In 1984 Hopfield showed that the time evolution of a symmetric Hopfield neural networks are a motion in state space that seeks out minima in the energy function (i.e., equilibrium point set of Hopfield neural networks). Because high-order Hopfield ...
Analysis and optimization of multilayer neural networks

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 46, Issue 9

Sept. 2024

621 pages

Issue’s Table of Contents

0162-8828 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 September 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents