Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Designing Universally-Approximating Deep Neural Networks: A First-Order Optimization Approach

Published: 25 March 2024 Publication History

Abstract

Universal approximation capability, also referred to as universality, is an important property of deep neural networks, endowing them with the potency to accurately represent the underlying target function in learning tasks. In practice, the architecture of deep neural networks largely influences the performance of the models. However, most existing methodologies for designing neural architectures, such as the heuristic manual design or neural architecture search, ignore the universal approximation property, thus losing a potential safeguard about the performance. In this paper, we propose a unified framework to design the architectures of deep neural networks with a universality guarantee based on first-order optimization algorithms, where the forward pass is interpreted as the updates of an optimization algorithm. The (explicit or implicit) network is designed by replacing each gradient term in the algorithm with a learnable module similar to a two-layer network or its derivatives. Specifically, we explore the realm of width-bounded neural networks, a common practical scenario, showcasing their universality. Moreover, adding operations of normalization, downsampling, and upsampling does not hurt the universality. To the best of our knowledge, this is the first work that <italic>width-bounded</italic> networks with universal approximation guarantee can be designed in a principled way. Our framework can inspire a variety of neural architectures including some renowned structures such as ResNet and DenseNet, as well as novel innovations. The experimental results on image classification problems demonstrate that the newly inspired networks are competitive and surpass the baselines of ResNet, DenseNet, as well as the advanced ConvNeXt and ViT, testifying to the effectiveness of our framework.

References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[2]
Y. Wu et al., “Google's neural machine translation system: Bridging the gap between human and machine translation,” 2016,.
[3]
D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.
[4]
K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Netw., vol. 2, pp. 359–366, 1989.
[5]
G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Control Signals Syst., vol. 2, pp. 303–314, 1989.
[6]
Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6232–6240.
[7]
H. Lin and S. Jegelka, “ResNet with one-neuron hidden layers is a universal approximator,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 6172–6181.
[8]
C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
[9]
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[10]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[11]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269.
[12]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5987–5995.
[13]
B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” in Proc. Int. Conf. Learn. Representations, 2017. [Online]. Available: https://openreview.net/forum?id=S1c2cvqee
[14]
H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1eYHoC5FX
[15]
K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 399–406.
[16]
Y. Yang, J. Sun, H. Li, and Z. Xu, “ADMM-CSNet: A deep learning approach for image compressive sensing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 3, pp. 521–538, Mar. 2020.
[17]
V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18–44, Mar. 2021.
[18]
H. Li, Y. Yang, D. Chen, and Z. Lin, “Optimization algorithm inspired deep neural network structure design,” in Proc. Asian Conf. Mach. Learn., 2018, pp. 614–629.
[19]
A. Bibi, B. Ghanem, V. Koltun, and R. Ranftl, “Deep layers as stochastic solvers,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=ryxxCiRqYX
[20]
Q. Li, T. Lin, and Z. Shen, “Deep learning via dynamical systems: An approximation perspective,” J. Eur. Math. Soc., vol. 25, pp. 1671–1709, 2022.
[21]
F. Bashforth and J. C. Adams, An Attempt to Test the Theories of Capillary Action by Comparing the Theoretical and Measured Forms of Drops of Fluid. Hungerford, U.K.: LEGARE STREET Press, 1883.
[22]
Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence O(1/k2),” in Proc. Doklady Akademii Nauk USSR, vol. 269, 1983, pp. 543–547.
[23]
N. Parikh et al., “Proximal algorithms,” Found. Trends Optim., vol. 1, pp. 127–239, 2014.
[24]
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 630–645.
[25]
S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in Proc. Adv. Neural Inf. Process. Syst., 2019, Art. no.
[26]
L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai, “Implicit deep learning,” SIAM J. Math. Data Sci., vol. 3, pp. 930–958, 2021.
[27]
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11966–11976.
[28]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.
[29]
J. Johnson, “Deep, skinny neural networks are not universal approximators,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=ryGgSsAcFQ
[30]
B. Hanin and M. Sellke, “Approximating continuous functions by ReLU nets of minimal width,” 2017,.
[31]
S. Park, C. Yun, J. Lee, and J. Shin, “Minimum width for universal approximation,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=O-XJwyoIF-k
[32]
P. Kidger and T. Lyons, “Universal approximation with deep narrow networks,” in Proc. Conf. Learn. Theory, 2020, pp. 2306–2327.
[33]
J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” 2014,.
[34]
J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1322–1332, Jun. 2018.
[35]
K. H. R. Chan, Y. Yu, C. You, H. Qi, J. Wright, and Y. Ma, “Deep networks from the principle of rate reduction,” 2020,.
[36]
Z. Shen et al., “Newton design: Designing CNNs with the family of Newton's methods,” Sci. China Inf. Sci., vol. 66, 2023, Art. no.
[37]
M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Netw., vol. 6, pp. 861–867, 1993.
[38]
M. E. Sander, P. Ablin, M. Blondel, and G. Peyré, “Momentum residual neural networks,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 9276–9287.
[39]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016.
[40]
M. Li, L. He, and Z. Lin, “Implicit Euler skip connections: Enhancing adversarial robustness via numerical stability,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 5874–5883.
[41]
X. Xie, Q. Wang, Z. Ling, X. Li, G. Liu, and Z. Lin, “Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 3604–3616, Mar. 2023.
[42]
Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 3276–3285.
[43]
L. Ruthotto and E. Haber, “Deep neural networks motivated by partial differential equations,” J. Math. Imag. Vis., vol. 62, pp. 352–364, 2020.
[44]
B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Comput. Math. Math. Phys., vol. 4, pp. 1–17, 1964.
[45]
Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 612–620.
[46]
L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integralquadratic constraints,” SIAM J. Optim., vol. 26, pp. 57–95, 2016.
[47]
E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural ODEs,” in Proc. Adv. Neural Inf. Process. Syst., 2019, Art. no.
[48]
H. Zhang, X. Gao, J. Unterman, and T. Arodz, “Approximation capabilities of neural ODEs and invertible residual networks,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 11086–11095.
[49]
T. Teshima, K. Tojo, M. Ikeda, I. Ishikawa, and K. Oono, “Universal approximation property of neural ordinary differential equations,” 2020,.
[50]
E. Weinan, “A proposal on machine learning via dynamical systems,” Commun. Math. Statist., vol. 5, pp. 1–11, 2017.
[51]
G. Dahlquist, “Convergence and stability in the numerical integration of ordinary differential equations,” Mathematica Scandinavica, vol. 4, pp. 33–53, 1956.
[52]
G. Wanner and E. Hairer, Solving Ordinary Differential Equations I. Berlin Heidelberg, New York, NY, USA: Springer, 1993.
[53]
B. Hanin, “Universal function approximation by deep neural nets with bounded width and ReLU activations,” Mathematics, vol. 7, no. 10, 2019, Art. no. [Online]. Available: https://www.mdpi.com/22277390/7/10/992
[54]
L. Zhao, J. Wang, X. Li, Z. Tu, and W. Zeng, “Deep convolutional neural networks with merge-and-run mappings,” 2016,.
[55]
R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in Proc. Conf. Learn. Theory, 2016, pp. 907–940.
[56]
M. Telgarsky, “Benefits of depth in neural networks,” in Proc. Conf. Learn. Theory, 2016, pp. 1517–1539.
[57]
M. Furst, J. B. Saxe, and M. Sipser, “Parity, circuits, and the polynomial-time hierarchy,” Math. Syst. Theory, vol. 17, pp. 13–27, 1984.
[58]
E. Malach, G. Yehudai, S. Shalev-Schwartz, and O. Shamir, “The connection between approximation, depth separation and learnability in neural networks,” in Proc. Conf. Learn. Theory, 2021, pp. 3265–3295.
[59]
A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” M.S. thesis, Univ. Tront, 2009.
[60]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
[61]
T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” 2017,.
[62]
M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6105–6114.
[63]
I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10425–10433.
[64]
S. Bai, V. Koltun, and J. Z. Kolter, “Multiscale deep equilibrium models,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 189–205.
[65]
E. Winston and J. Z. Kolter, “Monotone operator equilibrium networks,” in Proc. Adv. Neural Inf. Process. Syst., 2020, Art. no.
[66]
M. Li, Y. Wang, X. Xie, and Z. Lin, “Optimization inspired multi-branch equilibrium models,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=nbC8iTTXIrk
[67]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.
[68]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256.
[69]
A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017,.
[70]
G. Larsson, M. Maire, and G. Shakhnarovich, “FractalNet: Ultra-deep neural networks without residuals,” in Proc. Int. Conf. Learn. Representations, 2017. [Online]. Available: https://openreview.net/forum?id=S1VaB4cex
[71]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[72]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
[73]
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9992–10002.
[74]
T. H. Gronwall, “Note on the derivatives with respect to a parameter of the solutions of a system of differential equations,” Ann. Math., vol. 20, pp. 292–296, 1919.
[75]
G. Strang, Linear Algebra and its Applications. Belmont, CA, USA: Thomson, Brooks/Cole, 2006.
[76]
S. D. Conte and C. De Boor, Elementary Numerical Analysis: An Algorithmic Approach. Philadelphia, PA, USA: SIAM, 2017.
[77]
F. W. Warner, Foundations of Differentiable Manifolds and Lie Groups. Berlin, Germany: Springer Science & Business Media, 1983.
[78]
Z. Lin, H. Li, and C. Fang, Accelerated Optimization for Machine Learning - First-Order Algorithms. Berlin, Germany: Springer, 2020.
[79]
A. Beck, First-Order Methods in Optimization. Philadelphia, PA, USA: SIAM, 2017.
[80]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 448–456.
[81]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,.
[82]
P. Petersen and F. Voigtlaender, “Equivalence of approximation by convolutional neural networks and fully-connected networks,” Proc. Amer. Math. Soc., vol. 148, pp. 1567–1581, 2020.
[83]
K. Oono and T. Suzuki, “Approximation and non-parametric estimation of ResNet-type convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 4922–4931.
[84]
D.-X. Zhou, “Theory of deep convolutional neural networks: Downsampling,” Neural Netw., vol. 124, pp. 319–327, 2020.
[85]
J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with ReLU activation function,” Ann. Statist., vol. 48, pp. 1875–1897, 2020.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 46, Issue 9
Sept. 2024
621 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 March 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media