research-article

Newton Methods for Convolutional Neural Networks

Authors:

Chien-Chih Wang,

Kent Loong Tan,

Chih-Jen LinAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 11, Issue 2

Article No.: 19, Pages 1 - 30

https://doi.org/10.1145/3368271

Published: 25 January 2020 Publication History

Abstract

Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some situations. Recently, Newton methods have been investigated as an alternative optimization technique, but most existing studies consider only fully connected feedforward neural networks. These studies do not investigate some more commonly used networks such as Convolutional Neural Networks (CNN). One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation. In this work, we give details of all building blocks, including the evaluation of function, gradient, Jacobian, and Gauss-Newton matrix-vector products. These basic components are very important not only for practical implementation but also for developing variants of Newton methods for CNN. We show that an efficient MATLAB implementation can be done in just several hundred lines of code. Preliminary experiments indicate that Newton methods are less sensitive to parameters than the stochastic gradient approach.

References

[1]

Aleksandar Botev, Hippolyt Ritter, and David Barber. 2017. Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning. 557--565.

Digital Library

[2]

Richard H. Byrd, Gillian M. Chin, Will Neveitt, and Jorge Nocedal. 2011. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21, 3 (2011), 977--995.

[3]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS) 25. 1223--1231.

[4]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (1990), 1--17.

Digital Library

[5]

Roger Grosse and James Martens. 2016. A kronecker-factored approximate fisher matrix for convolution layers. In Proceedings of the International Conference on Machine Learning. 573--582.

[6]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15).

Digital Library

[7]

Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, and Martin Takáč. 2016. Distributed hessian-free optimization for deep neural network. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

[8]

Magnus Rudolph Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49, 1 (1952), 409--436.

[9]

Ryan Kiros. 2013. Training neural networks with stochastic Hessian-free optimization. arXiv preprint arXiv:1301.3641.

[10]

Alex Krizhevsky and Geoffrey E. Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.

[11]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). 1097--1105.

Digital Library

[12]

Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning. 265--272.

[13]

Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neur. Comput. 1, 4 (1989), 541--551.

Digital Library

[14]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.

[15]

Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. 1998. Efficient backprop. In Neural Networks: Tricks of the Trade. SpringerVerlag, 9--50. http://leon.bottou.org/papers/lecun-98x.

Digital Library

[16]

Yann LeCun, Fu Jie Huang, and Léon Bottou. 2004. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 97--104.

[17]

Kenneth Levenberg. 1944. A method for the solution of certain non-linear problems in least squares. Quart. Appl. Math. 2, 2 (1944), 164--168.

[18]

Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. 2008. Trust region Newton method for large-scale logistic regression. J. Mach. Learn. Res. 9 (2008), 627--650.

Digital Library

[19]

Donald W. Marquardt. 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Indust. Appl. Math. 11, 2 (1963), 431--441.

[20]

James Martens. 2010. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML’10).

Digital Library

[21]

James Martens and Ilya Sutskever. 2012. Training deep and recurrent networks with Hessian-free optimization. In Neural Networks: Tricks of the Trade. Springer, 479--535.

[22]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning.

[23]

Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neur. Netw. 12, 1 (1999), 145--151.

Digital Library

[24]

Nicol N. Schraudolph. 2002. Fast curvature matrix-vector products for second-order gradient descent. Neur. Comput. 14, 7 (2002), 1723--1738.

Digital Library

[25]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[26]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1139--1147.

[27]

Andrea Vedaldi and Karel Lenc. 2015. MatConvNet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM International Conference on Multimedia. 689--692.

Digital Library

[28]

Oriol Vinyals and Daniel Povey. 2012. Krylov subspace descent for deep learning. In Proceedings of the Annual Conference on Artificial Intelligence and Statistics. 1261--1268.

[29]

Chien-Chih Wang, Chun-Heng Huang, and Chih-Jen Lin. 2015. Subsampled Hessian Newton methods for supervised learning. Neur. Comput. 27, 8 (2015), 1766--1795. http://www.csie.ntu.edu.tw/&sim;cjlin/papers/sub_hessian/sample_hessian.pdf.

Digital Library

[30]

Chien-Chih Wang, Kent-Loong Tan, Chun-Ting Chen, Yu-Hsiang Lin, S. Sathiya Keerthi, Dhruv Mahajan, Sellamanickam Sundararajan, and Chih-Jen Lin. 2018. Distributed Newton methods for deep learning. Neur. Comput. 30, 6 (2018), 1673--1724. http://www.csie.ntu.edu.tw/&sim;cjlin/papers/dnn/dsh.pdf.

Digital Library

[31]

Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. 2017. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems. 4148--4158.

[32]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision. 818--833.

Cited By

Wu XXie YZeng JYang ZYu YLi QLiu WShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Adversarial Learning with Mask Reconstruction for Text-Guided Image InpaintingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475506(3464-3472)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475506
Hou CCao BRuan SFan J(2021)TLDS: A Transfer-Learning-Based Delivery Station Location Selection PipelineACM Transactions on Intelligent Systems and Technology10.1145/346908412:4(1-24)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3469084
Niu SJiang YChen BWang JLiu YSong H(2021)Cross-Modality Transfer Learning for Image-Text Information ManagementACM Transactions on Management Information Systems10.1145/346432413:1(1-14)Online publication date: 5-Oct-2021
https://dl.acm.org/doi/10.1145/3464324
Show More Cited By

Index Terms

Newton Methods for Convolutional Neural Networks
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Neural networks
2. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

Newton Methods for Quasidifferentiable Equations and Their Convergence

The Newton method and the inexact Newton method for solving quasidifferentiable equations via the quasidifferential are investigated. The notion of Q-semismoothness for a quasidifferentiable function is proposed. The superlinear convergence of the ...
Inexact Newton methods for the nonlinear complementarity problem

An exact Newton method for solving a nonlinear complementarity problem consists of solving a sequence of linear complementarity subproblems. For problems of large size, solving the subproblems exactly can be very expensive. In this paper we study ...
Globally Convergent Newton Methods for Nonsmooth Equations

This paper presents some globally convergent descent methods for solving systems of nonlinear equations defined by locally Lipschitzian functions. These methods resemble the well-known family of damped Newton and Gauss-Newton methods for solving systems ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 11, Issue 2

Survey Paper and Regular Paper

April 2020

274 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3379210

Editor:
Yu Zheng
JD Finance, China

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 January 2020

Accepted: 01 October 2019

Revised: 01 September 2019

Received: 01 July 2019

Published in TIST Volume 11, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

MOST of Taiwan via

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
458
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)4

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu XXie YZeng JYang ZYu YLi QLiu WShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Adversarial Learning with Mask Reconstruction for Text-Guided Image InpaintingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475506(3464-3472)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475506
Hou CCao BRuan SFan J(2021)TLDS: A Transfer-Learning-Based Delivery Station Location Selection PipelineACM Transactions on Intelligent Systems and Technology10.1145/346908412:4(1-24)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3469084
Niu SJiang YChen BWang JLiu YSong H(2021)Cross-Modality Transfer Learning for Image-Text Information ManagementACM Transactions on Management Information Systems10.1145/346432413:1(1-14)Online publication date: 5-Oct-2021
https://dl.acm.org/doi/10.1145/3464324
Du YWang JFeng WPan SQin TXu RWang CDemartini GZuccon GCulpepper JHuang ZTong H(2021)AdaRNNProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482315(402-411)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482315
Jiang DTan CPeng JChen CWu XZhao WSong YTong YLiu CXu QYang QDeng L(2021)A GDPR-compliant Ecosystem for Speech Recognition with Transfer, Federated, and Evolutionary LearningACM Transactions on Intelligent Systems and Technology10.1145/344768712:3(1-19)Online publication date: 5-May-2021
https://dl.acm.org/doi/10.1145/3447687
Liu YGuo BZhang DZeghlache DChen JZhang SZhou DShi XYu Z(2021)MetaStore: A Task-adaptative Meta-learning Model for Optimal Store Placement with Multi-city Knowledge TransferACM Transactions on Intelligent Systems and Technology10.1145/344727112:3(1-23)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447271
Yang MXu DChen HWen ZChen M(2021)Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01051(10649-10658)Online publication date: Jun-2021
https://doi.org/10.1109/CVPR46437.2021.01051
Wang GVaish HSun HWu JWang SZhang D(2020)Understanding User Behavior in Car Sharing Services Through The Lens of MobilityProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34322004:4(1-30)Online publication date: 18-Dec-2020
https://dl.acm.org/doi/10.1145/3432200
Wang GFang ZXie XWang SSun HZhang FLiu YZhang D(2020)Pricing-aware Real-time Charging Scheduling and Charging Station Expansion for Large-scale Electric BusesACM Transactions on Intelligent Systems and Technology10.1145/342808012:1(1-26)Online publication date: 25-Nov-2020
https://dl.acm.org/doi/10.1145/3428080
Wang XHigh AWang XZhao K(2020)Predicting users' continued engagement in online health communities from the quantity and quality of received supportJournal of the Association for Information Science and Technology10.1002/asi.2443672:6(710-722)Online publication date: 3-Dec-2020
https://dl.acm.org/doi/10.1002/asi.24436

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents