Abstract
The convergence of back-propagation learning is analyzed so as to explain common phenomenon observedb y practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposedin serious technical publications. This paper gives some of those tricks, ando.ers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Amari. Neural learning in structuredparameter spaces — natural riemannian gradient. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 127. The MIT Press, 1997.
S. Amari. Natural gradient works e.ciently in learning. Neural Computation, 10(2):251–276, 1998.
R. Battiti. First-and second-order methods for learning: Between steepest descent andnewton’s method. Neural Computation, 4:141–166, 1992.
S. Becker and Y. LeCun. Improving the convergence of backbropagation learning with secondo der metho ds. In David Touretzky, Geofrey Hinton, and T errence Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 29–37. Lawrence Erlbaum Associates, 1989.
C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.
L. Bottou. Online algorithms andsto chastic approximations. In David Saad, editor, Online Learning in Neural Networks (1997 Workshop at the Newton Institute), Cambridge, 1998. The Newton Institute Series, Cambridge University Press.
D. S. Broomheadand D. Lowe. Multivariable function interpolation andad aptive networks. Complex Systems, 2:321–355, 1988.
W. L. Buntine and A. S. Weigend. Computing second order derivatives in Feed-Forwardnet works: A review. IEEE Transactions on Neural Networks, 1993. To appear.
C. Darken and J. E. Moody. Note on learning rate schedules for stochastic optimization. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems, volume 3, pages 832–838. Morgan Kaufmann, San Mateo,CA, 1991.
K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks. Wiley, New York, 1996.
R. Fletcher. Practical Methods of Optimization, chapter 8.7: Polynomial time algorithms, pages 183–188. John Wiley & Sons, New York, second edition, 1987.
S. Geman, E. Bienenstock, and R. Doursat. Neural networks andthe bias/variance dilemma. Neural Computation, 4(1):1–58, 1992.
L. Goldstein. Mean square optimality in the continuous time Robbins Monro procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern California, LA, 1987.
G. H. Golub and C. F. Van Loan. Matrix Computations, 2nd ed. Johns Hopkins University Press, Baltimore, 1989.
T.M. Heskes and B. Kappen. On-line learning processes in arti.cial neural networks. In J. G. Tayler, editor, Mathematical Approaches to Neural Networks, volume 51, pages 199–233. Elsevier, Amsterdam, 1993.
Robert A. Jacobs. Increasedrates of convergence through learning rate adaptation. Neural Networks, 1:295–307, 1988.
A. H. Kramer and A. Sangiovanni-Vincentelli. Efficient parallel learning algorithms for neural networks. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems. Proceedings of the 1988 Conference, pages 40–48, San Mateo, CA, 1989. Morgan Kaufmann.
Y. LeCun. Modeles connexionnistes de l’apprentissage (connectionist learning models). PhD thesis, Université P. et M. Curie (Paris VI), 1987.
Y. LeCun. Generalization andnet work design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Amsterdam, 1989. Elsevier. Proceedings of the International Conference Connectionism in Perspective, University of Zürich, 10.–13. October 1988.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a backpropagation network. In D. S. Touretsky, editor, Advances in Neural Information Processing Systems, vol. 2, San Mateo, CA, 1990. Morgan Kaufman.
Y. LeCun, J.S. Denker, and S.A. Solla. Optimal brain damage. In D. S. Touretsky, editor, Advances in Neural Information Processing Systems, vol. 2, pages 598–605, 1990.
Y. LeCun, I. Kanter, and S. A. Solla. Secondord er properties of error surfaces. In Advances in Neural Information Processing Systems, vol. 3, San Mateo, CA, 1991. Morgan Kaufmann.
Y. LeCun, P. Y. Simard, and B. Pearlmutter. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In Giles, Hanson, and Cowan, editors, Advances in Neural Information Processing Systems, vol. 5, San Mateo, CA, 1993. Morgan Kaufmann.
M. MØller. A scaledconjugate gradient algorithm for fast supervisedlearning. Neural Networks, 6:525–533, 1993.
M. MØller. Supervised learning on large redundant training sets. International Journal of Neural Systems, 4(1):15–25, 1993.
J. E. Moody and C. J. Darken. Fast learning in networks of locally-tunedpro cessing units. Neural Computation, 1:281–294, 1989.
N. Murata. (in Japanese). PhD thesis, University of Tokyo, 1992.
N. Murata, K.-R. Müller, A. Ziehe, and S. Amari. Adaptive on-line learning in changing environments. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 599. The MIT Press, 1997.
A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice Hall, Englewood Cliffs, 1975.
G. B. Orr. Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon Graduate Institute, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R. (1998). Efficient BackProp. In: Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 1524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49430-8_2
Download citation
DOI: https://doi.org/10.1007/3-540-49430-8_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65311-0
Online ISBN: 978-3-540-49430-0
eBook Packages: Springer Book Archive