chapter

Free access

When does label smoothing help?

AUTHORs:

Rafael Müller,

Simon Kornblith,

Geoffrey HintonAuthors Info & Claims

Proceedings of the 33rd International Conference on Neural Information Processing Systems

December 2019

Article No.: 422, Pages 4694 - 4703

Published: 08 December 2019 Publication History

PDF eReader Publisher Site

Abstract

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

References

[1]

David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Nature, 323:19, 1986.

[2]

Eric B Baum and Frank Wilczek. Supervised learning of probability distributions by neural networks. In Neural information processing systems, pages 52-61, 1988.

[3]

Sara Solla, Esther Levin, and Michael Fleisher. Accelerated learning in layered neural networks. Complex systems, 2:625-640, 1988.

Digital Library

[4]

Scott E Fahlman. An empirical study of learning speed in back-propagation networks. Technical report, Carnegie Mellon University, Computer Science Department, 1988.

[5]

John A Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the theory of neural computation. Addison-Wesley, 2018.

[6]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818-2826, 2016.

[7]

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697-8710, 2018.

[8]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.

[9]

Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. GPipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018.

[10]

Jan Chorowski and Navdeep Jaitly. Towards better decoding and language model integration in sequence to sequence models. Proc. Interspeech 2017, pages 523-527, 2017.

[11]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998-6008, 2017.

Digital Library

[12]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.

[14]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

Digital Library

[15]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321-1330. JMLR. org, 2017.

Digital Library

[16]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[17]

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.

[18]

Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753-4762, 2016.

[19]

Myle Ott, Michael Auli, David Grangier, et al. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning, pages 3953-3962, 2018.

[20]

Aviral Kumar and Sunita Sarawagi. Calibration of encoder decoder models for neural machine translation. arXiv preprint arXiv:1903.00802, 2019.

[21]

Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974, 2018.

[22]

Ivan Chelombiev, Conor Houghton, and Cian O'Donnell. Adaptive estimators show information compression in deep neural networks. arXiv preprint arXiv:1902.09037, 2019.

[23]

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.

[24]

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1-5. IEEE, 2015.

[25]

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Cited By

Zhang ZYu YTakasu A(2024)Controllable Syllable-Level Lyrics Generation From Melody With Prior AttentionIEEE Transactions on Multimedia10.1109/TMM.2024.344366426(11083-11094)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3443664
Rouhe AGrósz TKurimo M(2024)Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour ScaleIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.333651732(623-638)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2023.3336517
Wang XWu QZhang HLyu CJiang XZheng ZLyu LHu SRastogi ATufano RBavota GArnaoudova VHaiduc S(2022)HELoCProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527896(354-365)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3524610.3527896
Show More Cited By

Index Terms

When does label smoothing help?
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
      2. Neural networks
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Does label smoothing help deep partial label learning?
ICML'24: Proceedings of the 41st International Conference on Machine Learning

Although deep partial label learning (deep PLL) classifiers have shown their competitive performance, they are heavily influenced by the noisy false-positive labels leading to poorer performance as the training progresses. Meanwhile, existing deep PLL ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Semi-supervised multi-label classification using incomplete label information
Highlights
- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
Abstract
Classifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

December 2019

15947 pages

Copyright © 2019 Neural Information Processing Systems Foundation, Inc.

In-Cooperation

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 08 December 2019

Qualifiers

Chapter
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)163
Downloads (Last 6 weeks)10

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZYu YTakasu A(2024)Controllable Syllable-Level Lyrics Generation From Melody With Prior AttentionIEEE Transactions on Multimedia10.1109/TMM.2024.344366426(11083-11094)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3443664
Rouhe AGrósz TKurimo M(2024)Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour ScaleIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.333651732(623-638)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2023.3336517
Wang XWu QZhang HLyu CJiang XZheng ZLyu LHu SRastogi ATufano RBavota GArnaoudova VHaiduc S(2022)HELoCProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527896(354-365)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3524610.3527896
Huang SLuo YZhuang ZYu JHe MWang YShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Context-Aware Selective Label Smoothing for Calibrating Sequence Recognition ModelProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475618(4591-4599)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475618
Li XLi JZhu LWang GHuang ZShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Imbalanced Source-free Domain AdaptationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475487(3330-3339)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475487
Wang YLi HChau LKot AShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Embracing the Dark Knowledge: Domain Generalization Using Regularized Knowledge DistillationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475434(2595-2604)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475434
You JShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Long Short-term Convolutional Transformer for No-Reference Video Quality AssessmentProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475368(2112-2120)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475368
Ma XChen HSong GDemartini GZuccon GCulpepper JHuang ZTong H(2021)LERegProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482447(1191-1201)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482447
Luo ZXi YMao XDemartini GZuccon GCulpepper JHuang ZTong H(2021)Smoothing with Fake LabelProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482184(3303-3307)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482184
Hatefi AVu XBhuyan MDrewes FDemartini GZuccon GCulpepper JHuang ZTong H(2021)CformerProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482073(3078-3082)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482073
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Figures

Tables

Media

View Table of Conten