Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3454287.3454709guideproceedingsArticle/Chapter ViewAbstractPublication PagesMonograph
chapter
Free access

When does label smoothing help?

Published: 08 December 2019 Publication History

Abstract

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

References

[1]
David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Nature, 323:19, 1986.
[2]
Eric B Baum and Frank Wilczek. Supervised learning of probability distributions by neural networks. In Neural information processing systems, pages 52-61, 1988.
[3]
Sara Solla, Esther Levin, and Michael Fleisher. Accelerated learning in layered neural networks. Complex systems, 2:625-640, 1988.
[4]
Scott E Fahlman. An empirical study of learning speed in back-propagation networks. Technical report, Carnegie Mellon University, Computer Science Department, 1988.
[5]
John A Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the theory of neural computation. Addison-Wesley, 2018.
[6]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818-2826, 2016.
[7]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697-8710, 2018.
[8]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
[9]
Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. GPipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018.
[10]
Jan Chorowski and Navdeep Jaitly. Towards better decoding and language model integration in sequence to sequence models. Proc. Interspeech 2017, pages 523-527, 2017.
[11]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998-6008, 2017.
[12]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
[14]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[15]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321-1330. JMLR. org, 2017.
[16]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[17]
Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
[18]
Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753-4762, 2016.
[19]
Myle Ott, Michael Auli, David Grangier, et al. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning, pages 3953-3962, 2018.
[20]
Aviral Kumar and Sunita Sarawagi. Calibration of encoder decoder models for neural machine translation. arXiv preprint arXiv:1903.00802, 2019.
[21]
Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974, 2018.
[22]
Ivan Chelombiev, Conor Houghton, and Cian O'Donnell. Adaptive estimators show information compression in deep neural networks. arXiv preprint arXiv:1902.09037, 2019.
[23]
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
[24]
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1-5. IEEE, 2015.
[25]
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Cited By

View all
  • (2024)Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour ScaleIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.333651732(623-638)Online publication date: 1-Jan-2024
  • (2022)HELoCProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527896(354-365)Online publication date: 16-May-2022
  • (2021)Context-Aware Selective Label Smoothing for Calibrating Sequence Recognition ModelProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475618(4591-4599)Online publication date: 17-Oct-2021
  • Show More Cited By

Index Terms

  1. When does label smoothing help?
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems
    December 2019
    15947 pages

    In-Cooperation

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 08 December 2019

    Qualifiers

    • Chapter
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)149
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour ScaleIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.333651732(623-638)Online publication date: 1-Jan-2024
    • (2022)HELoCProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527896(354-365)Online publication date: 16-May-2022
    • (2021)Context-Aware Selective Label Smoothing for Calibrating Sequence Recognition ModelProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475618(4591-4599)Online publication date: 17-Oct-2021
    • (2021)Imbalanced Source-free Domain AdaptationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475487(3330-3339)Online publication date: 17-Oct-2021
    • (2021)Embracing the Dark Knowledge: Domain Generalization Using Regularized Knowledge DistillationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475434(2595-2604)Online publication date: 17-Oct-2021
    • (2021)Long Short-term Convolutional Transformer for No-Reference Video Quality AssessmentProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475368(2112-2120)Online publication date: 17-Oct-2021
    • (2021)LERegProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482447(1191-1201)Online publication date: 26-Oct-2021
    • (2021)Smoothing with Fake LabelProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482184(3303-3307)Online publication date: 26-Oct-2021
    • (2021)CformerProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482073(3078-3082)Online publication date: 26-Oct-2021
    • (2020)Fine-tuning techniques and data augmentation on transformer-based models for conversational texts and noisy user-generated contentProceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM49781.2020.9381329(919-925)Online publication date: 7-Dec-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media