Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2983323.2983888acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Distilling Word Embeddings: An Encoding Approach

Published: 24 October 2016 Publication History

Abstract

Distilling knowledge from a well-trained cumbersome network to a small one has recently become a new research topic, as lightweight neural networks with high performance are particularly in need in various resource-restricted systems. This paper addresses the problem of distilling word embeddings for NLP tasks. We propose an encoding approach to distill task-specific knowledge from a set of high-dimensional embeddings, so that we can reduce model complexity by a large margin as well as retain high accuracy, achieving a good compromise between efficiency and performance. Experiments reveal the phenomenon that distilling knowledge from cumbersome embeddings is better than directly training neural networks with small embeddings.

References

[1]
J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, pages 2654--2662, 2014.
[2]
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. JMLR, 3:1137--1155, 2003.
[3]
C. Bucilu\va, R. Caruana, and A. Niculescu-Mizil. Model compression. In SIGKDD, pages 535--541, 2006.
[4]
W. Chan, N. R. Ke, and I. Lane. Transferring knowledge from a RNN to a DNN. arXiv:1504.01483, 2015.
[5]
W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In ICML, pages 2285--2294, 2015.
[6]
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pages 160--167, 2008.
[7]
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2014.
[8]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.
[9]
F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In AISTAT, pages 246--252, 2005.
[10]
L. Mou, H. Peng, G. Li, Y. Xu, L. Zhang, and Z. Jin. Discriminative neural sentence modeling by tree-based convolution. In EMNLP, pages 2315--2325, 2015.
[11]
A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. FitNets: Hints for thin deep nets. In ICLR, 2014.
[12]
D. Wang, C. Liu, Z. Tang, Z. Zhang, and M. Zhao. Recurrent neural network training with dark knowledge transfer. arXiv:1505.04630, 2015.
[13]
Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin. Classifying relations via long short term memory networks along shortest dependency paths. In EMNLP, pages 1785--1794, 2015.

Cited By

View all
  • (2024)Exploring the Learning Difficulty of Data: Theory and MeasureACM Transactions on Knowledge Discovery from Data10.1145/363651218:4(1-37)Online publication date: 13-Feb-2024
  • (2022)Morphologically-Aware Vocabulary Reduction of Word Embeddings2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00018(56-63)Online publication date: Nov-2022
  • (2021)Knowledge Distillation: A SurveyInternational Journal of Computer Vision10.1007/s11263-021-01453-zOnline publication date: 22-Mar-2021
  • Show More Cited By

Index Terms

  1. Distilling Word Embeddings: An Encoding Approach

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
    October 2016
    2566 pages
    ISBN:9781450340731
    DOI:10.1145/2983323
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. model compression
    2. neural networks
    3. word embeddings

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    CIKM'16
    Sponsor:
    CIKM'16: ACM Conference on Information and Knowledge Management
    October 24 - 28, 2016
    Indiana, Indianapolis, USA

    Acceptance Rates

    CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Exploring the Learning Difficulty of Data: Theory and MeasureACM Transactions on Knowledge Discovery from Data10.1145/363651218:4(1-37)Online publication date: 13-Feb-2024
    • (2022)Morphologically-Aware Vocabulary Reduction of Word Embeddings2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00018(56-63)Online publication date: Nov-2022
    • (2021)Knowledge Distillation: A SurveyInternational Journal of Computer Vision10.1007/s11263-021-01453-zOnline publication date: 22-Mar-2021
    • (2020)Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering SystemProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371792(690-698)Online publication date: 20-Jan-2020
    • (2020)Improving Low-Resource Neural Machine Translation With Teacher-Free Knowledge DistillationIEEE Access10.1109/ACCESS.2020.30378218(206638-206645)Online publication date: 2020
    • (2019)The pupil has become the masterProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367471.3367519(3439-3445)Online publication date: 10-Aug-2019
    • (2018)Adversarial Distillation for Efficient Recommendation with External KnowledgeACM Transactions on Information Systems10.1145/328165937:1(1-28)Online publication date: 13-Dec-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media