Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3269206.3269277acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance

Published: 17 October 2018 Publication History

Abstract

Neural word embedding approaches, due to their ability to capture semantic meanings of vocabulary terms, have recently gained attention of the information retrieval (IR) community and have shown promising results in improving ad hoc retrieval performance. It has been observed that these approaches are sensitive to various choices made during the learning of word embeddings and their usage, often leading to poor reproducibility. We study the effect of varying following two parameters, viz., i) the term normalization and ii) the choice of training collection, on ad hoc retrieval performance with word2vec and fastText embeddings. We present quantitative estimates of similarity of word vectors obtained under different settings, and use embeddings based query expansion task to understand the effects of these parameters on IR effectiveness.

References

[1]
Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016. Analysis of the Paragraph Vector Model for Information Retrieval. In Proc. of ICTIR'16 . 133--142.
[2]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL, Vol. 5 (2017), 135--146.
[3]
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016a. Query Expansion with Locally-Trained Word Embeddings. In Proc. of ACL'16 .
[4]
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016b. Query Expansion with Locally-Trained Word Embeddings. In Proc. of ACL'16 . 367--377.
[5]
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J. F. Jones. 2015. Word Embedding based Generalized Language Model for Information Retrieval. In Proc. of SIGIR'15 . 795--798.
[6]
Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, Fabrizio Silvestri, and Narayan Bhamidipati. 2015. Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search. In Proc. of SIGIR '15. 383--392.
[7]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In Proc. of CIKM'16 . 55--64.
[8]
Donna K. Harman (Ed.). 1992. Overview of TREC-1. NIST .
[9]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proc. NIPS '13 . 3111--3119.
[10]
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search. In Proc. of WWW'17. 1291--1299.
[11]
Dwaipayan Roy, Debasis Ganguly, Mandar Mitra, and Gareth J. F. Jones. 2016a. Word Vector Compositionality based Relevance Feedback using Kernel Density Estimation. In Proc. of CIKM'16. 1281--1290.
[12]
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. 2016b. Using Word Embeddings for Automatic Query Expansion. In Proc. of NeuIR-2016 Workshop, collocated with SIGIR .
[13]
Hamed Zamani and W. Bruce Croft. 2016. Embedding-based Query Language Models. In Proc. of ICTIR'16. 147--156.
[14]
Guoqing Zheng and Jamie Callan. 2015. Learning to Reweight Terms with Distributed Representations. In Proc. of SIGIR '15. 575--584.
[15]
Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. 2015. Integrating and Evaluating Neural Word Embeddings in Information Retrieval. In Proc. of ADCS '15. 12:1--12:8.

Cited By

View all
  • (2024)Dimension Importance Estimation for Dense Information RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657691(1318-1328)Online publication date: 10-Jul-2024
  • (2023)Bias Invariant Approaches for Improving Word Embedding FairnessProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614792(1400-1410)Online publication date: 21-Oct-2023
  • (2023)ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document RetrievalACM Transactions on the Web10.1145/357240517:1(1-39)Online publication date: 16-Jan-2023
  • Show More Cited By

Index Terms

  1. Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
    October 2018
    2362 pages
    ISBN:9781450360142
    DOI:10.1145/3269206
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. neural information retrieval
    2. sensitivity analysis
    3. term normalization
    4. word embeddings

    Qualifiers

    • Short-paper

    Conference

    CIKM '18
    Sponsor:

    Acceptance Rates

    CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)56
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Dimension Importance Estimation for Dense Information RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657691(1318-1328)Online publication date: 10-Jul-2024
    • (2023)Bias Invariant Approaches for Improving Word Embedding FairnessProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614792(1400-1410)Online publication date: 21-Oct-2023
    • (2023)ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document RetrievalACM Transactions on the Web10.1145/357240517:1(1-39)Online publication date: 16-Jan-2023
    • (2023)Deep Learning–Based Named Entity Recognition and Resolution of Referential Ambiguities for Enhanced Information Extraction from Construction Safety RegulationsJournal of Computing in Civil Engineering10.1061/(ASCE)CP.1943-5487.000106437:5Online publication date: Sep-2023
    • (2023)Genetic data visualization using literature text-based neural networksNeural Networks10.1016/j.neunet.2023.05.015165:C(562-595)Online publication date: 1-Aug-2023
    • (2023)A literature embedding model for cardiovascular disease prediction using risk factors, symptoms, and genotype informationExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118930213:PAOnline publication date: 1-Mar-2023
    • (2022)Hierarchical Bayesian multi-kernel learning for integrated classification and summarization of app reviewsProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3549174(558-569)Online publication date: 7-Nov-2022
    • (2022)Local or Global? A Comparative Study on Applications of Embedding Models for Information RetrievalProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493701(115-119)Online publication date: 8-Jan-2022
    • (2022)Deep-QPPProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498491(201-209)Online publication date: 11-Feb-2022
    • (2022)A Brief Overview of Universal Sentence Representation Methods: A Linguistic ViewACM Computing Surveys10.1145/348285355:3(1-42)Online publication date: 26-Mar-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media