Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1183614.1183644acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

A document-centric approach to static index pruning in text retrieval systems

Published: 06 November 2006 Publication History

Abstract

We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a document-centric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's contribution to the document's Kullback-Leibler divergence from the text collection's global language model. Our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval effectiveness. It thus allows us to make the index small enough to fit entirely into the main memory of a single PC, even for large text collections containing millions of documents. This results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the GOV2 document collection.

References

[1]
V. N. Anh and A. Moffat. Pruned Query Evaluation Using Precomputed Impacts. In Proceedings of the 29th ACM SIGIR Conf. on Research and Development in Information Retrieval, Seattle, USA, 2006.
[2]
L. Azzopardi and D. E. Losada. An Efficient Computation of the Multiple-Bernoulli Language Model. In Proceedings of the 28th European Conference on Information Retrieval (ECIR 2006), pages 480--483, London, UK, April 2006.
[3]
D. Bahle, H. E. Williams, and J. Zobel. Efficient Phrase Querying with an Auxiliary Index. In Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 215--221, Tampere, Finland, 2002.
[4]
B. Billerbeck and J. Zobel. Techniques for Efficient Query Expansion. In Proceedings of the 11th Symposium on String Processing and Information Retrieval, pages 30--42, Padova, Italy, September 2004.
[5]
S. Büttcher and C. L. A. Clarke. Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval. In Proceedings of the 14th Text REtrieval Conference, Gaithersburg, USA, November 2005.
[6]
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static Index Pruning for Information Retrieval Systems. In Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43--50, 2001.
[7]
C. Carpineto, R. de Mori, G. Romano, and B. Bigi. An Information-Theoretic Approach to Automatic Query Expansion. ACM Transactions on Information Systems, 19(1):1--27, 2001.
[8]
C. Clarke, F. Scholer, and I. Soboroff. Overview of the TREC Terabyte Track. In Proceedings of the 14th Text REtrieval Conference, Gaithersburg, USA, November 2005.
[9]
E. S. de Moura, C. F. dos Santos, D. R. Fernandes, A. S. Silva, P. Calado, and M. A. Nascimento. Improving Web Search Efficiency via a Locality Based Static Pruning Method. In Proceedings of the 14th International Conference on World Wide Web, pages 235--244, New York, USA, 2005.
[10]
A. Moffat and J. Zobel. Self-Indexing Inverted Files for Fast Text Retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
[11]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered Document Retrieval with Frequency-Sorted Indexes. Journal of the American Society for Information Science, 47(10):749--764, October 1996.
[12]
V. Plachouras, B. He, and I. Ounis. University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte Tracks with Terrier. In Proceedings of the 13th Text REtrieval Conference, Gaithersburg, USA, November 2004.
[13]
S. E. Robertson, S. Walker, and M. Hancock-Beaulieu. Okapi at TREC-7. In Proceedings of the Seventh Text REtrieval Conference, Gaithersburg, USA, November 1998.
[14]
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of Inverted Indexes for Fast Query Evaluation. In Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002.
[15]
H. Turtle and J. Flood. Query Evaluation: Strategies and Optimization. Information Processing & Management, 31(1):831--850, November 1995.

Cited By

View all
  • (2024)An Analysis on Matching Mechanisms and Token Pruning for Late-interaction ModelsACM Transactions on Information Systems10.1145/363981842:5(1-28)Online publication date: 29-Apr-2024
  • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
  • (2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
November 2006
916 pages
ISBN:1595934332
DOI:10.1145/1183614
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. KL divergence
  2. index pruning
  3. information retrieval

Qualifiers

  • Article

Conference

CIKM06
CIKM06: Conference on Information and Knowledge Management
November 6 - 11, 2006
Virginia, Arlington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Analysis on Matching Mechanisms and Token Pruning for Late-interaction ModelsACM Transactions on Information Systems10.1145/363981842:5(1-28)Online publication date: 29-Apr-2024
  • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
  • (2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
  • (2024)Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADEAdvances in Information Retrieval10.1007/978-3-031-56060-6_23(349-363)Online publication date: 16-Mar-2024
  • (2023)The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information RetrievalACM Transactions on Information Systems10.1145/356839441:3(1-35)Online publication date: 7-Feb-2023
  • (2023)Representation Sparsification with Hybrid Thresholding for Fast SPLADE-based Document RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592051(2329-2333)Online publication date: 19-Jul-2023
  • (2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 17-Sep-2020
  • (2018)Exploring Size-Speed Trade-Offs in Static Index Pruning2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622177(1093-1100)Online publication date: Dec-2018
  • (2017)An Empirical Analysis of Pruning TechniquesProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133151(2023-2026)Online publication date: 6-Nov-2017
  • (2017)LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked ListsProceedings of the 20th International Workshop on the Web and Databases10.1145/3068839.3068845(23-28)Online publication date: 14-May-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media