research-article

Incremental cluster-based retrieval using compressed cluster-skipping inverted files

Authors:

Ismail Sengor Altingovde,

Özgür UlusoyAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 26, Issue 3

Article No.: 15, Pages 1 - 36

https://doi.org/10.1145/1361684.1361688

Published: 20 June 2008 Publication History

Abstract

We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size.

References

[1]

Altingovde, I. S., Can, F., and Ulusoy, Ö. 2006. Algorithms for within-cluster searches using inverted files. In Proceedings of the International Symposium on Computer and Information Sciences (ISCIS). 707--716.

Digital Library

[2]

Altingovde, I. S., Özcan, R., Öcalan, H. C., Can, F., and Ulusoy, Ö. 2007. Large-Scale cluster-based retrieval experiments on Turkish texts. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 891--892.

Digital Library

[3]

Anh, N. V., Kretser, O. de, and Moffat, A. 2001. Vector-Space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 35--42

Digital Library

[4]

Anh, V. N. and Moffat, A. 2005a. Inverted index compression using word-aligned binary codes. Inf. Retr. 8, 1, 151--166.

Digital Library

[5]

Anh, V. N. and Moffat, A. 2005b. Simplified similarity scoring using term ranks. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 226--233.

Digital Library

[6]

Anh, V. N. and Moffat, A. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. Seattle, WA. ACM Press, 372--379.

Digital Library

[7]

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 183--190.

Digital Library

[8]

Blandford, D. and Blelloch, G. 2002. Index compression through document reordering. In Proceedings of the Data Compression Conference. IEEE, 342--351.

Digital Library

[9]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. 30, 1--7, 107--117.

Digital Library

[10]

Brown, E. W. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 30--38.

Digital Library

[11]

Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 97--110.

Digital Library

[12]

Buckley C. and Voorhees E. M. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 33--40.

Digital Library

[13]

Cacheda, F. and Baeza-Yates, R. 2004. An optimistic model for searching Web directories. In Proceedings of the 26th European Conference on IR Research (ECIR). 364--377.

[14]

Cacheda, F., Carneiro, V., Guerrero, C., and Viña, Á. 2003. Optimization of restricted searches in Web directories using hybrid data structures. In Proceedings of the 25th European Conference on IR Research (ECIR). 436--451.

Digital Library

[15]

Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. thesis, Bilkent University, Ankara, Turkey.

[16]

Cambazoglu, B. B. and Aykanat, C. 2006. Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Inf. Process. Manage. 42, 4, 875--898.

Digital Library

[17]

Can, F. 1993. Incremental clustering for dynamic information processing. ACM Trans. Inf. Syst. 11, 2, 143--164.

Digital Library

[18]

Can, F. 1994. On the efficiency of best-match cluster searches. Inf. Process. Manage. 30, 3, 343--361.

Digital Library

[19]

Can, F., Altingovde, I. S., and Demir, E. 2004. Efficiency and effectiveness of query processing in cluster-based retrieval. Inf. Syst. 29, 8, 697--717.

Digital Library

[20]

Can, F. and Ozkarahan, E. A. 1990. Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans. Database Syst. 15, 4, 483--517.

Digital Library

[21]

Croft, W. B. 1980. A model of cluster searching based on classification. Inf. Syst. 5, 3, 189--195.

[22]

Harman, D. 1992. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, eds. Prentice Hall, Englewood Cliffs, NJ, (Chapter 14), 363--392.

Digital Library

[23]

Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ.

Digital Library

[24]

Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Inf. Storage Retr. 7, 217--240.

[25]

Lester, N., Moffat, A., Webber, W., and Zobel, J. 2005. Space-Limited ranked query evaluation using adaptive pruning. In Proceedings of the 6th International Conference on Web Information Systems Engineering, New York, 470--477.

Digital Library

[26]

Liu, X. and Croft, W. B. 2004. Cluster-Based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 186--193.

Digital Library

[27]

Long, X. and Suel, T. 2003. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 129--140.

Digital Library

[28]

Moffat, A. and Zobel, J. 1996. Self-Indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14, 4, 349--379.

Digital Library

[29]

Murray, D. M. 1972. Document retrieval based on clustered files. Ph.D. thesis, Cornell University. Also Rep. ISR-20. National Science Foundation, National Library of Medicine.

Digital Library

[30]

Nuray, R. and Can, F. 2006. Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manage. 42, 3, 595--614.

Digital Library

[31]

Persin, M. 1994. Document filtering for fast ranking. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 339--348.

Digital Library

[32]

Persin, M., Zobel, J., and Sacks-Davis, R. 1996. Filtered document retrieval with frequency-sorted indexes. J. Amer. Soc. Inf. Sci. 47, 10, 749--764.

Digital Library

[33]

Salton, G. 1975. Dynamic Information and Library Processing. Prentice-Hall, Englewood Cliffs, NJ.

Digital Library

[34]

Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA.

Digital Library

[35]

Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513-523.

Digital Library

[36]

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York.

Digital Library

[37]

Silvestri, F., Orlando, S., and Perego, R. 2004. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 305--312.

Digital Library

[38]

Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR). 101--112.

Digital Library

[39]

Stanford University. 2007. Webbase project Web site. www-diglib.stanford.edu/~testbed/doc2/WebBase.

[40]

Strohman, T. and Croft, W. B. 2007. Efficient document retrieval in main memory. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 175--182.

Digital Library

[41]

Trec 2006. Trec. Web site. http://trec.nist.gov.

[42]

Tombros, A. 2002. The effectiveness of query-based hierarchic clustering of documents for information retrieval. Ph.D. thesis, University of Glasgow, Glasgow, UK.

[43]

Van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London.

Digital Library

[44]

Voorhees, E. M. 1985. Cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 188--196.

Digital Library

[45]

Voorhees, E. M. 1986a. The effectiveness and efficiency of agglomerative hierarchical clustering in document retrieval. Ph.D. thesis, Cornell University, Ithaca, NY.

Digital Library

[46]

Voorhees, E. M. 1986b. The efficiency of inverted index and cluster searches. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York. 164--174.

Digital Library

[47]

Willett, P. 1988. Recent trends in hierarchical document clustering: A critical review. Inf. Process. Manage. 24, 5, 577--597.

Digital Library

[48]

Witten, I. H., Moffat, A., and Bell, T. C. 1994. Managing Gigabytes Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York.

Digital Library

[49]

Zettair 2007. The Zettair search engine. http://www.seg.rmit.edu.au/zettair/.

[50]

Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, 1--56.

Digital Library

Cited By

Gorantla SMehrotra ADeshpande ALouis A(2023)Sampling Individually-Fair Rankings that are Always Group FairProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604671(205-216)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.1145/3600211.3604671
Cinar EAltingovde IChen HDuh WHuang HKato MMothe JPoblete B(2023)Exploiting Cluster-Skipping Inverted Index for Semantic Place RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591983(1981-1985)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591983
Li ZMackenzie JChen HDuh WHuang HKato MMothe JPoblete B(2023)Profiling and Visualizing Dynamic Pruning AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591806(3125-3129)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591806
Show More Cited By

Index Terms

Incremental cluster-based retrieval using compressed cluster-skipping inverted files
1. Information systems

Recommendations

Compressing Inverted Files
Abstract
Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “...
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent ...
Performance Evaluation of Compressed Inverted Index in Lucene
ICRCCS '09: Proceedings of the 2009 International Conference on Research Challenges in Computer Science

Inverted index is the most popular index structure in search engine. Applying index compression can reduce storage space on inverted index, and improve the search performance. In this paper, we compare three typical index compression schemes in Lucene—...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 26, Issue 3

June 2008

236 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1361684

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2008

Accepted: 01 December 2007

Revised: 01 November 2007

Received: 01 February 2006

Published in TOIS Volume 26, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Türkiye Bilimsel ve Teknolojik Arastirma Kurumu

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
737
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gorantla SMehrotra ADeshpande ALouis A(2023)Sampling Individually-Fair Rankings that are Always Group FairProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604671(205-216)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.1145/3600211.3604671
Cinar EAltingovde IChen HDuh WHuang HKato MMothe JPoblete B(2023)Exploiting Cluster-Skipping Inverted Index for Semantic Place RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591983(1981-1985)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591983
Li ZMackenzie JChen HDuh WHuang HKato MMothe JPoblete B(2023)Profiling and Visualizing Dynamic Pruning AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591806(3125-3129)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591806
Mackenzie JPetri MMoffat A(2021)Anytime Ranking on Document-Ordered IndexesACM Transactions on Information Systems10.1145/346789040:1(1-32)Online publication date: 8-Sep-2021
https://dl.acm.org/doi/10.1145/3467890
Bouadjenek MSanner SDu Y(2020)Relevance- and interface-driven clustering for visual information retrievalInformation Systems10.1016/j.is.2020.101592(101592)Online publication date: Jul-2020
https://doi.org/10.1016/j.is.2020.101592
Gaioso RGil‐Costa VGuardia HSenger H(2019)Performance evaluation of single vs. batch of queries on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.547432:20Online publication date: 8-Aug-2019
https://doi.org/10.1002/cpe.5474
Gaioso RGil-Costa VGuardia HSenger H(2018)A Parallel Implementation of WAND on GPUs2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00011(10-17)Online publication date: Mar-2018
https://doi.org/10.1109/PDP2018.2018.00011
Kim YCallan JCulpepper JMoffat A(2017)Efficient distributed selective searchInformation Retrieval10.1007/s10791-016-9290-620:3(221-252)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10791-016-9290-6
Hafizoglu FKucukoglu EAltingovde I(2017)On the Efficiency of Selective SearchAdvances in Information Retrieval10.1007/978-3-319-56608-5_69(705-712)Online publication date: 8-Apr-2017
https://doi.org/10.1007/978-3-319-56608-5_69
Jin XAgun DYang TWu QShen YZhao SMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983733
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents