Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1060745.1060783acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Improving Web search efficiency via a locality based static pruning method

Published: 10 May 2005 Publication History

Abstract

The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search effectiveness but also time and space efficiency. In this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former. To this effect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices. Experiments using a real search engine show that our technique can reduce the indices' storage costs by up to 60% over traditional lossless compression methods, while keeping the loss in retrieval precision to a minimum. When compared to the indices size with no compression at all, the compression rate is higher than 88%, i.e., less than one eighth of the original size. More importantly, our results indicate that, due to the reduction in storage overhead, query processing time can be reduced to nearly 65% of the original time, with no loss in average precision. The new method yields significative improvements when compared against the best known static pruning method for search engine indices. In addition, since our technique is orthogonal to the underlying search algorithms, it can be adopted by virtually any search engine.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
[2]
D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with and auxiliary index. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 215--221, 2002.
[3]
T. C. Bell, J. G. Cleary, and I. H. Witten. Text compression. Prentice Hall, 1990.
[4]
P. P. Calado, E. S. de~Moura, B. Ribeiro-Neto, I. Silva, and N. Ziviani. Local versus global link information in the web. ACM Transactions on Information Systems (TOIS), 2003.
[5]
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43--50, New Orleans, Louisiana, USA, September 2001.
[6]
W. B. Croft, H. R. Turtle, and D. D. Lewis. The use of phrases and structured queries in information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 32--45, 1991.
[7]
E. S. de~Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems(ACM TOIS), 18(2):113--139, April 2000.
[8]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of American Society for Information Science, 41(6), 1990.
[9]
R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 28--36, 2003.
[10]
D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web page. In Proceedings of the 12th International World Wide Web Conference, pages 669--678, May 2003.
[11]
D. Hawking, N. Craswell, and P. B. Thistlewaite. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7), pages 91--104, Gaithersburg, Maryland, USA, November 1998.
[12]
D. Hawking, N. Craswell, P. B. Thistlewaite, and D. Harman. Results and challenges in web search evaluation. Computer Networks, 31(11--16):1321--1330, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.
[13]
D. Hawking, E. Voorhees, P. Bailey, and N. Craswell. Overview of trec-8 web track. In Proc. of TREC-8, pages 131--150, Gaithersburg MD, November 1999.
[14]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668--677, San Francisco, California, USA, January 1998.
[15]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proceedings of the 12th International World Wide Web Conference, pages 102--111, May 2003.
[16]
G. Navarro, E. S. de~Moura, M. Neubert, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. Information Retrieval Journal, 3(1):49--77, 2000.
[17]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society of Information Science, 47(10):749--764, Oct. 1996.
[18]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1st edition, 1983.
[19]
P. C. Saraiva, E. S. Moura, N. Ziviani, B. Ribeiro-Neto, W. Meira, and R. L. C. Fonseca. Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 51--58, New Orleans, September 2001.
[20]
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufmann Publishers, New York, second edition, 1999.

Cited By

View all
  • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
  • (2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 17-Sep-2020
  • (2018)Exploring Size-Speed Trade-Offs in Static Index Pruning2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622177(1093-1100)Online publication date: Dec-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '05: Proceedings of the 14th international conference on World Wide Web
May 2005
781 pages
ISBN:1595930469
DOI:10.1145/1060745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. indexing
  2. information retrieval
  3. pruning
  4. search engines
  5. web search

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
  • (2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 17-Sep-2020
  • (2018)Exploring Size-Speed Trade-Offs in Static Index Pruning2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622177(1093-1100)Online publication date: Dec-2018
  • (2018)Inverted FilesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1136(2037-2041)Online publication date: 7-Dec-2018
  • (2017)Inverted FilesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_1136-2(1-5)Online publication date: 16-Oct-2017
  • (2016)Improved methods for static index pruning2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840661(686-695)Online publication date: Dec-2016
  • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
  • (2015)On Divergence Measures and Static Index PruningProceedings of the 2015 International Conference on The Theory of Information Retrieval10.1145/2808194.2809472(151-160)Online publication date: 27-Sep-2015
  • (2015)Selective SearchACM Transactions on Information Systems10.1145/273803533:4(1-33)Online publication date: 23-Apr-2015
  • (2015)Faster Exact Search Using Document ClusteringProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_1(1-12)Online publication date: 1-Sep-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media