Article

Three-level caching for efficient query processing in large Web search engines

Authors:

Torsten SuelAuthors Info & Claims

WWW '05: Proceedings of the 14th international conference on World Wide Web

Pages 257 - 266

https://doi.org/10.1145/1060745.1060785

Published: 10 May 2005 Publication History

Abstract

Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level.We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.

References

[1]

V. Anh, O. Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 35--42, Sept. 2001.

Digital Library

[2]

V. Anh and A. Moffat. Compressed inverted files with reduced decoding overheads. In Proc. 21st Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 290--297, 1998.

Digital Library

[3]

A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the web. ACM Transactions on Internet Technologies, 1(1), June 2001.

Digital Library

[4]

C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proc. of the 9th String Processing and Information Retrieval Symposium (SPIRE), Sept. 2002.

[5]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999.

Digital Library

[6]

D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In Proc. of the 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 215--221, 2002.

Digital Library

[7]

B. Bhattacharjee, S. Chawathe, V. Gopalakrishnan, P. Keleher, and B. Silaghi. Efficient peer-to-peer searches using result-caching. In Proc. of the 2nd Int. Workshop on Peer-to-Peer Systems, 2003.

[8]

E. Brewer. Lessons from giant scale services. IEEE Internet Computing, pages 46--55, August 2001.

Digital Library

[9]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998.

Digital Library

[10]

A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, pages 21--29. IEEE Computer Society, 1997.

Digital Library

[11]

P. Cao and S. Irani. Cost-aware WWW proxy caching algorithms. In USENIX Symp. on Internet Technologies and Systems (USITS), 1997.

Digital Library

[12]

S. Chaudhuri and L. Gravano. Optimizing queries over multimedia repositories. Data Engineering Bulletin, 19(4):45--52, 1996.

[13]

E. Demaine, A. Lopez-Ortiz, and J. Munro. Adaptive set intersections, unions, and differences. In Proc. of the 11th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 743--752, 2000.

Digital Library

[14]

R. Fagin. Combining fuzzy information from multiple systems. In Proc. of ACM Symp. on Principles of Database Systems, 1996.

Digital Library

[15]

R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 43--50, Sept. 2001.

Digital Library

[16]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proc. of ACM Symp. on Principles of Database Systems, 2001.

Digital Library

[17]

M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP Completeness. WH Freeman and Company, 1979.

Digital Library

[18]

T. Haveliwala. Topic-sensitive pagerank. In Proc. of the 11th Int. World Wide Web Conference, May 2002.

Digital Library

[19]

B. T. Jonsson, M. J. Franklin, and D. Srivastava. Interaction of query evaluation and buffer management for information retrieval. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 118--129, June 1998.

Digital Library

[20]

M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems (TOIS), 17(4):406--439, Oct. 1999.

Digital Library

[21]

R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. In Proc. of the 28th Int. Conf. on Very Large Data Bases, Aug. 2002.

Digital Library

[22]

R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. In Proc. of the 12th Int. World-Wide Web Conference, 2003.

Digital Library

[23]

J. Li, B. Loo, J. Hellerstein, F. Kaashoek, D. Karger, and R. Morris. On the feasibility of peer-to-peer web indexing. In Proc. of the 2nd Int. Workshop on Peer-to-Peer Systems, 2003.

[24]

X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, September 2003.

Digital Library

[25]

E. Markatos. On caching search engine query results. In 5th International Web Caching and Content Delivery Workshop, May 2000.

[26]

N. Megiddo and D. Modha. Outperforming LRU with an adaptive replacement cache. IEEE Computer, pages 58--65, April 2004.

Digital Library

[27]

S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.

Digital Library

[28]

M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749--764, May 1996.

[29]

M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, 2002.

[30]

K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132--143, 2003.

Digital Library

[31]

K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289--302, 2002.

[32]

P. Saraiva, E. de~Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Ribeiro-Neto. Rank-preserving two-level caching for scalable search engines. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 51--58, Sept. 2001.

Digital Library

[33]

F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, 2002.

Digital Library

[34]

V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, 2002.

Digital Library

[35]

T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB), June 2003.

[36]

A. Tomasic and H. Garcia-Molina. Performance of inverted indices in distributed text document retrieval systems. In Proc. of the 2nd Int. Conf. on Parallel and Distributed Information Systems (PDIS), 1993.

Digital Library

[37]

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999.

Digital Library

[38]

Y. Xie and D. O'Hallaron. Locality in search engine queries and its implications for caching. In IEEE Infocom 2002, pages 1238--1247, 2002.

[39]

N. Young. On-line file caching. In Proc. of the 9th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 82--86, 1998.

Digital Library

Cited By

Ni C(2023)Top-k query optimization on the hierarchical memory structure2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE)10.1109/AUTEEE60196.2023.10407189(1075-1080)Online publication date: 15-Dec-2023
https://doi.org/10.1109/AUTEEE60196.2023.10407189
Siedlaczek MMallia ASuel TSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498489
Li XWang BQiu TYu GWang N(2020)Constructing Effective Caches of Shortest Path Queries on Road NetworksIEEE Access10.1109/ACCESS.2020.30246648(193777-193789)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3024664
Show More Cited By

Index Terms

Three-level caching for efficient query processing in large Web search engines
1. General and reference
  1. Cross-computing tools and techniques
    1. Metrics
2. Information systems

Recommendations

Performance of compressed inverted list caching in search engines
WWW '08: Proceedings of the 17th international conference on World Wide Web

Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making ...
The impact of caching on search engines
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs.caching posting lists. Using a query log ...
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data

This article discusses efficiency and effectiveness issues in caching the results of queries submitted to a Web search engine (WSE). We propose SDC (Static Dynamic Cache), a new caching strategy aimed to efficiently exploit the temporal and spatial ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '05: Proceedings of the 14th international conference on World Wide Web

May 2005

781 pages

ISBN:1595930469

DOI:10.1145/1060745

Conference Chairs:
Allan Ellis
Southern Cross University
,
Tatsuya Hagino
Keio University

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

101
Total Citations
View Citations
1,401
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ni C(2023)Top-k query optimization on the hierarchical memory structure2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE)10.1109/AUTEEE60196.2023.10407189(1075-1080)Online publication date: 15-Dec-2023
https://doi.org/10.1109/AUTEEE60196.2023.10407189
Siedlaczek MMallia ASuel TSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498489
Li XWang BQiu TYu GWang N(2020)Constructing Effective Caches of Shortest Path Queries on Road NetworksIEEE Access10.1109/ACCESS.2020.30246648(193777-193789)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3024664
Lempel RSilvestri F(2018)Web Search Result Caching and PrefetchingEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_464(4655-4660)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_464
De Moura ECristo M(2018)Inverted FilesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1136(2037-2041)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_1136
Kane ATompa FCamilleri KBonnici A(2017)Small-Term Distribution for Disk-Based SearchProceedings of the 2017 ACM Symposium on Document Engineering10.1145/3103010.3103022(49-58)Online publication date: 31-Aug-2017
https://dl.acm.org/doi/10.1145/3103010.3103022
Dasgupta AKumar RSarlós TBarrett RCummings RAgichtein EGabrilovich E(2017)Caching with Dual CostsProceedings of the 26th International Conference on World Wide Web Companion10.1145/3041021.3054187(643-652)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3041021.3054187
Dong XLi RHe HGu XSarem MQiu MLi K(2017)EDSFuture Generation Computer Systems10.1016/j.future.2016.02.01474:C(220-231)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.future.2016.02.014
Li XQiu TWang NYang XWang BYu G(2017)Refreshment of the shortest path cache with change of single edgeExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.08.01167:C(1-11)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.eswa.2016.08.011
Tolosa GFeuerstein EBecchetti LMarchetti-Spaccamela A(2017)Performance improvements for search systems using an integrated cache of lists + intersectionsInformation Retrieval Journal10.1007/s10791-017-9299-520:3(172-198)Online publication date: 11-Mar-2017
https://doi.org/10.1007/s10791-017-9299-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten