Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1060745.1060785acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Three-level caching for efficient query processing in large Web search engines

Published: 10 May 2005 Publication History

Abstract

Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level.We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.

References

[1]
V. Anh, O. Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 35--42, Sept. 2001.
[2]
V. Anh and A. Moffat. Compressed inverted files with reduced decoding overheads. In Proc. 21st Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 290--297, 1998.
[3]
A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the web. ACM Transactions on Internet Technologies, 1(1), June 2001.
[4]
C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proc. of the 9th String Processing and Information Retrieval Symposium (SPIRE), Sept. 2002.
[5]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999.
[6]
D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In Proc. of the 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 215--221, 2002.
[7]
B. Bhattacharjee, S. Chawathe, V. Gopalakrishnan, P. Keleher, and B. Silaghi. Efficient peer-to-peer searches using result-caching. In Proc. of the 2nd Int. Workshop on Peer-to-Peer Systems, 2003.
[8]
E. Brewer. Lessons from giant scale services. IEEE Internet Computing, pages 46--55, August 2001.
[9]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998.
[10]
A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, pages 21--29. IEEE Computer Society, 1997.
[11]
P. Cao and S. Irani. Cost-aware WWW proxy caching algorithms. In USENIX Symp. on Internet Technologies and Systems (USITS), 1997.
[12]
S. Chaudhuri and L. Gravano. Optimizing queries over multimedia repositories. Data Engineering Bulletin, 19(4):45--52, 1996.
[13]
E. Demaine, A. Lopez-Ortiz, and J. Munro. Adaptive set intersections, unions, and differences. In Proc. of the 11th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 743--752, 2000.
[14]
R. Fagin. Combining fuzzy information from multiple systems. In Proc. of ACM Symp. on Principles of Database Systems, 1996.
[15]
R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 43--50, Sept. 2001.
[16]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proc. of ACM Symp. on Principles of Database Systems, 2001.
[17]
M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP Completeness. WH Freeman and Company, 1979.
[18]
T. Haveliwala. Topic-sensitive pagerank. In Proc. of the 11th Int. World Wide Web Conference, May 2002.
[19]
B. T. Jonsson, M. J. Franklin, and D. Srivastava. Interaction of query evaluation and buffer management for information retrieval. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 118--129, June 1998.
[20]
M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems (TOIS), 17(4):406--439, Oct. 1999.
[21]
R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. In Proc. of the 28th Int. Conf. on Very Large Data Bases, Aug. 2002.
[22]
R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. In Proc. of the 12th Int. World-Wide Web Conference, 2003.
[23]
J. Li, B. Loo, J. Hellerstein, F. Kaashoek, D. Karger, and R. Morris. On the feasibility of peer-to-peer web indexing. In Proc. of the 2nd Int. Workshop on Peer-to-Peer Systems, 2003.
[24]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, September 2003.
[25]
E. Markatos. On caching search engine query results. In 5th International Web Caching and Content Delivery Workshop, May 2000.
[26]
N. Megiddo and D. Modha. Outperforming LRU with an adaptive replacement cache. IEEE Computer, pages 58--65, April 2004.
[27]
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.
[28]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749--764, May 1996.
[29]
M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, 2002.
[30]
K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132--143, 2003.
[31]
K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289--302, 2002.
[32]
P. Saraiva, E. de~Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Ribeiro-Neto. Rank-preserving two-level caching for scalable search engines. In Proc. of the 24th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 51--58, Sept. 2001.
[33]
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, 2002.
[34]
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, 2002.
[35]
T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB), June 2003.
[36]
A. Tomasic and H. Garcia-Molina. Performance of inverted indices in distributed text document retrieval systems. In Proc. of the 2nd Int. Conf. on Parallel and Distributed Information Systems (PDIS), 1993.
[37]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999.
[38]
Y. Xie and D. O'Hallaron. Locality in search engine queries and its implications for caching. In IEEE Infocom 2002, pages 1238--1247, 2002.
[39]
N. Young. On-line file caching. In Proc. of the 9th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 82--86, 1998.

Cited By

View all
  • (2023)Top-k query optimization on the hierarchical memory structure2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE)10.1109/AUTEEE60196.2023.10407189(1075-1080)Online publication date: 15-Dec-2023
  • (2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
  • (2020)Constructing Effective Caches of Shortest Path Queries on Road NetworksIEEE Access10.1109/ACCESS.2020.30246648(193777-193789)Online publication date: 2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '05: Proceedings of the 14th international conference on World Wide Web
May 2005
781 pages
ISBN:1595930469
DOI:10.1145/1060745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web search
  2. caching
  3. inverted index

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Top-k query optimization on the hierarchical memory structure2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE)10.1109/AUTEEE60196.2023.10407189(1075-1080)Online publication date: 15-Dec-2023
  • (2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
  • (2020)Constructing Effective Caches of Shortest Path Queries on Road NetworksIEEE Access10.1109/ACCESS.2020.30246648(193777-193789)Online publication date: 2020
  • (2018)Web Search Result Caching and PrefetchingEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_464(4655-4660)Online publication date: 7-Dec-2018
  • (2018)Inverted FilesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1136(2037-2041)Online publication date: 7-Dec-2018
  • (2017)Small-Term Distribution for Disk-Based SearchProceedings of the 2017 ACM Symposium on Document Engineering10.1145/3103010.3103022(49-58)Online publication date: 31-Aug-2017
  • (2017)Caching with Dual CostsProceedings of the 26th International Conference on World Wide Web Companion10.1145/3041021.3054187(643-652)Online publication date: 3-Apr-2017
  • (2017)EDSFuture Generation Computer Systems10.1016/j.future.2016.02.01474:C(220-231)Online publication date: 1-Sep-2017
  • (2017)Refreshment of the shortest path cache with change of single edgeExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.08.01167:C(1-11)Online publication date: 1-Jan-2017
  • (2017)Performance improvements for search systems using an integrated cache of lists + intersectionsInformation Retrieval Journal10.1007/s10791-017-9299-520:3(172-198)Online publication date: 11-Mar-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media