Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/511446.511465acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Optimal crawling strategies for web search engines

Published: 07 May 2002 Publication History

Abstract

Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.

References

[1]
R. Ahuja, T. Magnanti and J. Orlin, Network Flows, Prentice Hall, 1993.
[2]
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke and S. Raghavan, "Searching the Web", ACM Transactions on Internet Technology, 1(1), 2001.
[3]
J. Blazewicz, K. Ecker, G. Schmidt and J. Weglarz, Scheduling in Computer and Manufacturing Systems, Springer-Verlag, 1993.
[4]
A. Broder, Personal communication.
[5]
J. Challenger, P. Dantzig, A. Iyengar, M. S. Squillante, and L. Zhang. Efficiently serving dynamic data at highly accessed web sites. Preprint, May 2001.
[6]
J. Cho and H. Garcia-Molina, "Synchronizing a Database to Improve Freshness", ACM SIGMOD Conference, 2000.
[7]
E. Coffman, Z. Liu and R. Weber, "Optimal Robot Scheduling for Web Search Engines", INRIA Research Report, 1997.
[8]
F. Douglas, A. Feldmann and B. Krishnamurthy, "Rate of Change and other Metrics: A Live Study of the World Wide Web", USENIX Symposium on Internetworking Technologies and Systems, 1999.
[9]
B. Fox, "Discrete Optimization via Marginal Analysis", Management Science, 13:210--216, 1966.
[10]
G. Frederickson and D. Johnson, "The Complexity of Selection and Ranking in X+Y and Matrices with Sorted Columns", Journal of Computer and System Science, 24:197--208, 1982.
[11]
Z. Galil and N. Megiddo, "A Fast Selection Algorithm and the Problem of Optimum Distribution of Efforts", Journal of the ACM, 26:58--64, 1981.
[12]
Ibaraki, T., and Katoh, N., "Resource Allocation Problems: Algorithmic Approaches", MIT Press, Cambridge, MA, 1988.
[13]
International Business Machines Corporation, Optimization Subroutine Library Guide and Reference, IBM, 1995.
[14]
N. Katoh and T. Ibaraki, "Resource Allocation Problems", in Handbook of Combinatorial Optimization, D-Z. Du and P. Pardalos, editors, Kluwer Academic Press, 2000.
[15]
D. Knuth, The Art of Computer Programming, vol. 2, Addison Wesley, 1973.
[16]
A. Iyengar, M. Squillante and L. Zhang, "Analysis and Characterization of Large-Scale Web Server Access Patterns and Performance", World Wide Web, 2:85--100, 1999.
[17]
S. Lawrence and C. Giles, "Accessibility of Information on the Web", Nature, 400:107--109, 1999.
[18]
G. Nemhauser and L. Wolsey, Integer and Combinatorial Optimization, J. Wiley, 1988.
[19]
V. N. Padmanabhan and L. Qiu. "The Content and Access Dynamics of a Busy Web Site: Findings and Implications", ACM SIGCOMM '00 Conference, 2000.
[20]
M. Pinedo, Scheduling: Theory, Algorithms and Systems, Prentice-Hall, 1995.
[21]
J. Pitkow and P. Pirolli, "Life, Death and Lawfulness on the Electronic Frontier", CHI Conference on Human Factors in Computing Systems, 1997.
[22]
W. Press, B. Flannery, S. Teukolsky and W. Vetterling, Numerical Recipes, Cambridge University Press, 1986.
[23]
S. M. Ross. Stochastic Processes. John Wiley and Sons, Second Edition, 1997.
[24]
K. Sigman. Stationary Marked Point Processes: An Intuitive Approach. Chapman and Hall, 1995.
[25]
M. Squillante, D. Yao and L. Zhang, "Web Traffic Modeling and Web Server Performance Analysis", IEEE Conference on Decision and Control, 1999.
[26]
J. Talim, Z. Liu, P. Nain and E. Coffman, "Optimizing the Number of Robots for Web Search Engines", Telecommunication Systems Journal, 17(1-2):234-243, 2001.
[27]
C. Wills and M. Mikhailov, "Towards a Better Understanding of Web Resources and Server Responses for Improved Caching", WWW Conference, 1999.
[28]
R. W. Wolff. Stochastic Modeling and the Theory of Queues. Prentice Hall, 1989.
[29]
G. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, 1949.

Cited By

View all
  • (2022)Scalability Challenges in Web Search EnginesundefinedOnline publication date: 10-Mar-2022
  • (2022)Search-Based ApplicationsundefinedOnline publication date: 14-Mar-2022
  • (2021)Crawler by Contextual InferenceSN Computer Science10.1007/s42979-021-00574-z2:3Online publication date: 16-Apr-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '02: Proceedings of the 11th international conference on World Wide Web
May 2002
754 pages
ISBN:1581134495
DOI:10.1145/511446
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

WWW02
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)3
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Scalability Challenges in Web Search EnginesundefinedOnline publication date: 10-Mar-2022
  • (2022)Search-Based ApplicationsundefinedOnline publication date: 14-Mar-2022
  • (2021)Crawler by Contextual InferenceSN Computer Science10.1007/s42979-021-00574-z2:3Online publication date: 16-Apr-2021
  • (2020)Online learning for active cache synchronizationProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525436(5371-5380)Online publication date: 13-Jul-2020
  • (2020)Adversarial Bandits Policy for Crawling Commercial Web ContentProceedings of The Web Conference 202010.1145/3366423.3380125(407-417)Online publication date: 20-Apr-2020
  • (2019)Staying up to date with online content changes using reinforcement learning for schedulingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454340(581-591)Online publication date: 8-Dec-2019
  • (2019)Optimal Freshness Crawl Under Politeness ConstraintsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331241(495-504)Online publication date: 18-Jul-2019
  • (2019)Predictive Crawling for Commercial Web ContentThe World Wide Web Conference10.1145/3308558.3313694(627-637)Online publication date: 13-May-2019
  • (2019)Topic-Based Influence Computation in Social Networks Under Resource ConstraintsIEEE Transactions on Services Computing10.1109/TSC.2016.261968812:6(970-986)Online publication date: 1-Nov-2019
  • (2018)Analysing performance in retrieving heterogeneous information in big data environmentIndian Journal of Science and Technology10.17485/ijst/2018/v11i20/12334511:20(1-7)Online publication date: 1-May-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media