Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2063576.2063592acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Discovering URLs through user feedback

Published: 24 October 2011 Publication History

Abstract

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

References

[1]
E. Adar, J. Teevan, and S. T. Dumais. Resonance on the Web: Web dynamics and revisitation patterns. In Proc. 27th Int'l Conf. on Human Factors in Computing Systems, pages 1381--1390, 2009.
[2]
E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: Understanding the dynamics of web content. In Proc. 2nd ACM Int'l Conf. on Web Search and Data Mining, pages 282--291, 2009.
[3]
L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In Proc. 16th Int'l Conf. on World Wide Web, pages 441--450, 2007.
[4]
M. K. Bergman. White paper: The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1):online, 2001.
[5]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A scalable fully distributed web crawler. Softw. Pract. Exper., 34(8):711--726, 2004.
[6]
B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proc. 3rd Int'l Conf. on Scalable Information Systems, pages 1--10, 2008.
[7]
J. Cho and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. In Proc. 26th Int'l Conf. on Very Large Data Bases, pages 200--209, 2000.
[8]
J. Cho and H. Garcia-Molina. Parallel crawlers. In Proc. 11th Int'l Conf. on World Wide Web, pages 124--135, 2002.
[9]
J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4):390--426, 2003.
[10]
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst., 30(1-7):161--172, 1998.
[11]
A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the Web. In Proc. 16th Int'l Conf. on World Wide Web, pages 421--430, 2007.
[12]
J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th Int'l Conf. on World Wide Web, pages 106--113, 2001.
[13]
N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th Int'l Conf. on World Wide Web, pages 309--318, 2004.
[14]
J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Geographical partition for distributed web crawling. In Proc. 2005 Workshop on Geographic Information Retrieval, pages 55--60, 2005.
[15]
J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Efficient partitioning strategies for distributed web crawling. In Information Networking. Towards Ubiquitous Networking and Services, volume 5200 of Lect. Notes Comput. Sc., pages 544--553. 2008.
[16]
D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search e ectiveness. In Proc. 32nd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 580--587, 2009.
[17]
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Softw. Pract. Exper., 34(2):213--237, 2004.
[18]
A. d. C. Fontes and F. S. Silva. SmartCrawl: A new strategy for the exploration of the hidden web. In Proc. 6th Annual ACM Int'l Workshop on Web Information and Data Management, pages 9--15, 2004.
[19]
L. A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. In Proc. of the 27th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 478--479, 2004.
[20]
A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. Proc. 8th Int'l Conf. on World Wide Web, 2(4):219--229, 1999.
[21]
P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proc. 1st Int'l Conf. on Web Search and Data Mining, pages 195--206, 2008.
[22]
R. Kumar and A. Tomkins. A characterization of online browsing behavior. In Proc. 19th Int'l Conf. on World Wide Web, pages 561--570, 2010.
[23]
S. Lawrence and C. L. Giles. Accessibility of information on the Web. Intelligence, 11(1):32--39, 2000.
[24]
H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: Scaling to 6 billion pages and beyond. In Proc. 17th Int'l Conf. on World Wide Web, pages 427--436, 2008.
[25]
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. Proc. VLDB Endowment, 1(2):1241--1252, 2008.
[26]
M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th Int'l Conf. on World Wide Web, pages 114--118, 2001.
[27]
A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: The evolution of the web from a search engine perspective. In Proc. 13th Int'l Conf. on World Wide Web, pages 1--12, 2004.
[28]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proc. 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 100--109, 2005.
[29]
C. Olston and M. Najork. Web crawling. Found. Trends Inf. Retr., 4(3):175--246, 2010.
[30]
S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th Int'l Conf. on World Wide Web, pages 401--411, 2005.
[31]
F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In Proc. of the 11th Int'l Conf. on Knowledge Discovery in Data Mining, pages 239--248, 2005.
[32]
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proc. 27th Int'l Conf. on Very Large Data Bases, pages 129--138, 2001.
[33]
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. 18th Int'l Conf. on Data Engineering, page 357, 2002.
[34]
H. E. Williams. Discovering web-based multimedia using search toolbar data, 2007. US 2007/0136263 A1.
[35]
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th Int'l Conf. on World Wide Web, pages 136--147, 2002.
[36]
D. Zeinalipour-Yazti and M. D. Dikaiakos. Design and implementation of a distributed crawler and filtering processor. In Proc. 5th Int'l Workshop on Next Generation Information Technologies and Systems, 2002.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. search engines
  2. toolbar
  3. url discovery
  4. web crawling

Qualifiers

  • Research-article

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)You, the Web, and Your DeviceACM Transactions on the Web10.1145/323146612:4(1-30)Online publication date: 27-Sep-2018
  • (2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
  • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
  • (2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
  • (2014)Crawling Policies Based on Web Page Popularity PredictionProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.5555/2964060.2964158(100-111)Online publication date: 13-Apr-2014
  • (2014)Crawling Policies Based on Web Page Popularity PredictionAdvances in Information Retrieval10.1007/978-3-319-06028-6_9(100-111)Online publication date: 2014
  • (2012)An Analysis of URLs Generated from JavaScript CodeProceedings of the 2012 IEEE/ACIS 11th International Conference on Computer and Information Science10.1109/ICIS.2012.28(688-693)Online publication date: 30-May-2012
  • (2012)Novel approaches to crawling important pages earlyKnowledge and Information Systems10.1007/s10115-012-0535-433:3(707-734)Online publication date: 1-Dec-2012
  • (2012)Recording and replaying navigations on AJAX web sitesProceedings of the 12th international conference on Web Engineering10.1007/978-3-642-31753-8_30(370-377)Online publication date: 23-Jul-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media