research-article

Discovering URLs through user feedback

Authors:

B. Barla Cambazoglu,

Flavio P. JunqueiraAuthors Info & Claims

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 77 - 86

https://doi.org/10.1145/2063576.2063592

Published: 24 October 2011 Publication History

Abstract

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

References

[1]

E. Adar, J. Teevan, and S. T. Dumais. Resonance on the Web: Web dynamics and revisitation patterns. In Proc. 27th Int'l Conf. on Human Factors in Computing Systems, pages 1381--1390, 2009.

Digital Library

[2]

E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: Understanding the dynamics of web content. In Proc. 2nd ACM Int'l Conf. on Web Search and Data Mining, pages 282--291, 2009.

Digital Library

[3]

L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In Proc. 16th Int'l Conf. on World Wide Web, pages 441--450, 2007.

Digital Library

[4]

M. K. Bergman. White paper: The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1):online, 2001.

[5]

P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A scalable fully distributed web crawler. Softw. Pract. Exper., 34(8):711--726, 2004.

Digital Library

[6]

B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proc. 3rd Int'l Conf. on Scalable Information Systems, pages 1--10, 2008.

Digital Library

[7]

J. Cho and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. In Proc. 26th Int'l Conf. on Very Large Data Bases, pages 200--209, 2000.

Digital Library

[8]

J. Cho and H. Garcia-Molina. Parallel crawlers. In Proc. 11th Int'l Conf. on World Wide Web, pages 124--135, 2002.

Digital Library

[9]

J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4):390--426, 2003.

Digital Library

[10]

J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst., 30(1-7):161--172, 1998.

Digital Library

[11]

A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the Web. In Proc. 16th Int'l Conf. on World Wide Web, pages 421--430, 2007.

Digital Library

[12]

J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th Int'l Conf. on World Wide Web, pages 106--113, 2001.

Digital Library

[13]

N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th Int'l Conf. on World Wide Web, pages 309--318, 2004.

Digital Library

[14]

J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Geographical partition for distributed web crawling. In Proc. 2005 Workshop on Geographic Information Retrieval, pages 55--60, 2005.

Digital Library

[15]

J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Efficient partitioning strategies for distributed web crawling. In Information Networking. Towards Ubiquitous Networking and Services, volume 5200 of Lect. Notes Comput. Sc., pages 544--553. 2008.

Digital Library

[16]

D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search e ectiveness. In Proc. 32nd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 580--587, 2009.

Digital Library

[17]

D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Softw. Pract. Exper., 34(2):213--237, 2004.

Digital Library

[18]

A. d. C. Fontes and F. S. Silva. SmartCrawl: A new strategy for the exploration of the hidden web. In Proc. 6th Annual ACM Int'l Workshop on Web Information and Data Management, pages 9--15, 2004.

Digital Library

[19]

L. A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. In Proc. of the 27th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 478--479, 2004.

Digital Library

[20]

A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. Proc. 8th Int'l Conf. on World Wide Web, 2(4):219--229, 1999.

Digital Library

[21]

P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proc. 1st Int'l Conf. on Web Search and Data Mining, pages 195--206, 2008.

Digital Library

[22]

R. Kumar and A. Tomkins. A characterization of online browsing behavior. In Proc. 19th Int'l Conf. on World Wide Web, pages 561--570, 2010.

Digital Library

[23]

S. Lawrence and C. L. Giles. Accessibility of information on the Web. Intelligence, 11(1):32--39, 2000.

Digital Library

[24]

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: Scaling to 6 billion pages and beyond. In Proc. 17th Int'l Conf. on World Wide Web, pages 427--436, 2008.

Digital Library

[25]

J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. Proc. VLDB Endowment, 1(2):1241--1252, 2008.

Digital Library

[26]

M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th Int'l Conf. on World Wide Web, pages 114--118, 2001.

Digital Library

[27]

A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: The evolution of the web from a search engine perspective. In Proc. 13th Int'l Conf. on World Wide Web, pages 1--12, 2004.

Digital Library

[28]

A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proc. 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 100--109, 2005.

Digital Library

[29]

C. Olston and M. Najork. Web crawling. Found. Trends Inf. Retr., 4(3):175--246, 2010.

Digital Library

[30]

S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th Int'l Conf. on World Wide Web, pages 401--411, 2005.

Digital Library

[31]

F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In Proc. of the 11th Int'l Conf. on Knowledge Discovery in Data Mining, pages 239--248, 2005.

Digital Library

[32]

S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proc. 27th Int'l Conf. on Very Large Data Bases, pages 129--138, 2001.

Digital Library

[33]

V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. 18th Int'l Conf. on Data Engineering, page 357, 2002.

Digital Library

[34]

H. E. Williams. Discovering web-based multimedia using search toolbar data, 2007. US 2007/0136263 A1.

[35]

J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th Int'l Conf. on World Wide Web, pages 136--147, 2002.

Digital Library

[36]

D. Zeinalipour-Yazti and M. D. Dikaiakos. Design and implementation of a distributed crawler and filtering processor. In Proc. 5th Int'l Workshop on Next Generation Information Technologies and Systems, 2002.

Digital Library

Cited By

Vassio LDrago IMellia MHouidi ZLamali M(2018)You, the Web, and Your DeviceACM Transactions on the Web10.1145/323146612:4(1-30)Online publication date: 27-Sep-2018
https://dl.acm.org/doi/10.1145/3231466
Cui YSparkman CLee HLoguinov D(2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
https://dl.acm.org/doi/10.1145/3182180
Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Show More Cited By

Index Terms

Discovering URLs through user feedback
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

A novel crawling algorithm for web pages
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Crawler is a main component of search engines. In search engines, crawler part is responsible for discovering and downloading web pages. No search engine can cover whole of the web, thus it has to focus on the most valuable web pages. Several Crawling ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
High-performance web crawling
Handbook of massive data sets

High-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

October 2011

2712 pages

ISBN:9781450307178

DOI:10.1145/2063576

Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '11

Sponsor:

CIKM '11: International Conference on Information and Knowledge Management

October 24 - 28, 2011

Glasgow, Scotland, UK

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
403
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vassio LDrago IMellia MHouidi ZLamali M(2018)You, the Web, and Your DeviceACM Transactions on the Web10.1145/323146612:4(1-30)Online publication date: 27-Sep-2018
https://dl.acm.org/doi/10.1145/3231466
Cui YSparkman CLee HLoguinov D(2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
https://dl.acm.org/doi/10.1145/3182180
Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Tran GTurk ACambazoglu BNejdl WBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767737
Ostroumova LBogatyy IChelnokov ATikhonov AGusev G(2014)Crawling Policies Based on Web Page Popularity PredictionProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.5555/2964060.2964158(100-111)Online publication date: 13-Apr-2014
https://dl.acm.org/doi/10.5555/2964060.2964158
Ostroumova LBogatyy IChelnokov ATikhonov AGusev G(2014)Crawling Policies Based on Web Page Popularity PredictionAdvances in Information Retrieval10.1007/978-3-319-06028-6_9(100-111)Online publication date: 2014
https://doi.org/10.1007/978-3-319-06028-6_9
Zhou JDing Y(2012)An Analysis of URLs Generated from JavaScript CodeProceedings of the 2012 IEEE/ACIS 11th International Conference on Computer and Information Science10.1109/ICIS.2012.28(688-693)Online publication date: 30-May-2012
https://dl.acm.org/doi/10.1109/ICIS.2012.28
Alam MHa JLee S(2012)Novel approaches to crawling important pages earlyKnowledge and Information Systems10.1007/s10115-012-0535-433:3(707-734)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1007/s10115-012-0535-4
Bartoli AMedvet EMauri M(2012)Recording and replaying navigations on AJAX web sitesProceedings of the 12th international conference on Web Engineering10.1007/978-3-642-31753-8_30(370-377)Online publication date: 23-Jul-2012
https://dl.acm.org/doi/10.1007/978-3-642-31753-8_30

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten