research-article

An automatic wrapper generation process for large scale crawling of news websites

Authors:

Iraklis Varlamis,

Nikos Tsirakis,

Vasilis Poulopoulos,

Panagiotis TsantilasAuthors Info & Claims

PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics

Pages 1 - 6

https://doi.org/10.1145/2645791.2645824

Published: 02 October 2014 Publication History

Get Access

Abstract

The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.

References

[1]

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.

Google Scholar

[2]

Y. Diao, H. Lu, S. Chen, and Z. Tian. Toward learning based web query processing. In Proceedings of the 26th International Conference on Very Large Databases (VLDB '00), pages 317--328, 2000.

Digital Library

Google Scholar

[3]

C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8):521--538, 1998.

Digital Library

Google Scholar

[4]

S. Huang, X. Zheng, X. Wang, and D. Chen. News information extraction based on adaptive weighting using unsupervised bayesian algorithm. In Web Information Systems and Mining, pages 251--258. Springer, 2011.

Digital Library

Google Scholar

[5]

H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In LREC, 2008.

Google Scholar

[6]

N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997.

Digital Library

Google Scholar

[7]

I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190--197. ACM, 1999.

Digital Library

Google Scholar

[8]

J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of the 12th international conference on World Wide Web, pages 187--196. ACM, 2003.

Digital Library

Google Scholar

[9]

Y. Xia, Y. Yang, S. Zhang, and H. Yu. Automatic wrapper generation and maintenance. In PACLIC, pages 90--99, 2011.

Google Scholar

[10]

H. Yan and J. Yang. A very efficient approach to news title and content extraction on the web. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pages 389--390. ACM, 2011.

Digital Library

Google Scholar

[11]

L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305. ACM, 2003.

Digital Library

Google Scholar

[12]

G. G. G. Zaccak. Wrapster: semi-automatic wrapper generation for semi-structured websites. PhD thesis, Massachusetts Institute of Technology, 2007.

Google Scholar

[13]

S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency. In AAAI, volume 7, pages 1507--1513, 2007.

Digital Library

Google Scholar

Cited By

View all

Jamil H(2024)Smart Science Needs Linked Open Data with a Dash of Large Language Models and Extended RelationsProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663971(1-11)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3663742.3663971
Osial PKim AKauranen K(2019)Cascading Workflow of Healthcare ServicesInternational Journal of Extreme Automation and Connectivity in Healthcare10.4018/IJEACH.20190101081:1(79-95)Online publication date: Jan-2019
https://doi.org/10.4018/IJEACH.2019010108
Poulopoulos VWallace MVarlamis ICaridakis GTsantilas P(2019)PaloAnalytics: project concept, scope and early results from the system implementation2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900696(1-4)Online publication date: Jul-2019
https://doi.org/10.1109/IISA.2019.8900696
Show More Cited By

Index Terms

An automatic wrapper generation process for large scale crawling of news websites
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
An effective and efficient Web content extractor for optimizing the crawling process

Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. ...
A statistical approach for efficient crawling of rich internet applications
ICWE'12: Proceedings of the 12th international conference on Web Engineering

Modern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics

October 2014

355 pages

ISBN:9781450328975

DOI:10.1145/2645791

General Chairs:
Katsikas Sokratis
Department of Digital Systems, University of Piraeus
,
Hatzopoulos Michael
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens
,
Apostolopoulos Theodoros
Department of Informatics, Athens University of Economics and Business
,
Anagnostopoulos Dimosthenis
Department of Informatics & Telematics, Harokopio University of Athens
,
Program Chairs:
Carayiannis Elias
Department of Systems & Technology Management, School of Business, George Washington University
,
Varvarigou Theodora
School of Electrical and Computer Engineering, National Technical University of Athens
,
Nikolaidou Mara
Department of Informatics & Telematics, Harokopio University of Athens

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Greek Com Soc: Greek Computer Society
Univ. of Piraeus: University of Piraeus
National and Kapodistrian University of Athens: National and Kapodistrian University of Athens
Athens U of Econ & Business: Athens University of Economics and Business

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

PCI '14

PCI '14: 18th Panhellenic Conference on Informatics

October 2 - 4, 2014

Athens, Greece

Acceptance Rates

PCI '14 Paper Acceptance Rate 51 of 102 submissions, 50%;

Overall Acceptance Rate 190 of 390 submissions, 49%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jamil H(2024)Smart Science Needs Linked Open Data with a Dash of Large Language Models and Extended RelationsProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663971(1-11)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3663742.3663971
Osial PKim AKauranen K(2019)Cascading Workflow of Healthcare ServicesInternational Journal of Extreme Automation and Connectivity in Healthcare10.4018/IJEACH.20190101081:1(79-95)Online publication date: Jan-2019
https://doi.org/10.4018/IJEACH.2019010108
Poulopoulos VWallace MVarlamis ICaridakis GTsantilas P(2019)PaloAnalytics: project concept, scope and early results from the system implementation2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900696(1-4)Online publication date: Jul-2019
https://doi.org/10.1109/IISA.2019.8900696
Tsirakis NPoulopoulos VTsantilas PVarlamis I(2015)A Platform for Real-Time Opinion Mining from Social Media and News StreamsProceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 0210.1109/Trustcom.2015.587(223-228)Online publication date: 20-Aug-2015
https://dl.acm.org/doi/10.1109/Trustcom.2015.587

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Current challenges in web crawling

An effective and efficient Web content extractor for optimizing the crawling process

A statistical approach for efficient crawling of rich internet applications