Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2645791.2645824acmotherconferencesArticle/Chapter ViewAbstractPublication PagespciConference Proceedingsconference-collections
research-article

An automatic wrapper generation process for large scale crawling of news websites

Published: 02 October 2014 Publication History

Abstract

The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.

References

[1]
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.
[2]
Y. Diao, H. Lu, S. Chen, and Z. Tian. Toward learning based web query processing. In Proceedings of the 26th International Conference on Very Large Databases (VLDB '00), pages 317--328, 2000.
[3]
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8):521--538, 1998.
[4]
S. Huang, X. Zheng, X. Wang, and D. Chen. News information extraction based on adaptive weighting using unsupervised bayesian algorithm. In Web Information Systems and Mining, pages 251--258. Springer, 2011.
[5]
H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In LREC, 2008.
[6]
N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997.
[7]
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190--197. ACM, 1999.
[8]
J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of the 12th international conference on World Wide Web, pages 187--196. ACM, 2003.
[9]
Y. Xia, Y. Yang, S. Zhang, and H. Yu. Automatic wrapper generation and maintenance. In PACLIC, pages 90--99, 2011.
[10]
H. Yan and J. Yang. A very efficient approach to news title and content extraction on the web. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pages 389--390. ACM, 2011.
[11]
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305. ACM, 2003.
[12]
G. G. G. Zaccak. Wrapster: semi-automatic wrapper generation for semi-structured websites. PhD thesis, Massachusetts Institute of Technology, 2007.
[13]
S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency. In AAAI, volume 7, pages 1507--1513, 2007.

Cited By

View all
  • (2024)Smart Science Needs Linked Open Data with a Dash of Large Language Models and Extended RelationsProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663971(1-11)Online publication date: 14-Jun-2024
  • (2019)Cascading Workflow of Healthcare ServicesInternational Journal of Extreme Automation and Connectivity in Healthcare10.4018/IJEACH.20190101081:1(79-95)Online publication date: Jan-2019
  • (2019)PaloAnalytics: project concept, scope and early results from the system implementation2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900696(1-4)Online publication date: Jul-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics
October 2014
355 pages
ISBN:9781450328975
DOI:10.1145/2645791
  • General Chairs:
  • Katsikas Sokratis,
  • Hatzopoulos Michael,
  • Apostolopoulos Theodoros,
  • Anagnostopoulos Dimosthenis,
  • Program Chairs:
  • Carayiannis Elias,
  • Varvarigou Theodora,
  • Nikolaidou Mara
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Greek Com Soc: Greek Computer Society
  • Univ. of Piraeus: University of Piraeus
  • National and Kapodistrian University of Athens: National and Kapodistrian University of Athens
  • Athens U of Econ & Business: Athens University of Economics and Business

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web Mining
  2. Web crawling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PCI '14

Acceptance Rates

PCI '14 Paper Acceptance Rate 51 of 102 submissions, 50%;
Overall Acceptance Rate 190 of 390 submissions, 49%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Smart Science Needs Linked Open Data with a Dash of Large Language Models and Extended RelationsProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663971(1-11)Online publication date: 14-Jun-2024
  • (2019)Cascading Workflow of Healthcare ServicesInternational Journal of Extreme Automation and Connectivity in Healthcare10.4018/IJEACH.20190101081:1(79-95)Online publication date: Jan-2019
  • (2019)PaloAnalytics: project concept, scope and early results from the system implementation2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900696(1-4)Online publication date: Jul-2019
  • (2015)A Platform for Real-Time Opinion Mining from Social Media and News StreamsProceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 0210.1109/Trustcom.2015.587(223-228)Online publication date: 20-Aug-2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media