research-article

Language independent web news extraction system based on text detection framework

Author:

Yu-Chieh WuAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 342, Issue C

Pages 132 - 149

https://doi.org/10.1016/j.ins.2015.12.025

Published: 10 May 2016 Publication History

Abstract

Web news provides a direct and efficient way to construct large text corpora. The creation of text data requires an understanding of HTML code and the preparation of customized parsing rules to identify text content in a webpage. Typically, parsing rules are written manually and cannot be applied to pages with different layouts. In this study, we present a web news extraction system that is based on a text detection framework. The proposed method scans the input HTML page and creates text statistics as a projection profile. Then, text block identification is applied to determine a set of content candidates. To filter noise, text verification determines whether a given text block can be included with content. We evaluate the proposed approach with the L3S-GN1 corpus and 3506 multilingual news data items randomly sampled from 325 websites (15 geographic regions and 11 distinct languages). We also compare the proposed method to 23 well-known state-of-the-art techniques. The experimental results show that the proposed method outperforms the second best method (NReadability) by 7.30% in the macro F-measure rate and is 16.91 times faster than NReadability. In terms of the perfect rate, the proposed method demonstrates 46.38% accuracy, whereas the Boilerpipe algorithm demonstrates only 21.54% accuracy. The proposed method is very useful for constructing a multilingual corpus because it requires no language-specific processing component.

References

[1]

Brad Adelberg, NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp. 283-294.

Digital Library

[2]

Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Extracting content structure for web pages based on visual representation, in: Proceedings of the Fifth Asia-Pacific Web Conference on Web Technologies and Applications, 2003, pp. 406-417.

Digital Library

[3]

EduardoTeixeira Cardoso, IamVita Jabour, EduardoSany Laber, Rogerio Rodrigues, Pedro Cardoso, An efficient language-independent method to extract content from news webpages, in: Proceedings of the ACM Symposium on Document Engineering, 2011, pp. 121-128.

[4]

ThomasG. Dietterich, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Comput., 10 (1998) 1895-1923.

Digital Library

[5]

Stefan Evert, A lightweight and efficient tool for cleaning web pages, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation, 2008, pp. 3489-3493.

[6]

Aidan Finn, Nicholas Kushmerick, Barry Smyth, Fact or fiction: content classification for digital libraries, in: Proceedings of Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries, 2001.

[7]

A.L. Ghuribi, Sumaia Mohammed, Saleh Alshomrani, Bi-languages mining algorithm for extraction useful web contents, Arab. J. Sci. Eng., 40 (2015) 501-518.

[8]

John Gibson, Ben Wellner, Susan Lubar, Adaptive web-page content identification, in: Proceedings of the Ninth Annual ACM International Workshop on Web Information and Data Management, 2007, pp. 105-112.

[9]

Thomas Gottron, Combining content extraction heuristics: the CombinE system, in: Proceedings of the Tenth International Conference on Information Integration and Web-based Applications and Services, 2008, pp. 591-595.

Digital Library

[10]

Thomas Gottron, Content code blurring: a new approach to content extraction, in: Proceedings of the Ninteenth International Conference on Database and Expert Systems Application, 2008, pp. 29-33.

Digital Library

[11]

Jesus Serrano-Guerrero, JoseA. Olivas, FranciscoP. Romero, Enrique Herrera-Viedma, Sentiment analysis: A review and comparative analysis of web services, Inf. Sci., 311 (2015) 18-38.

Digital Library

[12]

DanielS. Hirschberg, A linear space algorithm for computing maximal common subsequences, Commun. ACM, 18 (1975) 341-343.

Digital Library

[13]

Shilin Huang, Xiaolin Zheng, Xiaowei Wang, Deren Chen, News information extraction based on adaptive weighting using unsupervised Bayesian algorithm, in: Proceedings of the 2011 International Conference on Web Information Systems and Mining, 2011, pp. 251-258.

[14]

Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl, Boilerplate detection using shallow text features, in: Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2011, pp. 441-450.

[15]

EduardoSany Laber, CrístonPereirade Souza, IamVita Jabour, EvelinCarvalhoFreirede Amorim, EduardoTeixeira Cardoso, RaúlPierre Rentería, LúcioCunha Tinoco, CaioDias Valentim, A fast and simple method for extracting relevant content from news webpages, in: Proceedings of the Eighteenth ACM Conference on Information and Knowledge Management, 2009, pp. 1685-1688.

Digital Library

[16]

Lishuang Li, Rongpeng Zhou, Degen Huang, Two-phase biomedical named entity recognition using CRFs, Comput. Biol. Chem., 33 (2009) 334-338.

Digital Library

[17]

Rainer Lienhart, Axel Wernicke, Localizing and segmenting text in images and videos, IEEE Trans. Circuits Syst. Video Technol., 12 (2002) 256-268.

Digital Library

[18]

Wei Liu, Xiaofeng Meng, Weiyi Meng, ViDE: A vision-based approach for deep web data Extraction, IEEE Trans. Knowl. Data Eng., 22 (2010) 447-460.

Digital Library

[19]

Ling Liu, Calton Pu, Wei Han, XWRAP: An XML-enabled wrapper construction system for web information sources, in: Proceedings of the Sixteenth International Conference on Data Engineering, 2000, pp. 611-621.

[20]

MichaelR. Lyu, Jiqiang Song, Min Cai, A comprehensive method for multilingual video text detection, localization, and extraction, IEEE Trans. Circuits Syst. Video Technol., 15 (2005) 243-255.

Digital Library

[21]

Xiao Ma, Jiangfeng Chen, Hui Zhang, CELB: Content extraction based on line-block, in: Proceedings of the Sixth International Conference on Computer Sciences and Convergence Information Technology, 2011, pp. 412-417.

[22]

Michal Marek, Pavel Pecina, Miroslav Spousta, Web page cleaning with conditional random fields, in: Proceedings of the Third Web as Corpus Workshop on Building and Exploring Web corpora, 2007, pp. 155-162.

[23]

Iris Miliaraki, Klaus Berberich, Rainer Gemulla, Spyros Zoupanos, Mind the gap: large-scale frequent sequence mining, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 797-808.

Digital Library

[24]

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gerhard Heyer, TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity, in: Proceedings of the Twelfth International Workshop on Web Information and Data Management, 2012, pp. 65-72.

[25]

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh, Extracting the main content of Web documents based on character encoding and a naive smoothing method, Softw. Data Technol. Commun. Comput. Inf. Sci., 303 (2013) 217-236.

[26]

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Extracting the main content of Web documents based on a naive smoothing method, in: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, 2011, pp. 470-475.

[27]

EdlenoS. de Moura, David Fernandes, Berthier Ribeiro-Neto, AltigranS.da Silva, MarcosAndré Gonçalves, Using structural information to improve search in Web collections, J. Am. Soc. Inf. Sci. Technol., 61 (2010) 2503-2513.

Digital Library

[28]

Ion Muslea, Steven Minton, CraigA. Knoblock, Hierarchical wrapper induction for semistructured information sources, Auton. Agents Multi Agent Syst., 4 (2001) 93-114.

[29]

Jeff Pasternack, Dan Roth, Extracting article text from the web with maximum subsequence segmentation, in: Proceedings of the Eighteenth International Conference on World Wide Web, 2009, pp. 971-980.

Digital Library

[30]

Jan Pomikálek, Building Parallel Corpora from the Web, (Ph.D. dissertation), Masaryk University, 2011.

[31]

PirAbdulRasool Qureshi, Nasrullah Memon, Hybrid CETR model of content extraction, J. Comput. Syst. Sci., 78 (2012) 1248-1257.

Digital Library

[32]

Rahman Alam, A.F.R. Rahman, H. Alam, R. Hartono, content extraction from HTML documents, in: Proceedings of the First International Workshop on Web Document Analysis, 2001, pp. 7-10.

[33]

DaviC. Reis, PauloB. Golgher, AltigranS. Silva, AlbertoF. Laender, Automatic web news extraction using tree edit distance, in: Proceedings of the Thirteenth International Conference on World Wide Web, 2004, pp. 502-511.

[34]

Miroslav Spousta, Michal Marek, Pavel Pecina, Victor: the web-page cleaning tool, in: Proceedings of the Fourth Web as Corpus Workshop Autonomic Communication - WAC, 2008, pp. 12-17.

[35]

Fei Sun, Dandan Song, Lejian Liao, DOM based content extraction via text density, in: Proceedings of the Thirty Fourth International ACM SIGIR Conference on Research and Development in Information Retrieval, 2011, pp. 245-254.

[36]

Erdinç Uzun, EdipSerdar Güner, Yılmaz Kılıçaslan, Tarık Yerlikaya, HayriVolkan Agun, An effective and efficient web content extractor for optimizing the crawling process, Softw. Pract. Exp., 44 (2014) 1181-1199.

Digital Library

[37]

Junfeng Wang, Xiaofei He, Can Wang, Jian Pei, Jiajun Bu, Chun Chen, Ziyu Guan, Gang Lu, News article extraction with template-independent wrapper, in: Proceedings of the Eighteenth International Conference on World Wide Web, 2009, pp. 1085-1086.

Digital Library

[38]

Tim Weninger, WilliamH. Hsu, Jiawei Han, CETR: content extraction via tag ratios, in: Proceedings of the Nineteenth International Conference on World Wide Web, 2010, pp. 971-980.

Digital Library

[39]

Yiming Yang, Xin Liu, A re-examination of text categorization methods, in: Proceedings of the Twenty Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.

Digital Library

[40]

Shuyi Zheng, Ruihua Song, Ji-Rong Wen, Template-independent news extraction based on visual consistency, in: Proceedings of the Twenty Second National Conference on Artificial Intelligence, 2007, pp. 1507-1512.

Cited By

Uzun E(2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.1145/3616849
Agun HUzun E(2023)An efficient regular expression inference approach for relevant image extractionApplied Soft Computing10.1016/j.asoc.2023.110030135:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110030
Chen ZZhou JSun R(2023)An efficient content extraction method for webpage based on tag-line-block analysisSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09076-x27:20(14631-14645)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1007/s00500-023-09076-x
Show More Cited By

Language independent web news extraction system based on text detection framework
1. Information systems

Recommendations

Automatic Web Content Extraction by Combination of Learning and Grouping
WWW '15: Proceedings of the 24th International Conference on World Wide Web

Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part ...
Web news extraction via path ratios
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 342, Issue C

May 2016

222 pages

ISSN:0020-0255

Issue’s Table of Contents

Copyright © Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 10 May 2016

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Uzun E(2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.1145/3616849
Agun HUzun E(2023)An efficient regular expression inference approach for relevant image extractionApplied Soft Computing10.1016/j.asoc.2023.110030135:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110030
Chen ZZhou JSun R(2023)An efficient content extraction method for webpage based on tag-line-block analysisSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09076-x27:20(14631-14645)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1007/s00500-023-09076-x
Somboonsak P(2018)Misspelling Error Detection in Thai Language ApplicationProceedings of the 6th International Conference on Information Technology: IoT and Smart City10.1145/3301551.3301584(19-24)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.1145/3301551.3301584
Zeleny JBurget RZendulka J(2017)Box clustering segmentationInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00253:3(735-750)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1016/j.ipm.2017.02.002
Yazdanbakhsh OZhou YDick S(2017)An intelligent system for livestock disease surveillanceInformation Sciences: an International Journal10.1016/j.ins.2016.10.026378:C(26-47)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1016/j.ins.2016.10.026

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents