Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Language independent web news extraction system based on text detection framework

Published: 10 May 2016 Publication History

Abstract

Web news provides a direct and efficient way to construct large text corpora. The creation of text data requires an understanding of HTML code and the preparation of customized parsing rules to identify text content in a webpage. Typically, parsing rules are written manually and cannot be applied to pages with different layouts. In this study, we present a web news extraction system that is based on a text detection framework. The proposed method scans the input HTML page and creates text statistics as a projection profile. Then, text block identification is applied to determine a set of content candidates. To filter noise, text verification determines whether a given text block can be included with content. We evaluate the proposed approach with the L3S-GN1 corpus and 3506 multilingual news data items randomly sampled from 325 websites (15 geographic regions and 11 distinct languages). We also compare the proposed method to 23 well-known state-of-the-art techniques. The experimental results show that the proposed method outperforms the second best method (NReadability) by 7.30% in the macro F-measure rate and is 16.91 times faster than NReadability. In terms of the perfect rate, the proposed method demonstrates 46.38% accuracy, whereas the Boilerpipe algorithm demonstrates only 21.54% accuracy. The proposed method is very useful for constructing a multilingual corpus because it requires no language-specific processing component.

References

[1]
Brad Adelberg, NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp. 283-294.
[2]
Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Extracting content structure for web pages based on visual representation, in: Proceedings of the Fifth Asia-Pacific Web Conference on Web Technologies and Applications, 2003, pp. 406-417.
[3]
EduardoTeixeira Cardoso, IamVita Jabour, EduardoSany Laber, Rogerio Rodrigues, Pedro Cardoso, An efficient language-independent method to extract content from news webpages, in: Proceedings of the ACM Symposium on Document Engineering, 2011, pp. 121-128.
[4]
ThomasG. Dietterich, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Comput., 10 (1998) 1895-1923.
[5]
Stefan Evert, A lightweight and efficient tool for cleaning web pages, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation, 2008, pp. 3489-3493.
[6]
Aidan Finn, Nicholas Kushmerick, Barry Smyth, Fact or fiction: content classification for digital libraries, in: Proceedings of Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries, 2001.
[7]
A.L. Ghuribi, Sumaia Mohammed, Saleh Alshomrani, Bi-languages mining algorithm for extraction useful web contents, Arab. J. Sci. Eng., 40 (2015) 501-518.
[8]
John Gibson, Ben Wellner, Susan Lubar, Adaptive web-page content identification, in: Proceedings of the Ninth Annual ACM International Workshop on Web Information and Data Management, 2007, pp. 105-112.
[9]
Thomas Gottron, Combining content extraction heuristics: the CombinE system, in: Proceedings of the Tenth International Conference on Information Integration and Web-based Applications and Services, 2008, pp. 591-595.
[10]
Thomas Gottron, Content code blurring: a new approach to content extraction, in: Proceedings of the Ninteenth International Conference on Database and Expert Systems Application, 2008, pp. 29-33.
[11]
Jesus Serrano-Guerrero, JoseA. Olivas, FranciscoP. Romero, Enrique Herrera-Viedma, Sentiment analysis: A review and comparative analysis of web services, Inf. Sci., 311 (2015) 18-38.
[12]
DanielS. Hirschberg, A linear space algorithm for computing maximal common subsequences, Commun. ACM, 18 (1975) 341-343.
[13]
Shilin Huang, Xiaolin Zheng, Xiaowei Wang, Deren Chen, News information extraction based on adaptive weighting using unsupervised Bayesian algorithm, in: Proceedings of the 2011 International Conference on Web Information Systems and Mining, 2011, pp. 251-258.
[14]
Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl, Boilerplate detection using shallow text features, in: Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2011, pp. 441-450.
[15]
EduardoSany Laber, CrístonPereirade Souza, IamVita Jabour, EvelinCarvalhoFreirede Amorim, EduardoTeixeira Cardoso, RaúlPierre Rentería, LúcioCunha Tinoco, CaioDias Valentim, A fast and simple method for extracting relevant content from news webpages, in: Proceedings of the Eighteenth ACM Conference on Information and Knowledge Management, 2009, pp. 1685-1688.
[16]
Lishuang Li, Rongpeng Zhou, Degen Huang, Two-phase biomedical named entity recognition using CRFs, Comput. Biol. Chem., 33 (2009) 334-338.
[17]
Rainer Lienhart, Axel Wernicke, Localizing and segmenting text in images and videos, IEEE Trans. Circuits Syst. Video Technol., 12 (2002) 256-268.
[18]
Wei Liu, Xiaofeng Meng, Weiyi Meng, ViDE: A vision-based approach for deep web data Extraction, IEEE Trans. Knowl. Data Eng., 22 (2010) 447-460.
[19]
Ling Liu, Calton Pu, Wei Han, XWRAP: An XML-enabled wrapper construction system for web information sources, in: Proceedings of the Sixteenth International Conference on Data Engineering, 2000, pp. 611-621.
[20]
MichaelR. Lyu, Jiqiang Song, Min Cai, A comprehensive method for multilingual video text detection, localization, and extraction, IEEE Trans. Circuits Syst. Video Technol., 15 (2005) 243-255.
[21]
Xiao Ma, Jiangfeng Chen, Hui Zhang, CELB: Content extraction based on line-block, in: Proceedings of the Sixth International Conference on Computer Sciences and Convergence Information Technology, 2011, pp. 412-417.
[22]
Michal Marek, Pavel Pecina, Miroslav Spousta, Web page cleaning with conditional random fields, in: Proceedings of the Third Web as Corpus Workshop on Building and Exploring Web corpora, 2007, pp. 155-162.
[23]
Iris Miliaraki, Klaus Berberich, Rainer Gemulla, Spyros Zoupanos, Mind the gap: large-scale frequent sequence mining, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 797-808.
[24]
Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gerhard Heyer, TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity, in: Proceedings of the Twelfth International Workshop on Web Information and Data Management, 2012, pp. 65-72.
[25]
Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh, Extracting the main content of Web documents based on character encoding and a naive smoothing method, Softw. Data Technol. Commun. Comput. Inf. Sci., 303 (2013) 217-236.
[26]
Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Extracting the main content of Web documents based on a naive smoothing method, in: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, 2011, pp. 470-475.
[27]
EdlenoS. de Moura, David Fernandes, Berthier Ribeiro-Neto, AltigranS.da Silva, MarcosAndré Gonçalves, Using structural information to improve search in Web collections, J. Am. Soc. Inf. Sci. Technol., 61 (2010) 2503-2513.
[28]
Ion Muslea, Steven Minton, CraigA. Knoblock, Hierarchical wrapper induction for semistructured information sources, Auton. Agents Multi Agent Syst., 4 (2001) 93-114.
[29]
Jeff Pasternack, Dan Roth, Extracting article text from the web with maximum subsequence segmentation, in: Proceedings of the Eighteenth International Conference on World Wide Web, 2009, pp. 971-980.
[30]
Jan Pomikálek, Building Parallel Corpora from the Web, (Ph.D. dissertation), Masaryk University, 2011.
[31]
PirAbdulRasool Qureshi, Nasrullah Memon, Hybrid CETR model of content extraction, J. Comput. Syst. Sci., 78 (2012) 1248-1257.
[32]
Rahman Alam, A.F.R. Rahman, H. Alam, R. Hartono, content extraction from HTML documents, in: Proceedings of the First International Workshop on Web Document Analysis, 2001, pp. 7-10.
[33]
DaviC. Reis, PauloB. Golgher, AltigranS. Silva, AlbertoF. Laender, Automatic web news extraction using tree edit distance, in: Proceedings of the Thirteenth International Conference on World Wide Web, 2004, pp. 502-511.
[34]
Miroslav Spousta, Michal Marek, Pavel Pecina, Victor: the web-page cleaning tool, in: Proceedings of the Fourth Web as Corpus Workshop Autonomic Communication - WAC, 2008, pp. 12-17.
[35]
Fei Sun, Dandan Song, Lejian Liao, DOM based content extraction via text density, in: Proceedings of the Thirty Fourth International ACM SIGIR Conference on Research and Development in Information Retrieval, 2011, pp. 245-254.
[36]
Erdinç Uzun, EdipSerdar Güner, Yılmaz Kılıçaslan, Tarık Yerlikaya, HayriVolkan Agun, An effective and efficient web content extractor for optimizing the crawling process, Softw. Pract. Exp., 44 (2014) 1181-1199.
[37]
Junfeng Wang, Xiaofei He, Can Wang, Jian Pei, Jiajun Bu, Chun Chen, Ziyu Guan, Gang Lu, News article extraction with template-independent wrapper, in: Proceedings of the Eighteenth International Conference on World Wide Web, 2009, pp. 1085-1086.
[38]
Tim Weninger, WilliamH. Hsu, Jiawei Han, CETR: content extraction via tag ratios, in: Proceedings of the Nineteenth International Conference on World Wide Web, 2010, pp. 971-980.
[39]
Yiming Yang, Xin Liu, A re-examination of text categorization methods, in: Proceedings of the Twenty Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.
[40]
Shuyi Zheng, Ruihua Song, Ji-Rong Wen, Template-independent news extraction based on visual consistency, in: Proceedings of the Twenty Second National Conference on Artificial Intelligence, 2007, pp. 1507-1512.

Cited By

View all
  1. Language independent web news extraction system based on text detection framework

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Information Sciences: an International Journal
    Information Sciences: an International Journal  Volume 342, Issue C
    May 2016
    222 pages

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 10 May 2016

    Author Tags

    1. Block segmentation
    2. Content extraction
    3. HTML
    4. Information filtering
    5. Web mining

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 19-Aug-2023
    • (2023)An efficient regular expression inference approach for relevant image extractionApplied Soft Computing10.1016/j.asoc.2023.110030135:COnline publication date: 1-Mar-2023
    • (2023)An efficient content extraction method for webpage based on tag-line-block analysisSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09076-x27:20(14631-14645)Online publication date: 1-Oct-2023
    • (2018)Misspelling Error Detection in Thai Language ApplicationProceedings of the 6th International Conference on Information Technology: IoT and Smart City10.1145/3301551.3301584(19-24)Online publication date: 29-Dec-2018
    • (2017)Box clustering segmentationInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00253:3(735-750)Online publication date: 1-May-2017
    • (2017)An intelligent system for livestock disease surveillanceInformation Sciences: an International Journal10.1016/j.ins.2016.10.026378:C(26-47)Online publication date: 1-Feb-2017

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media