Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An efficient content extraction method for webpage based on tag-line-block analysis

Published: 24 August 2023 Publication History

Abstract

World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.

References

[1]
Baroni M, Chantree F, Kilgarriff A et al (2008) Cleaneval: a competition for cleaning web pages. In: Proceedings of the 6th international conference on language resources and evaluation, pp 638–643
[2]
Cai D, Yu S, Wen J R, et al (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-pacific web conference on web technologies and applications, pp 406–417
[3]
Cardoso E, Jabour I, Laber E, et al (2011) An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on document engineering, pp 121–128
[4]
Chen X (2011) Universal web content extraction based on row block distribution function. https://code.google.com/p/cx-extractor
[5]
Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th international conference on very large data bases, vol. 1, pp 109–118
[6]
Ferrara E, De Meo P, Fiumara G, et al. Web data extraction, applications and techniques: a survey Knowl-Based Syst 2014 70 301-323
[7]
Gan L, Ye B, Huang Z, et al. Knowledge graph construction based on ship collision accident reports to improve maritime traffic safety Ocean Coast Manag 2023 240 106660
[8]
Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp 830–839
[9]
Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of the 10th international conference on information integration and web-based applications and services, pp 591–595
[10]
Gu Y, Gao Y, Gao B, et al. Research on deep web information extraction based on template and domain ontology Comput Eng Des 2014 35 327-332
[11]
Gupta S, Kaiser G, Neistadt D et al (2003) DOM-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214
[12]
Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1th East-European symposium on advances in databases and information systems, vol. 1, pp 1–13
[13]
IDC, Statista (2022) Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (in zettabytes). https://www.statista.com/statistics/871513/worldwide-data-created/
[14]
Joe Dhanith PR and Surendiran B An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm Int J Comput Appl 2022 44 12 1123-1129
[15]
Karthikeyan T, Sekaran K, Ranjith D, et al. Personalized content extraction and text classification using effective web scraping techniques Int J Web Port 2019 11 2 41-52
[16]
Laber ES, de Souza CP, Jabour IV et al (2009) A fast and simple method for extracting relevant content from news webpages. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 1685–1688
[17]
Liang D, Yang Y, and Wei Z Information extraction of web pages based on support vector machine Comput Mod 2018 9 21-26
[18]
Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering, pp 611–621
[19]
Rahman A, Alam H, Hartono R (2001) Content extraction from html documents. In: Proceedings of the 1st international workshop on web document analysis, pp 1–4
[20]
Ramakrishna M, Gowdar L, Havanur MS et al (2010) Web mining: key accomplishments, applications and future directions. In: Proceedings of the 2010 international conference on data storage and data engineering, pp 187–191
[21]
Samuel MO, Tolulope AI, Oyejoke OO (2019) A systematic review of current trends in web content mining. In: Proceedings of the 3th international conference on science and sustainable development, vol. 1299, p 012040
[22]
Sandeep KS and Patil N A multidimensional approach to blog mining. progress in intelligent computing techniques: theory, practice, and applications Adv Intell Syst Comput 2018 719 51-58
[23]
Sestito S and Dillon T Knowledge acquisition of conjunctive rules using multilayered neural networks Int J Intell Syst 1993 8 7 779-805
[24]
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
[25]
Sun C and Guan Y A statistical approach for content extraction from web page J Chin Inf Process 2004 18 5 17-22
[26]
Tan Z, He C, Fang Y, et al. Title-based extraction of news contents for text mining IEEE Access 2018 6 64085-64095
[27]
Waldherr A, Maier D, Miltner P, et al. Big data, big noise: the challenge of finding issue networks on the web Soc Sci Comput Rev 2017 35 4 427-443
[28]
Wang Q, Fang Y, Ravula A, et al (2022) Webformer: the web-page transformer for structure information extraction. In: Proceedings of the 2022 ACM web conference, pp 3124–3133
[29]
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on World Wide Web, pp 971–980
[30]
Wu Y Language independent web news extraction system based on text detection framework Inf Sci 2016 342 132-149
[31]
Yu M, Chen T, and Xu H Research and design of HTML parser based on page segmentation J Comput Appl 2005 25 4 974-976
[32]
Yunis H, Stein B, Kiesel J, et al. Content extraction from webpages using machine learning 2016 Bauhaus-Universitaet Weimar
[33]
Zhang H, Li L, Hu W, et al. Visualization of location-referenced web textual information based on map mashups IEEE Access 2019 7 40475-40487
[34]
Zhang Z, Yu B, Liu T, et al. (2023) Learning structural co-occurrences for structured web data extraction in low-resource settings. In: Proceedings of the 2023 ACM web conference, pp 1683–1692

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Soft Computing - A Fusion of Foundations, Methodologies and Applications
Soft Computing - A Fusion of Foundations, Methodologies and Applications  Volume 27, Issue 20
Oct 2023
881 pages
ISSN:1432-7643
EISSN:1433-7479
Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 August 2023
Accepted: 29 July 2023

Author Tags

  1. Web content extraction
  2. Tag-line-block distribution function
  3. Tag semantic information
  4. Automatic threshold setting

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media