Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Constructing Novel Block Layouts for Webpage Analysis

Published: 10 July 2019 Publication History

Abstract

Webpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of webpage segmentation, we propose a two-stage segmentation method that can combine visual, logic, and semantic features of the contents on a webpage. Specifically, we devise a new model to measure the similarities of the elements on webpages based on both visual layout and logic organization in the first stage, and we propose a novel block regrouping method using semantic statistics and visual positions in the second stage. This two-stage method can effectively conduct webpage segmentation on complicated and dynamic webpages. The performance and accuracy of the method are verified by comparing with two existing webpage segmentation methods. The experiment results show that the proposed method significantly outperforms the existing state of the art in terms of higher precision, recall, and accuracy.

References

[1]
Alexa. 2016. The top 500 sites on the web. Retrieved from http://www.alexa.com/topsites.
[2]
Shumeet Baluja. 2006. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proceedings of the 15th International Conference on World Wide Web. ACM, 33--42.
[3]
Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web. ACM, 580--591.
[4]
Lidong Bing, Rui Guo, Wai Lam, Zheng-Yu Niu, and Haifeng Wang. 2014. Web page segmentation with structured prediction and its application in web page classification. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 767--776.
[5]
Ahmet Selman Bozkir and Ebru Akcapinar Sezer. 2018. Layout-based computation of web page similarity ranks. International Journal of Human-Computer Studies 110 (2018), 95--114.
[6]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A Visionbased Page Segmentation Algorithm. Technical Report. Microsoft technical report, MSR-TR-2003-79.
[7]
Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web. ACM, 61--70.
[8]
Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2008. A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th International Conference on World Wide Web. ACM, 377--386.
[9]
Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International Conference on World Wide Web. ACM, 225--233.
[10]
dataset-popular 2014. A dataset of popular pages (taken from dir.yahoo.com) with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-popular.
[11]
dataset-random 2014. A dataset of random pages with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-random.
[12]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Vol. 96. 226--231.
[13]
Evernote. 2016. Evernote Web Clipper. Retrieved from https://evernote.com/webclipper/.
[14]
ExtJs. 2016. Sencha Ext JS. Retrieved from https://www.sencha.com/products/extjs/.
[15]
Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web. ACM, 207--214.
[16]
PhantomJS - Scriptable Headless WebKit. https://github.com/ariya/phantomjs.
[17]
Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2015. Short text understanding through lexical-semantic analysis. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. IEEE, 495--506.
[18]
Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.
[19]
Zexun Jiang, Ruifeng Kuang, Jiaying Gong, Hao Yin, Yongqiang Lyu, and Xu Zhang. 2018. What makes a great mobile app? A quantitative study using a new mobile crawler. In Proceedings of the 2018 IEEE Symposium on Service-Oriented System Engineering (SOSE). IEEE, 222--227.
[20]
Christian Kohlschütter and Wolfgang Nejdl. 2008. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 1173--1182.
[21]
Rupesh R. Mehta, Pabitra Mitra, and Harish Karnick. 2005. Extracting semantic structure of web documents using content and visual information. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. ACM, 928--929.
[22]
William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846--850.
[23]
React. 2017. A JavaSscript Library for Building User Interfaces. Retrieved from https://facebook.github.io/react/.
[24]
Andres Sanoja and Stephane Gancarski. 2014. Block-o-matic: A web page segmentation framework. In Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS’14). IEEE, 595--600.
[25]
Yayuan Tang, Hao Wang, Kehua Guo, Yizhe Xiao, and Tao Chi. 2018. Relevant feedback based accurate and intelligent retrieval on capturing user intention for personalized websites. IEEE Access 6 (2018), 24239--24248.
[26]
Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, Joao Cavalcanti, and Juliana Freire. 2006. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, 258--267.
[27]
VIPS-JAVA {n.d.}. Implementation of Vision Based Page Segmentation Algorithm in Java. Retrieved from https://github.com/tpopela/vips-java.
[28]
Tim Weninger, William H Hsu, and Jiawei Han. 2010. CETR: Content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web. ACM, 971--980.
[29]
Yulei Wu, Fei Hu, Geyong Min, and Albert Y. Zomaya. 2017. Big Data and Computational Intelligence in Networking. CRC Press.
[30]
Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing 8 Management 53, 3 (2017), 735--750.

Cited By

View all
  • (2024)Internet Web page content block dataset and solutions for its data labelling simplificationundefined10.20334/2024-032-MOnline publication date: 2024
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023
  • (2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 25-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 19, Issue 3
Special Section on Advances in Internet-Based Collaborative Technologies
August 2019
289 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/3329912
  • Editor:
  • Ling Liu
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2019
Accepted: 01 April 2019
Revised: 01 March 2019
Received: 01 April 2018
Published in TOIT Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Webpage segmentation
  2. clustering
  3. semantic regourping
  4. similarity model
  5. webpage analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • The National Key Research and Development Program of China
  • Exeter-Tsinghua Academic Fellowship award

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Internet Web page content block dataset and solutions for its data labelling simplificationundefined10.20334/2024-032-MOnline publication date: 2024
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023
  • (2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 25-Oct-2023
  • (2022)Multimodal Web Page Segmentation Using Self-organized Multi-objective ClusteringACM Transactions on Information Systems10.1145/348096640:3(1-49)Online publication date: 7-Mar-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media