Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2695664.2695786acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Web page segmentation evaluation

Published: 13 April 2015 Publication History

Abstract

In this paper, we present a framework for evaluating segmentation algorithms for Web pages. Web page segmentation consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmentation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmentation algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.

References

[1]
Abiteboul, S.: Querying semi-structured data. In: Afrati, F. N., Kolaitis, P. G. (eds.) Database Theory - ICDT '97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings. Lecture Notes in Computer Science, vol. 1186, pp. 1--18. Springer (1997)
[2]
Asakawa, C., Takagi, H.: Annotation-based transcoding for nonvisual web access. In: Proceedings of the Fourth International ACM Conference on Assistive Technologies. pp. 172--179. Assets '00, ACM, New York, NY, USA (2000), http://doi.acm.org/10.1145/354324.354588
[3]
Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of the 15th international conference on World Wide Web. pp. 33--42. ACM (2006)
[4]
Breuel, T. M.: Representations and metrics for off-line handwriting segmentation. In: Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on. pp. 428--433. IEEE (2002)
[5]
Cai, D., Yu, S., Wen, J. R., Ma, W. Y.: Extracting content structure for web pages based on visual representation. In: APWeb 2003. LNCS, vol. 2642, pp. 406--417. Springer (2003)
[6]
Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. ITC-irst Technical Report 9703(09) (1998)
[7]
Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th international conference on World Wide Web. pp. 377--386. ACM (2008)
[8]
Chen, Y., Xie, X., Ma, W. Y., Zhang, H. J.: Adapting web pages for small-screen devices. IEEE Internet Computing 9(1), 50--56 (2005)
[9]
Hu, J., Kashi, R., Wilfong, G.: Document image layout comparison and classification. In: 1999. ICDAR '99. Proceedings of the Fifth International Conference on Document Analysis and Recognition. pp. 285--288 (Sep 1999)
[10]
Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM conference on Information and knowledge management. pp. 1173--1182. ACM (2008)
[11]
Kreuzer, R.: A Quantitative Comparison of Semantic Web Page Segmentation Algorithms. Master's thesis, Universiteit Utrecht (2013)
[12]
Pehlivan, Z., Saad, M. B., Gançarski, S.: Vi-diff: Understanding web pages changes. In: DEXA (1). pp. 1--15 (2010)
[13]
Popela, T.: IMPLEMENTACE ALGORITMU PRO VIZUALNI SEGMENTACI WWW STRANEK. Master's thesis, BRNO University of Technology (2012)
[14]
Saad, M. B., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: Proceedings of the 2010 EDBT/ICDT Workshops. p. 43. ACM (2010)
[15]
Saad, M. B., Gançarski, S.: Archiving the web using page changes patterns: a case study. Int. J. on Digital Libraries 13(1), 33--49 (2012)
[16]
Sanoja, A., Gançarski, S.: Block-o-matic: A web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS'14). Marrakeh, Morroco (2014)
[17]
Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(6), 941--954 (2008)
[18]
Solis, B.: The conversation prism (2014), https://conversationprism.com/
[19]
Tang, Y. Y., Suen, C. Y.: Document structures: a survey. International journal of pattern recognition and artificial intelligence 8(05), 1081--1111 (1994)
[20]
Xiao, Y., Tao, Y., Li, Q.: Web page adaptation for mobile device. In: Wireless Communications, Networking and Mobile Computing, 2008. WiCOM '08. 4th International Conference on. pp. 1--5 (2008)
[21]
Yesilada, Y.: Web page segmentation: A review. Tech. rep., University of Manchester and Middle East Technical University Northern Cyprus Campus (2011)
[22]
Zhang, Y., Gerbrands, J.: Objective and quantitative segmentation evaluation and comparison. Signal processing 39(1), 43--54 (1994)

Cited By

View all
  • (2024)Internet Web page content block dataset and solutions for its data labelling simplificationundefined10.20334/2024-032-MOnline publication date: 2024
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023
  • (2022)Multimodal Web Page Segmentation Using Self-organized Multi-objective ClusteringACM Transactions on Information Systems10.1145/348096640:3(1-49)Online publication date: 7-Mar-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing
April 2015
2418 pages
ISBN:9781450331968
DOI:10.1145/2695664
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • FP7 ICT-2009.4.1

Conference

SAC 2015
Sponsor:
SAC 2015: Symposium on Applied Computing
April 13 - 17, 2015
Salamanca, Spain

Acceptance Rates

SAC '15 Paper Acceptance Rate 291 of 1,211 submissions, 24%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Internet Web page content block dataset and solutions for its data labelling simplificationundefined10.20334/2024-032-MOnline publication date: 2024
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023
  • (2022)Multimodal Web Page Segmentation Using Self-organized Multi-objective ClusteringACM Transactions on Information Systems10.1145/348096640:3(1-49)Online publication date: 7-Mar-2022
  • (2020)Web Page Segmentation RevisitedProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412782(3047-3054)Online publication date: 19-Oct-2020
  • (2020)Model-Driven Web Page Segmentation for Non Visual AccessComputational Linguistics10.1007/978-981-15-6168-9_17(191-205)Online publication date: 2-Jul-2020
  • (2019)Designing Experiments to Compare Web Page SegmentersProceedings of the 2nd International Workshop on Human Factors in Hypertext10.1145/3345509.3349280(27-29)Online publication date: 12-Sep-2019
  • (2018)Web Page Segmentation Towards Information Extraction for Web SemanticsInternational Conference on Innovative Computing and Communications10.1007/978-981-13-2354-6_45(431-442)Online publication date: 20-Nov-2018
  • (2017)Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its EvaluationAdvances in Databases and Information Systems10.1007/978-3-319-66917-5_25(375-393)Online publication date: 25-Aug-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media