Abstract
Unsupervised HTML records detection is an important step in many Web content mining applications.
In this paper we propose a method of bottom-up discovery of clusters of maximal, non-agglomerative similar HTML ranges in nested set HTML tree representation. Afterward we demonstrate its applicability to records detection in search engines results. For performance measurement several distance assessment strategies were evaluated and two test collections were prepared containing results pages from almost 60 global and country-specific search engines and almost 100 methodically generated complex HTML trees with pre-set properties respectively.
Empirical study shows that our method performs well and can detect successfully most of search results ranges clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Age Of Com, http://www.ageof.com
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: Data Structures and Algorithms. Addison-Wesley, Reading (1983)
Big Search Engine Index, http://www.search-engine-index.co.uk
Celko, J.: Trees and Hierarchies in SQL for Smarties (2004)
Chang, K.C.C., He, B.: Structured databases on the web: observations and implications. SIGMOD Record (2004)
Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: WWW (2001)
Chawathe, S.S.: Comparing Hierarchical Data in External Memory. In: VLDB (1999)
Chi, Y., Yang, Y., Muntz, R.R.: Canonical forms for labelled trees and their applications in frequent subtree mining. Knowledge and Information Systems 8(2), 203–234 (2005)
Chilkat XML .NET, http://www.chilkatsoft.com/xml-dotnet.asp
Embley, D.W.: Tao C. Automating the Extraction of Data from HTML Tables with Unknown Structure. Data & Knowledge Engineering (2005)
Gazen, B., Minton, S.: AutoFeed: an unsupervised learning system for generating webfeeds. In: K-CAP (2005)
HTML Tidy Library, http://tidy.sourceforge.net/
HTTrack Website Copier, http://www.httrack.com/
Knuth, D.E.: The Art of Computer Programming. Addison-Wesley, Reading (1968)
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: SIGKDD (2003)
Minton, S., Knoblock, C.A., Lerman, K.: Automatic data extraction from lists and tables in web sources. In: IJCAI (2001)
Opera Web Browser, http://www.opera.com/
Pandia Powersearch, http://www.pandia.com/powersearch
Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: CIKM (2005)
Wang, J., Lochovsky, F.: Data Extraction and Label Assignment for Web Databases. In: WWW (2003)
World Wide Web Consortium. HTML 4.01 Specification (1999)
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW (2005)
Zhao, H., et al.: Fully Automatic Wrapper Generation for Search Engines. In: WWW (2005)
Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages. In: VLDB (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Flejter, D., Hryniewiecki, R. (2007). Bottom-Up Discovery of Clusters of Maximal Ranges in HTML Trees for Search Engines Results Extraction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2007. Lecture Notes in Computer Science, vol 4439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72035-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-72035-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72034-8
Online ISBN: 978-3-540-72035-5
eBook Packages: Computer ScienceComputer Science (R0)