Abstract
We present WebSelF, a framework for web scraping which models the process of web scraping and decomposes it into four conceptually independent, reusable, and composable constituents. We have validated our framework through a full parameterized implementation that is flexible enough to capture previous work on web scraping. We conducted an experiment that evaluated several qualitatively different web scraping constituents (including previous work and combinations hereof) on about 11,000 HTML pages on daily versions of 17 web sites over a period of more than one year. Our framework solves three concrete problems with current web scraping and our experimental results indicate that composition of previous and our new techniques achieve a higher degree of accuracy, precision and specificity than existing techniques alone.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Brabrand, Thomsen: Typed and unambiguous pattern matching on strings using regular expressions. In: Proc. of PPDP (2010)
Cohen: Recognizing structure in web pages using similarity queries. In: AAAI/IAAI. AAAI (1999)
Cohen, Fan: Learning page-independent heuristics for extracting data from web pages. CN 31(11-16) (1999)
Bex, et al.: Inference of concise DTDs from XML data. In: Proc. of VLDB (2006)
Bray, et al.: DTD: Document type definition. World Wide Web Consortium (November 1996), http://www.w3.org/TR/xml/#sec-prolog-dtd
Chang, et al.: A survey of web information extraction systems. TKDE (2006)
Dalvi, et al.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: Proc. of SIGMOD (2009)
Fazzinga, et al.: Schema-based web wrapping. In: KAIS (2009)
Kushmerick, et al.: Wrapper induction for information extraction. In: IJCAI (1997)
Lerman, et al.: Wrapper maintenance: A machine learning approach. JAIR (2003)
Meng, et al.: Schema-guided data extraction from the web. JCST 17(4) (2002)
Meng, et al.: Schema-guided wrapper maintenance for web-data extraction. In: Proc. of WIDM (2003)
Mohapatra, et al.: Efficient wrapper reinduction from dynamic web sources. In: Proc. of WI. IEEE Computer Society (2004)
Muslea, et al: Hierarchical wrapper induction for semistructured information sources. AAMAS 4(1) (2001)
Nakatoh, et al.: Automatic generation of deep web wrappers based on discovery of repetition. In: Proc. of AIRS (2004)
Parameswaran et al.: Optimal schemes for robust web extraction. In: Proc. of VLDB (2011)
Raposo et al.: Automatic wrapper maintenance for semi-structured web sources using results from previous queries. In: Proc. of SAC (2005)
Thomsen et al.: WebSelf: A web selection framework. Tech. report, Computer Science. Aarhus University (2012)
Kistler, Marais: Webl - a programming language for the web. CN 30(1-7) (1998)
Kushmerick: Wrapper verification. In: WWW (2000)
Lingam, Elbaum: Supporting end-users in the creation of dependable web clips. In: WWW (2007)
Liu, Ling: A conceptual model and rule-based query language for HTML. In: WWW (2001)
Myllymaki: Effective web data extraction with standard XML technologies. CN 39(5) (2002)
Myllymaki, Jackson: Robust web data extraction with xml path expressions. IBM Research Report, RJ10245 (2002)
Sahuguet, Azavant: Building intelligent web applications using lightweight wrappers. DKE 36(3) (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thomsen, J.G., Ernst, E., Brabrand, C., Schwartzbach, M. (2012). WebSelF: A Web Scraping Framework. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds) Web Engineering. ICWE 2012. Lecture Notes in Computer Science, vol 7387. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31753-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-31753-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31752-1
Online ISBN: 978-3-642-31753-8
eBook Packages: Computer ScienceComputer Science (R0)