Extracting General Lists from Web Documents: A Hybrid Approach

Fabio Fumarola²³,
Tim Weninger²⁴,
Rick Barber²⁴,
Donato Malerba²³ &
…
Jiawei Han²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6703))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

Abstract

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Information Extraction from the Web by Matching Visual Presentation Patterns

Organizing without Understanding

Article 24 August 2017

DataGorri: a tool for automated data collection of tabular web content

Article 01 October 2018

References

Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Article Google Scholar
Cai, D., Yu, S., Rong Wen, J., Ying Ma, W.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. SIGMOD, 624–624 (2002)
Google Scholar
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: WWW, pp. 71–80. ACM, New York (2007)
Chapter Google Scholar
Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow. 2(1), 289–300 (2009)
Article Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. SIGMOD, 119–130 (2004)
Google Scholar
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: IJCAI. AAAI Press, Menlo Park (2001)
Google Scholar
Lie, H.W., Bos, B.: Cascading Style Sheets:Designing for the Web, 2nd edn. Addison-Wesley Professional, Reading (1999)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606. ACM Press, New York (2003)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng. 22(3), 447–460 (2010)
Article Google Scholar
Mehta, R.R., Mitra, P., Karnick, H.: Extracting semantic structure of web documents using content and visual information. In: WWW, pp. 928–929. ACM, New York (2005)
Google Scholar
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW, pp. 981–990. ACM, New York (2009)
Chapter Google Scholar
Tong, S., Dean, J.: System and methods for automatically creating lists. In: US Patent: 7350187 (March 2008)
Google Scholar
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM, pp. 342–350. IEEE, Washington, DC, USA (2007)
Google Scholar
Weninger, T., Fumarola, F., Barber, R., Han, J., Malerba, D.: Unexpected results in automatic list extraction on the web. SIGKDD Explorations 12(2), 26–30 (2010)
Article Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85. ACM, New York (2005)
Google Scholar
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari “Aldo Moro”, Bari, Italy
Fabio Fumarola & Donato Malerba
Computer Science Department, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
Tim Weninger, Rick Barber & Jiawei Han

Authors

Fabio Fumarola
View author publications
You can also search for this author in PubMed Google Scholar
Tim Weninger
View author publications
You can also search for this author in PubMed Google Scholar
Rick Barber
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer and Inforamtion Science, Center for Science and Technology, Syracuse University, 13244-4100, Syracuse, NY, USA
Kishan G. Mehrotra & Chilukuri K. Mohan &
Department of Electrical Engineering and Computer Science, Syracuse University, 13244, NY, USA
Jae C. Oh
Department of Electrical Engineering and Computer Science, Syracuse University, 13244, Syracuse, NY, USA
Pramod K. Varshney
Department of Computer Science, Texas State University San Marcos, 601 University Drive, 78666-4616, San Marcos, TX, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J. (2011). Extracting General Lists from Web Documents: A Hybrid Approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds) Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science(), vol 6703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21822-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-21822-4_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21821-7
Online ISBN: 978-3-642-21822-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics