Abstract
The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Cai, D., Yu, S., Rong Wen, J., Ying Ma, W.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. SIGMOD, 624–624 (2002)
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: WWW, pp. 71–80. ACM, New York (2007)
Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow. 2(1), 289–300 (2009)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. SIGMOD, 119–130 (2004)
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: IJCAI. AAAI Press, Menlo Park (2001)
Lie, H.W., Bos, B.: Cascading Style Sheets:Designing for the Web, 2nd edn. Addison-Wesley Professional, Reading (1999)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606. ACM Press, New York (2003)
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng. 22(3), 447–460 (2010)
Mehta, R.R., Mitra, P., Karnick, H.: Extracting semantic structure of web documents using content and visual information. In: WWW, pp. 928–929. ACM, New York (2005)
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW, pp. 981–990. ACM, New York (2009)
Tong, S., Dean, J.: System and methods for automatically creating lists. In: US Patent: 7350187 (March 2008)
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM, pp. 342–350. IEEE, Washington, DC, USA (2007)
Weninger, T., Fumarola, F., Barber, R., Han, J., Malerba, D.: Unexpected results in automatic list extraction on the web. SIGKDD Explorations 12(2), 26–30 (2010)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85. ACM, New York (2005)
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J. (2011). Extracting General Lists from Web Documents: A Hybrid Approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds) Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science(), vol 6703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21822-4_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-21822-4_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21821-7
Online ISBN: 978-3-642-21822-4
eBook Packages: Computer ScienceComputer Science (R0)