Abstract
This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties.
We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques.
Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Liu, L., He, G., Shi, X., Song, H.: Metadata extraction based on mutual information in digital libraries. In: First IEEE International Symposium on Information Technologies and Applications in Education, ISITAE 2007 (2007)
Noufal, P.P.: Metadata: Automatic generation and extraction. In: 7th Manlibnet Annual National Convention on Digital Libraries in Knowledge Management: Opportunities for Management Libraries, at Indian Institute of Management Kozhikode (2005)
Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 3–20 (2006)
Greenberg, J.: Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging 6, 59–82 (2004)
Krowne, A., Skinner, K., Halbert, M., Ingram, S., Gadi, U., Pathak, S.: Metacombine project interim report. Technical report, Emory University (2006)
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Joint Conference on Digital Libraries, 2003. Proceedings, pp. 37–48 (2003)
Zhang, Z., Sun, M., Liu, S. (eds.): Proceedings of 2005 IEEE International Conference on Automatic content based title extraction for Chinese documents using support vector machine (2005)
Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. Inf. Process. Manage. 42, 1276–1293 (2006)
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 250–257. ACM, New York (2005)
Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Diane Cerra (2005)
Breiman, L.: Random forests. Machine Learning (2001)
Pater, N.: Enhancing random forest implementation in weka. In: Learning Conference Paper for ECE591Q (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Changuel, S., Labroche, N., Bouchon-Meunier, B. (2009). A General Learning Method for Automatic Title Extraction from HTML Pages. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_53
Download citation
DOI: https://doi.org/10.1007/978-3-642-03070-3_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)