Nothing Special   »   [go: up one dir, main page]

Skip to main content

A General Learning Method for Automatic Title Extraction from HTML Pages

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2009)

Abstract

This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties.

We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques.

Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Liu, L., He, G., Shi, X., Song, H.: Metadata extraction based on mutual information in digital libraries. In: First IEEE International Symposium on Information Technologies and Applications in Education, ISITAE 2007 (2007)

    Google Scholar 

  2. Noufal, P.P.: Metadata: Automatic generation and extraction. In: 7th Manlibnet Annual National Convention on Digital Libraries in Knowledge Management: Opportunities for Management Libraries, at Indian Institute of Management Kozhikode (2005)

    Google Scholar 

  3. Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 3–20 (2006)

    Article  Google Scholar 

  4. Greenberg, J.: Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging 6, 59–82 (2004)

    Article  Google Scholar 

  5. Krowne, A., Skinner, K., Halbert, M., Ingram, S., Gadi, U., Pathak, S.: Metacombine project interim report. Technical report, Emory University (2006)

    Google Scholar 

  6. Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Joint Conference on Digital Libraries, 2003. Proceedings, pp. 37–48 (2003)

    Google Scholar 

  7. Zhang, Z., Sun, M., Liu, S. (eds.): Proceedings of 2005 IEEE International Conference on Automatic content based title extraction for Chinese documents using support vector machine (2005)

    Google Scholar 

  8. Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. Inf. Process. Manage. 42, 1276–1293 (2006)

    Article  Google Scholar 

  9. Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 250–257. ACM, New York (2005)

    Chapter  Google Scholar 

  10. Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Diane Cerra (2005)

    Google Scholar 

  11. Breiman, L.: Random forests. Machine Learning (2001)

    Google Scholar 

  12. Pater, N.: Enhancing random forest implementation in weka. In: Learning Conference Paper for ECE591Q (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Changuel, S., Labroche, N., Bouchon-Meunier, B. (2009). A General Learning Method for Automatic Title Extraction from HTML Pages. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03070-3_53

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03069-7

  • Online ISBN: 978-3-642-03070-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics