A General Learning Method for Automatic Title Extraction from HTML Pages

Sahar Changuel²⁰,
Nicolas Labroche²⁰ &
Bernadette Bouchon-Meunier²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5632))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2424 Accesses

Abstract

This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties.

We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques.

Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM

Web Page Structured Content Detection Using Supervised Machine Learning

Effective Top-Down Active Learning for Hierarchical Text Classification

References

Liu, L., He, G., Shi, X., Song, H.: Metadata extraction based on mutual information in digital libraries. In: First IEEE International Symposium on Information Technologies and Applications in Education, ISITAE 2007 (2007)
Google Scholar
Noufal, P.P.: Metadata: Automatic generation and extraction. In: 7th Manlibnet Annual National Convention on Digital Libraries in Knowledge Management: Opportunities for Management Libraries, at Indian Institute of Management Kozhikode (2005)
Google Scholar
Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 3–20 (2006)
Article Google Scholar
Greenberg, J.: Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging 6, 59–82 (2004)
Article Google Scholar
Krowne, A., Skinner, K., Halbert, M., Ingram, S., Gadi, U., Pathak, S.: Metacombine project interim report. Technical report, Emory University (2006)
Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Joint Conference on Digital Libraries, 2003. Proceedings, pp. 37–48 (2003)
Google Scholar
Zhang, Z., Sun, M., Liu, S. (eds.): Proceedings of 2005 IEEE International Conference on Automatic content based title extraction for Chinese documents using support vector machine (2005)
Google Scholar
Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. Inf. Process. Manage. 42, 1276–1293 (2006)
Article Google Scholar
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 250–257. ACM, New York (2005)
Chapter Google Scholar
Ian, H., Witten, E.F.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Diane Cerra (2005)
Google Scholar
Breiman, L.: Random forests. Machine Learning (2001)
Google Scholar
Pater, N.: Enhancing random forest implementation in weka. In: Learning Conference Paper for ECE591Q (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris, France
Sahar Changuel, Nicolas Labroche & Bernadette Bouchon-Meunier

Authors

Sahar Changuel
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Labroche
View author publications
You can also search for this author in PubMed Google Scholar
Bernadette Bouchon-Meunier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Körnerstr. 10, 04107, Leipzig, Deutschland, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Changuel, S., Labroche, N., Bouchon-Meunier, B. (2009). A General Learning Method for Automatic Title Extraction from HTML Pages. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_53

Download citation

DOI: https://doi.org/10.1007/978-3-642-03070-3_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A General Learning Method for Automatic Title Extraction from HTML Pages

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM

Web Page Structured Content Detection Using Supervised Machine Learning

Effective Top-Down Active Learning for Hierarchical Text Classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A General Learning Method for Automatic Title Extraction from HTML Pages

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM

Web Page Structured Content Detection Using Supervised Machine Learning

Effective Top-Down Active Learning for Hierarchical Text Classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation