article

Automatic extraction of titles from general documents using machine learning

Authors:

Dmitriy Meyerzon,

Qinghua ZhengAuthors Info & Claims

Information Processing and Management: an International Journal, Volume 42, Issue 5

Pages 1276 - 1293

https://doi.org/10.1016/j.ipm.2005.12.001

Published: 01 September 2006 Publication History

Abstract

In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to other domains, and more surprisingly we can even train models in one language and apply them to other languages. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.

References

[1]

Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39-71.

Digital Library

[2]

Crystal, A., & Land, P. (2003). Metadata and Search Global Corporate Circle DCMI 2003 Workshop. Available from http:// dublincore.org/groups/corporate/Seattle/.

[3]

Collins, M. (2002). Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of conference on empirical methods in natural language processing (pp. 1-8).

Digital Library

[4]

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297.

[5]

Chieu, H. L., & Ng, H. T. (2002). A maximum entropy approach to information extraction from semi-structured and free text. In Proceedings of the eighteenth national conference on artificial intelligence (pp. 768-791).

Digital Library

[6]

Evans, D. K., Klavans, J. L., & McKeown, K. R. (2004). Columbia newsblaster: multilingual news summarization on the Web. In Proceedings of human language technology conference/North American chapter of the association for computational linguistics annual meeting (pp. 1-4).

Digital Library

[7]

Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden markov models. Machine Learning, 29, 245-273.

Digital Library

[8]

Gheel, J., & Anderson, T. (1999). Data and metadata for finding and reminding. In Proceedings of the 1999 international conference on information visualization (pp. 446-451).

Digital Library

[9]

Giles, C. L., Petinot, Y., Teregowda, P. B., Han, H., Lawrence, S., & Rangaswamy, A., et al. (2003). eBizSearch: a niche search engine for e-Business. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 413-414).

Digital Library

[10]

Giuffrida, G., Shek, E. C., & Yang, J. (2000). Knowledge-based metadata extraction from PostScript files. In Proceedings of the fifth ACM conference on digital libraries (pp. 77-84).

Digital Library

[11]

Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines. In Proceedings of the third ACM/IEEE-CS joint conference on digital libraries (pp. 37-48).

Digital Library

[12]

Kobayashi, M., & Takeda, K. (2000). Information retrieval on the Web. ACM Computing Surveys, 32, 144-173.

Digital Library

[13]

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning (pp. 282-289).

Digital Library

[14]

Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. S., (2002). The perceptron algorithm with uneven margins. In Proceedings of the nineteenth international conference on machine learning (pp. 379-386).

Digital Library

[15]

Liddy, E. D., Sutton, S., Allen, E., Harwell, S., Corieri, S., & Yilmazel, O., et al. (2002). Automatic metadata generation & evaluation. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 401-402).

Digital Library

[16]

Littlefield, A. (2002). Effective enterprise information retrieval across new content formats. In Proceedings of the seventh search engine conference. Available from http://www.infonortics.com/searchengines/sh02/02prog.html.

[17]

Mao, S., Kim, J. W., & Thoma, G. R. (2004). A dynamic feature generation system for automated metadata extraction in preservation of digital materials. In Proceedings of the first international workshop on document image analysis for libraries (pp. 225-232).

Digital Library

[18]

McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy markov models for information extraction and segmentation. In Proceedings of the seventeenth international conference on machine learning (pp. 591-598).

Digital Library

[19]

Murphy, L. D. (1998). Digital document metadata in organizations: roles, analytical approaches, and future research directions. In Proceedings of the thirty-first annual Hawaii international conference on system sciences (pp. 267-276).

Digital Library

[20]

Peng, F., & McCallum, A. (2004). Accurate information extraction from research papers using conditional random fields. In Proceedings of the human language technology conference/North American chapter of the association for computational linguistics annual meeting (pp. 329-336).

[21]

Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003). Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 235-242).

Digital Library

[22]

Ratnaparkhi, A. (1998). Unsupervised statistical models for prepositional phrase attachment. In Proceedings of the seventeenth international conference on computational linguistics (pp. 1079-1085).

Digital Library

[23]

Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of ACM thirteenth conference on information and knowledge management (pp. 42-49).

Digital Library

[24]

Yi, J., & Sundaresan, N. (2000). Metadata based Web mining for relevance. In Proceedings of the 2000 international symposium on database engineering & applications (pp. 113-121).

Digital Library

[25]

Yilmazel, O., Finneran, C. M., & Liddy, E. D. (2004). MetaExtract: an NLP system to automatically assign metadata. In Proceedings of the 2004 joint ACM/IEEE conference on digital libraries (pp. 241-242).

Digital Library

[26]

Zhang, J., & Dimitroff, A. (2004). Internet search engines' response to metadata Dublin Core implementation. Journal of Information Science, 30, 310-320.

[27]

Zhang, L., Pan, Y., & Zhang, T. (2004). Recognising and using named entities: focused named entity recognition using machine learning. In Proeeedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (pp. 281-288).

Digital Library

Cited By

Landau S(2020)Categorizing Uses of Communications Metadata: Systematizing Knowledge and Presenting a Path for PrivacyProceedings of the New Security Paradigms Workshop 202010.1145/3442167.3442171(1-19)Online publication date: 26-Oct-2020
https://dl.acm.org/doi/10.1145/3442167.3442171
Montebruno PBennett RSmith HLieshout C(2020)Machine learning classification of entrepreneurs in British historical census dataInformation Processing and Management: an International Journal10.1016/j.ipm.2020.10221057:3Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1016/j.ipm.2020.102210
Duretec KRauber ABecker CMcDonald RWorby NJatowt AMarshall CMilligan I(2017)A text extraction software benchmark based on a synthesized datasetProceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries10.5555/3200334.3200347(109-118)Online publication date: 19-Jun-2017
https://dl.acm.org/doi/10.5555/3200334.3200347
Show More Cited By

Index Terms

Automatic extraction of titles from general documents using machine learning
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 42, Issue 5

September 2006

266 pages

ISSN:0306-4573

Issue’s Table of Contents

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2006

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Landau S(2020)Categorizing Uses of Communications Metadata: Systematizing Knowledge and Presenting a Path for PrivacyProceedings of the New Security Paradigms Workshop 202010.1145/3442167.3442171(1-19)Online publication date: 26-Oct-2020
https://dl.acm.org/doi/10.1145/3442167.3442171
Montebruno PBennett RSmith HLieshout C(2020)Machine learning classification of entrepreneurs in British historical census dataInformation Processing and Management: an International Journal10.1016/j.ipm.2020.10221057:3Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1016/j.ipm.2020.102210
Duretec KRauber ABecker CMcDonald RWorby NJatowt AMarshall CMilligan I(2017)A text extraction software benchmark based on a synthesized datasetProceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries10.5555/3200334.3200347(109-118)Online publication date: 19-Jun-2017
https://dl.acm.org/doi/10.5555/3200334.3200347
Beel JGipp BLanger SBreitinger C(2016)Research-paper recommender systemsInternational Journal on Digital Libraries10.1007/s00799-015-0156-017:4(305-338)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1007/s00799-015-0156-0
Sharef NMartin TKasmiran KMustapha ASulaiman MAzmi-Murad M(2015)A comparative study of evolving fuzzy grammar and machine learning techniques for text categorizationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-014-1358-x19:6(1701-1714)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1007/s00500-014-1358-x
Lopez CPrince VRoche M(2014)How can catchy titles be generated without loss of informativeness?Expert Systems with Applications: An International Journal10.1016/j.eswa.2013.07.10241:4(1051-1062)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1016/j.eswa.2013.07.102
Beel JLanger SGenzmehr MMüller CDownie JMcDonald RCole TSanderson RShipman F(2013)Docear's PDF inspectorProceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries10.1145/2467696.2467789(443-444)Online publication date: 22-Jul-2013
https://dl.acm.org/doi/10.1145/2467696.2467789
Sathiyamurthy KGeetha T(2012)Automatic Organization and Generation of Presentation Slides for E-LearningInternational Journal of Distance Education Technologies10.4018/jdet.201207010310:3(35-52)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.4018/jdet.2012070103
Beel JGipp BShaker AFriedrich N(2010)SciPlore XtractProceedings of the 14th European conference on Research and advanced technology for digital libraries10.5555/1887759.1887818(413-416)Online publication date: 6-Sep-2010
https://dl.acm.org/doi/10.5555/1887759.1887818

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents