Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1815330.1815367acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

An impact of linguistic features on automated classification of OCR texts

Published: 09 June 2010 Publication History

Abstract

Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.

References

[1]
L. S. Busagala, W. Ohyama, T. Wakabayashi, and F. Kimura. Improving automatic text classification by integrated feature analysis. IEICE - Trans. Inf. Syst., E91--D(4):1101--1109, 2008.
[2]
S. Chapman. Measuring search retrieval accuracy of uncorrected ocr: Findings from the harvard-radcliffe online historical reference shelf digitization project. Harvard University Library, Harvard, 2001.
[3]
P. Frasconi, G. Soda, and A. Vullo. Text categorization for multi-page documents: A hybrid naive bayes hmm approach. In In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, pages 11--20. ACM Press, 2001.
[4]
K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 2 edition, 1990.
[5]
T. Joachims. Learning to classify text using support vector machines: Methods, Theory and Algorithms. Kluwer Academic Publishers Boston Dordrecht London, 2001.
[6]
S. Lam and L. Lee. Feature reduction for neural network based text categorization. In Proceedings of DASFAA-99, 6th IEEE International Conference on Database Advanced systems for advanced applications, pages 195--202, 1999. Hsinchu, TW.
[7]
H.-S. Lim. Improving kNN based text classification with well estimated parameters. In International Conference on Neural Information Processing, pages 516--523, 2004.
[8]
M. Murata, L. S. P. Busagala, W. Ohyama, T. Wakabayashi, and F. Kimura. The impact of ocr accuracy and feature transformation on automatic text classification. In Document Analysis Systems, pages 506--517, 2006.
[9]
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 12th International Conference on Machine Learning (ICML), pages 616--623, Washington DC, 2003).
[10]
H. Schmid. Probabilistic part-of-speech tagging using decision trees, 1994.
[11]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[12]
F. Sebastiani and C. N. D. Ricerche. Machine learning in automated text categorization. ACM Computing Surveys, 34:1--47, 2002.
[13]
K. Taghva, T. Nartker, A. Condit, and J. Borsack. Automatic removal of ąÈgarbage stringsąÉ in ocr text: An implementation.
[14]
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1:67--88, 1999.
[15]
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the Twenty-First International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999.

Cited By

View all
  • (2021)Indonesian ID Card Extractor Using Optical Character Recognition and Natural Language Post-Processing2021 9th International Conference on Information and Communication Technology (ICoICT)10.1109/ICoICT52021.2021.9527510(621-626)Online publication date: 3-Aug-2021

Index Terms

  1. An impact of linguistic features on automated classification of OCR texts

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
    June 2010
    490 pages
    ISBN:9781605587738
    DOI:10.1145/1815330
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. OCR text classification
    2. feature generation
    3. feature transformation
    4. linguistic features and ocr texts
    5. parts of speech analysis
    6. text categorization

    Qualifiers

    • Research-article

    Conference

    DAS '10

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Indonesian ID Card Extractor Using Optical Character Recognition and Natural Language Post-Processing2021 9th International Conference on Information and Communication Technology (ICoICT)10.1109/ICoICT52021.2021.9527510(621-626)Online publication date: 3-Aug-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media