Multimodal Output Combination for Transcribing Historical Handwritten Documents

Emilio Granell¹⁵ &
Carlos-D. Martínez-Hinarejos¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9256))

Included in the following conference series:

International Conference on Computer Analysis of Images and Patterns

3139 Accesses

Abstract

Transcription of digitalised historical documents is an interesting task in the document analysis area. This transcription can be achieved by using Handwritten Text Recognition (HTR) on digitalised pages or by using Automatic Speech Recognition (ASR) on the dictation of contents. Moreover, another option is using both systems in a multimodal combination to obtain a draft transcription, given that combining the outputs of different recognition systems will generally improve the recognition accuracy. In this work, we present a new combination method based on Confusion Network. We check its effectiveness for transcribing a Spanish historical book. Results on both unimodal combination with different optical (for HTR) and acoustic (for ASR) models, and multimodal combination, show a relative reduction of Word and Character Error Rate of $14.3\%$ and $16.6\%$, respectively, over the HTR baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automatic Bidirectional Conversion of Audio and Text: A Review from Past Research

Multimodal image and audio music transcription

Article Open access 11 November 2021

Comprehensive Survey of Audio-to-Text Conversion

References

Alabau, V., Martínez-Hinarejos, C.D., Romero, V., Lagarda, A.L.: An iterative multimodal framework for the transcription of handwritten historical documents. Pattern Recognition Letters 35, 195–203 (2014)
Article Google Scholar
Bertolami, R., Halter, B., Bunke, H.: Combination of multiple handwritten text line recognition systems with a recursive approach. In: Proc. Int. Conf. Frontiers Handwriting Recognition, pp. 61–65 (2006)
Google Scholar
Bisani, M., Ney, H.: Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proc. of Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 409–412 (2004)
Google Scholar
Collobert, R., Bengio, S., Mariéthoz, J.: Torch: a modular machine learning software library. Tech. rep., IDIAP-RR 02–46, IDIAP (2002)
Google Scholar
Dreuw, P., Jonas, S., Ney, H.: White-space models for offline Arabic handwriting recognition. In: Proc. of Int. Conf. on Pattern Recognition, pp. 1–4 (2008)
Google Scholar
Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proc. of Int. Conf. Acoustics, Speech and Signal Processing, vol. 3, pp. 1635–1638 (2000)
Google Scholar
Ishimaru, S., Nishizaki, H., Sekiguchi, Y.: Effect of confusion network combination on speech recognition system for editing. In: Proc. of APSIPA Annual Summit and Conf., vol. 4, pp. 1–4 (2011)
Google Scholar
Johnson, D.: ICSI Quicknet soft package (2004). http://www1.icsi.berkeley.edu/Speech/qn.html
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proc. of Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 181–184 (1995)
Google Scholar
Krishnamurthy, H.K.: Study of algorithms to combine multiple automatic speech recognition (ASR) system outputs. Master’s thesis, Department of Electrical and Computer Engineering (2009). http://hdl.handle.net/2047/d10019273
Luján-Mares, M., Tamarit, V., Alabau, V., Martínez-Hinarejos, C.D., Pastor i Gadea, M., Sanchis, A., Toselli, A.H.: iATROS: a speech and handwritting recognition system. In: V Jornadas en Tecnologías del Habla (VJTH2008), pp. 75–78 (2008)
Google Scholar
Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J.B., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: Proc. of EuroSpeech 1993, pp. 175–178 (1993)
Google Scholar
Netpbm home page. http://netpbm.sourceforge.net/
Plamondon, R., Srihari, S.N.: On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 63–84 (2000)
Article Google Scholar
Romero, V., Leiva, L.A., Toselli, A.H., Vidal, E.: Interactive multimodal transcription of text images using a web-based demo system. In: Proc. of Conf. on Intelligent User Interfaces, pp. 477–478 (2009)
Google Scholar
Serrano, N., Castro, F., Juan, A.: The RODRIGO Database. In: Proc. of Language Resources and Evaluation Conference, pp. 2709–2712 (2010)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. Interspeech, pp. 901–904 (2002)
Google Scholar
Woodruff, P., Dupont, S.: Bimodal combination of speech and handwriting for improved word recognition. In: Proc. of EUSIPCO 2005, pp. 1918–1921 (2005)
Google Scholar
Xue, J., Zhao, Y.: Improved confusion network algorithm and shortest path search from word lattice. In: Proc. of Int. Conf. in Acoustics, Speech and Signal Processing, vol. 1, pp. 853–856 (2005)
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK book (for HTK version 3.4). Cambridge university Eng. Dept. (2006)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. Transactions on Information Systems 22(2), 179–214 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Pattern Recognition and Human Language Technology Research Center, Universitat Politècnica de València, Camino Vera s/n, 46022, Valencia, Spain
Emilio Granell & Carlos-D. Martínez-Hinarejos

Authors

Emilio Granell
View author publications
You can also search for this author in PubMed Google Scholar
Carlos-D. Martínez-Hinarejos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emilio Granell .

Editor information

Editors and Affiliations

University of Malta, Msida, Malta
George Azzopardi
University of Groningen, Groningen, The Netherlands
Nicolai Petkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Granell, E., Martínez-Hinarejos, CD. (2015). Multimodal Output Combination for Transcribing Historical Handwritten Documents. In: Azzopardi, G., Petkov, N. (eds) Computer Analysis of Images and Patterns. CAIP 2015. Lecture Notes in Computer Science(), vol 9256. Springer, Cham. https://doi.org/10.1007/978-3-319-23192-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-23192-1_21
Published: 25 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23191-4
Online ISBN: 978-3-319-23192-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Output Combination for Transcribing Historical Handwritten Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automatic Bidirectional Conversion of Audio and Text: A Review from Past Research

Multimodal image and audio music transcription

Comprehensive Survey of Audio-to-Text Conversion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multimodal Output Combination for Transcribing Historical Handwritten Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automatic Bidirectional Conversion of Audio and Text: A Review from Past Research

Multimodal image and audio music transcription

Comprehensive Survey of Audio-to-Text Conversion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation