Abstract
In this paper we describe design, setup and results of the speech recognition task in the framework of the Evalita campaign for the Italian language, giving details on the released corpora and tools used for the challenge. A general discussion about approaches to large vocabulary speech recognition introduces the recognition tasks. Systems are compared for recognition accuracy on audio sequences of Italian parliament. Although only a few systems have participated to the tasks, the contest provides an overview of the state-of-the-art of speech-to-text transcription technologies; the document reports systems performance, computed as Word Error Rate (WER), showing that the current approaches provide effective results. The best system achieves a WER as low as 5.4% on the released testset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Davis, K.H., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Amer. 24(6), 627–642 (1952)
Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C.-H., Morgan, N., O’Shaughnessy, D.: Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine 26(3), 75–80 (2009)
Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, Cambridge University, Cambridge (2004)
Sha, F.: Large margin training of acoustic models for speech recognition. Ph.D. thesis, University of Pennsylvania, Philadelphia (2007)
Schwenk, H.: Continuous space language models. Computer Speech and Language 21(3), 492–518 (2007)
Mohamed, A.R., Dahl, G.E., Hinton, G.E.: Deep belief networks for phone recognition. In: NIPS 22 Workshop on Deep Learning for Speech Recognition (2009)
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, and Signal Processing 28(4), 357–366 (1980)
Chiu, Y.-H. , Raj, B. , Stern, R.: Learning based auditory encoding for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 428–431 (2010)
Cohen, J., Kamm, T., Andreou, A.: Vocal tract normalization in speech recognition: compensation for system systematic speaker variability. J. Acoust. Soc. Amer. 97(5), pt. 2, 3246–3247 (1995)
Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. In: Speech Communication, pp. 283–297 (1998)
Bilmes, J.: A Gentle Tutorial of the EM algorithm and its application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report TR-97-021, International Computer Science Institute (1997)
Yu, D., Deng, L.: Large-Margin Discriminative Training of Hidden Markov Models for Speech Recognition. In: Proceedings of the International Conference on Semantic Computing, pp. 429–438. IEEE Computer Society, Washington, DC (2007)
Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Speech Communication 9, 171–186 (1995)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354 (1997)
Hoffmeister, B., Hillard, D., Hahn, S., Schluter, R., Ostendorf, M., Ney, H.: Cross-Site and Intra-Site ASR System Combination: Comparisons on Lattice and 1-Best Methods.XS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 1145–1148 (2007)
Hermansky, H., Ellis, D.P.W., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1635–1638 (2000)
Pinto, J.P.: Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, EPFL Switzerland (2010)
Schwarz, P., Matejka, P., Cernocky, J.: Hierarchical Structures of Neural Networks for Phoneme Recognition. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1(I), pp. 14–19 (2006)
Zweig, G., Nguyen, P.: A segmental CRF approach to large vocabulary continuous speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 152–157 (2009)
Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)
Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35(3), 400–401 (1987)
Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE 88(8), 1270–1278 (2000)
Schwenk, H.: Trends and challenges in language modeling for speech recognition and machine translation. In: IEEE Workshop on Automatic Speech Recognition and Understanding, Merano (2009)
The History of Automatic Speech Recognition Evaluations at NIST, http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html
Lamel, L., Gauvain, J.L., Adda, G., Barras, C., Bilinksi, E., Galibert, O., Pujol, A., Schwenk, H., Xuan, Z.: The LIMSI 2006 TC-STAR EPPS Transcription Systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 997–1000 (2007)
SAMPA - computer readable phonetic alphabet, http://www.phon.ucl.ac.uk/home/sampa/
Gretter, R., Peirone, G.: A Morphological Analyzer for the Italian Language. Istituto per la Ricerca Scientifica e Tecnologica, Tech. Rep. - Ref. No. 9108-01, Italy (December 12, 1991)
NIST: Speech recognition scoring toolkit, http://www.itl.nist.gov/iad/mig/tools/
Ronny, R., Shakoor, A., Brugnara, F., Gretter, R.: The FBK ASR system for Evalita 2011. In: Working Notes of EVALITA 2011, Rome, Italy (January 24-25, 2012)
Despres, J., Lamel, L., Gauvain, J.-L., Vieru, B., Woehrling, C., Bac Le, V., Oparin, I.: The Vocapia Research ASR Systems for Evalita 2011. In: Working Notes of EVALITA 2011, Rome, Italy (January 24-25, 2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Matassoni, M., Brugnara, F., Gretter, R. (2013). Evalita 2011: Automatic Speech Recognition Large Vocabulary Transcription. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-35828-9_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35827-2
Online ISBN: 978-3-642-35828-9
eBook Packages: Computer ScienceComputer Science (R0)