Abstract
In our previous works, we have explored articulatory and excitation source features to improve the performance of phone recognition systems (PRSs) using read speech corpora. In this work, we have extended the use of articulatory and excitation source features for developing PRSs of extempore and conversation modes of speech, in addition to the read speech. It is well known that the overall performance of speech recognition system heavily depends on accuracy of phone recognition. Therefore, the objective of this paper is to enhance the accuracy of phone recognition systems using articulatory and excitation source features in addition to conventional spectral features. The articulatory features (AFs) are derived from the spectral features using feedforward neural networks (FFNNs). We have considered five AF groups, namely: manner, place, roundness, frontness and height. Five different AF-based tandem PRSs are developed using the combination of Mel frequency cepstral coefficients (MFCCs) and AFs derived from FFNNs. Hybrid PRSs are developed by combining the evidences from AF-based tandem PRSs using weighted combination approach. The excitation source information is derived by processing the linear prediction residual of the speech signal. The vocal tract information is captured using MFCCs. The combination of vocal tract and excitation source features is used for developing PRSs. The PRSs are developed using hidden Markov models. Bengali speech database is used for developing PRSs of read, extempore and conversation modes of speech. The results are analyzed and the performance is compared across different modes of speech. From the results, it is observed that the use of either articulatory or excitation source features along-with to MFCCs will improve the performance of PRSs in all three modes of speech. The improvement in the performance using AFs is much higher compared to the improvement obtained using excitation source features.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Dordrecht: Kluwer.
Chengalvarayan, R. (1998). On the use of normalized LPC error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing (pp. 17–20).
Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5252–5255).
Fallside, F., Lucke, H., Marsland, T. P., O’Shea, P. J., Owen, M. S. J., Prager, R. W., et al. (1990). Continuous speech recognition for the TIMIT database using neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 445–448).
Gerfen. (2015). Phonetics theory (online). http://www.unc.edu/~gerfen/Ling 30Sp2002/phonetics.html.
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5).
He, J., Liu, L., & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 5–8).
Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1635–1638).
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.
Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4065–4068).
Kirchhoff, K., Fink, Gernot A., & Sagerer, Gerhard. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303–319.
Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.
Manjunath, K. E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In IEEE national conference on communications (NCC) (pp. 1–6).
Manjunath, K. E., & Sreenivasa Rao, K. (2015a). Source and system features for phone recognition. International Journal of Speech Technology, 18, 257–270.
Manjunath, K. E., & Sreenivasa Rao, K. (2015b). Improvement of phone recognition accuracy using articulatory features. Applied Soft Computing (revision submitted).
Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015a). Two-stage phone recognition system using articulatory and spectral features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 107–111).
Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015b). Improvement of phone recognition accuracy using source and system features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 501–505).
Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of phonetic engine for Indian languages: Bengali and Oriya. In 16th International oriental COCOSDA conference (IEEE explore) (pp. 1–6), Gurgoan, India.
Manjunath, K. E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages: Bengali and Oriya. In IEEE INDICON (IEEE Explore) (pp. 1–6), IIT Bombay, Mumbai, India.
Metze, F. (2005). Articulatory features for conversational speech recognition. Ph.D. dissertation, Carnegie Mellon University.
Mitra, V., Wang, W., Stolcke, A., Nam, H., Richey, C., Yuan, J., et al. (2013). Articulatory trajectories for large-vocabulary speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 7145–7149).
Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.
Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 8614–8618).
Siniscalchi, S. M., & Lee, C. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–1153.
Speech Group at the International Computer Science Ins. (2010). QuickNet software and documentation (online). http://www1.icsi.berkeley.edu/Speech.
Sreenivasa Rao, K., & Koolagudi, S. G. (2013). Recognition of emotions from video using acoustic and facial features. In Signal, image and video processing (SIViP) (pp. 1–17).
Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and Odia. In 16th International Oriental COCOSDA (pp. 1–5).
The Hidden Markov Model Toolkit and HTK book. (2015). (online). http://htk.eng.cam.ac.uk.
The International Phonetic Association. (2015). International Phonetic Alphabet (online). http://www.langsci.ucl.ac.uk/ipa/index.html.
Toth, L. (2014). Convolutional deep maxout networks for phone recognition. In International speech communication association (INTERSPEECH) (pp. 1078–1082).
Acknowledgments
The work presented in this paper was performed at IIT-Kharagpur as a part of the project 11(6)/2011-HCC(TDIL) , Dt. 23-12-2011, “Prosodically guided phonetic engine for searching speech databases in Indian languages” supported by Department of Information Technology, Government of India.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Manjunath, K.E., Sreenivasa Rao, K. Articulatory and excitation source features for speech recognition in read, extempore and conversation modes. Int J Speech Technol 19, 121–134 (2016). https://doi.org/10.1007/s10772-015-9329-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9329-x