Abstract
A phoneme classification model has been developed for Bengali continuous speech in this experiment. The analysis was conducted using a deep neural network based classification model. In the first phase, phoneme classification task has been performed using the deep-structured classification model along with two baseline models. The deep-structured model provided better overall classification accuracy than the baseline systems which were designed using hidden Markov model and multilayer Perceptron respectively. The confusion matrix of all the Bengali phonemes generated by the classification model is observed, and the phonemes are divided into nine groups. These nine groups provided better overall classification accuracy of 98.7%. In the next phase of this study, the place and manner of articulation based phonological features are detected and classified. The phonemes are regrouped into 15 groups using the manner of articulation based knowledge, and the deep-structured model is retrained. The system provided 98.9% of overall classification accuracy this time. This is almost equal to the overall classification accuracy which was observed for nine phoneme groups. But as the nine phoneme groups are redivided into 15 groups, the phoneme confusion in a single group became less which leads to a better phoneme classification model.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ali, A. A., Van Der Speigel, J., & Mueller, P. (2000). Auditory-based speech processing based on the average localized synchrony detection. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP’00 (Vol. 3, pp. 1623–1626). IEEE.
Ali, A. A., Van der Spiegel, J., & Mueller, P. (1998). An acoustic-phonetic featurebased system for the automatic recognition of fricative consonants. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. 961–964). IEEE.
Ali, A. A., Van der Spiegel, J., Mueller, P., Haentjens, G., & Berman, J. (1999). An acoustic-phonetic feature-based system for automatic phoneme recognition in continuous speech. In Proceedings of the 1999 IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 118–121). IEEE.
Ali, A. M. A., Van der Spiegel, J., & Mueller, P. (2001). Acoustic-phonetic features for the automatic classification of fricatives. The Journal of the Acoustical Society of America, 109(5), 2217–2235.
Ali, A. M. A., Van der Spiegel, J., & Mueller, P. (2002). Robust auditory-based speech processing using the average localized synchrony detection. IEEE Transactions on Speech and Audio Processing, 10(5), 279–292.
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1–127.
Bhattacharya, K. (1988). Bengali phonetic reader (Vol. 28). Mysuru: Central Institute of Indian Languages.
Bhowmik, T. (2017). Prosodic and phonological feature based speech recognition system for Bengali (Doctoral dissertation, IIT Kharagpur).
Bitar, N., & Espy-Wilson, C. Y. (1995a). A signal representation of speech based on phonetic features. In Proceedings of 5th Annual Dual Use Technologies and Applications Conference (pp. 310–315).
Bitar, N. N., & Espy-Wilson, C. Y. (1995b). Speech parameterization based on phonetic features: Application to speech recognition. In Fourth European Conference on Speech Communication and Technology (pp. 1411–1414).
Bitar, N. N., Espy-Wilson, C. Y. (1996). A knowledge-based signal representation for speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Processing (pp. 29–32). IEEE.
Chatterji, S. (1926). The origin and development of the Bengali language. Calcutta: Calcutta University Press.
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Large vocabulary continuous speech recognition with context dependent dbn-hmms. In IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP) (pp. 4688–4691). IEEE.
Das Mandal, S. (2007). Role of shape parameters in speech recognition: A study on standard colloquial Bengali (SCB). PhD thesis.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Dekel, O., Keshet, J., & Singer, Y. (2004). An online algorithm for hierarchical phoneme classification. In International Workshop on Machine Learning for Multimodal Interaction (pp. 146–158). Berlin: Springer.
Deng, L., Abdel-Hamid, O., & Yu, D. (2013). A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6669–6673). IEEE.
Deng, L., & Yu, D. (2013). Deep learning for signal and information processing. Redmond, WA: Microsoft Research Monograph.
Dusan, S. (2005). Estimation of speaker’s height and vocal tract length from speech signal. In Ninth European Conference on Speech Communication and Technology.
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632 + bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.
Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861–874.
Feng, X., Zhang, Y., & Glass, J. (2014). Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1759–1763). IEEE.
Frankel, J., & King, S. (2005). A hybrid ann/dbn approach to articulatory feature recognition. In Proceedings of Eurospeech. Lisbon: CD-ROM.
Garofolo, J., Consortium, L. D., et al. (1993). TIMIT: Acoustic-phonetic continuous speech corpus. Philadelphia, PA: Linguistic Data Consortium.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Vol. 9, pp. 249–256).
Goldberg, H., & Reddy, D. (1976). Feature extraction segmentation and labeling in the harpy and hearsay-ii systems. The Journal of the Acoustical Society of America, 60(S1), S11–S11.
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5), 602–610.
Harrington, J. (1987). Acoustic cues for automatic recognition of English consonants. In Speech Technology: A Survey. (pp. 19–74). Edinburgh: Edinburgh University Press
Harris, J. (1994). English sound structure. Oxford: Wiley.
Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language & Linguistic Theory, 9(1), 47–96.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton, G., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Hou, J. (2009). On the use of frame and segment-based methods for the detection and classification of speech sounds and features. PhD thesis, Rutgers University Graduate School, New Brunswick.
Huang, X. (1992). Phoneme classification using semicontinuous hidden markov models. IEEE Transactions on Signal Processing, 40(5), 1062–1067.
King, S., & Taylor, P. (2000). Detection of phonological features in continuous speech using neural networks. Computer Speech & Language, 14(4), 333–353.
King, S., Taylor, P., Frankel, J., & Richmond, K. (2000). Speech recognition via phonetically featured syllables. University of the Saarland.
Lahiri, A. (1999). Speech recognition with phonological features. In Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 99, pp. 715–718).
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine learning (pp. 473–480). ACM.
Lee, C.-H., Clements, M., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., & Rabiner, L. (2007). An overview on automatic speech attribute transcription (ASAT). In INTERSPEECH (pp. 1825–1828) Antwerp.
Lewis, M. P., Simons, G. F., & Fennig, C. D. (2016). Ethnologue: Languages of the world (Vol. 19). Dallas, TX: SIL International Dallas.
Mandal, S., Chandra, S., Lata, S., & Datta, A. (2011). Places and manner of articulation of Bangla consonants: An epg based study. In INTERSPEECH (pp. 3149–3152) Florence.
Mandal, S. D., Saha, A., & Datta, A. (2005). Annotated speech corpora development in Indian languages. Vishwa Bharat, 6, 49–64.
MATLAB. (2015). MATLAB version 8.5.0.197613 (R2015b). Natick: The Mathworks, Inc..
Meyer, B. T., Wächter, M., Brand, T., & Kollmeier, B. (2007). Phoneme confusions in human and automatic speech recognition. In INTERSPEECH (pp. 1485–1488).
Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.
Mohamed, A.-R., Yu, D., & Deng, L. (2010). Investigation of full sequence training of deep belief networks for speech recognition. In INTERSPEECH (Vol. 10, pp. 2846–2849).
Morales, S. O. C., & Cox, S. J. (2007). Modelling confusion matrices to improve speech recognition accuracy, with an application to dysarthric speech. In INTERSPEECH (pp. 1565–1568).
Moreau, N., Kim, H.-G., & Sikora, T. (2004). Phonetic confusion based document expansion for spoken document retrieval. In INTERSPEECH.
Online census data (2016). Retrieved July 20, 2016, from http://censusindia.gov.in/Census_Data_2001/ Census_Data_Online/Language/Statement3.htm.
Palm, R. B. (2012). Prediction as a candidate for learning deep hierarchical models of data. Master’s thesis.
Reetz, H. (1999). Converting speech signals to phonological features. In Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 99, pp. 1733–1736).
Renals, S., & Rohwer, R. (1989). Phoneme classification experiments using radial basis functions. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNNâ˘A´Z89) (Vol. 1, pp. 461–467).
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by backpropagating errors. Nature, 323, 533–536.
Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In INTERSPEECH (pp. 437–440). Florence.
Siniscalchi, S., & Lee, C.-H. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51(11), 1139–1153.
Siniscalchi, S., Lyu, D.-C., Svendsen, T., Lee, C.-H. (2012). Experiments on cross-language attribute detection and phone recognition with minimal targetspecific training data. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 875–887.
Siniscalchi, S., Svendsen, T., & Lee, C.-H. (2007). Towards bottom-up continuous phone recognition. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (pp. 566–569). IEEE.
Siniscalchi, S., Yu, D., Deng, L., & Lee, C.-H. (2013). Exploiting deep neural networks for detection based speech recognition. Neurocomputing, 106, 148–157.
Siniscalchi, S. M., & Reed, J., Svendsen, T., & Lee, C.-H. (2009). Exploring universal attribute characterization of spoken languages for spoken language recognition. In INTERSPEECH (pp. 168–171). Brighton.
Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2011). A bottom-up stepwise knowledge integration approach to large vocabulary continuous speech recognition using weighted finite state machines. In INTERSPEECH (pp. 901–904). Florence.
Srinivasan, S., & Petkovic, D. (2000). Phonetic confusion matrix based spoken document retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 81–87). ACM.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). ACM.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11, 3371–3408.
Xu, D., Wang, Y., & Metze, F. (2014). EM-based phoneme confusion matrix generation for low-resource spoken term detection. IEEE Spoken Language Technology Workshop (SLT) (pp. 424–429). IEEE.
Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. London: Springer.
Yu, D., Deng, L., & Dahl, G. (2010). Roles of pre-training and fine tuning in context dependent dbn-hmms for real world speech recognition. In Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
Yu, D., Siniscalchi, S., Deng, L., & Lee, C.-H. (2012). Boosting attribute and phone estimation accuracies with deep neural networks for detection based speech recognition. In ICASSP (pp. 4169–4172). IEEE.
Žgank, A., Horvat, B., & Kačič Z. (2005). Data driven generation of phonetic broad classes, based on phoneme confusion matrix similarity. Speech Communication, 47(3), 379–393.
Zhang, P., Shao, J., Han, J., Liu, Z., & Yan, Y. (2006). Keyword spotting based on phoneme confusion matrix. Proceedings of ICSLP (Vol. 2, pp. 408–419).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bhowmik, T., Mandal, S.K.D. Manner of articulation based Bengali phoneme classification. Int J Speech Technol 21, 233–250 (2018). https://doi.org/10.1007/s10772-018-9498-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9498-5