Manner of articulation based Bengali phoneme classification

379 Accesses
6 Citations
Explore all metrics

Abstract

A phoneme classification model has been developed for Bengali continuous speech in this experiment. The analysis was conducted using a deep neural network based classification model. In the first phase, phoneme classification task has been performed using the deep-structured classification model along with two baseline models. The deep-structured model provided better overall classification accuracy than the baseline systems which were designed using hidden Markov model and multilayer Perceptron respectively. The confusion matrix of all the Bengali phonemes generated by the classification model is observed, and the phonemes are divided into nine groups. These nine groups provided better overall classification accuracy of 98.7%. In the next phase of this study, the place and manner of articulation based phonological features are detected and classified. The phonemes are regrouped into 15 groups using the manner of articulation based knowledge, and the deep-structured model is retrained. The system provided 98.9% of overall classification accuracy this time. This is almost equal to the overall classification accuracy which was observed for nine phoneme groups. But as the nine phoneme groups are redivided into 15 groups, the phoneme confusion in a single group became less which leads to a better phoneme classification model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Neural Network Based Recognition and Classification of Bengali Phonemes: A Case Study of Bengali Unconstrained Speech

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Article 20 January 2021

Confusion analysis in phoneme based speech recognition in Hindi

Article 01 February 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ali, A. A., Van Der Speigel, J., & Mueller, P. (2000). Auditory-based speech processing based on the average localized synchrony detection. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP’00 (Vol. 3, pp. 1623–1626). IEEE.
Ali, A. A., Van der Spiegel, J., & Mueller, P. (1998). An acoustic-phonetic featurebased system for the automatic recognition of fricative consonants. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. 961–964). IEEE.
Ali, A. A., Van der Spiegel, J., Mueller, P., Haentjens, G., & Berman, J. (1999). An acoustic-phonetic feature-based system for automatic phoneme recognition in continuous speech. In Proceedings of the 1999 IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 118–121). IEEE.
Ali, A. M. A., Van der Spiegel, J., & Mueller, P. (2001). Acoustic-phonetic features for the automatic classification of fricatives. The Journal of the Acoustical Society of America, 109(5), 2217–2235.
Article Google Scholar
Ali, A. M. A., Van der Spiegel, J., & Mueller, P. (2002). Robust auditory-based speech processing using the average localized synchrony detection. IEEE Transactions on Speech and Audio Processing, 10(5), 279–292.
Article Google Scholar
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1–127.
Article MATH Google Scholar
Bhattacharya, K. (1988). Bengali phonetic reader (Vol. 28). Mysuru: Central Institute of Indian Languages.
Google Scholar
Bhowmik, T. (2017). Prosodic and phonological feature based speech recognition system for Bengali (Doctoral dissertation, IIT Kharagpur).
Bitar, N., & Espy-Wilson, C. Y. (1995a). A signal representation of speech based on phonetic features. In Proceedings of 5th Annual Dual Use Technologies and Applications Conference (pp. 310–315).
Bitar, N. N., & Espy-Wilson, C. Y. (1995b). Speech parameterization based on phonetic features: Application to speech recognition. In Fourth European Conference on Speech Communication and Technology (pp. 1411–1414).
Bitar, N. N., Espy-Wilson, C. Y. (1996). A knowledge-based signal representation for speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Processing (pp. 29–32). IEEE.
Chatterji, S. (1926). The origin and development of the Bengali language. Calcutta: Calcutta University Press.
Google Scholar
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Large vocabulary continuous speech recognition with context dependent dbn-hmms. In IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP) (pp. 4688–4691). IEEE.
Das Mandal, S. (2007). Role of shape parameters in speech recognition: A study on standard colloquial Bengali (SCB). PhD thesis.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Article Google Scholar
Dekel, O., Keshet, J., & Singer, Y. (2004). An online algorithm for hierarchical phoneme classification. In International Workshop on Machine Learning for Multimodal Interaction (pp. 146–158). Berlin: Springer.
Deng, L., Abdel-Hamid, O., & Yu, D. (2013). A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6669–6673). IEEE.
Deng, L., & Yu, D. (2013). Deep learning for signal and information processing. Redmond, WA: Microsoft Research Monograph.
Google Scholar
Dusan, S. (2005). Estimation of speaker’s height and vocal tract length from speech signal. In Ninth European Conference on Speech Communication and Technology.
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632 + bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.
MathSciNet MATH Google Scholar
Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861–874.
Article MathSciNet Google Scholar
Feng, X., Zhang, Y., & Glass, J. (2014). Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1759–1763). IEEE.
Frankel, J., & King, S. (2005). A hybrid ann/dbn approach to articulatory feature recognition. In Proceedings of Eurospeech. Lisbon: CD-ROM.
Garofolo, J., Consortium, L. D., et al. (1993). TIMIT: Acoustic-phonetic continuous speech corpus. Philadelphia, PA: Linguistic Data Consortium.
Book Google Scholar
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Vol. 9, pp. 249–256).
Goldberg, H., & Reddy, D. (1976). Feature extraction segmentation and labeling in the harpy and hearsay-ii systems. The Journal of the Acoustical Society of America, 60(S1), S11–S11.
Article Google Scholar
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5), 602–610.
Article Google Scholar
Harrington, J. (1987). Acoustic cues for automatic recognition of English consonants. In Speech Technology: A Survey. (pp. 19–74). Edinburgh: Edinburgh University Press
Harris, J. (1994). English sound structure. Oxford: Wiley.
Google Scholar
Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language & Linguistic Theory, 9(1), 47–96.
Article Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Hinton, G., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet MATH Google Scholar
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet MATH Google Scholar
Hou, J. (2009). On the use of frame and segment-based methods for the detection and classification of speech sounds and features. PhD thesis, Rutgers University Graduate School, New Brunswick.
Huang, X. (1992). Phoneme classification using semicontinuous hidden markov models. IEEE Transactions on Signal Processing, 40(5), 1062–1067.
Article Google Scholar
King, S., & Taylor, P. (2000). Detection of phonological features in continuous speech using neural networks. Computer Speech & Language, 14(4), 333–353.
Article Google Scholar
King, S., Taylor, P., Frankel, J., & Richmond, K. (2000). Speech recognition via phonetically featured syllables. University of the Saarland.
Lahiri, A. (1999). Speech recognition with phonological features. In Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 99, pp. 715–718).
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine learning (pp. 473–480). ACM.
Lee, C.-H., Clements, M., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., & Rabiner, L. (2007). An overview on automatic speech attribute transcription (ASAT). In INTERSPEECH (pp. 1825–1828) Antwerp.
Lewis, M. P., Simons, G. F., & Fennig, C. D. (2016). Ethnologue: Languages of the world (Vol. 19). Dallas, TX: SIL International Dallas.
Google Scholar
Mandal, S., Chandra, S., Lata, S., & Datta, A. (2011). Places and manner of articulation of Bangla consonants: An epg based study. In INTERSPEECH (pp. 3149–3152) Florence.
Mandal, S. D., Saha, A., & Datta, A. (2005). Annotated speech corpora development in Indian languages. Vishwa Bharat, 6, 49–64.
Google Scholar
MATLAB. (2015). MATLAB version 8.5.0.197613 (R2015b). Natick: The Mathworks, Inc..
Google Scholar
Meyer, B. T., Wächter, M., Brand, T., & Kollmeier, B. (2007). Phoneme confusions in human and automatic speech recognition. In INTERSPEECH (pp. 1485–1488).
Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.
Article Google Scholar
Mohamed, A.-R., Yu, D., & Deng, L. (2010). Investigation of full sequence training of deep belief networks for speech recognition. In INTERSPEECH (Vol. 10, pp. 2846–2849).
Morales, S. O. C., & Cox, S. J. (2007). Modelling confusion matrices to improve speech recognition accuracy, with an application to dysarthric speech. In INTERSPEECH (pp. 1565–1568).
Moreau, N., Kim, H.-G., & Sikora, T. (2004). Phonetic confusion based document expansion for spoken document retrieval. In INTERSPEECH.
Online census data (2016). Retrieved July 20, 2016, from http://censusindia.gov.in/Census_Data_2001/ Census_Data_Online/Language/Statement3.htm.
Palm, R. B. (2012). Prediction as a candidate for learning deep hierarchical models of data. Master’s thesis.
Reetz, H. (1999). Converting speech signals to phonological features. In Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 99, pp. 1733–1736).
Renals, S., & Rohwer, R. (1989). Phoneme classification experiments using radial basis functions. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNNâ˘A´Z89) (Vol. 1, pp. 461–467).
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by backpropagating errors. Nature, 323, 533–536.
Article MATH Google Scholar
Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In INTERSPEECH (pp. 437–440). Florence.
Siniscalchi, S., & Lee, C.-H. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51(11), 1139–1153.
Article Google Scholar
Siniscalchi, S., Lyu, D.-C., Svendsen, T., Lee, C.-H. (2012). Experiments on cross-language attribute detection and phone recognition with minimal targetspecific training data. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 875–887.
Article Google Scholar
Siniscalchi, S., Svendsen, T., & Lee, C.-H. (2007). Towards bottom-up continuous phone recognition. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (pp. 566–569). IEEE.
Siniscalchi, S., Yu, D., Deng, L., & Lee, C.-H. (2013). Exploiting deep neural networks for detection based speech recognition. Neurocomputing, 106, 148–157.
Article Google Scholar
Siniscalchi, S. M., & Reed, J., Svendsen, T., & Lee, C.-H. (2009). Exploring universal attribute characterization of spoken languages for spoken language recognition. In INTERSPEECH (pp. 168–171). Brighton.
Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2011). A bottom-up stepwise knowledge integration approach to large vocabulary continuous speech recognition using weighted finite state machines. In INTERSPEECH (pp. 901–904). Florence.
Srinivasan, S., & Petkovic, D. (2000). Phonetic confusion matrix based spoken document retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 81–87). ACM.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). ACM.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11, 3371–3408.
MathSciNet MATH Google Scholar
Xu, D., Wang, Y., & Metze, F. (2014). EM-based phoneme confusion matrix generation for low-resource spoken term detection. IEEE Spoken Language Technology Workshop (SLT) (pp. 424–429). IEEE.
Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. London: Springer.
MATH Google Scholar
Yu, D., Deng, L., & Dahl, G. (2010). Roles of pre-training and fine tuning in context dependent dbn-hmms for real world speech recognition. In Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
Yu, D., Siniscalchi, S., Deng, L., & Lee, C.-H. (2012). Boosting attribute and phone estimation accuracies with deep neural networks for detection based speech recognition. In ICASSP (pp. 4169–4172). IEEE.
Žgank, A., Horvat, B., & Kačič Z. (2005). Data driven generation of phonetic broad classes, based on phoneme confusion matrix similarity. Speech Communication, 47(3), 379–393.
Article Google Scholar
Zhang, P., Shao, J., Han, J., Liu, Z., & Yan, Y. (2006). Keyword spotting based on phoneme confusion matrix. Proceedings of ICSLP (Vol. 2, pp. 408–419).

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Petroleum and Energy Studies, Dehradun, India
Tanmay Bhowmik
CET, Indian Institute of Technology Kharagpur, Kharagpur, India
Tanmay Bhowmik & Shyamal Kumar Das Mandal

Authors

Tanmay Bhowmik
View author publications
You can also search for this author in PubMed Google Scholar
Shyamal Kumar Das Mandal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanmay Bhowmik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhowmik, T., Mandal, S.K.D. Manner of articulation based Bengali phoneme classification. Int J Speech Technol 21, 233–250 (2018). https://doi.org/10.1007/s10772-018-9498-5

Download citation

Received: 29 April 2017
Accepted: 04 March 2018
Published: 09 March 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10772-018-9498-5

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Neural Network Based Recognition and Classification of Bengali Phonemes: A Case Study of Bengali Unconstrained Speech

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Confusion analysis in phoneme based speech recognition in Hindi

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Manner of articulation based Bengali phoneme classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Neural Network Based Recognition and Classification of Bengali Phonemes: A Case Study of Bengali Unconstrained Speech

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Confusion analysis in phoneme based speech recognition in Hindi

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation