Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition
<p>The experiment outline. The speech signal is transformed with FFT, 1/3 octave filters, and A-filter. Next, the PCA algorithm is applied. Based on the PCA model, the distances between new (PCA) and previous system of coordinates are calculated, and then the classification process with multilayer perceptron is conducted.</p> "> Figure 2
<p>The average contribution of variables to the PCA model according to the analysed utterance type.</p> "> Figure 3
<p>First component (G<sub>1</sub>) factor loadings.</p> "> Figure 4
<p>Second component (G<sub>2</sub>) factor loadings.</p> "> Figure 5
<p>Prolongation spectrogram with G<sub>1</sub> (red) and G<sub>2</sub> (black) components. As can be observed, the shape of both G1 and G2 reflects the general time-frequency structure of the analysed utterance, but it is G2 which reflects higher frequencies while G1 concentrates on the lower ones instead.</p> "> Figure 6
<p>Third component (G<sub>3</sub>) factor loadings.</p> "> Figure 7
<p>Fourth component (G<sub>4</sub>) factor loadings.</p> "> Figure 8
<p>The representation of sound repetition with the PCA (<b>a</b>) and Kohonen (<b>b</b>) algorithm application.</p> "> Figure 9
<p>The certainty of classification concerning the fluency type.</p> ">
Abstract
:1. Introduction
2. Materials and Methods
2.1. The General Outline of the Experiment
2.2. Speech Samples Preparation and Processing
2.3. Principal Components Analysis
2.4. Kohonen Network Application
2.5. Recognition Process and Results Assessment
3. Results and Discussion
3.1. Frequency Ranges Contribution to the PCA Model
3.2. The Attempt at an Interpretation of the Role of Particular Principal Components in the Description of the Speech Signal
3.3. Distance Calculation
3.4. Classification Results
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Howell, P.; Sackin, S. Automatic recognition of repetitions and prolongations in stuttered speech. In Proceedings of the First World Congress on Fluency Disorders, Munich, Germany, 8–11 August 1995; University Press Nijmegen: Nijmegen, Holandia, 1995. [Google Scholar]
- Andrews, G.; Garside, R.; Harris, M. The syndrome of stuttering. In Clinics in Developmental Medicine; William Heineman Medical Books Ltd.: London, UK, 1964; Volume 17, pp. 1–191. [Google Scholar]
- Bloodstein, O. A Handbook on Stuttering; Singular Publishing Group Inc.: San Diego, CA, USA, 1995. [Google Scholar]
- Van-Riper, C. The Nature of Stuttering; Prentice Hall: Englewood Cliffs, NJ, USA, 1982. [Google Scholar]
- Brundage, S.B.; Bothe, A.K.; Lengeling, A.N.; Evans, J.J. Comparing judgments of stuttering made by students, clinicians, and highly experienced judges. J. Fluency Disord. 2006, 31, 271–283. [Google Scholar] [CrossRef]
- Howell, P.; Au-Yeung, J.; Pilgrim, L. Utterance rate and linguistic properties as determinants of lexical dysfluencies in children who stutter. J. Acoust. Soc. Am. 1999, 105, 481–490. [Google Scholar] [CrossRef] [Green Version]
- Howell, P.; Sackin, S.; Glenn, K. Development of a Two-Stage Procedure for the Automatic Recognition of Dysfluencies in the Speech of Children Who Stutter: I. Psychometric Procedures Appropriate for Selection of Training Material for Lexical Dysfluency Classifiers. J. Speech Lang. Hear. Res. 1997, 40, 1073–1084. [Google Scholar] [CrossRef]
- Howell, P.; Sackin, S.; Glenn, K. Development of a two-stage procedure for the automatic recognition of dysfluencies in the speech of children who stutter: II. ANN recognition of repetitions and prolongations with supplied word segment markers. J. Speech Lang. Hear. Res. 1997, 40, 1085–1096. [Google Scholar] [CrossRef]
- Bothe, A.K. Identification of Children’s Stuttered and Nonstuttered Speech by Highly Experienced Judges: Binary Judgments and Comparisons with Disfluency-Types Definitions. J. Speech, Lang. Hear. Res. 2008, 51, 867–878. [Google Scholar] [CrossRef]
- Heeman, P.A.; Lunsford, R.; McMillin, A.; Yaruss, J.S. Using clinician annotations to improve automatic speech recognition of stuttered speech. In Proceedings of the INTERSPEECH 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
- Huici, H.-D.; Kairuz, H.A.; Martens, H.; Van Nuffelen, G.; De Bodt, M. Speech rate estimation in disordered speech based on spectral landmark detection. Biomed. Signal Process. Control 2016, 27, 1–6. [Google Scholar] [CrossRef]
- Manjula, G.; Shivakumar, M.; Geetha, Y.V. Adaptive optimization based neural network for classification of stuttered speech. In Proceedings of the 3rd International Conference on Cryptography, Security and Privacy, Kuala Lumpur Malaysia, 19–21 January 2019; Association for Computing Machinery: Kuala Lumpur, Malaysia, 2019; pp. 93–98. [Google Scholar]
- Narasimhan, S.; Rao, R.R. Neural Network based speech assistance tool to enhance the fluency of adults who stutter. In Proceedings of the 2019 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), Manipal, India, 11–12 August 2019. [Google Scholar]
- Wali, A.; Alamgir, Z.; Karim, S.; Fawaz, A.; Barkat Ali, M.; Adan, M.; Mujtaba, M. Generative adversarial networks for speech processing: A review. Comput. Speech Lang. 2021, 72, 101308. [Google Scholar] [CrossRef]
- He, L.; Niu, M.; Tiwari, P.; Marttinen, P.; Su, R.; Jiang, J.; Guo, C.; Wang, H.; Ding, S.; Wang, Z.; et al. Deep learning for depression recognition with audiovisual cues: A review. Inf. Fusion 2021, 80, 56–86. [Google Scholar] [CrossRef]
- Ting, H.-N.; Yong, B.-F.; MirHassani, S.M. Self-Adjustable Neural Network for speech recognition. Eng. Appl. Artif. Intell. 2013, 26, 2022–2027. [Google Scholar] [CrossRef]
- Lei, X.; Lin, H.; Heigold, G. Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition. In Proceedings of the ICASSP 2013—2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
- Zhao, X.; Wang, Y.; Wang, D. Robust Speaker Identification in Noisy and Reverberant Conditions. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 836–845. [Google Scholar] [CrossRef]
- Sarma, M.; Sarma, K.K. Speaker identification model for Assamese language using a neural framework. In Proceedings of the International Joint Conference on Neural Networks, Dallas, TX, USA, 4–9 August 2013. [Google Scholar]
- Lim, C.P.; Woo, S.C. Text-dependent speaker recognition using wavelets and neural networks. Soft Comput. 2007, 11, 549–556. [Google Scholar] [CrossRef]
- Park, T.J.; Kanda, N.; Dimitriadis, D.; Han, K.J.; Watanabe, S.; Narayanan, S. A review of speaker diarization: Recent advances with deep learning. Comput. Speech Lang. 2022, 72, 101317. [Google Scholar] [CrossRef]
- Shi, Y.; Zhang, W.-Q.; Cai, M.; Liu, J. Efficient One-Pass Decoding with NNLM for Speech Recognition. IEEE Signal Process. Lett. 2014, 21, 377–381. [Google Scholar] [CrossRef]
- Naeini, M.P.; Moshiri, B.; Araabi, B.N.; Sadeghi, M. Learning by abstraction: Hierarchical classification model using evidential theoretic approach and Bayesian ensemble model. Neurocomputing 2014, 130, 73–82. [Google Scholar] [CrossRef]
- Dhanalakshmi, P.; Palanivel, S.; Ramalingam, V. Classification of audio signals using SVM and RBFNN. Expert Syst. Appl. 2009, 36, 6069–6075. [Google Scholar] [CrossRef]
- Sarimveis, H.; Doganis, P.; Alexandridis, A. A classification technique based on radial basis function neural networks. Adv. Eng. Softw. 2006, 37, 218–221. [Google Scholar] [CrossRef]
- Thasleema, T.M.; Prajith, P.; Narayanan, N.K. Time–domain non-linear feature parameter for consonant classification. Int. J. Speech Technol. 2012, 15, 227–239. [Google Scholar] [CrossRef]
- Reddy, V.R.; Rao, K.S. Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis. Comput. Speech Lang. 2013, 27, 1105–1126. [Google Scholar] [CrossRef]
- Kumar, R.K.S.; Lajish, V.L. Phoneme recognition using zerocrossing interval distribution of speech patterns and ANN. Int. J. Speech Technol. 2013, 16, 125–131. [Google Scholar] [CrossRef]
- Jaitly, N.; Nguyen, P.; Senior, A.; Vanhoucke, V. Application of pretrained deep neural networks to large vocabulary speech recognition. In Proceedings of the 13th Annual Conference of the International Speech Communication Association 2012 (INTERSPEECH 2012), Portland, OR, USA, 9–13 September 2012. [Google Scholar]
- Narendra, N.P.; Rao, K.S. Parameterization of Excitation Signal for Improving the Quality of HMM-Based Speech Synthesis System. Circuits Syst. Signal Process. 2017, 36, 3650–3673. [Google Scholar] [CrossRef]
- Świetlicka, I.; Kuniszyk-Jóźkowiak, W.; Smołka, E. Hierarchical ANN system for stuttering identification. Comput. Speech Lang. 2013, 27, 228–242. [Google Scholar] [CrossRef]
- Szczurowska, I.; Kuniszyk-Jóźkowiak, W.; Smołka, E. Speech nonfluency detection using Kohonen networks. Neural Comput. Appl. 2009, 18, 677–687. [Google Scholar] [CrossRef]
- Ritchings, R.; McGillion, M.; Moore, C. Pathological voice quality assessment using artificial neural networks. Med. Eng. Phys. 2002, 24, 561–564. [Google Scholar] [CrossRef]
- Godino-Llorente, J.; Fraile, R.; Sáenz-Lechón, N.; Osma-Ruiz, V.; Gómez-Vilda, P. Automatic detection of voice impairments from text-dependent running speech. Biomed. Signal Process. Control 2009, 4, 176–182. [Google Scholar] [CrossRef]
- Khara, S.; Singh, S.; Vir, D. A comparative study of the techniques for feature extraction and classification in stuttering. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Ranganathan Engineering College, Coimbatore, India, 20–21 April 2018. [Google Scholar]
- Kourkounakis, T.; Hajavi, A.; Etemad, A. FluentNet: End-to-End Detection of Stuttered Speech Disfluencies with Deep Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2986–2999. [Google Scholar] [CrossRef]
- Gupta, D.; Bansal, P.; Choudhary, K. The state of the art of feature extraction techniques in speech recognition. In Proceedings of the 50th Annual Convention of Computer Society of India, New Delhi, India, 2–5 December 2015. [Google Scholar]
- Arbajian, P.; Hajja, A.; Raś, Z.W.; Wieczorkowska, A.A. Effect of speech segment samples selection in stutter block detection and remediation. J. Intell. Inf. Syst. 2019, 53, 241–264. [Google Scholar] [CrossRef]
- Mahesha, P.; Vinod, D.S. Gaussian mixture model based classification of stuttering dysfluencies. J. Intell. Syst. 2015, 25, 387–399. [Google Scholar] [CrossRef]
- Esmaili, I.; Dabanloo, N.J.; Vali, M. Automatic classification of speech dysfluencies in continuous speech based on similarity measures and morphological image processing tools. Biomed. Signal Process. Control 2016, 23, 104–114. [Google Scholar] [CrossRef]
- Narendra, N.; Alku, P. Dysarthric speech classification from coded telephone speech using glottal features. Speech Commun. 2019, 110, 47–55. [Google Scholar] [CrossRef]
- Momo, N.; Abdullah; Uddin, J. Speech recognition using feed forward neural network and principle component analysis. In Proceedings of the 4th International Symposium on Signal Processing and Intelligent Recognition Systems, Bangalore, India, 19–22 September 2018; Springer: Berlin, Germany, 2019. [Google Scholar]
- Raitio, T.; Suni, A.; Vainio, M.; Alku, P. Comparing glottal-flow-excited statistical parametric speech synthesis methods. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
- Abolhassani, A.H.; Selouani, S.-A.; O’Shaughnessy, D. Speech enhancement using PCA and variance of the reconstruction error in distributed speech recognition. In Proceedings of the 2007 IEEE Workshop on Automatic Speech Recognition & Understanding, Kyoto, Japan, 9–13 December 2007. [Google Scholar]
- Chien, J.-T.; Ting, C.-W. Speaker identification using probabilistic PCA model selection. In Proceedings of the 8th International Conference on Spoken Language Processing INTERSPEECH 2004, Jeju Island, Korea, 4–8 October 2004. [Google Scholar]
- Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer Series in Statistics; Springer: Berlin, Germany, 2002. [Google Scholar]
- Jhawar, G.; Nagraj, P.; Mahalakshmi, P. Speech disorder recognition using MFCC. In Proceedings of the 2016 International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 6–8 April 2016. [Google Scholar]
- Gupta, S.; Shukla, R.S.; Shukla, R.K.; Verma, R. Deep Learning Bidirectional LSTM based Detection of Prolongation and Repetition in Stuttered Speech using Weighted MFCC. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 1–12. [Google Scholar] [CrossRef]
- Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
- Tarkowski, Z. Stuttering; Kobosko, J., Ed.; PWN: Warsaw, Poland, 1999. (In Polish) [Google Scholar]
- Cordes, A.K. Individual and Consensus Judgments of Disfluency Types in the Speech of Persons Who Stutter. J. Speech Lang. Hear. Res. 2000, 43, 951–964. [Google Scholar] [CrossRef]
- Winursito, A.; Hidayat, R.; Bejo, A. Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition. In Proceedings of the 2018 International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, 6–7 March 2018. [Google Scholar]
- Rasheed, J.; Hameed, A.A.; Ajlouni, N.; Jamil, A.; Özyavaş, A.; Orman, Z. Application of Adaptive Back-Propagation Neural Networks for Parkinson’s Disease Prediction. In Proceedings of the 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI), Sakheer, Bahrain, 26–27 October 2020. [Google Scholar]
- Rahman, M.A.; Hossain, M.F.; Hossain, M.; Ahmmed, R. Employing PCA and t-statistical approach for feature extraction and classification of emotion from multichannel EEG signal. Egypt. Inform. J. 2020, 21, 23–35. [Google Scholar] [CrossRef]
- Pandya, S.; Ghayvat, H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence. Adv. Eng. Inform. 2021, 47, 101238. [Google Scholar] [CrossRef]
- Ghayvat, H.; Pandya, S.; Patel, A. Deep learning model for acoustics signal based preventive healthcare monitoring and activity of daily living. In Proceedings of the 2nd International Conference on Data, Engineering and Applications (IDEA), Bhopal, India, 28–29 February 2020. [Google Scholar]
- Kourkounakis, T.; Hajavi, A.; Etemad, A. Detecting multiple speech disfluencies using a deep residual network with bidirectional Long Short-Term Memory. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Tibrewal, V.; Haque, M.M.; Pandey, A.; Manimozhi, M. Identifying stuttering using deep learning. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 1152–1154. [Google Scholar]
People Who Stutter | Fluent Speakers | |||||
---|---|---|---|---|---|---|
Disfluency Type | Number | Gender | Age Ranges (years old) | Number | Gender | Age Ranges (years old) |
blocks | 11 | 9M and 2F | 10–23 | 8 | 4M and 4F | 22–50 |
syllable repetitions | 6 | 4M and 2F | 11–23 | 6 | 4M and 2F | 22–53 |
prolongations | 7 | 7M | 10–25 | 4 | 2M and 2F | 24–51 |
Total | 19 | 16M and 3F | 10–25 | 14 | 9M and 5F | 22–53 |
The Number of Samples | Total | |||
---|---|---|---|---|
Sample Type | Training Set | Validation Set | Test Set | |
blocks | 37 | 9 | 9 | 55 |
syllable repetitions | 36 | 5 | 5 | 46 |
prolongations | 42 | 5 | 12 | 59 |
fluent | 25 | 10 | 3 | 38 |
Total | 140 | 29 | 29 | 198 |
Feature Extraction Method | Recognition Rate | ||
---|---|---|---|
Training | Validation | Test | |
PCA | 92.14 | 72.41 | 75.86 |
SOM | 59.29 | 55.17 | 51.72 |
acc | ε | |
---|---|---|
PCA | 0.76 | 0.24 |
SOM | 0.52 | 0.48 |
Accuracy [%] | ||||
---|---|---|---|---|
Blocks | Syllable Repetitions | Prolongations | Fluent | |
PCA | 71.43 | 50.00 | 90.91 | 71.43 |
SOM | 57.14 | 75.00 | 72.73 | 0.00 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Świetlicka, I.; Kuniszyk-Jóźkowiak, W.; Świetlicki, M. Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition. Sensors 2022, 22, 321. https://doi.org/10.3390/s22010321
Świetlicka I, Kuniszyk-Jóźkowiak W, Świetlicki M. Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition. Sensors. 2022; 22(1):321. https://doi.org/10.3390/s22010321
Chicago/Turabian StyleŚwietlicka, Izabela, Wiesława Kuniszyk-Jóźkowiak, and Michał Świetlicki. 2022. "Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition" Sensors 22, no. 1: 321. https://doi.org/10.3390/s22010321
APA StyleŚwietlicka, I., Kuniszyk-Jóźkowiak, W., & Świetlicki, M. (2022). Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition. Sensors, 22(1), 321. https://doi.org/10.3390/s22010321