Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG

Published: 15 August 2023 Publication History

Abstract

Speech signals are more susceptible to emotional influences and acoustic interference than other communications. Applications for real-time speech processing face difficulties when dealing with noisy, emotion-filled speech data. Finding a reliable method to separate the dominating signal from outside influences. An ideal system should be capable of precisely identifying necessary auditory events from a complex scene captured in an undesirable circumstance. In this work, we proposed and evaluated an end-to-end framework for voice recognition in adverse talking conditions using a pre-trained Deep Neural Network mask and voice VGG. This research suggests a unique method for speaker recognition under challenging circumstances, including emotion and interference. Using the Ryerson audio–visual dataset, the presented model outperformed recent literature on emotional speech data in English and Arabic, reporting an average speaker identification rate of 85.2%, 87.0%, and 86.6% using the Ryerson audio–visual dataset (RAVDESS), speech under simulated and actual stress (SUSAS) dataset and Emirati-accented Speech dataset (ESD) respectively.

Highlights

A novel method is proposed to enhance speaker identification in abnormal conditions.
Our proposal is based on using pre-trained Deep Neural Network mask and speech VGG.
Our proposed framework is evaluated using three diverse speech databases.
The proposed model obtained superior performance over the recent literature.

References

[1]
Antoni J., Chauhan S., A study and extension of second-order blind source separation to operational modal analysis, Journal of Sound and Vibration 332 (4) (2013) 1079–1106.
[2]
Arons B., A review of the cocktail party effect, Journal of the American Voice I/O Society 12 (7) (1992) 35–50.
[3]
Bao, H., Xu, M.-X., & Zheng, T. F. (2007). Emotion attribute projection for speaker recognition on emotional speech. In Eighth annual conference of the international speech communication association.
[4]
Beckmann P., Kegler M., Saltini H., Cernak M., Speech-vgg: A deep feature extractor for speech processing, 2019, arXiv preprint arXiv:1910.09909.
[5]
Bregman A.S., Psychological data and computational ASA, in: Computational auditory scene analysis, L. Erlbaum Associates Inc., 1998, pp. 1–12.
[6]
Brown G.J., Cooke M., Computational auditory scene analysis, Computer Speech and Language 8 (4) (1994) 297–336.
[7]
Brown G.J., Wang D., Separation of speech by computational auditory scene analysis, in: Speech enhancement, Springer, 2005, pp. 371–402.
[8]
Cardoso J.-F., Infomax and maximum likelihood for blind source separation, IEEE Signal Processing Letters 4 (4) (1997) 112–114.
[9]
Davis S., Mermelstein P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech and Signal Processing 28 (4) (1980) 357–366.
[10]
El Ayadi M., Kamel M.S., Karray F., Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44 (3) (2011) 572–587.
[11]
Gao Z., Zhang H., Dong S., Sun S., Wang X., Yang G., Wu W., Li S., de Albuquerque V.H.C., Salient object detection in the distributed cloud-edge intelligent network, IEEE Network 34 (2) (2020) 216–224.
[12]
Håkansson B., Tjellström A., Rosenhall U., Carlsson P., The bone-anchored hearing aid: Principal design and a psychoacoustical evaluation, ACTA Oto-laryngologica 100 (3–4) (1985) 229–239.
[13]
Hamsa S., Iraqi Y., Shahin I., Werghi N., An enhanced emotion recognition algorithm using pitch correlogram, deep sparse matrix representation and random forest classifier, IEEE Access (2021).
[14]
Hamsa S., Shahin I., Iraqi Y., Werghi N., Emotion recognition from speech using wavelet packet transform cochlear filter bank and random forest classifier, IEEE Access 8 (2020) 96994–97006.
[15]
Hansen, J. H., & Bou-Ghazale, S. E. (1997). Getting started with SUSAS: A speech under simulated and actual stress database. In Fifth European conference on speech communication and technology.
[16]
Hermansky H., Morgan N., Bayya A., Kohn P., RASTA-PLP speech analysis, in: Proc. IEEE Int’L Conf. Acoustics, Speech and Signal Processing, vol. 1, Citeseer, 1991, pp. 121–124.
[17]
Ho K., Sun M., An accurate algebraic closed-form solution for energy-based source localization, IEEE Transactions on Audio, Speech, and Language Processing 15 (8) (2007) 2542–2550.
[18]
Livingstone S.R., Russo F.A., The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLoS One 13 (5) (2018).
[19]
Mahmoodzadeh A., Abutalebi H.R., Soltanian-Zadeh H., Sheikhzadeh H., Single channel speech separation with a frame-based pitch range estimation method in modulation frequency, in: 2010 5th International symposium on telecommunications, IEEE, 2010, pp. 609–613.
[20]
Meddis R., O’Mard L., A unitary model of pitch perception, The Journal of the Acoustical Society of America 102 (3) (1997) 1811–1820.
[21]
Nassif A.B., Shahin I., Hamsa S., Nemmour N., Hirose K., CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions, Applied Soft Computing 103 (2021).
[22]
Qassim H., Verma A., Feinzimer D., Compressed residual-VGG16 CNN model for big data places image recognition, in: 2018 IEEE 8th annual computing and communication workshop and conference, IEEE, 2018, pp. 169–175.
[23]
Qin S., Zhong Y.M., A new envelope algorithm of Hilbert–Huang transform, Mechanical Systems and Signal Processing 20 (8) (2006) 1941–1952.
[24]
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618–626).
[25]
Shahin I., Nassif A.B., Hamsa S., Emotion recognition using hybrid Gaussian mixture model and deep neural network, IEEE Access 7 (2019) 26777–26787.
[26]
Shahin I., Nassif A.B., Hamsa S., Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments, Neural Computing and Applications 32 (7) (2020) 2575–2587.
[27]
Shao Y., Srinivasan S., Jin Z., Wang D., A computational auditory scene analysis system for speech segregation and robust speech recognition, Computer Speech and Language 24 (1) (2010) 77–93.
[28]
Sokolova M., Japkowicz N., Szpakowicz S., Beyond accuracy, F1-score and ROC: A family of discriminant measures for performance evaluation, in: Australasian joint conference on artificial intelligence, Springer, 2006, pp. 1015–1021.
[29]
Tamada D., Noise and artifact reduction for MRI using deep learning, 2020, arXiv preprint arXiv:2002.12889.
[30]
Wang D., On ideal binary mask as the computational goal of auditory scene analysis, in: Speech separation by humans and machines, Springer, 2005, pp. 181–197.
[31]
Wang K., An N., Li B.N., Zhang Y., Li L., Speech emotion recognition using Fourier parameters, IEEE Transactions on Affective Computing 6 (1) (2015) 69–75.
[32]
Wang D., Chen J., Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (10) (2018) 1702–1726.
[33]
Wieczorek M., Siłka J., Woźniak M., Garg S., Hassan M.M., Lightweight convolutional neural network model for human face detection in risk situations, IEEE Transactions on Industrial Informatics 18 (7) (2021) 4820–4829.
[34]
Woźniak, M., Siłka, J., & Wieczorek, M. (2021). Deep learning based crowd counting model for drone assisted systems. In Proceedings of the 4th ACM MobiCom workshop on drone assisted wireless communications for 5G and beyond (pp. 31–36).
[35]
Woźniak, M., Wieczorek, M., & Siłka, J. (2022). Deep neural network with transfer learning in remote object detection from drone. In Proceedings of the 5th international ACM mobicom workshop on drone assisted wireless communications for 5G and beyond (pp. 121–126).
[36]
Xin Y., Kong L., Liu Z., Chen Y., Li Y., Zhu H., Gao M., Hou H., Wang C., Machine learning and deep learning methods for cybersecurity, Ieee Access 6 (2018) 35365–35381.
[37]
Zhang Z., Improved adam optimizer for deep neural networks, in: 2018 IEEE/ACM 26th international symposium on quality of service, IEEE, 2018, pp. 1–2.
[38]
Zhang R., Wang Z., Huang Z., Li L., Zheng M., Predicting emotion reactions for human–computer conversation: A variational approach, IEEE Transactions on Human-Machine Systems (2021).

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 224, Issue C
Aug 2023
1483 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 15 August 2023

Author Tags

  1. Deep Neural Network
  2. Emotional talking conditions
  3. Feature extraction
  4. Noise reduction
  5. Speaker identification
  6. Speech segregation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media