Interpretable Probabilistic Identification of Depression in Speech
<p>MAP-based adaptation of the <span class="html-italic">k</span>-the component of model <math display="inline"><semantics> <mi mathvariant="script">M</mi> </semantics></math> using class-specific observations <span class="html-italic">R</span>.</p> "> Figure 2
<p>The topologies including the transition probabilities of the HMMs constructed to address Interview and Reading tasks.</p> "> Figure 3
<p>Effect of the number of HMM states on F1-score for Reading and Interview tasks.</p> "> Figure 4
<p>The probabilities output by the UBM-HMM on recordings representing both Healthy and Control subjects with respect to Reading and Interview tasks.</p> ">
Abstract
:1. Introduction
2. Problem Formalization
3. The Proposed Solution
3.1. Feature Extraction
3.1.1. Mel-Frequency Cepstral Coefficients
3.1.2. Teager Energy Operator Autocorrelation Envelope
3.1.3. Fundamental Frequency/Pitch and Harmonic Ratio
3.2. Universal Probabilistic Model
3.3. Hidden Markov Model
- the number of states S,
- the PDF’s approximating states’ distribution via a Gaussian mixture model (GMM), , where are the weights associated with each mixture, x is the feature vector, denotes the Gaussian mixture, ,
- the state transition probability matrix , where entry comprises the probability of moving from state j at time t to state i at time t + 1, and
- the initial state distribution , where is the probability that HMM starts in state i, i.e., .
3.4. Correlation-Based k-Medoids for Selecting Training and Adaptation Data
3.5. Maximum A Posteriori Adaptation (MAP)
3.6. Classification of Speech Depression
Algorithm 1: The speech classification algorithm. |
4. Experimental Protocol and Analysis of the Results
4.1. The Employed Dataset
4.2. Parameterization of the Proposed Solution
4.3. Experimental Results and Analysis
4.4. Testing Model Generalization Capabilities Across Tasks
4.5. Ablation Study
5. Interpretation of the Model’s Predictions and Interaction with Medical Experts
6. Conclusions
- consider hybrid HMMs, i.e., with emission probabilities based on neural networks [49],
- modify the current framework to suit applications with similar specifications,
- integrate additional mental states and develop appropriate sets of features,
- explore temporal integration methodologies (statistics, spectral moments, autoregressive models, etc.) at the feature level, given that mental states tend to not change rapidly over time,
- examine the efficiency of identification in small data environments, specifically addressing scenarios with limited data availability, such as having few or even just one training sample per class, and
- improve the capabilities of the interpretability module, with a particular emphasis on ensuring user-friendliness and garnering acceptance among medical experts.
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mental Disorders. Available online: https://www.who.int/news-room/fact-sheets/detail/mental-disorders (accessed on 25 November 2024).
- Trautmann, S.; Rehm, J.; Wittchen, H. The economic costs of mental disorders: Do our societies react appropriately to the burden of mental disorders? EMBO Rep. 2016, 17, 1245–1249. [Google Scholar] [CrossRef] [PubMed]
- Low, D.M.; Bentley, K.H.; Ghosh, S.S. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig. Otolaryngol. 2020, 5, 96–116. [Google Scholar] [CrossRef] [PubMed]
- Kapitány-Fövény, M.; Vetró, M.; Révy, G.; Fabó, D.; Szirmai, D.; Hullám, G. EEG based depression detection by machine learning: Does inner or overt speech condition provide better biomarkers when using emotion words as experimental cues? J. Psychiatr. Res. 2024, 178, 66–76. [Google Scholar] [CrossRef] [PubMed]
- Yasin, S.; Othmani, A.; Raza, I.; Hussain, S.A. Machine learning based approaches for clinical and non-clinical depression recognition and depression relapse prediction using audiovisual and EEG modalities: A comprehensive review. Comput. Biol. Med. 2023, 159, 106741. [Google Scholar] [CrossRef]
- Williamson, J.R.; Godoy, E.; Cha, M.; Schwarzentruber, A.; Khorrami, P.; Gwon, Y.; Kung, H.T.; Dagli, C.; Quatieri, T.F. Detecting Depression using Vocal, Facial and Semantic Communication Cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 15–19 October 2016; pp. 11–18. [Google Scholar] [CrossRef]
- Yang, L.; Li, Y.; Chen, H.; Jiang, D.; Oveneke, M.C.; Sahli, H. Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, Seoul, Republic of Korea, 22 October 2018; pp. 15–21. [Google Scholar] [CrossRef]
- Shen, Y.; Yang, H.; Lin, L. Automatic Depression Detection: An Emotional Audio-Textual Corpus and A Gru/Bilstm-Based Model. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6247–6251. [Google Scholar] [CrossRef]
- Cummins, N.; Sethu, V.; Epps, J.; Williamson, J.R.; Quatieri, T.F.; Krajewski, J. Generalized Two-Stage Rank Regression Framework for Depression Score Prediction from Speech. IEEE Trans. Affect. Comput. 2020, 11, 272–283. [Google Scholar] [CrossRef]
- Huang, Z.; Epps, J.; Joachim, D.; Chen, M. Depression Detection from Short Utterances via Diverse Smartphones in Natural Environmental Conditions. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3393–3397. [Google Scholar] [CrossRef]
- Wu, P.; Wang, R.; Lin, H.; Zhang, F.; Tu, J.; Sun, M. Automatic depression recognition by intelligent speech signal processing: A systematic survey. CAAI Trans. Intell. Technol. 2022, 8, 701–711. [Google Scholar] [CrossRef]
- Verde, L.; Raimo, G.; Vitale, F.; Carbonaro, B.; Cordasco, G.; Marrone, S.; Esposito, A. A Lightweight Machine Learning Approach to Detect Depression from Speech Analysis. In Proceedings of the 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), Washington, DC, USA, 1–3 November 2021; pp. 330–335. [Google Scholar] [CrossRef]
- Li, Y.; Lin, Y.; Ding, H.; Li, C. Speech databases for mental disorders: A systematic review. Gen. Psychiatry 2019, 32, e100022. [Google Scholar] [CrossRef]
- Zhou, G.; Hansen, J.; Kaiser, J. Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 2001, 9, 201–216. [Google Scholar] [CrossRef]
- Tao, F.; Esposito, A.; Vinciarelli, A. The Androids Corpus: A New Publicly Available Benchmark for Speech Based Depression Detection. In Proceedings of the INTERSPEECH 2023, Convention Centre, Dublin, Ireland, 20–24 August 2023; pp. 4149–4153. [Google Scholar] [CrossRef]
- Ntalampiras, S. Toward Language-Agnostic Speech Emotion Recognition. J. Audio Eng. Soc. 2020, 68, 7–13. [Google Scholar] [CrossRef]
- Mantegazza, I.; Ntalampiras, S. Italian Speech Emotion Recognition. In Proceedings of the 2023 24th International Conference on Digital Signal Processing (DSP), Island of Rhodes, Greece, 11–13 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Ntalampiras, S. Generalized Sound Recognition in Reverberant Environments. J. Audio Eng. Soc. 2019, 67, 772–781. [Google Scholar] [CrossRef]
- Hidayat, R.; Bejo, A.; Sumaryono, S.; Winursito, A. Denoising Speech for MFCC Feature Extraction Using Wavelet Transformation in Speech Recognition System. In Proceedings of the 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE), Bali, Indonesia, 24–26 July 2018; pp. 280–284. [Google Scholar] [CrossRef]
- Ariyanti, W.; Liu, K.C.; Chen, K.Y.; Yu-Tsao. Abnormal Respiratory Sound Identification Using Audio-Spectrogram Vision Transformer. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar] [CrossRef]
- Ntalampiras, S. Collaborative framework for automatic classification of respiratory sounds. IET Signal Process. 2020, 14, 223–228. [Google Scholar] [CrossRef]
- Poirè, A.M.; Simonetta, F.; Ntalampiras, S. Deep Feature Learning for Medical Acoustics. In Artificial Neural Networks and Machine Learning—ICANN 2022; Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 39–50. [Google Scholar]
- Nosan, A.; Sitjongsataporn, S. Speech Recognition Approach using Descend-Delta-Mean and MFCC Algorithm. In Proceedings of the 2019 16th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Chonburi, Thailand, 10–13 July 2019; pp. 381–384. [Google Scholar] [CrossRef]
- Yang, L.; Zhao, Y.; Wang, Y.; Liu, L.; Zhang, X.; Li, B.; Cui, R. The Effects of Psychological Stress on Depression. Curr. Neuropharmacol. 2015, 13, 494–504. [Google Scholar] [CrossRef] [PubMed]
- Atal, B.S. Automatic Speaker Recognition Based on Pitch Contours. J. Acoust. Soc. Am. 1972, 52, 1687–1697. [Google Scholar] [CrossRef]
- McRoberts, G.W.; Studdert-Kennedy, M.; Shankweiler, D.P. The role of fundamental frequency in signaling linguistic stress and affect: Evidence for a dissociation. Percept. Psychophys. 1995, 57, 159–174. [Google Scholar] [CrossRef] [PubMed]
- France, D.; Shiavi, R.; Silverman, S.; Silverman, M.; Wilkes, M. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomed. Eng. 2000, 47, 829–837. [Google Scholar] [CrossRef]
- Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef]
- Rabiner, L.R.; Juang, B.H. An introduction to hidden Markov models. IEEE ASSP Mag. 1986, 3, 4–15. [Google Scholar] [CrossRef]
- Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G.J. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: London, UK, 1998. [Google Scholar]
- Ntalampiras, S. Identification of Anomalous Phonocardiograms Based on Universal Probabilistic Modeling. IEEE Lett. Comput. Soc. 2020, 3, 50–53. [Google Scholar] [CrossRef]
- Kaufman, L.; Rousseeuw, P. Clustering by means of medoids. In Statistical Data Analysis Based on the L1-Norm and Related Methods; Dodge, Y., Ed.; North-Holland; Springer: Berlin/Heidelberg, Germany, 1987; pp. 405–416. [Google Scholar]
- Neto, A.J.; Pacheco, A.G.C.; Luvizon, D.C. Improving Deep Learning Sound Events Classifiers Using Gram Matrix Feature-Wise Correlations. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3780–3784. [Google Scholar] [CrossRef]
- Ntalampiras, S. Moving Vehicle Classification Using Wireless Acoustic Sensor Networks. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 129–138. [Google Scholar] [CrossRef]
- Ntalampiras, S.; Potamitis, I. Canonical correlation analysis for classifying baby crying sound events. In Proceedings of the 22nd International Congress on Sound and Vibration. International Institute of Acoustics and Vibrations, Florence, Italy, 12–16 July 2015; pp. 1–7. [Google Scholar]
- Ntalampiras, S. A Novel Holistic Modeling Approach for Generalized Sound Recognition. IEEE Signal Process. Lett. 2013, 20, 185–188. [Google Scholar] [CrossRef]
- Sun, L.; Ji, S.; Ye, J. Canonical Correlation Analysis for Multilabel Classification: A Least-Squares Formulation, Extensions, and Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 194–200. [Google Scholar] [CrossRef] [PubMed]
- Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 3rd ed.; Academic Press, Inc.: Orlando, FL, USA, 2006. [Google Scholar]
- Young, S.; Evermann, G.; Gales, M.; Hain, T.; Kershaw, D.; Liu, X.; Moore, G.; Odell, J.; Ollason, D.; Povey, D.; et al. The HTK Book; Version 3.4; Cambridge University Engineering Department: Cambridge, UK, 2006. [Google Scholar]
- Kessler, R.C.; Berglund, P.; Demler, O.; Jin, R.; Koretz, D.; Merikangas, K.R.; Rush, A.J.; Walters, E.E.; Wang, P.S. The Epidemiology of Major Depressive Disorder: Results From the National Comorbidity Survey Replication (NCS-R). JAMA 2003, 289, 3095. [Google Scholar] [CrossRef] [PubMed]
- Torch Machine Learning Library. Available online: http://torch.ch/ (accessed on 16 February 2025).
- Gerczuk, M.; Amiriparian, S.; Ottl, S.; Schuller, B.W. EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2023, 14, 1472–1487. [Google Scholar] [CrossRef]
- Ntalampiras, S. Transfer Learning for Generalized Audio Signal Processing. In Handbook of Artificial Intelligence for Music; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 679–691. [Google Scholar] [CrossRef]
- Perikos, I.; Kardakis, S.; Hatzilygeroudis, I. Sentiment analysis using novel and interpretable architectures of Hidden Markov Models. Knowl. Based Syst. 2021, 229, 107332. [Google Scholar] [CrossRef]
- Ntalampiras, S.; Potamitis, I. A Statistical Inference Framework for Understanding Music-Related Brain Activity. IEEE J. Sel. Top. Signal Process. 2019, 13, 275–284. [Google Scholar] [CrossRef]
- Tjoa, E.; Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4793–4813. [Google Scholar] [CrossRef]
- Zhang, Y.; Tino, P.; Leonardis, A.; Tang, K. A Survey on Neural Network Interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 5, 726–742. [Google Scholar] [CrossRef]
- Akman, A.; Schuller, B.W. Audio Explainable Artificial Intelligence: A Review. Intell. Comput. 2024, 3, 74. [Google Scholar] [CrossRef]
- Razavi, M.; Rasipuram, R.; Magimai-Doss, M. On modeling context-dependent clustered states: Comparing HMM/GMM, hybrid HMM/ANN and KL-HMM approaches. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 7659–7663. [Google Scholar] [CrossRef]
Band ID | Lower (Hz) | Center (Hz) | Upper (Hz) |
---|---|---|---|
1 | 100 | 250 | 400 |
2 | 400 | 500 | 600 |
3 | 600 | 700 | 800 |
4 | 800 | 910 | 1020 |
5 | 1020 | 1140 | 1260 |
6 | 1260 | 1400 | 1540 |
7 | 1540 | 1690 | 1840 |
8 | 1840 | 2000 | 2160 |
9 | 2160 | 2350 | 2540 |
10 | 2540 | 2750 | 2960 |
11 | 2960 | 3200 | 3440 |
12 | 3440 | 3720 | 4000 |
13 | 4000 | 4310 | 4620 |
14 | 4620 | 5010 | 5400 |
15 | 5400 | 5850 | 6300 |
16 | 6300 | 6850 | 7400 |
Reading Task | ||||
---|---|---|---|---|
Approach | Acc. | Prec. | Rec. | F1 |
Random | 50.1 | 51.8 | 51.8 | 51.8 |
UBM-HMM | 96.2 ± 1.2 | 99.1 ± 0.7 | 93.5 ± 0.9 | 96.2 ± 1.1 |
Linear SVM | 69.6 ± 5.3 | 73.6 ± 19.1 | 68.8 ± 12.0 | 68.4 ± 7.7 |
LSTM [15] | 84.4 ± 1.1 | 84.5 ± 2.1 | 85.6 ± 2.8 | 83.7 ± 1.1 |
Interview Task | ||||
---|---|---|---|---|
Approach | Acc. | Prec. | Rec. | F1 |
Random | 50.5 | 55.2 | 55.2 | 55.2 |
UBM-HMM | 87.0 ± 1.2 | 85.8 ± 1.9 | 92.3 ± 1.8 | 88.9 ± 1.7 |
Linear SVM | 73.3 ± 10.6 | 73.5 ± 16.1 | 74.5 ± 13.2 | 73.6 ± 13.6 |
LSTM [15] | 83.9 ± 1.3 | 85.8 ± 3.1 | 86.1 ± 2.7 | 84.7 ± 0.9 |
Acc. | Prec. | Rec. | F1 |
---|---|---|---|
Interview Task | |||
68.5 ± 5.9 | 73.3 ± 5.4 | 66.9 ± 4.8 | 70.0 ± 6.1 |
Reading Task | |||
80.0 ± 3.7 | 76.5 ± 4.1 | 99.2 ± 0.5 | 86.4 ± 3.5 |
Reading Task | ||||
---|---|---|---|---|
Approach | Acc. | Prec. | Rec. | F1 |
UBM-HMM MFCC | 91.3 ± 1.9 | 99.0 ± 0.8 | 84.6 ± 3.4 | 91.2 ± 2.1 |
UBM-HMM MFCC_D | 92.1 ± 2.1 | 99.1 ± 0.9 | 85.7 ± 3.1 | 91.9 ± 2.5 |
UBM-HMM MFCC_D_A | 96.2 ± 1.2 | 99.1 ± 0.7 | 93.5 ± 0.9 | 96.2 ± 1.1 |
UBM-HMM MFCC_D_A_TEO | 92.4 ± 2.2 | 99.1 ± 0.9 | 87.0 ± 2.5 | 92.6 ± 1.8 |
UBM-HMM MFCC_D_A_TEO_F0_HR | 93.1 ± 2.1 | 99.1 ± 0.9 | 88.2 ± 1.9 | 93.3 ± 1.7 |
class-specific HMM MFCC_D_A | 85.3 ± 3.2 | 84.2 ± 2.9 | 86.1 ± 3.6 | 85.1 ± 2.9 |
left-right UBM-HMM MFCC_D_A | 83.8 ± 4.1 | 83.9 ± 3.5 | 82.8 ± 4.3 | 83.3 ± 3.9 |
clustering UBM-HMM MFCC_D_A | 95.0 ± 1.8 | 95.8 ± 2 | 93.9 ± 2.3 | 94.8 ± 1.5 |
Interview Task | ||||
---|---|---|---|---|
Approach | Acc. | Prec. | Rec. | F1 |
UBM-HMM MFCC | 69.6 ± 4.7 | 87.5 ± 5.3 | 53.8 ± 4.5 | 66.7 ± 5.1 |
UBM-HMM MFCC_D | 78.3 ± 4.2 | 90.0 ± 4.4 | 69.2 ± 3.9 | 78.3 ± 3.8 |
UBM-HMM MFCC_D_A | 79.6 ± 4 | 92.4 ± 4.2 | 70.3 ± 3.8 | 79.8 ± 3.5 |
UBM-HMM MFCC_D_A_TEO | 84.6 ± 3.5 | 95.4 ± 2.7 | 71.3 ± 2.6 | 81.6 ± 3.1 |
UBM-HMM MFCC_D_TEO_F0_HR | 87.0 ± 1.2 | 85.8 ± 1.9 | 92.3 ± 1.8 | 88.9 ± 1.7 |
class-specific HMM MFCC_D_A_TEO_F0_HR | 77.4 ± 5.1 | 84.3 ± 4.7 | 68.9 ± 5.2 | 75.8 ± 4.3 |
left-right UBM-HMM MFCC_D_A_TEO_F0_HR | 84.3 ± 2.7 | 85.1 ± 2.8 | 80.9 ± 2.4 | 82.9 ± 2.2 |
clustering UBM-HMM MFCC_D_A_TEO_F0_HR | 88.5 ± 2.4 | 88.1 ± 1.9 | 88.6 ± 2 | 87.3 ± 2.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ntalampiras, S. Interpretable Probabilistic Identification of Depression in Speech. Sensors 2025, 25, 1270. https://doi.org/10.3390/s25041270
Ntalampiras S. Interpretable Probabilistic Identification of Depression in Speech. Sensors. 2025; 25(4):1270. https://doi.org/10.3390/s25041270
Chicago/Turabian StyleNtalampiras, Stavros. 2025. "Interpretable Probabilistic Identification of Depression in Speech" Sensors 25, no. 4: 1270. https://doi.org/10.3390/s25041270
APA StyleNtalampiras, S. (2025). Interpretable Probabilistic Identification of Depression in Speech. Sensors, 25(4), 1270. https://doi.org/10.3390/s25041270