research-article

AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening

Authors:

Elke A. RundensteinerAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 4145 - 4154

https://doi.org/10.1145/3459637.3481895

Published: 30 October 2021 Publication History

Abstract

Depression is a leading cause of disability with tremendous socioeconomic costs. In spite of early detection being crucial to improving prognosis, this mental illness remains largely undiagnosed. Depression classification from voice holds the promise to revolutionize diagnosis by ubiquitously integrating this screening capability into virtual assistants and smartphone technologies. Unfortunately, due to privacy concerns, audio datasets with depression labels have a small number of participants, causing current classification models to suffer from low performance. To tackle this challenge, we introduce Audio-Assisted BERT (AudiBERT), a novel deep learning framework that leverages the multimodal nature of human voice. To alleviate the small data problem, AudiBERT integrates pretrained audio and text representation models for the respective modalities augmented by a dual self-attention mechanism into a deep learning architecture. AudiBERT applied to depression classification consistently achieves promising performance with an increase in F1 scores between 6% and 30% compared to state-of-the-art audio and text models for 15 thematic question datasets. Using answers from medically targeted and general wellness questions, our framework achieves F1 scores of up to 0.92 and 0.86, respectively, demonstrating the feasibility of depression screening from informal dialogue using voice-enabled technologies.

Supplementary Material

MP4 File (CIKM21-afp0975.mp4)

A presentation of AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening.

Download
185.38 MB

References

[1]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. CoRR, Vol. abs/1609.08675 (2016).

[2]

Tuka Alhanai, Mohammad Ghassemi, and James Glass. 2018. Detecting depression with audio/text sequence modeling of interviews. Proceedings of INTERSPEECH (2018), 1716--1720.

[3]

Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019).

[4]

Christos-Nikolaos Anagnostopoulos, Theodoros Iliou, and Ioannis Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, Vol. 43, 2 (2015), 155--177.

Digital Library

[5]

Alexei Baevski, Steffen Schneider, and Michael Auli. 2019. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453 (2019).

[6]

Alexei Baevski, Yuhao Zhou, Abdel-rahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, Vol. 33 (2020).

[7]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[8]

Timothy Bickmore, Amanda Gruber, and Rosalind Picard. 2005. Establishing the computer--patient working alliance in automated health behavior change interventions. Patient education and counseling, Vol. 59, 1 (2005), 21--30.

[9]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media".

Digital Library

[10]

David E Bloom, Elizabeth Cafiero, Eva Jané-Llopis, Shafika Abrahams-Gessel, Lakshmi Reddy Bloom, et al. 2012. The global economic burden of noncommunicable diseases. Technical Report. Program on the Global Demography of Aging.

[11]

Felix Burkhardt, Martin Eckert, Wiebke Johannsen, and Joachim Stegmann. 2010. A Database of Age and Gender Annotated Telephone Speech. In LREC.

[12]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems. 577--585.

Digital Library

[13]

Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. 2015. A review of depression and suicide risk assessment using speech analysis. Speech Communication, Vol. 71 (2015), 10--49.

Digital Library

[14]

David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, et al. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 1061--1068.

Digital Library

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 4171--4186.

[16]

Hamdi Dibekliouglu, Zakia Hammal, Ying Yang, and Jeffrey F Cohn. 2015. Multimodal detection of depression in clinical interviews. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 307--310.

Digital Library

[17]

Ada Dogrucu, Alex Perucic, Anabella Isaro, Damon Ball, Ermal Toto, Elke A. Rundensteiner, Emmanuel Agu, Rachel Davis-Martin, and Edwin Boudreaux. 2020. Moodable: On feasibility of instantaneous depression assessment using machine learning on voice samples with retrospectively harvested smartphone and social media data. Smart Health, Vol. 17 (2020), 100118.

[18]

Dominic B Dwyer, Peter Falkai, and Nikolaos Koutsouleris. 2018. Machine learning approaches for clinical psychology and psychiatry. Annual review of clinical psychology, Vol. 14 (2018), 91--118.

[19]

Ronald M Epstein, Paul R Duberstein, Mitchell D Feldman, Aaron B Rochlen, Robert A Bell, Richard L Kravitz, Camille Cipri, Jennifer D Becker, Patricia M Bamonti, and Debora A Paterniti. 2010. "I didn't know what was wrong": How people with undiagnosed depression recognize, name and explain their distress. Journal of general internal medicine, Vol. 25, 9 (2010), 954--961.

[20]

Celia Jaes Falicov. 2003. Culture, society and gender in depression. Journal of Family Therapy, Vol. 25, 4 (2003), 371--387.

[21]

J-A Gómez-García, Laureano Moro-Velízquez, Juan Ignacio Godino-Llorente, and Germán Castellanos-Domãnguez. 2015. Automatic age detection in normal and pathological voice. In the 16th Annual Conference of the International Speech Communication Association.

[22]

Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, et al. 2014. The distress analysis interview corpus of human and computer interviews. In LREC. 3123--3128.

[23]

Rajat Hebbar, Krishna Somandepalli, and Shrikanth S Narayanan. 2018. Improving Gender Identification in Movie Audio Using Cross-Domain Data. In INTERSPEECH. 282--286.

[24]

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135.

[25]

Kun-Yi Huang, Chung-Hsien Wu, Ming-Hsiang Su, and Yu-Ting Kuo. 2018. Detecting unipolar and bipolar depressive disorders from elicited speech responses using latent affective structure model. IEEE Transactions on Affective Computing (2018).

[26]

Emil Kraepelin. 1921. Manic depressive insanity and paranoia. The Journal of Nervous and Mental Disease, Vol. 53, 4 (1921), 350.

[27]

Kurt Kroenke, Robert L Spitzer, and Janet BW Williams. 2001. The PHQ-9: validity of a brief depression severity measure. Journal of general internal medicine, Vol. 16, 9 (2001), 606--613.

[28]

Einat Liebenthal, David A Silbersweig, and Emily Stern. 2016. The language, tone and prosody of emotions: neural substrates and dynamics of spoken-word emotion perception. Frontiers in neuroscience, Vol. 10 (2016), 506.

[29]

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).

[30]

Zhenyu Liu, Dongyu Wang, Lan Zhang, and Bin Hu. 2020. A Novel Decision Tree for Depression Recognition in Speech. arXiv preprint arXiv:2002.12759 (2020).

[31]

Lu-Shih Alex Low, Namunu C Maddage, Margaret Lech, Lisa B Sheeber, and Nicholas B Allen. 2010. Detection of clinical depression in adolescents' speech during family interactions. IEEE Transactions on Biomedical Engineering, Vol. 58, 3 (2010), 574--586.

[32]

Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang. 2016. DepAudioNet: An Efficient Deep Model for Audio Based Depression Classification. In AVEC. 35--42.

Digital Library

[33]

Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 3 (2015), 540--552.

Digital Library

[34]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In ICASSP. IEEE, 5206--5210.

[35]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In EMNLP. Association for Computational Linguistics, 1532--1543.

[36]

Huy Phan, Lars Hertel, Marco Maass, and Alfred Mertins. 2016. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338 (2016).

[37]

Mirco Ravanelli and Yoshua Bengio. 2018. Interpretable convolutional filters with sincnet. arXiv preprint arXiv:1811.09725 (2018).

[38]

Mariana Rodrigues Makiuchi, Tifani Warnita, Kuniaki Uto, and Koichi Shinoda. 2019. Multimodal fusion of BERT-CNN and gated CNN representations for depression detection. In AVEC. 55--63.

Digital Library

[39]

Nicolas Saenz-Lechon, Juan I Godino-Llorente, Víctor Osma-Ruiz, and Pedro Gómez-Vilda. 2006. Methodological issues in the development of automatic systems for voice pathology detection. Biomedical Signal Processing and Control, Vol. 1, 2 (2006), 120--128.

[40]

M Sharifa, Roland Goecke, Michael Wagner, Julien Epps, Michael Breakspear, Gordon Parker, et al. 2012. From joyous to clinically depressed: Mood detection using spontaneous speech. In Twenty-Fifth International FLAIRS Conference.

[41]

Harsh Sinha and Pawan K Ajmera. 2018. Interweaving Convolutions: An application to Audio Classification. In 2018 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).

[42]

Albert L Siu, Kirsten Bibbins-Domingo, David C Grossman, Linda Ciofu Baumann, et al. 2016. Screening for depression in adults: US Preventive Services Task Force recommendation statement. Jama, Vol. 315, 4 (2016), 380--387.

[43]

Robert L Spitzer, Kurt Kroenke, Janet BW Williams, et al. 1999. Validation and utility of a self-report version of PRIME-MD: the PHQ primary care study. Jama, Vol. 282, 18 (1999), 1737--1744.

[44]

M. L. Tlachac, Ermal Toto, Rimsha Kayastha, Joshua Lovering, Nina Taurich, and Elke Rundensteiner. 2021. EMU: Early Mental Health Uncovering Framework and Dataset. In submission.

[45]

Ermal Toto, Brendan J Foley, and Elke A Rundensteiner. 2018. Improving emotion detection with sub-clip boosting. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 37--52.

[46]

E. Toto, M. L. Tlachac, F. L. Stevens, and E. A. Rundensteiner. 2020. Audio-based Depression Screening using Sliding Window Sub-clip Pooling. In ICMLA. IEEE.

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010.

Digital Library

[48]

Mark D Weist, Marcia Rubin, Elizabeth Moore, Steven Adelsheim, and Gordon Wrobel. 2007. Mental health screening in schools. Journal of School Health, Vol. 77, 2 (2007), 53--58.

[49]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, et al. 2019. HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).

[50]

Le Yang, Dongmei Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. 2016. Decision tree based depression classification from audio video and language information. In AVEC. 89--96.

Digital Library

Cited By

Khoo LLim MChong CMcNaney R(2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
https://doi.org/10.3390/s24020348
Tlachac MHeinz MReisch MOgden S(2024)Symptom Detection with Text Message Log Distributions for Holistic Depression and Anxiety ScreeningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435548:1(1-28)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.1145/3643554
Yang BCao MZhu XWang SYang CNi RLiu X(2024)MMPF: Multimodal Purification Fusion for Automatic Depression DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341161611:6(7421-7434)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3411616
Show More Cited By

Index Terms

AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening
1. Applied computing
  1. Life and medical sciences
    1. Health informatics
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Classification and regression trees

Recommendations

Multimodal sensing of affect intensity
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

Most research on affect intensity has relied on the Affect Intensity Measure (AIM) of self-report that asks respondents to rate how often they react to situations with strong emotions. The AIM gives an indication of how strongly or weakly individuals ...
A hybrid deep learning approach for classification of music genres using wavelet and spectrogram analysis
Abstract
Manual classification of millions of songs of the same or different genres is a challenging task for human beings. Therefore, there should be a machine intelligent model that can classify the genres of the songs very accurately. In this paper, a ...
Ensemble of convolutional neural networks to improve animal audio classification
Abstract
In this work, we present an ensemble for automated audio classification that fuses different types of features extracted from audio files. These features are evaluated, compared, and fused with the goal of producing better classification accuracy ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Funding Sources

GAANN grants and AFRI Grant
US Department of Education

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
1,058
Total Downloads

Downloads (Last 12 months)276
Downloads (Last 6 weeks)25

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khoo LLim MChong CMcNaney R(2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
https://doi.org/10.3390/s24020348
Tlachac MHeinz MReisch MOgden S(2024)Symptom Detection with Text Message Log Distributions for Holistic Depression and Anxiety ScreeningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435548:1(1-28)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.1145/3643554
Yang BCao MZhu XWang SYang CNi RLiu X(2024)MMPF: Multimodal Purification Fusion for Automatic Depression DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341161611:6(7421-7434)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3411616
Yang BWang PCao MZhu XWang SNi RYang C(2024)Uncertainty-Aware Label Contrastive Distribution Learning for Automatic Depression DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.331101311:2(2979-2989)Online publication date: Apr-2024
https://doi.org/10.1109/TCSS.2023.3311013
Hadzic BMohammed PDanner MOhse JZhang YShiban YRätsch M(2024)Enhancing early depression detection with AI: a comparative use of NLP modelsSICE Journal of Control, Measurement, and System Integration10.1080/18824889.2024.234262417:1(135-143)Online publication date: 23-Apr-2024
https://doi.org/10.1080/18824889.2024.2342624
Liu MChen KYe QWu H(2024)How to identify patient perception of AI voice robots in the follow-up scenario? A multimodal identity perception method based on deep learningJournal of Biomedical Informatics10.1016/j.jbi.2024.104757160(104757)Online publication date: Dec-2024
https://doi.org/10.1016/j.jbi.2024.104757
Bryan AHeinz MSalzhauer APrice GTlachac MJacobson N(2024)Behind the Screen: A Narrative Review on the Translational Capacity of Passive Sensing for Mental Health AssessmentBiomedical Materials & Devices10.1007/s44174-023-00150-42:2(778-810)Online publication date: 22-Feb-2024
https://doi.org/10.1007/s44174-023-00150-4
He MBakker ELew M(2024)DPD (DePression Detection) Net: a deep neural network for multimodal depression detectionHealth Information Science and Systems10.1007/s13755-024-00311-912:1Online publication date: 12-Nov-2024
https://doi.org/10.1007/s13755-024-00311-9
Ristea NAnghel AIonescu R(2024)Cascaded cross-modal transformer for audio–textual classificationArtificial Intelligence Review10.1007/s10462-024-10869-157:9Online publication date: 2-Aug-2024
https://doi.org/10.1007/s10462-024-10869-1
Katebi MPoshdar MBabaeian Jelodar MZihayat Kermani M(2023)Enhancing Disaster Resilience Studies: Leveraging Linked Data and Natural Language Processing for Consistent Open-Ended InterviewsCONVR 2023 - Proceedings of the 23rd International Conference on Construction Applications of Virtual Reality10.36253/979-12-215-0289-3.100(998-1009)Online publication date: 2023
https://doi.org/10.36253/979-12-215-0289-3.100
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents