Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3459637.3481895acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening

Published: 30 October 2021 Publication History

Abstract

Depression is a leading cause of disability with tremendous socioeconomic costs. In spite of early detection being crucial to improving prognosis, this mental illness remains largely undiagnosed. Depression classification from voice holds the promise to revolutionize diagnosis by ubiquitously integrating this screening capability into virtual assistants and smartphone technologies. Unfortunately, due to privacy concerns, audio datasets with depression labels have a small number of participants, causing current classification models to suffer from low performance. To tackle this challenge, we introduce Audio-Assisted BERT (AudiBERT), a novel deep learning framework that leverages the multimodal nature of human voice. To alleviate the small data problem, AudiBERT integrates pretrained audio and text representation models for the respective modalities augmented by a dual self-attention mechanism into a deep learning architecture. AudiBERT applied to depression classification consistently achieves promising performance with an increase in F1 scores between 6% and 30% compared to state-of-the-art audio and text models for 15 thematic question datasets. Using answers from medically targeted and general wellness questions, our framework achieves F1 scores of up to 0.92 and 0.86, respectively, demonstrating the feasibility of depression screening from informal dialogue using voice-enabled technologies.

Supplementary Material

MP4 File (CIKM21-afp0975.mp4)
A presentation of AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening.

References

[1]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. CoRR, Vol. abs/1609.08675 (2016).
[2]
Tuka Alhanai, Mohammad Ghassemi, and James Glass. 2018. Detecting depression with audio/text sequence modeling of interviews. Proceedings of INTERSPEECH (2018), 1716--1720.
[3]
Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019).
[4]
Christos-Nikolaos Anagnostopoulos, Theodoros Iliou, and Ioannis Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, Vol. 43, 2 (2015), 155--177.
[5]
Alexei Baevski, Steffen Schneider, and Michael Auli. 2019. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453 (2019).
[6]
Alexei Baevski, Yuhao Zhou, Abdel-rahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, Vol. 33 (2020).
[7]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[8]
Timothy Bickmore, Amanda Gruber, and Rosalind Picard. 2005. Establishing the computer--patient working alliance in automated health behavior change interventions. Patient education and counseling, Vol. 59, 1 (2005), 21--30.
[9]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media".
[10]
David E Bloom, Elizabeth Cafiero, Eva Jané-Llopis, Shafika Abrahams-Gessel, Lakshmi Reddy Bloom, et al. 2012. The global economic burden of noncommunicable diseases. Technical Report. Program on the Global Demography of Aging.
[11]
Felix Burkhardt, Martin Eckert, Wiebke Johannsen, and Joachim Stegmann. 2010. A Database of Age and Gender Annotated Telephone Speech. In LREC.
[12]
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems. 577--585.
[13]
Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. 2015. A review of depression and suicide risk assessment using speech analysis. Speech Communication, Vol. 71 (2015), 10--49.
[14]
David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, et al. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 1061--1068.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 4171--4186.
[16]
Hamdi Dibekliouglu, Zakia Hammal, Ying Yang, and Jeffrey F Cohn. 2015. Multimodal detection of depression in clinical interviews. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 307--310.
[17]
Ada Dogrucu, Alex Perucic, Anabella Isaro, Damon Ball, Ermal Toto, Elke A. Rundensteiner, Emmanuel Agu, Rachel Davis-Martin, and Edwin Boudreaux. 2020. Moodable: On feasibility of instantaneous depression assessment using machine learning on voice samples with retrospectively harvested smartphone and social media data. Smart Health, Vol. 17 (2020), 100118.
[18]
Dominic B Dwyer, Peter Falkai, and Nikolaos Koutsouleris. 2018. Machine learning approaches for clinical psychology and psychiatry. Annual review of clinical psychology, Vol. 14 (2018), 91--118.
[19]
Ronald M Epstein, Paul R Duberstein, Mitchell D Feldman, Aaron B Rochlen, Robert A Bell, Richard L Kravitz, Camille Cipri, Jennifer D Becker, Patricia M Bamonti, and Debora A Paterniti. 2010. "I didn't know what was wrong": How people with undiagnosed depression recognize, name and explain their distress. Journal of general internal medicine, Vol. 25, 9 (2010), 954--961.
[20]
Celia Jaes Falicov. 2003. Culture, society and gender in depression. Journal of Family Therapy, Vol. 25, 4 (2003), 371--387.
[21]
J-A Gómez-García, Laureano Moro-Velízquez, Juan Ignacio Godino-Llorente, and Germán Castellanos-Domãnguez. 2015. Automatic age detection in normal and pathological voice. In the 16th Annual Conference of the International Speech Communication Association.
[22]
Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, et al. 2014. The distress analysis interview corpus of human and computer interviews. In LREC. 3123--3128.
[23]
Rajat Hebbar, Krishna Somandepalli, and Shrikanth S Narayanan. 2018. Improving Gender Identification in Movie Audio Using Cross-Domain Data. In INTERSPEECH. 282--286.
[24]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135.
[25]
Kun-Yi Huang, Chung-Hsien Wu, Ming-Hsiang Su, and Yu-Ting Kuo. 2018. Detecting unipolar and bipolar depressive disorders from elicited speech responses using latent affective structure model. IEEE Transactions on Affective Computing (2018).
[26]
Emil Kraepelin. 1921. Manic depressive insanity and paranoia. The Journal of Nervous and Mental Disease, Vol. 53, 4 (1921), 350.
[27]
Kurt Kroenke, Robert L Spitzer, and Janet BW Williams. 2001. The PHQ-9: validity of a brief depression severity measure. Journal of general internal medicine, Vol. 16, 9 (2001), 606--613.
[28]
Einat Liebenthal, David A Silbersweig, and Emily Stern. 2016. The language, tone and prosody of emotions: neural substrates and dynamics of spoken-word emotion perception. Frontiers in neuroscience, Vol. 10 (2016), 506.
[29]
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).
[30]
Zhenyu Liu, Dongyu Wang, Lan Zhang, and Bin Hu. 2020. A Novel Decision Tree for Depression Recognition in Speech. arXiv preprint arXiv:2002.12759 (2020).
[31]
Lu-Shih Alex Low, Namunu C Maddage, Margaret Lech, Lisa B Sheeber, and Nicholas B Allen. 2010. Detection of clinical depression in adolescents' speech during family interactions. IEEE Transactions on Biomedical Engineering, Vol. 58, 3 (2010), 574--586.
[32]
Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang. 2016. DepAudioNet: An Efficient Deep Model for Audio Based Depression Classification. In AVEC. 35--42.
[33]
Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 3 (2015), 540--552.
[34]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In ICASSP. IEEE, 5206--5210.
[35]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In EMNLP. Association for Computational Linguistics, 1532--1543.
[36]
Huy Phan, Lars Hertel, Marco Maass, and Alfred Mertins. 2016. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338 (2016).
[37]
Mirco Ravanelli and Yoshua Bengio. 2018. Interpretable convolutional filters with sincnet. arXiv preprint arXiv:1811.09725 (2018).
[38]
Mariana Rodrigues Makiuchi, Tifani Warnita, Kuniaki Uto, and Koichi Shinoda. 2019. Multimodal fusion of BERT-CNN and gated CNN representations for depression detection. In AVEC. 55--63.
[39]
Nicolas Saenz-Lechon, Juan I Godino-Llorente, Víctor Osma-Ruiz, and Pedro Gómez-Vilda. 2006. Methodological issues in the development of automatic systems for voice pathology detection. Biomedical Signal Processing and Control, Vol. 1, 2 (2006), 120--128.
[40]
M Sharifa, Roland Goecke, Michael Wagner, Julien Epps, Michael Breakspear, Gordon Parker, et al. 2012. From joyous to clinically depressed: Mood detection using spontaneous speech. In Twenty-Fifth International FLAIRS Conference.
[41]
Harsh Sinha and Pawan K Ajmera. 2018. Interweaving Convolutions: An application to Audio Classification. In 2018 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
[42]
Albert L Siu, Kirsten Bibbins-Domingo, David C Grossman, Linda Ciofu Baumann, et al. 2016. Screening for depression in adults: US Preventive Services Task Force recommendation statement. Jama, Vol. 315, 4 (2016), 380--387.
[43]
Robert L Spitzer, Kurt Kroenke, Janet BW Williams, et al. 1999. Validation and utility of a self-report version of PRIME-MD: the PHQ primary care study. Jama, Vol. 282, 18 (1999), 1737--1744.
[44]
M. L. Tlachac, Ermal Toto, Rimsha Kayastha, Joshua Lovering, Nina Taurich, and Elke Rundensteiner. 2021. EMU: Early Mental Health Uncovering Framework and Dataset. In submission.
[45]
Ermal Toto, Brendan J Foley, and Elke A Rundensteiner. 2018. Improving emotion detection with sub-clip boosting. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 37--52.
[46]
E. Toto, M. L. Tlachac, F. L. Stevens, and E. A. Rundensteiner. 2020. Audio-based Depression Screening using Sliding Window Sub-clip Pooling. In ICMLA. IEEE.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010.
[48]
Mark D Weist, Marcia Rubin, Elizabeth Moore, Steven Adelsheim, and Gordon Wrobel. 2007. Mental health screening in schools. Journal of School Health, Vol. 77, 2 (2007), 53--58.
[49]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, et al. 2019. HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[50]
Le Yang, Dongmei Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. 2016. Decision tree based depression classification from audio video and language information. In AVEC. 89--96.

Cited By

View all
  • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
  • (2024)Symptom Detection with Text Message Log Distributions for Holistic Depression and Anxiety ScreeningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435548:1(1-28)Online publication date: 6-Mar-2024
  • (2024)MMPF: Multimodal Purification Fusion for Automatic Depression DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341161611:6(7421-7434)Online publication date: Dec-2024
  • Show More Cited By

Index Terms

  1. AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
      October 2021
      4966 pages
      ISBN:9781450384469
      DOI:10.1145/3459637
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      • Best Paper

      Author Tags

      1. audio classification
      2. depression
      3. multimodal
      4. transfer learning

      Qualifiers

      • Research-article

      Funding Sources

      • GAANN grants and AFRI Grant
      • US Department of Education

      Conference

      CIKM '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)276
      • Downloads (Last 6 weeks)25
      Reflects downloads up to 14 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
      • (2024)Symptom Detection with Text Message Log Distributions for Holistic Depression and Anxiety ScreeningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435548:1(1-28)Online publication date: 6-Mar-2024
      • (2024)MMPF: Multimodal Purification Fusion for Automatic Depression DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.341161611:6(7421-7434)Online publication date: Dec-2024
      • (2024)Uncertainty-Aware Label Contrastive Distribution Learning for Automatic Depression DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.331101311:2(2979-2989)Online publication date: Apr-2024
      • (2024)Enhancing early depression detection with AI: a comparative use of NLP modelsSICE Journal of Control, Measurement, and System Integration10.1080/18824889.2024.234262417:1(135-143)Online publication date: 23-Apr-2024
      • (2024)How to identify patient perception of AI voice robots in the follow-up scenario? A multimodal identity perception method based on deep learningJournal of Biomedical Informatics10.1016/j.jbi.2024.104757160(104757)Online publication date: Dec-2024
      • (2024)Behind the Screen: A Narrative Review on the Translational Capacity of Passive Sensing for Mental Health AssessmentBiomedical Materials & Devices10.1007/s44174-023-00150-42:2(778-810)Online publication date: 22-Feb-2024
      • (2024)DPD (DePression Detection) Net: a deep neural network for multimodal depression detectionHealth Information Science and Systems10.1007/s13755-024-00311-912:1Online publication date: 12-Nov-2024
      • (2024)Cascaded cross-modal transformer for audio–textual classificationArtificial Intelligence Review10.1007/s10462-024-10869-157:9Online publication date: 2-Aug-2024
      • (2023)Enhancing Disaster Resilience Studies: Leveraging Linked Data and Natural Language Processing for Consistent Open-Ended InterviewsCONVR 2023 - Proceedings of the 23rd International Conference on Construction Applications of Virtual Reality10.36253/979-12-215-0289-3.100(998-1009)Online publication date: 2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media