Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2304.06910 (eess)

[Submitted on 14 Apr 2023 (v1), last revised 9 Jan 2024 (this version, v2)]

Title:HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

Abstract:Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.

Comments:	11 pages, 6 figures
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2304.06910 [eess.AS]
	(or arXiv:2304.06910v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2304.06910

Submission history

From: Soumya Dutta Mr [view email]
[v1] Fri, 14 Apr 2023 03:25:00 UTC (1,833 KB)
[v2] Tue, 9 Jan 2024 11:45:34 UTC (1,833 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators