Fnbot 17 1181598

TYPE Original Research
PUBLISHED 22 May 2023

DOI 10.3389/fnbot.2023.1181598
Multimodal transformer
OPEN ACCESS augmented fusion for speech
emotion recognition
EDITED BY
Christian Wallraven,
Korea University, Republic of Korea
REVIEWED BY
Vittorio Cuculo, Yuanyuan Wang1 , Yu Gu1*, Yifei Yin2 , Yingping Han1 , He Zhang3 ,
University of Modena and Reggio Emilia, Italy
Erwei Yin, Shuang Wang1 , Chenyu Li1 and Dou Quan1
Tianjin Artificial Intelligence Innovation Center
1
(TAIIC), China School of Artificial Intelligence, Xidian University, Xi’an, China, 2 Guangzhou Huya Technology Co., Ltd.,
Guangzhou, China, 3 School of Journalism and Communication, Northwest University, Xi’an, China
*CORRESPONDENCE
Yu Gu
guyu@xidian.edu.cn
Speech emotion recognition is challenging due to the subjectivity and ambiguity
RECEIVED 07 March 2023
ACCEPTED 28 April 2023 of emotion. In recent years, multimodal methods for speech emotion recognition
PUBLISHED 22 May 2023 have achieved promising results. However, due to the heterogeneity of data from
CITATION different modalities, effectively integrating different modal information remains
Wang Y, Gu Y, Yin Y, Han Y, Zhang H, Wang S, a difficulty and breakthrough point of the research. Moreover, in view of the
Li C and Quan D (2023) Multimodal transformer
augmented fusion for speech emotion limitations of feature-level fusion and decision-level fusion methods, capturing
recognition. Front. Neurorobot. 17:1181598. fine-grained modal interactions has often been neglected in previous studies. We
doi: 10.3389/fnbot.2023.1181598 propose a method named multimodal transformer augmented fusion that uses
COPYRIGHT a hybrid fusion strategy, combing feature-level fusion and model-level fusion
© 2023 Wang, Gu, Yin, Han, Zhang, Wang, Li
methods, to perform fine-grained information interaction within and between
and Quan. This is an open-access article
distributed under the terms of the Creative modalities. A Model-fusion module composed of three Cross-Transformer
Commons Attribution License (CC BY). The use, Encoders is proposed to generate multimodal emotional representation for modal
distribution or reproduction in other forums is
guidance and information fusion. Specifically, the multimodal features obtained
permitted, provided the original author(s) and
the copyright owner(s) are credited and that by feature-level fusion and text features are used to enhance speech features.
the original publication in this journal is cited, in Our proposed method outperforms existing state-of-the-art approaches on the
accordance with accepted academic practice.
IEMOCAP and MELD dataset.
No use, distribution or reproduction is
permitted which does not comply with these
terms. KEYWORDS
speech emotion recognition, multimodal enhancement, hybrid fusion, modal interaction,

transformer
1. Introduction
Speech emotion recognition (SER) is a branch of affective computing that aims to
determine a person’s emotional state from their speech (Ayadi et al., 2011). With the
increasing demand for human-computer interaction and emotional interaction, research
into SER has practical significance and broad application prospects. A SER system can be
used in a vehicle system to determine the driver’s psychological state and ensure the safe
operation of the vehicle in real-time (Wani et al., 2021). In hospitals and other medical
facilities, SER system can help doctors analyze the patient’s emotions so as to enhance
communication between doctors and patients and help doctors carry out disease diagnosis
(Tao and Tan, 2005). SER has also been applied in customer service centers, such as train
stations to detect the emotional state of customers in real time, which can help customer
service personnel provide more efficient and higher-quality services (Schuller, 2018). SER
is also used to help children with autistic who may encounter difficulties in identifying and
expressing emotions, thereby improving their socioemotional communication skills (Marchi
et al., 2012). However, due to the complexity, subjectivity and ambiguity of emotions, it is still
a challenge to accurately recognize emotions from speech.
Frontiers in Neurorobotics 01 frontiersin.org

Wang et al. 10.3389/fnbot.2023.1181598
In recent years, the introduction of multimodal methods the speech and text features. The enhanced text features are then
for SER has attracted the attention of researchers (Sebe et al., used to further enhance the speech features. Finally, the enhanced
2005). Emotion is a form of multi-channel expression and people speech features are used for sentiment classification. The superior
generally use multimodal information such as speech, text, and performance of MTAF over recent state-of-the-art methods is
facial expressions to express emotions (Shimojo and Shams, 2001). demonstrated through a variety of experiments conducted on the
Moreover, when noise occurs in one modality, the complementary IEMOCAP and MELD dataset.
information of different modalities can increase the robustness of
the system. SER based on multimodal fusion information is can
therefore be expected to outperform SER based on speech only 2. Related work
(Wang et al., 2022).
This paper focuses on using the fusion of speech and text Although multimodal methods have achieved significant
modalities to improve the performance of SER. First, the text success in the field of SER, there are great differences between
modality provides the semantic content. The semantic information modalities, both in their relative independence and in their
is rich and direct, but it is easily affected by the speech recognition synchronous or asynchronous information interaction. Hence,
task so as to contain ambiguity and bias (Wu J. et al., 2021). The effectively integrating the information from different modalities
speech modality provides information about the tone, speed, and remains a difficulty and breakthrough point of the research (Poria
volume of the speech delivery. Its advantage is that it can help one et al., 2017a). In the field of multimodal emotion recognition,
perceive the speaker ’s emotions, but it is difficult to obtain semantic researchers have mainly sought to determine at what stage the
information directly from speech. Second, text can be transcribed model could perform the fusion of different modal features.
from speech, and the text features are part of the speech features. Fusion methods can be divided into feature-level fusion (early
Text and speech can complement each other well (Atmaja et al., fusion), decision-level fusion (late fusion), model-level fusion,
2022). In the event of ambiguity and bias in the text, the emotional and hybrid-level fusion. Feature-level fusion fuses the features of
information based on the speaker in the speech can be used as a various modalities (such as visual features, text features, audio
reference. If it is difficult to obtain semantic information from the features) into general feature vectors, and uses the combined
speech, the text can provide supplementary information. features for analysis. Wu et al. proposed a new deep learning
In view of the limitations of feature-level fusion and decision- architecture Parallel Inception Convolutional Neural Network
level fusion methods, capturing comprehensive and fine-grained (PICNN). They performed convolution in parallel to process sEMG
modal interaction has often been neglected in previous studies. signals from six channels and then used the concatenation method
Compared with decision-level and feature-level fusion methods, to combine features of different scales directly before entering
model-level fusion can better use the advantages of deep neural them into the remainder of the common convolutional neural
networks, better integrate the features of different modalities, and network (Wu J. et al., 2021; Wu et al., 2022). Joshi et al. combined
obtain more accurate emotional representations. In hybrid fusion, audio, video, and text features by adding them and sent them to
the advantages of the different fusion strategies can be combined Transformer Encoder (Joshi et al., 2022). The advantage of feature-
to capture more fine-grained information on intra-modal and level fusion is that the low-level features of the data are used in the
inter-modal interaction. early stage. More information from the original data is used, and
Furthermore, inspired by the attention mechanism, researchers the task is completed based on the correlation between multimodal
have proposed the transformer (Vaswani et al., 2017), which features (Lian et al., 2020). However, the features obtained by
has achieved promising results in the field of natural language this fusion method belong to different modalities and may vary
processing (NLP). Transformer has an excellent ability in modeling greatly in many respects. The features must therefore be converted
long-term dependencies in sequences. Although the original to the same format before the fusion process. Moreover, this
Transformer was proposed to solve machine translation problems fusion method lacks information interaction within the modality,
in the field of NLP, researchers are studying its adaptability to and the high-dimensional feature set may be susceptible to data
the field of speech signal processing. The multi-head self-attention sparsity problems (Lian et al., 2021), making the method prone to
mechanism can learn long-term time dependence; the multimodal information redundancy and leading to data overfitting. The
head cross-attention mechanism can realize the fusion of different advantages of feature-level fusion are therefore limited.
modal features from the model level, and generate intermediate To overcome this limitation, decision-level fusion uses
multimodal emotional representations from the common semantic unimodal decision values and fuses them by ensemble learning
feature space, thereby improving the accuracy of SER. (Chen and Zhao, 2020), tensor fusion (Zadeh et al., 2017), or
Therefore, we propose a method named multimodal multiplication layer fusion (Mittal et al., 2020). In the decision-level
transformer augmented fusion (MTAF) that uses a hybrid fusion or late fusion process, the features of each modality are examined
strategy, combining feature-level fusion and model-level fusion and classified independently, and the results are fused into decision
methods. The feature-level fusion method is used to fuse the speech vectors to obtain the final decision. The advantage of decision-level
and text features to obtain multimodal features. Self-Transformer fusion is that, compared with feature-level fusion, the fusion of
Encoders are then used to model the long-term time dependence decisions obtained from various modalities becomes easier, because
of different modal features. A Model-fusion module is proposed decisions generated by multiple modalities often have the same
to generate multimodal emotional intermediate representations data form (Poria et al., 2016). Another advantage of this fusion
for modal guidance and information fusion by Cross-Transformer process is that each modality can use its most appropriate classifier
Encoders. Specifically, the multimodal features are used to enhance or model to learn its features (Sun et al., 2021). However, due

Wang et al. 10.3389/fnbot.2023.1181598
to the use of different classifiers or models in the analysis task, makes different modalities interact well, it increases the complexity
the learning process in the decision-level fusion phase becomes and training difficulty.
cumbersome and time-consuming. Moreover, this method must Most of the methods above use a single fusion strategy or a
solve the problem of the inability to capture more fine-grained single fusion model and lack fine-grained modal interactions. In
modal dynamics, without taking into account the interaction and contrast to the methods above, our proposed method uses a variety
correlation between different modalities. of fusion strategies and multi-level fusion models to capture fine-
Model-level fusion, in contrast, fuses intermediate grained intra-modal and inter-modal information interactions,
representations of different modalities by using various models and achieve high recognition accuracy. This paper presents the
of deep learning, such as Long Short Term Memory (LSTM), multimodal transformer augmented fusion (MTAF) method for
attention, and transformer. In model-level fusion, previous emotion recognition, focusing on speech and text domains. The
work has used kernel-based methods (Nen et al., 2011) to fuse novelty of the work lies in the combination of feature-level and
multimodal features, and showing performance improvements. model-level fusion methods and the introduction of a Model-fusion
Subsequently, Charles achieved good results by combining the module to facilitate fine-grained interactions between and within
ability of the graph model (Sutton and McCallum, 2010) to modalities. We first use feature-level fusion to perform early modal
compactly model diverse data with the ability of classification interactions between the speech and text modalities. Then, we
methods to make predictions using a large number of input construct the three independent models using Self-Cransformer
features. Recently, more advanced methods use attention-based Encoders to capture the intra-modality dynamics. Finally, a Model-
neural networks for model-level fusion. Chen and Jin (2016) fusion module composed of three Cross-Transformer Encoders to
proposed a multi-modal conditional attention fusion method to perform late modal interactions. By using a joint model, Fine-
accomplish a continuous multimodal emotion prediction task. grained intermodal dynamic interactions are captured for the
Their method can use the temporal information of video combined speech and text modalities.
the historical information and the different levels of features of
different modalities, and dynamically give different weights to
the visual and auditory modalities input by LSTM at each time 3. Proposed method
step. Poria et al. (2017c) introduced an attention-based fusion
mechanism called AT-Fusion that uses the attention score of As shown in Figure 1, we propose a method that uses both
each modality to fuse multimodal features. It amplifies higher- speech and text modalities for emotion recognition. The extracted
quality and more informative modalities in the fusion process of low-level features are fed in sequence to the transformer encoder.
multimodal classification, and has achieved promising results in The model consists of five parts: Speech, Text, Feature-fusion,
emotion recognition. Wang et al. (2021) proposed a multimodal Model-fusion modules, and a classification layer.
transformer with shared weights for SER. The proposed network
shares cross-modal weights in each Transformer layer to learn
the correlation between multiple modalities. However, the 3.1. Feature-level fusion
effect of the model-level fusion method mainly depends on the
fusion model used. This method lacks fine-grained interactions The input speech feature sequence of an utterance is
within and between modalities, and cannot make full use of the represented as xa . The text feature sequence of an utterance is
complementary information between modalities. represented as xl .
Hybrid-level fusion (Poria et al., 2017a) is a combination of The multimodal feature sequence of an utterance is as follows:
the first three fusion methods and is therefore more complex.
Sebastian and Pierucci (2019) proposed a combination of early
and late fusion techniques, using complementary information from xe = [xa ; xl ] (1)
speech and text modalities. Wu W. et al. (2021) proposed a
where [ ; ] is the concatenation operator.
dual-branch structure combining time synchronization and time
asynchronous features for multimodal emotion recognition. A time
synchronous branch (TSB) captures the correlation between each
word in each time step and its acoustic implementation, while 3.2. Multimodal transformer
time asynchronous branch (TAB) integrates sentence embedding
from context sentences. Shen et al. (2020) designed a hierarchical We first map the speech, text, and multimodal features obtained
representation of audio at the word, phoneme and frame levels in the previous step to the same dimension through a linear layer.
to form more emotionally relevant word-level acoustic features. The features are then sent to the Self-Transformer Encoder to
Xu et al. (2020) established a hierarchical granularity and feature capture the time dependence. Finally, the Model-fusion module,
model, which helps to capture more subtle clues and obtain a more which is composed of three Cross-Transformer Encoders is used
complete representation from the original acoustic data. The hybrid to generate multimodal emotional intermediate representations
fusion method varies depending on the combination of the different for modal guidance and information fusion. Specifically, the
fusion methods and is the best and most comprehensive fusion multimodal features are used to enhance the speech and text
method at present. However, although the hybrid-level fusion features. The enhanced text features are then used to further
method combines the advantages of different fusion methods and enhance the speech features.

Wang et al. 10.3389/fnbot.2023.1181598
FIGURE 1
Architecture of the MTAF model.
The core components of the Self-Transformer Encoder 3.2.1. Scaled dot-product attention
and the Cross-Transformer Encoder are a multihead self- The query Q, key K, and value V of the multi-head self-
attention mechanism and multihead cross-attention mechanism, attention mechanism come from the same modality. However, for
respectively. Each transformer encoder has m layers and n the multi-head cross-attention mechanism, the source modality
attention heads. feature is transformed to the pair of K and V while the target

Wang et al. 10.3389/fnbot.2023.1181598
modality feature is transformed into Q. We compute the matrix of 4.1.2. MELD

outputs as follows: The MELD (Poria et al., 2018) is a new multimodal dataset for
emotion recognition. It consists of 13,708 utterances with seven
! emotions (anger, disgust, fear, joy, neutral, sadness, and surprise)
QK T taken from 1,433 dialogues from the classic TV-series Friends. The
Attention (Q, K, V) = softmax p V (2)
dk whole dataset is divided into training, validation, and test sets. In
this work, we only use the training and test sets.
where dk is the dimension of Q.
4.2. Speech and text features

3.2.2. Multihead attention
Multihead Attention allows the model to focus on information 4.2.1. Speech features
from different presentation subspaces in different locations. Librosa (McFee et al., 2015), a Python package, was used to
extract utterance-level speech features. Features with a total of
199 dimensions were extracted, including Mel-Frequency Ceptral
Multihead (Q, K, V) = Concat head1 , ..., headn W O

(3) Coefficients (MFCC), chroma, pitch, zero-crossing rate, spectral
and their statistical measures (HSDs) such as mean, standard
deviation, minimum, and maximum.

where headi = Attention QWiQ , KWiK , VWiV (4)
4.2.2. Text features
The transcripts in the IEMOCAP and MELD dataset were used
where n is the number of attention heads; WO, WiQ , WiK , and WiV
to extract a 1,890-dimensional Term Frequency-Inverse Document
are learned model parameters.
Frequency (TFIDF) feature vector. TFIDF is a numerical statistic
that shows the correlation between a word and a document in a
collection or corpus (Sahu, 2019).
3.3. Classification layer
After the Model-fusion module, the final multimodal emotional 4.3. Implementation details
intermediate representation H is passed through a fully-connected
network and a softmax layer to predict the emotion class with the Through a linear layer, we obtain 256-dimensional speech, text
cross-entropy loss as the cost function: and multimodal features. We feed them into Self-Transformer
Encoder, which has 2 Transformer Encoder layers and 4 multi-
∼ head attention heads. Next, three newly generated 256-dimensional
y = softmax(wH + b) (5) vectors are sent to the Model-fusion module, which is composed of
three Cross-Transformer Encoders. Q is the first residual part of the
Cross-Transformer Encoder to perform deep interactions between
N modalities. Cross-Transformer Encoder and Self-Transformer
1 X ∼
Loss = − yi log yi (6) Encoder have the same number of layers and attention heads. After
N
i=1 the Model-fusion module, a 256-dimensional emotional feature
∼
vector is finally obtained for sentiment classification.
where yi is the true label, yi is the predicted probability distribution The training procedure was implemented using PyTorch on a
from the softmax layer, w and b are learned model parameters, and GTX3090. We used the Adam (Kingma and Ba, 2014) optimizer,
N is the total number of samples used in training. setting the learning rate to 0.0001. The batch size was 200. To
alleviate overfitting, we used the dropout method with a rate of 0.4.
We trained the model for at most 50,000 epochs until the accuracy
4. Experiment set-up did not change. Weighted accuracy (WA) and unweighted accuracy
(UA) were used as the evaluation metrics.
4.1. Datasets
4.1.1. IEMOCAP 5. Experiment results

The IEMOCAP (Busso et al., 2008) contains approximately 12
h of audiovisual data. We used the speech and text transcription 5.1. Comparison with state-of-the-art
data which include 7,487 utterances conveying seven emotions: approaches
frustration (1,849), neutral (1,708), anger (1,103), sadness (1,084),
excitement (1,041), happiness (595), and surprise (107). Excitement To verify the effectiveness of our proposed method, we
is incorporated into happiness. We randomly split the dataset into compared our MTAF with the following thirteen state-of-the-
a training (80%) and a test (20%) set. art approaches, all of which use multiple modalities for emotion

Wang et al. 10.3389/fnbot.2023.1181598
TABLE 1 Model performance comparisons on the IEMOCAP dataset.

recognition. These methods can be divided into four groups
according to the fusion level.
Model WA (%) UA (%)
(1) Feature-level fusion: (a) Audio + Text_LSTM (Sahu, 2019)
directly sends the concatenated features to the bidirectional LSTM Audio+Text_LSTM 64.20 —
network. (b) COGMEN (Joshi et al., 2022) propose COntextualized COGMEN 68.2 —
Graph Neural Network based Multimodal Emotion recognitioN Kumar et al. 71.70 75.00
(COGMEN) system that leverages local information (inter/intra
Xu et al. 70.40 69.50
dependency between speakers) and global information (context).
(2) Decision-level fusion: (a) In Kumar et al. (2021), the audio MCSAN 61.20 56.00
and textual features were extracted separately using attention- CAN 57.90 48.70
based Gated Recurrent Unit (GRU) and pre-trained Bidirectional
CMA+Raw waveform — 72.82
Encoder Representations from Transformers (BERT), respectively.
Then they were concatenated and used to predict the final emotion CTNet — 67.60
class. (b) In MDNN (Zhou et al., 2018), the proposed framework Late Fusion-III 61.20 59.30
train raw features by groups in local classifiers to avoid high
STSER 71.06 72.05
dimensional. Then high-level features of each local classifiers are
concatenated as input of a global classifier. (c) bcLSTM (Poria et al., MTAF 72.31 75.08
2017b) propose a LSTM network that takes as input the sequence The bold values indicate the model with the best results, WA and UA with the highest
results, respectively.
of utterances in a video and extracts contextual unimodal and
multimodal features by modeling the dependencies.
(3) Model-level fusion: (a) Xu et al. (2019) utilized an attention
network to learn the alignment between speech and text. (b) program Friends, which is not closer to real life than the data in
MCSAN (Sun et al., 2021) employed the parallel cross- and IEMOCAP. Moreover, the data collection conditions are not as
self attention modules to explicitly model both inter- and intra- standardized as those of IEMOCAP. The recognition accuracy for
modal interactions of audio and text. (c) CAN (Yoonhyung MELD is therefore not as high as IEMOCAP. But we do believe that
et al., 2020) applied the attention weights of each modality MELD is a good platform to compare and validate our method,
to the other modality in a crossed way so that the CAN because compared with other models, the experimental results of
gathers the audio and text information from the same time steps our method improves a lot although the overall results are low.
based on each modality. (d) CMA + Raw waveform (Krishna On the surface it seems there is limited improvement in
and Patil, 2020) applied Cross-modal attention to the output both WA and UA, compared to Kumar’s work. Actually, our
sequences from the audio encoder and text encoder, which helps experimental results were obtained by averaging the results of
in finding the interactive information between the audio and 10 experiments. Each experiment is carried out under the same
text sequences and thus helps improve the performance. (e) experimental conditions, including datasets, model parameters,
CTNet (Lian et al., 2021) proposed to use the transformer- training and testing processes. The purpose is to reduce the
based structure to model intra-modal and cross-modal interactions influence of random errors and increase the reliability and stability
among multimodal features. of the results. The highest experimental results of WA was 73.18%
(4) Hybrid-level fusion: (a) Late Fusion-III (Sebastian and UA was 75.49% on the IEMOCAP dataset. Compared to
and Pierucci, 2019) employed various fusion techniques to Kumar’s work, our method achieves 1.48% higher WA and 0.49%
provide relevance to intermodality dynamics, while keeping higher UA. In addition, the training speed of our proposed model
the separate models to capture the intra-modality dynamics. is very fast, so that it can process large amounts of data quickly. The
(b) HGFM (Xu et al., 2020) took the output of frame- memory occupied by the model is also very small, which is achieved
level structure as the input of utterance-level structure and by optimizing and streamlining the model to minimize unnecessary
extract the acoustic features of these two levels respectively computing and storage operations. We therefore believe that the
for effective and complementary fusion. (c) STSER (Chen experimental results show the superiority of our method.
and Zhao, 2020) applied a multi-scale fusion strategy, Based on the above experimental results, we analyze the
including feature fusion and ensemble learning to improve performance enhancements of the model. (1) Though a
the overall performance. hybrid fusion strategy, our model combines the advantages
All the results are listed in Tables 1, 2. On the IEMOCAP of feature-level and decision-level fusion methods to better
dataset, as we can see, our proposed method achieves 72.31% WA integrate the two modalities of speech and text. It uses the
and 75.08% UA. Compared with other state-of-the-art approaches, complementary information of the two modalities to better
the WA of our method is 0.61 to 14.41% higher and the UA is 0.08 generate emotional representation. The improvement of the
to 26.38% higher. On the MELD dataset, our proposed method accuracy in the experimental results effectively verifies this point.
achieves 48.12% WA. A 5.82 to 14.12% improvement on other (2) The multimodal Transformer Encoders achieve fine-grained
approaches. Although our method is superior to other algorithms, inter-modal and intra-modal interactions between speech and
the overall performance on the MELD dataset is not ideal. We text modalities well. The Model-fusion module composed of
speculate that this is because, compared with the IEMOCAP three Cross-Transformer Encoders can generate multimodal
dataset, the MELD dataset should be a relatively large dataset in emotional intermediate representations for modal guidance and
the field of emotion recognition. Its data comes from the TV information fusion.

Wang et al. 10.3389/fnbot.2023.1181598
5.2. Confusion matrix of experiment Text-Only models have a Speech module and Text module,
respectively, but not a Model-fusion module.
Figure 2 shows the confusion matrices of models using speech For the two emotional categories of frustration and neutral
only and text only and the confusion matrix for our proposed from Figure 2, the recognition accuracy of the multimodal model is
model MTAF on the IEMOCAP dataset. The Speech-Only and close to those of the single-modality models. For other emotional
categories, the recognition accuracy of the multimodal model is
TABLE 2 Model performance comparisons on the MELD dataset. much higher than those of the single-modality models. We also
observe a significant improvement of 10–27% recognition accuracy
Model WA (%)
for the happiness category after combining speech and text for
MDNN 34.00 emotion recognition. Furthermore, the recognition accuracy of
bcLSTM 39.10 the anger category is significantly improved by 15–18%, and that
of the sad category is significantly improved by 9–14%. These
HGFM 42.30
experiment results confirm the effectiveness of emotion recognition
MTAF 48.12 that combines speech and text modalities. Multimodal methods
The bold values indicate the model with the best results, WA with the highest combine the advantages of different modalities to obtain richer
results, respectively.
FIGURE 2
Confusion matrices of each model on the IEMOCAP dataset. (A) Speech-only confusion matrix. (B) Text-only confusion matrix. (C) MTAF confusion
matrix.

Wang et al. 10.3389/fnbot.2023.1181598
TABLE 3 Ablation study of our proposed method. TABLE 4 Ablation study of transformer encoder.
Model WA (%) UA (%) n m WA (%) UA (%)

Feature fusion 67.15 68.30 1 3 70.89 73.47
Model fusion 63.66 64.95 2 4 72.31 75.08
Without self-transformer encoder 70.09 73.48 3 5 71.57 73.86
Without cross-transformer encoder 69.65 72.24 5 8 70.95 73.64
MTAF 72.31 75.08 6 10 70.69 73.17

The bold values indicate the model with the best results, WA and UA with the highest The bold values indicate the Transformer Encoder layers and multi-head attention heads with
results, respectively. the best results, WA and UA with the highest results, respectively.
emotional representation. Moreover, complementary information information of speech and text modalities to obtain richer
of different modalities can also increase the robustness of the emotional representation.
system when noise occurs in one modality. When the Self-Transformer Encoder is removed, the
It is notable that the anger, happiness, and neutral categories model’s performance decreases by 2.22% in terms of WA
are misclassified as sadness with a relatively large probability. and 1.60% in terms of UA. This finding highlights the
Additionally, the happiness and sadness categories are also often importance of using multi-head self-attention mechanisms
classified as neutral. 55% of interpersonal relationships rely on for intra-modal interaction to capture contextual time
facial expressions or body movements, 38% rely on speech, and dependence. Furthermore, the model’s performance
only 7% rely on text (Mehrabian, 1971). Facial expressions therefore decreases when the Cross-Transformer Encoder is removed,
give very important clues to human emotions (Kim et al., 2017; which indicates that the multi-head cross-attention
Dai et al., 2021; Lee et al., 2021). On the basis of speech and text mechanisms can integrate features of different modalities for
modalities, increasing facial expressions can improve recognition information exchange and generate multimodal emotional
accuracy (Yoon et al., 2019; Kumar et al., 2022). Thus, we infer intermediate representations.
that humans express these emotions more in facial expressions Table 4 shows the results of using n Transformer Encoder layers
than in speech and semantic content. These are interesting findings and m multi-head attention heads. Through comparison of the
requiring more research and may lead to further improvement in results, we found that n = 2, and m = 4 achieves the best results.
the recognition accuracy. This finding shows that deep-level models are not suitable for SER
tasks.
5.3. Ablation study

6. Conclusion
A variety of ablation experiments were conducted on the
IEMOCAP dataset to evaluate the fusion methods, transformer We propose a method named multimodal transformer
encoders, and model parameters in our proposed method. Tables 3, augmented fusion that uses a hybrid fusion of both speech
4 present the results. and text features, combining feature-level fusion and model-
Table 3 shows the ablation study results for : Feature Fusion, level fusion methods, to effectively integrate different
Model Fusion, Without Self-Transformer Encoder, and Without modal information. A Model-fusion module composed of
Cross-Transformer Encoder. The Feature Fusion model includes three Cross-Transformer Encoders is proposed to generate
the Feature-fusion module in the middle of Figure 1, but not multimodal emotional representation for modal guidance
the Model-fusion module. The Model Fusion model does not and information fusion. Specifically, the Transformer
include the Feature-fusion module branch and uses only one Cross- Encoders are used to perform fine-grained dynamic intra-
Transformer Encoder for multimodal fusion. Then, the Without and inter-modality interactions. Moreover, experimental
Self-Transformer Encoder model directly sends extracted low-level results demonstrate the effectiveness of our proposed method
speech, text and concatenated multimodal features into the Model- on the IEMOCAP and MELD dataset. In future work, we
fusion module. Finally, The Without Cross-Transformer Encoder will try to add facial expressions for multimodal emotion
model removes the Model-fusion module and concatenates the recognition and further improve the accuracy of speech
outputs of the three branches directly into a fully-connected layer. emotion recognition.
In comparison to the Feature Fusion and Model Fusion
models, our proposed model, MTAF, achieves 5.16 to 8.65%
higher WA and 6.78 to 10.13% higher UA. The experimental Data availability statement
results confirm the effectiveness of our proposed hybrid fusion
strategy. It combines the advantages of feature-level and model- The original contributions presented in the study are included
level fusion methods, captures more fine-grained intra-modal and in the article/supplementary material, further inquiries can be
inter-model interactions, and makes full use of the complementary directed to the corresponding author.

Wang et al. 10.3389/fnbot.2023.1181598
Author contributions Conflict of interest

YW: conceptualization and methodology. YG: validation and YY was employed by the company Guangzhou Huya
supervision. All authors contributed to the article and approved the Technology Co., Ltd.
submitted version. The remaining authors declare that the research was
conducted in the absence of any commercial or financial
relationships that could be construed as a potential conflict
Funding of interest.
This work was supported by the National Natural Science

Foundation of China (Nos. 62271377 and 62201407), the Key Publisher’s note
Research and Development Program of Shanxi (Program Nos.
2023-YBGY-244, 2021ZDLGY0106, and 2022ZDLGY0112), the All claims expressed in this article are solely those of the
Fundamental Research Funds for the Central Universities under authors and do not necessarily represent those of their affiliated
Grant No. ZYTS23064, the National Key R&D Program of China organizations, or those of the publisher, the editors and the
under Grant Nos. 2021ZD0110404 and SQ2021AAA010888, and reviewers. Any product that may be evaluated in this article, or
the Key Scientific Technological Innovation Research Project by claim that may be made by its manufacturer, is not guaranteed or
Ministry of Education. endorsed by the publisher.
References
Atmaja, B. T., Sasou, A., and Akagi, M. (2022). Survey on bimodal speech emotion Marchi, E., Schuller, B., Batliner, A., Fridenzon, S., and Golan, O. (2012). “Emotion
recognition from acoustic and linguistic information fusion. Speech Commun. 140, in the speech of children with autism spectrum conditions: Prosody and everything
11–28. doi: 10.1016/j.specom.2022.03.002 else,” in Proceedings 3rd Workshop on Child, Computer and Interaction (WOCCI 2012).
Ayadi, M. E., Kamel, M. S., and Karray, F. (2011). Survey on speech emotion McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., et al. (2015).
recognition: features, classification schemes, and databases. Pattern Recogn. 44, 572– “librosa: Audio and music signal analysis in Python,” in Proceedings of the 14th Python
587. doi: 10.1016/j.patcog.2010.09.020 in Science Conference, 18–25.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). Mehrabian, A. (1971). Silent Messages. Belmont, CA: Wadsworth Publishing
IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. Company, Inc.
42, 335–359. doi: 10.1007/s10579-008-9076-6
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., and Manocha, D. (2020). “M3er:
Chen, M., and Zhao, X. (2020). “A multi-scale fusion framework for bimodal speech multiplicative multimodal emotion recognition using facial, textual, and speech cues,”
emotion recognition,” in Interspeech (Cary, NC), 374–378. in Proceedings of the AAAI Conference on Artificial Intelligence (Menlo Park, CA: AAAI
Chen, S., and Jin, Q. (2016). “Multi-modal conditional attention fusion for Press), 1359–1367.
dimensional emotion prediction,” in Proceedings of the 24th ACM International
Nen, M., Alpayd, and Ethem, N. (2011). Multiple kernel learning algorithms. J.
Conference on Multimedia (New York, NY: ACM), 571–575.
Mach. Learn. Res. 12, 2211–2268.
Dai, W., Cahyawijaya, S., Liu, Z., and Fung, P. (2021). Multimodal end-to-end
Poria, S., Cambria, E., Bajpai, R., and Hussain, A. (2017a). A review of affective
sparse model for emotion recognition. arXiv preprint arXiv:2103.09666.
computing: from unimodal analysis to multimodal fusion. Inform. Fusion 37, 98–125.
Joshi, A., Bhat, A., and Jain, A. (2022). “Contextualized gnn based multimodal doi: 10.1016/j.inffus.2017.02.003
emotion recognition,” in Proceedings of the 2022 Conference of the North
Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., and Morency,
American Chapter of the Association for Computational Linguistics: Human Language
L.-P. (2017b). “Context-dependent sentiment analysis in user-generated videos,” in
Technologies (Stroudsburg, PA), 4148–4164.
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
Kim, H. R., Kim, S. J., and Lee, I. K. (2017). Building emotional machines: (Stroudsburg), 873–883.
Recognizing image emotions through deep neural networks. IEEE Trans. Multimedia.
Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., and Morency,
20, 2980–2992. doi: 10.1109/TMM.2018.2827782
L.-P. (2017c). “Multi-level multiple attentions for contextual multimodal sentiment
Kingma, D., and Ba, J. (2014). Adam: a method for stochastic optimization. Comput. analysis,” in 2017 IEEE International Conference on Data Mining (ICDM) (Piscataway,
Sci. NJ: IEEE), 1033–1038.
Krishna, D. N., and Patil, A. (2020). “Multimodal emotion recognition using cross- Poria, S., Cambria, E., Howard, N., Huang, G., and Hussain, A.
modal attention and 1d convolutional neural networks,” in Interspeech (Cary, NC), (2016). Fusing audio, visual and textual clues for sentiment analysis from
4243–4247. multimodal content. Neurocomputing 174, 50–59. doi: 10.1016/j.neucom.2015.
01.095
Kumar, P., Kaushik, V., and Raman, B. (2021). “Towards the explainability of
multimodal speech emotion recognition,” in InterSpeech (Cary, NC), 1748–1752. Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., and Mihalcea,
R. (2018). “MELD: a multimodal multi-party dataset for emotion recognition in
Kumar, P., Malik, S., and Raman, B. (2022). Interpretable multimodal emotion
conversations,” in Proceedings of the 57th Annual Meeting of the Association for
recognition using hybrid fusion of speech and image data. arXiv preprint
Computational Linguistics (Stroudsburg).
arXiv:2208.11868.
Sahu, G. (2019). Multimodal speech emotion recognition and ambiguity resolution.
Lee, S., Han, D. K., and Ko, H. (2021). Multimodal emotion recognition fusion
arXiv preprint arXiv:1904.06022.
analysis adapting bert with heterogeneous feature unification. IEEE Access. 9, 94557–
94572. doi: 10.1109/ACCESS.2021.3092735 Schuller, B. W. (2018). Speech emotion recognition two decades in a nutshell,
benchmarks, and ongoing trends. Commun. ACM 61, 90–99. doi: 10.1145/3129340
Lian, Z., Liu, B., and Tao, J. (2021). “CTNet: conversational transformer network
for emotion recognition,” in IEEE/ACM Transactions on Audio, Speech, and Language Sebastian, J., and Pierucci, P. (2019). “Fusion techniques for utterance-level emotion
Processing (New York, NY: IEEE). recognition combining speech and transcripts,” in Interspeech (Cary, NC), 51–55.
Lian, Z., Tao, J., Liu, B., Huang, J., Yang, Z., and Li, R. (2020). “Context- Sebe, N., Cohen, I., Gevers, T., and Huang, T. S. (2005). “Multimodal approaches
dependent domain adversarial neural network for multimodal emotion recognition,” for emotion recognition: a survey,” in Proceedings of SPIE - The International Society
in Interspeech (Cary, NC), 394–398. for Optical Engineering (Bellingham).

Wang et al. 10.3389/fnbot.2023.1181598
Shen, G., Lai, R., Chen, R., Zhang, Y., Zhang, K., Han, Q., et al. (2020). “Wise: Wu, J., Zhao, T., Zhang, Y., Xie, L., Yan, Y., and Yin, E. (2021). “Parallel-inception
word-level interaction-based multimodal fusion for speech emotion recognition,” in cnn approach for facial semg based silent speech recognition,” in 2021 43rd Annual
Interspeech (Cary, NC), 369–373. International Conference of the IEEE Engineering in Medicine & Biology Society
(EMBC), 554–557.
Shimojo, S., and Shams, L. (2001). Sensory modalities are not separate
modalities: plasticity and interactions. Curr. Opin. Neurobiol. 11, 505–509. Wu, W., Zhang, C., and Woodland, P. C. (2021). “Emotion recognition by fusing
doi: 10.1016/S0959-4388(00)00241-5 time synchronous and time asynchronous representations,” in ICASSP 2021-2021
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Sun, L., Liu, B., Tao, J., and Lian, Z. (2021). “Multimodal cross-and self-attention
(Piscataway, NJ: IEEE), 6269–6273.
network for speech emotion recognition,” in ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 4275–4279. Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., and Li, X. (2019). Learning
alignment for multimodal emotion recognition from speech. arXiv preprint
Sutton, C., and McCallum, A. (2010). An introduction to conditional random fields.
arXiv:1909.05645.
Found. Trends Mach. Learn. 4, 267–373. doi: 10.1561/2200000013
Xu, Y., Xu, H., and Zou, J. (2020). “HGFM: a hierarchical grained and feature
Tao, J., and Tan, T. (2005). “Affective computing: a review,” in Affective Computing
model for acoustic emotion recognition,” in ICASSP 2020-2020 IEEE International
and Intelligent Interaction: First International Conference, ACII 2005 (Beijing), 981–
Conference on Acoustics, Speech and Signal Processing (ICASSP) (Piscataway, NJ: IEEE),
995.
6499–6503.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al.
Yoon, S., Dey, S., Lee, H., and Jung, K. (2019). Attentive modality hopping
(2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
mechanism for speech emotion recognition. arXiv preprint arXiv:1912.00846.
Wang, C., Ren, Y., Zhang, N., Cui, F., and Luo, S. (2022). Speech emotion
recognition based on multi-feature and multi-lingual fusion. Multimedia Tools Appl. Yoonhyung, L., Yoon, S., and Jung, K. (2020). “Multimodal speech emotion
81, 4897–4907. doi: 10.1007/s11042-021-10553-4 recognition using cross attention with aligned audio and text,” in Interspeech (Cary,
NC), 2717–2721.
Wang, Y., Shen, G., Xu, Y., Li, J., and Zhao, Z. (2021). “Learning mutual correlation
in multimodal transformer for speech emotion recognition,” in Interspeech (Cary, NC), Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L. P. (2017). “Tensor
4518–4522. fusion network for multimodal sentiment analysis,” in Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing (Stroudsburg), 1103–
Wani, T. M., Gunawan, T. S., Qadri, S., Kartiwi, M., and Ambikairajah, E. (2021).
1114.
A comprehensive review of speech emotion recognition systems. IEEE Access. 9,
47795–47814. doi: 10.1109/ACCESS.2021.3068045 Zhou, S., Jia, J., Wang, Q., Dong, Y., Yin, Y., and Lei, K. (2018).
Wu, J., Zhang, Y., Xie, L., Yan, Y., Zhang, X., Liu, S., et al. (2022). A novel “Inferring emotion from conversational voice data: a semi-supervised
silent speech recognition approach based on parallel inception convolutional neural multi-path generative neural network approach,” in Proceedings of
network and mel frequency spectral coefficient. Front. Neurorobot. 16, 971446. the AAAI Conference on Artificial Intelligence (Menlo Park, CA:
doi: 10.3389/fnbot.2022.971446 AAAI Press).

Fnbot 17 1181598

Uploaded by

Copyright:

Available Formats

Fnbot 17 1181598

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fnbot 17 1181598

Uploaded by

Copyright:

Available Formats

TYPE Original Research

PUBLISHED 22 May 2023

speech emotion recognition, multimodal enhancement, hybrid fusion, modal interaction,

Frontiers in Neurorobotics 01 frontiersin.org

Frontiers in Neurorobotics 02 frontiersin.org

Frontiers in Neurorobotics 03 frontiersin.org

Frontiers in Neurorobotics 04 frontiersin.org

modality feature is transformed into Q. We compute the matrix of 4.1.2. MELD

4.2. Speech and text features

4.1.1. IEMOCAP 5. Experiment results

Frontiers in Neurorobotics 05 frontiersin.org

TABLE 1 Model performance comparisons on the IEMOCAP dataset.

Frontiers in Neurorobotics 06 frontiersin.org

Frontiers in Neurorobotics 07 frontiersin.org

Model WA (%) UA (%) n m WA (%) UA (%)

Model fusion 63.66 64.95 2 4 72.31 75.08

Without self-transformer encoder 70.09 73.48 3 5 71.57 73.86

Without cross-transformer encoder 69.65 72.24 5 8 70.95 73.64

MTAF 72.31 75.08 6 10 70.69 73.17

5.3. Ablation study

Frontiers in Neurorobotics 08 frontiersin.org

Author contributions Conﬂict of interest

This work was supported by the National Natural Science

Frontiers in Neurorobotics 09 frontiersin.org

Frontiers in Neurorobotics 10 frontiersin.org

You might also like