1 s2.0 S0003682X24000379 Main

Applied Acoustics 218 (2024) 109886
Contents lists available at ScienceDirect
Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust
Design of smart home system speech emotion recognition model based on

ensemble deep learning and feature fusion
Mengsheng Wang, Hongbin Ma ∗ , Yingli Wang, Xianhe Sun
School of Electronics and Engineering, Heilongjiang University, No. 74 Xuefu Road, Nangang District, 150006, Heilongjiang Province, China
A R T I C L E I N F O A B S T R A C T
Keywords: In the realm of consumer technology, Artificial Intelligence (AI)-based Speech Emotion Recognition (SER) has
Smart home rapidly gained traction and integration into smart home systems. Its precision in recognition has become a pivotal
Data augmentation factor significantly impacting user experience. However, the intricate task of selecting suitable features has
Feature fusion
emerged as a daunting challenge due to the variances in speech features induced by emotional nuances. Present
Convolutional neural network
Bidirectional long short-term memory network
research predominantly concentrates on localized speech characteristics, neglecting the broader contextual cues
Transformer inherent in speech signals. This oversight contributes to relatively diminished accuracy in emotion recognition
Ensemble learning within smart home systems. To tackle this challenge, this paper introduces an enhanced Speech Emotion
Recognition approach named TF-Mix. This methodology enriches emotional prediction from speech by leveraging
audio data augmentation and embracing multiple features, thereby achieving superior performance in emotion
recognition. To augment the model’s adaptability, TF-Mix adeptly amalgamates various feature extraction
techniques, encompassing Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs),
and Transformer architecture. The synergy among these methodologies culminates in the formulation of
three distinct architectural models. The primary architecture is founded on a 1-dimensional Convolutional
Neural Network (CNN), closely followed by a Fully Connected Network (FCN). Subsequent architectures,
notably BiLSTM-FCN and BiLSTM-Transformer-FCN, retain their respective structures while incorporating CNNs.
Moreover, the amalgamation of individual models into an ensemble model, designated as D, via weighted
averaging, further amplifies the efficacy of emotion recognition. Experimental outcomes showcase exceptional
performance across all four models in the SER task. The ensemble Model D achieves noteworthy accuracy across
multiple datasets: 87.513% on RAVDESS, 86.233% on SAVEE, 99.857% on TESS, 82.295% on CREMA-D, and
97.546% on the TOTAL dataset.
1. Introduction The integration of smart home systems stands as the epitome of the
fusion between contemporary technology and lifestyle, embodying hu-
manity’s aspirations for the future. It surpasses a mere collection of
The dynamic evolution of human-computer interaction is reshap- smart devices, serving as an emblem of redefining quality living and
ing the technological landscape of today’s world. Human-Computer convenience. Its core ambition extends beyond mere convenience; it
Interaction (HCI) delves deeply into this intricate relationship, striving strives to craft a personalized, intelligent, comfortable, and eco-friendly
to elevate user experiences through meticulous design endeavors [1]. living space [3]. Within this system, Speech Emotion Recognition (SER)
Speech, as a primary medium of interpersonal communication, inher- holds an indispensable role. Acting as a perceptive observer, it intri-
ently carries a crucial element: emotion. Speech Emotion Recognition cately perceives and interprets users’ emotional shifts to offer a more
(SER), as a vibrant research domain, serves as a bridge between human- intelligent, tailored, and intimate interactive experience. SER goes be-
computer interaction and digital signal processing. Within the realm of yond mere emotion identification; it delves deeply into the subtleties
HCI, SER assumes a pivotal role, profoundly influencing the trajectory of emotional expression. It customizes the analysis and identification of
of modern electronic devices, steering them toward more sophisticated users’ emotional traits, better catering to their needs through personal-
and emotion-responsive interactive modes [2]. ized services. In the domain of smart furniture, speech emotion analysis
* Corresponding author.
E-mail addresses: 18756045935@163.com (M. Wang), mahongbin@hlju.edu.cn (H. Ma), wangyingli@hlju.edu.cn (Y. Wang), XH321076856@163.com (X. Sun).
https://doi.org/10.1016/j.apacoust.2024.109886
Received 5 September 2023; Received in revised form 29 December 2023; Accepted 20 January 2024
0003-682X/© 2024 Elsevier Ltd. All rights reserved.
M. Wang, H. Ma, Y. Wang et al. Applied Acoustics 218 (2024) 109886
Fig. 1. Workflow of SER system and proposal of TF-mix structure.
acts as a conduit connecting users’ emotional experiences with smart Fig. 1a provides an outline of the SER system framework discussed
home systems. It transcends a mere technological function, becoming in this paper. The initial preprocessing phase concentrates on stan-
a medium for emotional communication. By accurately capturing user dardizing the lengths of all sample audio files and extracting both
emotions, speech emotion analysis aids smart furniture in comprehend- temporal and spectral features. Presently, the field of SER extensively
ing and responding to user needs, strengthening the emotional bond be- explores continuous speech attributes, including pitch, sound intensity,
tween users and smart furniture, ultimately enhancing the overall user and factors related to speech quality. Moreover, extensive research has
experience. As technology evolves incessantly, this advancement will delved into spectral attributes such as Mel Frequency Cepstral Coef-
propel ongoing innovation and progress in the smart furniture industry. ficients (MFCC) [6], Linear Predictive Coding (LPC) [7], and Power
It signifies not just technological growth but a genuine embodiment of a Spectral Density [8]. Following stages of processing introduce data aug-
human-centric, intelligent home experience. In the coming days, smart mentation techniques aimed at enriching the extracted features. This
home systems will further integrate into people’s lives, offering users a augmentation not only broadens the dataset but also effectively mit-
more intelligent, comfortable, and personalized home experience. This igates inherent imbalances. Current research primarily delves into lo-
not only signifies technological progression but also a redefinition of calized, user-specific speech characteristics within smart home systems,
quality living, providing users with increased convenience and delight considering the diverse backgrounds and the potential impact of user
within their home environments. emotional fluctuations on speech features. However, their efficacy in
Over the past decade, researchers have made remarkable strides in more intricate environments tends to be less prominent. Consequently,
developing robust and lightweight systems for Speech Emotion Recogni- the precision of emotion recognition in smart furniture systems signifi-
tion [4]. However, accurately discerning human emotional states from cantly falls short of expectations. This presents substantial challenges in
speech remains a complex and challenging task, stemming from vari- strategically manipulating user speech and conducting meticulous fea-
ous underlying factors. One of the central challenges revolves around ture selection.
the lack of advanced technologies and tools capable of effectively ana- In the Speech Emotion Recognition (SER) system, typically during
lyzing and interpreting emotions embedded within speech signals. The the third phase, features derived from speech signals are merged into
inherent ambiguity of emotions further intensifies the complexity, as customized feature vectors for distinct classification tasks. Researchers
emotions often present intricate and nuanced features, making their ac- employ various classifiers, covering both linear and nonlinear variants.
curate identification a formidable endeavor. Additionally, the diversity Linear classifiers like Support Vector Machine (SVM) [9], logistic re-
of languages and accents across different cultural backgrounds intro- gression [10], Bayesian Networks (BN) [11], and stochastic gradient de-
duces further complexity to the process of speech emotion recognition. scent (SGD) [12] attract attention. However, due to the non-stationary
Variations in frequency and amplitude within human utterances, influ- nature of speech signals, certain nonlinear classifiers, such as Hidden
enced by factors such as gender and age, also contribute to the intricacy Markov Model (HMM) [13], Gaussian Mixture Model (GMM) [14], and
of precisely identifying emotional states from speech [5]. Despite these K-Nearest Neighbors (KNN) [15], are prevalent. Nevertheless, with the
challenges, researchers persistently explore innovative methodologies swift evolution of deep learning, it has emerged as the preferred ap-
and leverage advancements in machine learning and deep learning proach for identifying latent features and efficiently executing classifi-
techniques to enhance the precision of emotion recognition. The pur- cation tasks. Deep learning models harness their ability to automatically
suit of refining Speech Emotion Recognition systems remains a pivotal extract intricate patterns from data, thereby significantly boosting the
research direction in this field, given its substantial potential across di- accuracy of emotion recognition.
verse domains, including human-computer interaction and smart home In recent years, Deep Learning (DL) has emerged as a focal point
systems. in academic research, surpassing conventional Machine Learning (ML)
2
by automating feature extraction and uncovering latent patterns within ditionally, it mitigates overfitting risks, enhances adaptability to new
handcrafted features. This has resulted in substantial advancements in data and prediction accuracy, rendering the model more robust and ap-
speech emotion recognition tasks. Extensive research has investigated plicable across diverse datasets and real-world scenarios.
the effective utilization of one-dimensional Convolutional Neural Net-
works (1D CNNs) as models for speech emotion recognition [16]. No- 2. Related works
tably, 1D CNN models exhibit robust capabilities in handling time-series
data and audio classification tasks. Furthermore, various Recurrent Achieving accurate speech emotion recognition within smart home
Neural Network architectures, including Recurrent Neural Networks systems is influenced by various factors. Researchers have proposed
(RNNs), Long Short-Term Memory networks (LSTMs), Gated Recurrent diverse methodologies to address the challenges posed by emotion
Units (GRUs), and Bidirectional LSTMs [17–19], have been proposed recognition tasks. In these tasks, precise feature selection is critical
to delve deeper into comprehending long-term contextual relationships as irrelevant features can directly impact subsequent speech emotion
and potential emotional cues in speech. The integration of these ar- classification. Presently, many researchers are increasingly favoring the
chitectures with 1D CNNs enables a comprehensive capture of intricate utilization of deep learning techniques for emotion recognition tasks.
temporal patterns and relationships within audio data, resulting in more This inclination arises from the capability of these techniques to extract
precise and robust emotion recognition from speech signals. deeper features from speech feature sets, capturing intricate internal
Motivated by the complex analysis of multidimensional information relationships within the data and enhancing emotion recognition per-
observed in the human brain during intricate decision-making pro- formance.
cesses, and inspired by the significant achievements of deep learning Speech signals encompass various features. In the time domain rep-
in classification tasks, we introduced four deep learning-based frame- resentation, fundamental descriptors such as amplitude envelope, am-
works aimed at addressing the challenges of speech emotion recognition plitude, zero-crossing rate, and root mean square [24] play a significant
in smart home systems, as shown in Fig. 1b. To evaluate these frame- role in capturing distinct attributes of speech signals and are crucial
works effectiveness, comprehensive experiments were conducted using for emotion recognition. Frequency domain features comprise band
four widely available benchmark datasets (RAVDESS [20], SAVEE [21], energy ratio, Mel-frequency cepstral coefficients (MFCC), spectral cen-
TESS [22], and CREMA-D [23]). Given the relatively limited sample troid, Chromagram [25], among others. Among these features, MFCC is
sizes in these datasets, they were merged into a fifth dataset named notably the most crucial and widely adopted cepstral feature. Recently,
TOTAL. Subsequent experiments were aimed at mitigating overfitting researchers have started exploring the use of Gammatone Cepstral Coef-
issues during the training process of deep learning models. The pro- ficients (GTCC) [26,27] in emotion recognition. Additionally, statistical
posed methods have made substantial strides in the field of speech descriptors like entropy, skewness, and kurtosis find common use. Spec-
emotion recognition, showcasing their potential for practical applica- trograms, extensively employed in current emotion recognition studies,
tions. Overall, the primary contributions of this paper encompass: visually depict signal power evolution over time across varying fre-
1: In the domain of smart home speech emotion recognition, the TF- quencies. MFCC is pivotal in emotion recognition tasks as it transforms
Mix model innovatively combines time-domain and frequency-domain traditional frequencies into the Mel-scale, considering human sensitivity
features. Departing from the reliance on single-feature approaches, it to different frequencies, rendering it well-suited for emotion recogni-
expands across multiple feature dimensions. Time-domain features scru- tion applications [28]. Moreover, numerous researchers in the emo-
tinize the temporal aspects of audio signals, while frequency-domain tion recognition field are presently applying machine learning, deep
features concentrate on their distribution across the frequency spec- learning, and their amalgamation to enhance emotion recognition per-
trum. This amalgamation comprehensively captures diverse informa- formance [29–32]. Simultaneously, active exploration is underway to
tion within speech signals, as different emotions may manifest unique combine various attention mechanisms with deep learning methods to
patterns across varying frequency ranges or specific time segments in focus more on specific regions of interest within speech signals. These
the audio. The fusion of time-domain and frequency-domain features advancements underscore continual efforts aimed at refining emotion
furnishes a richer, more varied representation, empowering the model recognition models and deepening the understanding of emotional cues
to more precisely identify and differentiate between distinct emotional embedded within speech signals.
states. In the realm of speech emotion analysis, pivotal roles are assumed by
2: Data augmentation techniques play a pivotal role in overcoming deep learning techniques like Convolutional Neural Networks (CNNs),
the scarcity of training samples and mitigating overfitting. They assist Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM)
the model in capturing richer, more nuanced features, particularly in networks, and Transformers, serving as significant feature extractors.
discerning minor emotional nuances embedded within speech signals. CNNs, in particular, stand out for their outstanding feature extrac-
This often leads to more accurate and reliable outcomes in emotional tion capabilities in speech emotion recognition [33–35]. For instance,
classification and recognition. Moreover, integrating data augmentation Zhang et al. [33] utilized CNNs to extract distinct discriminative fea-
techniques bolsters the model’s robustness. By exposing the model to a tures and subsequently employed a linear Support Vector Machine
more diverse spectrum of data variations during training, it fortifies (SVM) for classification. Conversely, researchers like Mao et al. [36] au-
its adaptability and stability. Augmenting data significantly strengthens tonomously employed CNNs to learn emotion-specific feature represen-
the model’s robustness and applicability. tations from segmental-level logarithmic Mel filterbanks. Their model
3: This study introduces four deep neural network-based models integrated three classifiers: Extreme Learning Machine (ELM), Random
for emotion recognition. Firstly, Model-A leverages convolutional neu- Forest (RF), and SVM. In an alternate strategy, Zhang et al. [37] sepa-
ral networks to extract local features from speech data and derive rately applied 2D CNNs and 3D CNNs to derive pertinent features from
advanced latent features through multiple convolutional modules. Sec- speech and video data. They then combined the outputs of these CNNs
ondly, Model-B extends Model-A by incorporating bidirectional long using a Deep Belief Network and employed SVM for classification.
short-term memory networks and fully connected networks to cap- RNN efficiently utilizes contextual information when handling se-
ture long-term contextual features from speech signals. Subsequently, quential data, providing a notable advantage in this context. However,
Model-C adopts a BiLSTM-Transformer-FCNs architecture, employing its ability to extract features is somewhat limited. To tackle this, a com-
multi-head attention mechanisms to extract the most relevant emotional mon strategy involves using CNN to extract low-level features from the
features. Lastly, Model-D serves as an ensemble framework, integrating raw signal, subsequently feeding these features into RNN for higher-
predictions from these three independent models through weighted av- level representation learning. Furthermore, the attention mechanism
eraging. Ensemble learning effectively mitigates bias and variance in is frequently deployed to manage RNN outputs, producing significant
individual models, thereby enhancing overall model performance. Ad- emotional features for Speech Emotion Recognition (SER) [38]. Re-
3
searchers have proposed several methods to improve RNN performance more extensive, diverse representation, aiding in accurately distinguish-
in SER tasks. For instance, Luo et al. [39] presented a dual-channel ing various emotional states.
SER system that amalgamates CRNN-learned features with high-level To diversify and adapt the training data, the model employs di-
statistical functional (HSF) features, achieving optimal representation verse audio augmentation strategies. Data augmentation plays a crucial
learning. Chen et al. [40] applied the attention mechanism to Con- role in overcoming sparse training samples and curbing overfitting.
volutional Recurrent Neural Networks (CRNN) and introduced a 3D It aids in capturing richer, subtle features, particularly minute emo-
attention-based CRNN for learning discriminative features in speaker- tional differences within speech signals. Typically, this approach yields
independent SER. Xie et al. [41] proposed an enhanced attention-based more accurate and reliable emotion classification and identification out-
LSTM by adjusting the forgetting gate of the traditional LSTM, reducing comes. Furthermore, data augmentation bolsters the model’s robustness
computational complexity without compromising performance. by exposing it to a broader spectrum of data variations during training,
In recent years, attention mechanisms have become prevalent in thereby fortifying its adaptability and stability. Hence, data augmenta-
the field of emotion recognition, enabling exploration across various tion is pivotal in enhancing the model’s resilience and applicability.
facets of data. Zhao et al. introduced an innovative hybrid deep CNN During emotion classification, the proposed method adeptly inte-
architecture, merging parallel convolutional layers with a squeeze-and- grates multiple models. Initially, Model-A extracts local features from
excitation network and a self-attention-driven dilated residual network speech data using convolutional neural networks and gathers advanced
[42]. Trained on sequence-to-sequence classification loss, this model latent features through multiple convolution modules. Subsequently,
adeptly captures intricate long-range contextual dependencies pivotal Model-B extends Model-A by introducing bidirectional long short-
for discrete emotion recognition tasks. Additionally, Xie et al. pre- term memory networks and fully connected networks to capture long-
sented an LSTM framework with enhanced attention mechanisms for term contextual features in speech signals. Following this, Model-C
emotion recognition [43]. This framework directly extracts frame-level adopts a BiLSTM-Transformer-FCNs structure, leveraging the Trans-
speech features from audio waveforms, preserving temporal relation- former’s multi-head attention mechanism to extract the most pertinent
ships within frame sequences and boosting model efficiency through emotional features. Finally, Model-D serves as an ensemble frame-
attention mechanisms across time and feature dimensions. These in- work, amalgamating the predictions of these three independent models
novative strategies notably bolster the performance and capabilities of through weighted averaging. Ensemble learning effectively mitigates
emotion recognition models. bias and variance in individual models. Moreover, it reduces overfitting
On a different front, Li et al. combined self-attention mechanisms risks and enhances the model’s adaptability and predictive accuracy
with bidirectional LSTM networks to delve into phoneme autocorrela- with new data, rendering the model more robust and suitable across
tion within speech [44]. This mechanism assigns varying weights based diverse datasets and real-world scenarios. This approach significantly
on emotional intensity between frames and quantifies the autocorrela- amplifies the overall model performance.
tion between frames. The attention mechanism in Transformer models
amplifies the comprehension and processing of sequential data, surpass- 3.1. Audio preprocessing
ing traditional neural networks in handling long-range dependencies
and showcasing exceptional parallel computation and sequence pro- In the realm of acoustic emotion recognition, features serve a criti-
cessing abilities. Transformer structures have gained widespread use in cal function. Nonetheless, deep learning models may face hurdles when
emotion recognition. Zhang et al. conducted an extensive review, revis- handling atypical audio inputs directly. Thus, the conversion of raw
iting the historical development of emotion recognition and exploring speech signals into coherent feature patterns becomes highly signifi-
various principles of deep learning models [45]. They integrated Trans- cant. An effective approach entails harnessing automatically extracted
former structures, feature extraction, and specialized information fusion features directly from the spectrogram of the speech signal. By cir-
strategies tailored for emotion recognition. Fan et al. utilized multi- cumventing conventional feature extraction techniques, deep learning
modal feature extraction, Transformer-based feature enhancement, and models can exploit the encoded information within the spectrogram
graph fusion networks to detect and infer patients’ negative emotions, to facilitate emotion recognition. This strategy empowers the model to
offering novel perspectives for diagnosing depression [46]. Concur- more adeptly discern emotions from speech signals, ultimately bolster-
rently, Zhang et al. proposed an unsupervised cross-corpus speech emo- ing the model’s performance and efficiency.
tion recognition solution based on Transformer and mutual information
[47]. This method effectively mitigated feature distribution differences
3.1.1. Spectrogram
between training and testing corpora and improved speech emotion
A spectrogram serves as a graphical representation of an audio sig-
recognition performance through a multi-head attention fusion strat-
nal, aiding in explaining the interplay between its frequency content
egy. Diverse research teams persist in exploring and refining feature
and temporal characteristics. By visually portraying how the signal’s
extraction, model architecture, and optimization techniques. Their col-
frequency components shift over time, this visualization type empow-
laborative endeavors lay a robust groundwork for enhancing the ac-
ers individuals to observe and comprehend the evolutionary dynamics
curacy, resilience, and practical utility of speech emotion recognition
of frequency components within the audio [16]. To craft a spectrogram
systems.
for a speech signal, the Fourier Transform (STFT) technique is applied.
The STFT dissects the audio signal into brief, overlapping segments and
3. Proposed TF-mix architecture
computes the frequency components inherent in each segment. This
scrutiny of temporal variations equips individuals with a comprehen-
The TF-Mix system incorporates three primary components: feature
sive insight into the manifestation of frequency content across the entire
extraction, data augmentation, and emotion classification. In the feature
duration of the audio.
extraction phase, the system deeply analyzes the inherent properties of
speech signals. The TF-Mix model amalgamates temporal and spectral 𝑀−1
∑ −2𝜋𝑗𝑖
features, broadening the multidimensional aspects and evading depen- 𝑋(𝑠, 𝑛) = 𝑋[𝑖]𝑤[𝑖 − 𝑛]𝑒 𝑀 (1)
dence on a sole feature. Temporal features focus on the time-related 𝑖=0
traits of audio signals, while spectral features center on their distribu- In Equation (1), 𝑋[𝑖] symbolizes the input audio signal, while 𝑤[𝑖]
tion across the frequency spectrum. This fusion captures a wide array denotes an M-point window function. The spectrogram serves as a two-
of information within speech signals, as different emotions may mani- dimensional portrayal of the audio signal, wherein the vertical axis
fest distinct patterns across various frequency ranges or specific audio corresponds to frequency amplitude, and the horizontal axis signifies
time segments. The fusion of temporal and spectral features offers a time. The implementation of the Hanning window ensures a seamless
4
Fig. 2. The original audio signal and its spectrogram.
transition at both the inception and culmination of the sampling pro- ∑

∞
𝑍𝐶𝑅𝑛 = |𝑠𝑖𝑔𝑛 [𝑥(𝑚)] − 𝑠𝑖𝑔𝑛 [𝑥(𝑚 − 1)]| 𝑤(𝑛 − 𝑚) (4)
cess. −∞
2𝜋𝑖 where
𝑤(𝑖) = 0.5 − 0.5 cos( ) 0≤𝑖≤𝑀 −1 (2)
𝑀 −1 {
The formula for calculating the length of the Hann window is shown in -1 𝑥 ≥ 0
𝑠𝑖𝑔𝑛[𝑥(𝑚)] = (5)
equation (3). 1 𝑥<0
𝐿=𝑀 +1 (3) 𝑤(𝑛) is a window function used for windowing N samples, where the
window size is
{
1
3.1.2. Mute removal 0≤𝑥≤𝑁 −1
𝑤(𝑛) = 2𝑁 (6)
In the field of speech emotion recognition, achieving high classifi- 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
cation accuracy demands meticulous preprocessing of raw audio data.
Silent segments at the start or end of audio snippets have a notable im-
3.2.2. Root mean square (RMS)
pact on model performance. Hence, eliminating these silent parts from
The Root Mean Square feature is extensively utilized to calculate the
the original audio data is vital to significantly enhance model accuracy.
power content of speech signals at the frame level. This feature profi-
The audio signal operates at a sampling frequency of 22,050 hertz,
ciently captures the amplitude or energy magnitude of speech signals,
and segments with signal intensity below 30 decibels are flagged as
thereby providing pivotal discriminatory insights for tasks related to
silent and consequently excluded. This preprocessing ensures that only
emotion classification. The inclusion of RMS feature extraction within
meaningful audio information remains for subsequent analysis. Fig. 2
the broader feature extraction process substantially contributes to the
showcases the waveform of the audio signal post the removal of silent
augmentation of emotion classification accuracy. When extracting RMS
sections, accompanied by its corresponding spectrogram.
features from speech signals, the mathematical formula employed for
computation is as follows:
3.2. Feature extraction
√
∑𝑁
𝑛=1 𝑋
2 (𝑛)
In the realm of emotion recognition within smart homes, factors
𝑅𝑀𝑆 = (7)
such as pitch, speaking rate, timbre, and other vocal attributes convey 𝑁
distinct emotions and imbue sound with unique qualities. Consequently, In Equation (7), 𝑋(𝑛) represents the speech signal, and ‘n’ signifies the
the meticulous selection of elements capable of precisely capturing count of samples within each frame of the speech signal. Fig. 3 visu-
these emotional attributes becomes paramount. Acoustic features en- ally depicts the time-domain characteristics of ZCR and RMS features
compass an array of categories, spanning frequency, time-domain, and
extracted for various emotions from the dataset.
prosodic aspects, and are widely acknowledged as an optimal choice
for emotion recognition. To further elevate the accuracy of emotion
recognition, the proposed TF-Mix model adeptly integrates a diverse ar- 3.2.3. Mel frequency cepstral coefficients (MFCC)
ray of features, encompassing both frequency-domain and time-domain In the field of speech emotion analysis, MFCCs play a pivotal role
characteristics. Through the extraction and fusion of these multifaceted as acoustic features, crucial for uncovering emotional states. Different
features, the model attains a deeper understanding from the input sig- emotions exhibit unique patterns in the frequency spectrum. For exam-
nal, thereby achieving a more refined and nuanced classification of ple, anger may lead to heightened energy in the high-frequency range,
emotions. while sadness might result in increased energy in the low-frequency
range. MFCCs adeptly capture and portray these spectral character-
3.2.1. Zero crossing rate (ZCR) istics. During the feature extraction process, the initial step involves
The Zero Crossing Rate (ZCR) is a valuable feature used to assess applying a series of Mel filters to the audio signal, converting the linear
the smoothness of audio signals. ZCR quantifies how frequently a signal frequency scale into the Mel scale, which better aligns with human au-
changes its polarity, shifting from positive to zero and then to negative, ditory perception. Subsequently, the application of the Discrete Cosine
or vice versa from negative to zero and then to positive. By measuring Transform (DCT) converts spectrum information into cepstral coeffi-
these changes in polarity, ZCR offers crucial insights into the swift fluc- cients, compressing and reducing the dimensionality of the spectrum.
tuations and variations inherent in speech signals. These fluctuations This procedure significantly enhances the accuracy and robustness of
often carry significant emotional cues or characteristics. Computing emotion analysis. As a result, MFCCs offer a wealth of informative re-
ZCR involves counting the instances when the signal crosses the zero sources for emotion recognition, thus amplifying the effectiveness of
level within a predefined time interval. This calculation facilitates a emotion analysis. The formula to convert the frequency 𝑓 to the 𝑀𝑒𝑙
deeper understanding of the temporal dynamics and texture of speech scale frequency 𝑀𝑒𝑙 is:
signals, aiding in the identification and analysis of patterns associated
𝑓
with emotions. The formula for computing the Zero Crossing Rate is: 𝑀𝑒𝑙(𝑓 ) = 1127 × 𝑙𝑛(1 + ) (8)
700
5
Fig. 3. Time-domain features: ZCR features of speech signals with different emotions are depicted as (a), (b), and (c), and the RMS features are represented as (d),
(e), and (f).
The pre-emphasis filter is employed to mitigate certain glottal effects cation in speech emotion analysis tasks. Despite the linear distribution
present in the original speech signal, as indicated by formula (9). Fur- of frequency curves in spectrograms, human auditory perception aligns
thermore, the application of the Hanning window is utilized to segment more closely with a logarithmic scale. This observation highlights the
the speech signal into frames, following the utilization of the formula: heightened sensitivity of the human ear to changes in lower frequencies,
contrasting with its relatively lower sensitivity to variations in higher
𝐻(𝑧) = 1 − 𝑏𝑧−1 (9) frequencies. Hence, utilizing a linear spectrogram for feature extraction
may potentially result in the loss of valuable information. In response to
3.2.4. Log-Mel spectrogram (log-Mel) this concern, the introduction of the logarithmic Mel spectrogram comes
Due to its precise representation of the spectral attributes of audio into play, transforming the spectrogram into a logarithmic scale. Within
signals, the logarithmic Mel (log Mel) feature finds widespread appli- this logarithmic scale, intricacies within the low-frequency realm are ac-
6
centuated, while the high-frequency region is purposefully compressed. 3.3.1. White noise injection
This transformation more precisely mirrors the attributes of human au- To enhance the resilience of speech models and improve speech
ditory perception of sound. As a result, within the sphere of speech emotion recognition under challenging conditions, data augmentation
emotion recognition, the integration of log Mel features aptly captures was utilized. Within the data augmentation process, Gaussian white
the spectral characteristics of audio signals. The formula for computing noise was introduced into the original audio, thereby amplifying the
logarithmic Mel frequency is duly delineated in equation (10): variability of the resulting speech emotion models. The scaling of white
𝑓 𝑓𝑚𝑎𝑥 noise was confined to the interval [0, 1], and a standard probability
𝐿𝑜𝑔𝑀𝑒𝑙(𝑓 ) = 𝑙𝑜𝑔(1 + )× (10) value of 0.5 was adopted within this speech emotion model. This strat-
𝑓0 𝑓
𝑙𝑜𝑔(1 + 𝑓𝑚𝑎𝑥 )
0 egy introduced controlled fluctuations into the speech data, imparting
greater adaptability to the model across diverse environmental condi-
where: 𝐿𝑜𝑔𝑀𝑒𝑙(𝑓 ) is the logarithmic Mel frequency of the frequency,
𝑓0 is the starting frequency of the Mel filter bank, typically set to 700 tions and augmenting its efficacy in real-world scenarios.
Hz, 𝑓𝑚𝑎𝑥 is the ending frequency of the Mel filter bank, usually half of
the sampling rate. 3.3.2. Pitch tuning
Pitch shifting, constitutes one of the augmented techniques imple-
3.2.5. Constant-Q transform (CQT) mented in the speech emotion model of this study. It encompasses the
In the general process of speech emotion classification, the spectro- manipulation of the pitch of the audio signal while maintaining its orig-
gram obtained through Short-Time Fourier Transform (STFT) maintains inal speed. Within pitch tuning, controlled deviations are introduced
a consistent frequency resolution. However, this unvarying resolution into the pitch by randomly altering the frequency components of the
can lead to a diminished low-frequency resolution for each octave audio signal. The parameter “bins per octave” is fixed at 12, determin-
within the low-frequency range, potentially resulting in the loss of cer- ing the extent of pitch modification. By integrating pitch variations into
tain distinct low-frequency features present in audio scenes. To tackle the training data, the model gains increased robustness and versatility
this challenge, the Constant-Q Transform is employed to enhance the ef- in accommodating diverse pitch conditions.
ficacy of low-frequency features [48]. CQT employs a constant Q-factor,
which augments the frequency resolution for each octave within the 3.3.3. Time stretching
low-frequency range, thereby more effectively capturing low-frequency Time stretching is a valuable data augmentation technique used in
characteristics. The specific procedure for extracting CQT features from speech emotion models. It entails altering the duration of the audio
speech involves segmenting the audio signal and applying a window signal while preserving the pitch. By either stretching or compressing
function, followed by the application of the CQT transformation to de- the audio’s duration, time stretching introduces temporal variations, ef-
rive the CQT spectrogram. The formula for the CQT extraction process fectively simulating the rhythmic and intonation changes that occur in
is presented as equation (11). different emotional states of the voice. Incorporating augmented data
with time-stretched audio empowers the model to better acclimate to
𝐶𝑄𝑇 (𝑓 , 𝑡) = |𝑋(𝑓 , 𝑡)| ∗ 𝐻(𝑓 ) (11) speech signals with diverse rhythms and intonations, ultimately bolster-
ing the robustness and generalization capacity of emotion recognition.
Where |𝑋(𝑓 , 𝑡)| represents the magnitude spectrum of the spectrogram
obtained from the audio signal x(t) after undergoing Short-Time Fourier
3.3.4. Audio shifting
Transform (STFT), as shown in equation (12).
Audio time shifting is a data augmentation technique that involves
|𝑋(𝑓 , 𝑡)| = |𝑆𝑇 𝐹 𝑇 (𝑥(𝑡), 𝑓 , 𝑡)| (12) manipulating the time axis of speech signals, causing the audio to shift
forwards or backwards. This technique aims to simulate the tempo-
𝐻(𝑓 ) is a constant Q-transform filter, with its center frequencies fol- ral variations in voice caused by different emotional states, thereby
lowing an exponential distribution. The mathematical expression of the enriching the diversity of training data and enhancing the model’s un-
filter in the frequency domain is given by equation (13). derstanding of temporal dynamics in sound. By introducing audio time
∑
∞ shifting, the model is exposed to a broader range of temporal varia-
𝐻(𝑓 ) = ℎ(𝑛)𝑒−𝑗2𝜋𝑓 𝑛 (13) tions, thereby strengthening its comprehension and discrimination of
𝑛=−∞ time-related features associated with different emotional states.
ℎ(𝑛) represents the time-domain response of the filter, usually con- Fig. 7 illustrates the impact of different data augmentation tech-
structed using Gaussian functions or other window functions. niques on speech signals, comparing the original audio with the varia-
Fig. 4, Fig. 5 and Fig. 6 displays the frequency-domain features of tions introduced through data augmentation.
different emotions from the dataset: MFCC, Log-Mel, and CQT.
3.4. Proposed methods
3.3. Data augmentation
The primary aim of this study was to explore integrating multiple
In the realm of speech emotion analysis, datasets often confront deep learning models to enhance speech emotion recognition accuracy
challenges such as limited data volume and class imbalance, resulting in within smart home systems. Throughout the research, careful consider-
insufficient samples for certain emotional states and impeding the mod- ations were given to factors such as dataset selection, sample quantity,
el’s ability to recognize minority emotional states. To tackle this issue, suitable feature extraction methods, and the selection of appropriate
data augmentation techniques are widely employed. These methods en- deep learning model classifiers. These factors were identified as crucial
rich the diversity of training data by simulating variations in sound components in constructing a comprehensive framework for emotion
across different emotional states. As a result, throughout the training recognition. The study employed four benchmark emotion recognition
process, the model is exposed to a wider array of emotional expression datasets—RAVDESS, SAVEE, TESS, and CREMA-D—which were amal-
samples, thereby enhancing its capacity to learn features from sounds gamated into a comprehensive dataset named TOTAL. Given the limited
associated with various emotional states. audio samples in these datasets, the research team implemented four
In practical implementation, we employed four data augmentation proposed data augmentation methods to expand the sample pool.
methods: white noise injection, pitch adjustment, time stretching, and The proposed deep learning approach involved extracting temporal
audio shifting. These techniques were pivotal in creating synthetic data and spectral features from preprocessed audio data, serving as inputs
and strengthening the deep learning model’s ability to generalize. for models A, B, and C. Subsequently, a complex process of weighted
7
Fig. 4. Frequency-domain feature: MFCC spectrograms of speech signals with different emotions are represented as (a), (b), (c), (d), (e), (f), (g).
averaging of predictions from each deep learning model led to the de- tors. They directly derive both local and global features from raw
velopment of an integrated model, referred to as model D. This model speech waveforms. Moreover, the intrinsic convolutional operations in
showcased notable capabilities in effectively discerning emotions from 1D CNNs leverage parameter sharing, reducing the model’s parameter
speech. count and thereby mitigating the risk of overfitting.
This study introduced a foundational model named “Model A,”
3.4.1. Proposed model-A specifically designed for speech emotion recognition, as illustrated in
1D CNNs showcase significant advantages within the realm of Fig. 8. The model seamlessly integrates a 1D CNN with a fully con-
speech emotion analysis. Firstly, they efficiently capture temporal in- nected network. During the feature extraction phase, the research team
tricacies within speech data, effectively discerning temporal nuances carefully extracted a set of crucial features from the raw speech signals.
and variations in speech rates present in audio signals. Secondly, 1D These features include key elements such as ZCR, RMS, MFCCs, log-Mel
CNNs autonomously acquire feature representations from speech sig- features, and CQT. These systematically organized key features form
nals, eliminating the necessity for manually crafted feature extrac- well-structured feature vectors. To enhance the data quality of these
8
Fig. 5. Frequency-domain feature: Log-Mel spectrograms of speech signals with different emotions are represented as (a), (b), (c), (d), (e), (f), (g).
feature vectors, augmentation techniques were applied before utilizing the dual purpose of output downsampling and the reduction of data di-
them as inputs for “Model A.” mensionality. This iterative process is subsequently reiterated multiple
The architecture of “Model A” comprises a dedicated Local Feature times. Further enriching the architecture, another convolutional layer,
Extraction (LFE) module that seamlessly integrates convolutional lay- endowed with 128 filters and leveraging a ReLU activation function, is
ers, max-pooling layers, and Dropout layers. This module is designed introduced. Subsequent to this, an additional max-pooling layer, char-
to unveil the latent local patterns concealed within the audio signals acterized by a pool size of 5, is implemented. The output stemming
of speech. In addition, FCNs are harnessed to aggregate the ultimate from this stage is then directed to a convolutional layer equipped with
global features. The input to the model assumes the form of a feature 64 filters and a kernel size of 3, employing the ReLU activation func-
vector array with dimensions 4284x1. The initial layer is constituted by tion once more. Lastly, a flattening layer is introduced to seamlessly
a convolutional layer, boasting 256 filters, a kernel size of 5, a stride consolidate diverse features into a singular value pivotal for emotional
of 1, and a ReLU activation function. Following this, a judiciously em- classification. This singular value is subsequently channeled into a FCN,
ployed max-pooling layer with a pool size of 5 and a stride of 2 achieves culminating in the aggregation of comprehensive global features. This
9
Fig. 6. Frequency-domain feature: CQT spectrograms are displayed as (a), (b), (c), (d), (e), (f), (g).
meticulous process significantly bolsters the precision of emotion clas- To normalize the output into probability distributions, the Softmax
sification outcomes. activation function is employed within the dense layer. The calculation
In Model A, assuming the input is a one-dimensional feature vector of the output can be succinctly expressed as follows:
𝑥 = [𝑥1 , 𝑥2 , 𝑥3 , ..., 𝑥𝑛 ], where n represents the sequence length, the calcu-
lation formula of the one-dimensional convolutional layer is as follows:
𝑒𝑧 𝑖
𝑦𝑖 = 𝑓 (𝑊 ∗ 𝑥𝑖∶𝑖+𝑘 + 𝑏) (14) 𝑝 𝑖 = ∑𝑛 𝑧𝑗
(15)
𝑗=1 𝑒
In equation (14), 𝑦𝑖 represents the i-th element in the output se-
quence, 𝑓 is the activation function, such as ReLU, 𝑊 denotes the In equation (15), 𝑧𝑖 represents the i-th element of the input vector;
∑𝑛
weight vector of the convolutional kernel (or filter), 𝑥𝑖∶𝑖+𝑘 denotes the and 𝑗=1 𝑒𝑧𝑗 is the sum of the exponential values of all elements in the
subsequence of the input sequence from the i-th element to the (𝑖 + 𝑘)-th vector. Here, n represents the number of possible emotional classifica-
element, and 𝑏 is the bias term. tions for sentiment analysis.
10
Fig. 7. The effect of different data augmentation techniques on speech representation. (a) raw audio (b) injected additive white noise (c) time stretching, (d) pitch
tuning (e) audio shifting.
Fig. 8. The architecture of the proposed baseline model-A.
11
Fig. 9. The architecture of the proposed baseline model-B.
An optimizer with a learning rate of 0.0001 is employed, and Table 1

Stochastic Gradient Descent (SGD) is utilized for optimization. Under Model-A network setup.
the SGD approach, the subsequent iterations are conducted. Layer(type) Output Shape Parameters
𝑛
1∑
conv1d(Conv1D) (None, 4284, 256) 1536
𝑄(𝑤) = 𝑄𝑖 (𝑤) (16) max_pooling1d(MaxPooling1D) (None, 2142, 256) 0
𝑛 𝑖=1 conv1d_1(Conv1D) (None, 2142, 256) 327,936
max_pooling1d_1(MaxPooling1D) (None, 1071, 256) 0
The parameter w represents the one that minimizes 𝑄(𝑤). Here, conv1d_2(Conv1D) (None, 1071, 128) 163,968
i denotes the iteration value of the data. At the i-th iteration, 𝑄𝑖 (𝑤) max_pooling1d_2(MaxPooling1D) (None, 536, 128) 0
represents the value of the loss function. The empirical risk is indicated conv1d_3(Conv1D) (None, 536, 64) 41,024
by 𝑄(𝑤).
flatten(Flatten) (None, 17, 152) 0
𝑛 𝑛
1 ∑ 𝜂∑ dense(Dense) (None, 32) 548,896
𝑤=𝑤−𝜂⋅ ⋅ Δ𝑄𝑖 (𝑤) = 𝑤 − Δ𝑄𝑖 (𝑤) (17) batch_normalization_5 (None, 32) 0
𝑛 𝑖=1 𝑛 𝑖=1 dense_1(Dense) (None, 7) 231
𝜂 is the learning rate, also known as the step size. In Stochastic Total parameters: 1,083,591
Gradient Descent (SGD), the approximation of the true gradient 𝑄(𝑤) Trainable parameters: 1,083,591
Non-trainable parameters: 0
using the gradient from a single sample is calculated as follows.
𝑤 = 𝑤 − 𝜂Δ𝑄𝑖 (𝑤) (18)

addresses the constraints of one-dimensional Convolutional Neural Net-
To measure the loss, categorical cross-entropy is employed, which works (1D CNNs) in modeling intricate speech sequences and capturing
quantifies the error using the following formula. enduring dependencies. Although 1D CNNs excel in temporal informa-
∑𝑀 tion processing within speech data, they might struggle with intricate
𝐿𝑜𝑠𝑠 = − 𝑗=1 𝑦𝑗 log 𝑦𝑗 (19) temporal dynamics and sequential intricacies. By integrating a BiLSTM
In this equation, 𝑀 represents the output size, indicating the scalar layer, our model adeptly extracts features from both preceding and suc-
value of the model’s output. 𝑦𝑗 represents the j-th scalar value of the ceeding contexts simultaneously. This capability enables the model to
output, and it corresponds to the value of the target for that particular capture more nuanced dependencies and enduring connections within
class. Notably, the sum of all 𝑦𝑗 is 1 since 𝑦𝑗 represents the probability speech sequences, fostering a deeper comprehension of temporal nu-
of event j occurring. ances and contextual intricacies within the speech data. This augmen-
Throughout the model implementation process, specific parame- tation significantly bolsters speech emotion recognition performance.
ters are selected based on the unique task at hand. These parameter Model B as shown in Fig. 9 employs a fusion of 1D CNN and Bidi-
choices can be fine-tuned according to the task’s demands in order to rectional Long Short-Term Memory (BiLSTM) layers for the task of
enhance overall performance. Comprehensive parameter configurations speech emotion recognition. In contrast to Model A, Model B demands
are available in Table 1. a lengthier training period. The BiLSTM architecture incorporates stor-
age units linked to both input and output gates. The control over all
3.4.2. Proposed model-B information entering and exiting these storage units is managed by the
To optimize speech emotion recognition, we introduced Model B— respective gates. In the context of LSTM, a sequence of data denoted
a Bidirectional Long Short-Term Memory network (BiLSTM)—building as 𝑋 = (𝑥1 , 𝑥2 , 𝑥3 , ..., 𝑥𝑚 ) serves as input, generating an output sequence
upon the framework of Model A. As a recurrent neural network, BiLSTM 𝑌 = (𝑦1 , 𝑦2 , 𝑦3 , ..., 𝑦𝑚 ).
12
During model training, in order to store new important information, Table 2

some less important information must be disregarded. This is addressed Model-B network setup.
using forget gates, which discard information after specific timestamps Layer(type) Output Shape Parameters
by multiplying values with 0 and 1. If the final result is 0, it will be conv1d(Conv1D) (None, 4284, 256) 1536
forgotten. Time 𝑋 𝑡 represents the input data at time t. max_pooling1d(MaxPooling1D) (None, 2142, 256) 0
Activation of the input, output, and forget gates is as follows: conv1d_1(Conv1D) (None, 2142, 256) 327,936
𝑓 𝑥𝑡 = 𝐺(𝑤𝑥 𝑋 𝑡 + 𝑝𝑥 𝑌 𝑡−1 + 𝑞𝑥 𝑐 𝑡−1 + 𝑏𝑖 ) (20) conv1d_2(Conv1D) (None, 1071, 128) 163,968
𝑓 𝑦𝑡 = 𝜎(𝑤𝑦 𝑋 𝑡 + 𝑝𝑦 𝑌 𝑡−1 + 𝑞𝑦 𝑐 𝑡−1 + 𝑏𝑦 ) (21)
dropout(Dropout) (None, 536, 128) 0
conv1d_3(Conv1D) (None, 536, 64) 41,024
𝑓 𝑓 𝑡 = 𝜎(𝑤𝑓 𝑋 𝑡 + 𝑝𝑓 𝑌 𝑡−1 + 𝑞𝑓 𝑐 𝑡−1 + 𝑏𝑓 )

(22)
bidirectional(Bidirectional) (None, 268, 128) 66,048
The weights of the input gate in BiLSTM are denoted as 𝑤𝑥 , 𝑤𝑦 , and dropout_1(Dropout) (None, 268, 128) 0
flatten(Flatten) (None, 34304) 0
𝑤𝑓 ∈ 𝑅𝑁×𝑀 . The weights of the output gate are represented by 𝑝𝑥 , 𝑝𝑦 , dense(Dense) (None, 32) 1,097,760
and 𝑝𝑓 ∈ 𝑅𝑁×𝑀 . Additionally, the memory cells in BiLSTM are charac- dropout_2(Dropout) (None, 32) 0
terized by weights 𝑞𝑥 , 𝑞𝑦 , and 𝑞𝑓 ∈ 𝑅𝑁×𝑀 . Furthermore, the activations dense_1(Dense) (None, 14) 231
in BiLSTM are modulated by the biases 𝑏𝑥 , 𝑏𝑦 , and 𝑏𝑓 ∈ 𝑅𝑁×𝑀 . The cal- Total parameters: 1,698,503
culation of the memory cell value and the output gate is as follows. Trainable parameters: 1,698,503
Non-trainable parameters: 0
𝑐 𝑡 = 𝑓 𝑥𝑡 ⋅ 𝑓 𝑦𝑡 + 𝑐 𝑡−1 ⋅ 𝑓 𝑓 𝑡 (23)
𝑡 𝑡 𝑡 𝑡−1
𝑜 = 𝜎(𝑤0 𝑋 + 𝑝0 𝑌 + 𝑞0 𝑐 + 𝑏0 ) (24)
𝑤0 , 𝑝0 , 𝑞0 , and 𝑏0 are the weights and biases of the output gate in
BiLSTM. The ultimate output of the memory cell is calculated as follows:
𝑌 𝑡 = ℎ(𝑐 𝑡 ) ⋅ 𝑜𝑡 (25)
The output of the CNN layer is fed as input to the BiLSTM layer. The
BiLSTM layer is equipped with a 64-dimensional output, and the chosen
activation function for this model is tanh.
𝑗 𝑗
𝑒𝑋 − 𝑒−𝑋
𝑡𝑎𝑛ℎ(𝑋𝑗 ) = (26)
𝑒𝑋 𝑗 + 𝑒−𝑋 𝑗
X represents the input vector to the tanh layer. To enhance the learn-
ing process, the RMSprop optimizer is utilized, along with categorical
cross-entropy as the loss function.
Model B’s input layer receives one-dimensional time-series data,
which is then passed through four convolutional layers for local feature
extraction. The number of convolutional filters in each layer is 256,
256,128, and 64, respectively, utilizing the ReLU activation function.
Subsequently, a max-pooling layer is added after each convolutional
layer to reduce data dimensions, preserve essential information, and
effectively reduce computational complexity. After the final convolu-
tional layer, the model incorporates a Bidirectional LSTM (BiLSTM)
layer with 64 LSTM units. LSTM, as a type of recurrent neural network,
possesses remarkable ability in capturing long-term dependencies, mak-
ing it especially suitable for time-series data processing. The unique
aspect of BiLSTM lies in including both forward and backward LSTM
layers, which proves highly beneficial in capturing bidirectional con-
textual information from the time-series data. To prevent overfitting,
the model integrates multiple Dropout layers with dropout rates set at
0.2. These strategies contribute to enhancing the model’s generaliza-
Fig. 10. The structure of the transformer.
tion capability. Following the BiLSTM layer, a flatten layer is added to
transform the data from a three-dimensional form to a one-dimensional
form. Subsequently, two fully connected layers, consisting of 32 and framework empowers the model to dynamically assign varying atten-
7 neurons respectively, are utilized to map the features into a higher- tion weights across different positions within the input sequence. This
dimensional space, ultimately producing the classification results for mechanism enables adaptive focus on pivotal elements linked to senti-
emotion categories. The specific parameters of the model are detailed ment expression, consequently minimizing attention towards irrelevant
in Table 2. details. With each time step, the model assimilates attention weights,
intensifying its focus on pivotal attributes pertinent to sentiment expres-
sion. This enhancement elevates the model’s proficiency in accurately
3.4.3. Proposed model-C
capturing the intricate emotional subtleties embedded within speech.
When handling extended sequences, conventional BiLSTMs might
To compute the attention weights 𝛼𝑖 , the model utilizes the follow-
confront performance hurdles, particularly in capturing localized senti-
ing formula to calculate each vector 𝑥𝑖 within the input sequence 𝑥:
ment nuances where their effectiveness is limited. To tackle this chal-
lenge, Model-C integrates a Transformer architecture, showcased in 𝑒𝑥𝑝(𝑓 (𝑥𝑖 ))
𝛼𝑖 = ∑ (27)
𝑗 𝑒𝑥𝑝(𝑓 (𝑥𝑗 ))
Fig. 10, with its self-attention mechanism at its core. The Transformer
13
Fig. 11. The architecture of the proposed baseline model-C.
Table 3
TransformerBlock architecture.
Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) [(None, None, 128)] 0 []
masking (Masking) (None, None, 128) 0 [input_1[0][0]]
positional_encoding (None, None, 128) 0 [masking[0][0]]
multi_head_attention_1 (None, None, 128) 66048 [positional_encoding[0][0],
positional_encoding[0][0]]
dropout (Dropout) (None, None, 128) 0 [multi_head_attention_1[0][0]]
dropout (Dropout) (None, None, 128) 0 [multi_head_attention_1[0][0]]
add (Add) (None, None, 128) 0 [positional_encoding[0][0], dropout[0][0]]
layer_normalization (None, None, 128) 256 [add[0][0]]
dense_2 (Dense) (None, None, 256) 33024 [layer_normalization[0][0]]
dense_3 (Dense) (None, None, 128) 32896 [dense_2[0][0]]
dropout_1 (Dropout) (None, None, 128) 0 [dense_3[0][0]]
add_1 (Add) (None, None, 128) 0 [layer_normalization[0][0], dropout_1[0][0]]
layer_normalization_1 (None, None, 128) 256 [add_1[0][0]]
Wherein, the scoring function 𝑓 (𝑥) employs a linear scoring func- Following this, the data undergoes flattening into a one-dimensional
tion 𝑓 (𝑥) = 𝑤𝑇𝑥 , where 𝑤 represents trainable parameters. The model form using a flattening layer. A fully connected layer further extracts
utilizes these attention weights to perform a weighted sum on the input and maps features, normalizing the output with batch normalization
sequence, i.e.: layers. Finally, a fully connected layer equipped with a Softmax activa-
tion function is utilized for sentiment category prediction, incorporating
∑
𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑣𝑒𝑥 = 𝑎𝑖 𝑥𝑖 (28) a total of seven neurons.
𝑖 Model C synergizes the benefits of one-dimensional CNNs, bidirec-
The TransformerBlock structure proposed by Model C is illustrated in tional LSTMs, and the Transformer structure, enabling more effective
Table 3. handling of time series data. This integration facilitates accurate fea-
The architecture of Model C, as shown in Fig. 11, comprises one- ture capture and augments performance and generalization in sentiment
dimensional convolutional layers, batch normalization layers, and max- recognition tasks. Specific model parameter information is outlined in
pooling layers to extract pivotal features from acoustic signal data. Sub- Table 4.
sequently, bidirectional LSTM layers generate comprehensive sequence
outputs, while the Transformer structure processes the bidirectional 3.4.4. Proposed model-D
LSTM output using an attention mechanism. This mechanism assists the Ensemble learning is a robust technique that bolsters the effec-
model in focusing on critical features, thereby enhancing performance tiveness and reliability of predictive capabilities by amalgamating the
in sentiment recognition tasks. outcomes from multiple deep learning models, surpassing the accuracy
14
Table 4 recordings from 48 male and 43 female actors, spanning diverse racial
Model-C network setup. and cultural backgrounds, enriching its relevance within the emotion
Layer(type) Output Shape Parameters speech recognition domain. The TOTAL dataset amalgamates these four
conv1d(Conv1D) (None, 4284, 512) 3072
datasets. It’s important to note that among these datasets, only the
batch_normalization (None, 4284, 512) 2048 RAVDESS dataset includes the “Calm” emotional category. To ensure
max_pooling1d(MaxPooling1D) (None, 2142, 512) 0 a more meaningful comparison, when constructing the TOTAL dataset,
conv1d_1(Conv1D) (None, 2142, 512) 1,311,232 the “Calm” emotional segment from the RAVDESS dataset was deliber-
batch_normalization_1() (None, 2142, 512) 2048
ately excluded.
conv1d_2(Conv1D) (None, 1071, 256) 655,616 Fig. 12 vividly presents the distribution pattern of speech samples
batch_normalization_2 (None, 1071, 256) 1,024 across diverse emotional categories within the dataset.
conv1d_3(Conv1D) (None, 536, 256) 196,864
batch_normalization_3 (None, 536, 256) 1,024 4.2. Model training
conv1d_4(Conv1D) (None, 268, 128) 98,432 In this experiment, four publicly available datasets along with the
batch_normalization_4 (None, 268, 128) 512
TOTAL dataset were utilized, merging them to extract a diverse set
bidirectional(Bidirectional) (None, 134, 128) 98,816 of temporal and frequency domain features. Subsequently, data aug-
batch_normalization_5 (None, 134, 128) 512 mentation techniques were applied to enhance dataset diversity. The
model_1(TransformerBlock) (None, None, 128) 132480 subsequent steps involved data normalization, computing feature means
flatten(Flatten) (None, 17152) 0
and standard deviations, followed by partitioning into training, valida-
dense(Dense) (None, 512) 8,782,336
batch_normalization_5 (None, 512) 2048 tion, and test sets in an 8:1:1 ratio. Training was conducted using the
dense_1(Dense) (None, 7) 3591 Keras deep learning framework, fine-tuning hyperparameters via grid
search for optimal values. After determining a batch size of 64, weights
Total parameters: 11,159,303
Trainable parameters: 11,154,695 were chosen to maximize the Weighted Average Accuracy (WAA) for
Non-trainable parameters: 4,608 ensemble model D. Additionally, the ‘Adam’ optimizer was selected,
employing “categorical cross-entropy” as the loss function. To com-
prehensively learn data features, the models underwent 100 training
achieved by individual models. In the domain of emotion speech recog- epochs on a Tesla P100 GPU. Leveraging weighted predictions, models
nition, a myriad of features encapsulates speech emotions. By harmo- A, B, and C effectively collaborated within ensemble model D, culmi-
nizing multiple models and harnessing their distinct attributes, notable nating in the final prediction results.
improvements in recognition performance can be achieved.
In this study, the weighted average ensemble approach is employed,
5. Experimental analysis
amalgamating models A, B, and C. Initially, the most fitting weights
for each individual model are determined using the random search
technique. Subsequently, these optimal weights are combined with the When evaluating the accuracy of speech emotion classification, four
predictions generated by each model using the tensor dot product func- key metrics come into play: True Positives (TP), True Negatives (TN),
tion. The weighted prediction results are computed by summing the False Positives (FP), and False Negatives (FN). These metrics represent
element-wise products along the specified axis. Following this step, the different scenarios of correct and incorrect classifications. True Positives
class exhibiting the highest predicted probability is selected from the refer to instances correctly labeled with an emotion category, while
weighted prediction results to serve as the ultimate prediction. By com- True Negatives signify instances accurately classified as not belong-
prehensively leveraging the unique strengths of each model, ensemble ing to the emotion label. Conversely, False Positives involve instances
model D excels in capturing diverse facets of the data, ultimately result- mistakenly labeled as emotion categories, and False Negatives entail in-
ing in a more robust and potent predictive performance. stances inaccurately classified without emotion labels. Together, these
metrics establish the foundation for the confusion matrix.
4. Experimental analysis A comprehensive analysis of the experimental outcomes was con-
ducted across all five datasets, encompassing three distinct individual
The experiment involved the utilization of four datasets. After pre- models: A, B, C, and the ensemble model D. Notably, given the excep-
processing, a diverse set of temporal and frequency domain features tional performance of ensemble model D, the confusion matrices for this
were extracted from these datasets. To further enhance the extracted model are provided, offering comparisons before and after the applica-
features, data augmentation techniques were applied. These enriched tion of data augmentation.
features were then employed to perform the emotion recognition task.
The study incorporated several classifier models, namely the proposed 5.1. Evaluation metrics
Model A, Model B, Model C, and an ensemble model known as Model
D. The Weighted Average (WA) score was used to address imbalanced
class distribution within the datasets. Precision, recall, and F1 score
4.1. Datasets metrics were also provided for each individual model on every dataset.
For ensemble model D, the Weighted Average Accuracy (WAA) metric
The study comprehensively evaluated the proposed model using four was employed, which calculates accuracy by adjusting the weights of
distinct English emotion speech datasets, offering a comprehensive and each model. To assess the performance of a specific emotional label 𝐿𝑖 ,
genuine portrayal of emotional expressions. In the realm of emotion evaluations were based on the counts of 𝑇 𝑃𝑖 , 𝑇 𝑁𝑖 , 𝐹 𝑁𝑖 , and 𝐹 𝑃𝑖 , using
speech recognition, the RAVDESS dataset integrates both audio and the label’s counts 𝐿𝑖 for computing accuracy, precision, and recall.
video recordings, involving 24 actors with an equal gender distribution.
The SAVEE dataset comprises 480 speech expressions performed by 𝑇𝑃 +𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (29)
four British actors, encapsulating seven distinct emotions. Meanwhile, 𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 + 𝐹𝑃
the TESS dataset features audio snippets of targeted words narrated by The WAA metric is derived from the ensemble Model-D classifier by
two British female actors, ensuring a diverse range of vocal expressions considering individual model weights, resulting in the calculation of the
across various age groups. Notably, the CREMA-D dataset encompasses average accuracy.
15
Fig. 12. The class-wise distribution of the number of categories in the four datasets is as follows: (a) RAVDESS (b) SAVEE (c) TESS (d) CREMA-D.
∑𝐾 𝑇 𝑃𝑖 +𝑇 𝑁𝑖
𝑖=1 ( 𝑇 𝑃𝑖 +𝐹 𝑁𝑖 +𝐹 𝑃𝑖 +𝑇 𝑁𝑖 ) Calculated is the Macro-F1 score for each class in the dataset, followed
𝑊 𝐴𝐴 = (30) by the computation of the average value, all without taking into account
𝐾
the frequency of each label or applying any weights.
The precision metric calculates the ratio of accurately recognized
positive speech emotion labels (𝑇 𝑃𝑖 ) to the sum of true positive in- ∑𝐾 ∑𝐾
stances and false positive instances (𝑇 𝑃𝑖 + 𝐹 𝑃𝑖 ).
2× 𝑖=1 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 × 𝑖=1 𝑅𝑒𝑐𝑎𝑙𝑙𝑖
∑𝐾 ∑𝐾
𝑖=1 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 + 𝑖=1 𝑅𝑒𝑐𝑎𝑙𝑙𝑖
𝑀𝑎𝑐𝑟𝑜 − 𝐹 1 = (34)
∑𝐾 𝐾
𝑖=1 𝑇 𝑃𝑖
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ∑𝐾 ∑𝐾 (31) In the same manner, we can derive the macro-recall and macro-
𝑖=1 𝑇 𝑃𝑖 + 𝑖=1 𝐹 𝑃𝑖 precision scores by computing the metrics within each category.
Across all labels, the recall signifies the proportion of accurately ∑𝐾
identified positive speech emotion labels. 𝑖=1 𝑇 𝑃𝑖
∑𝐾 ∑𝐾
𝑖=1 𝑇 𝑃𝑖 + 𝑖=1 𝐹 𝑁𝑖
𝑀𝑎𝑐𝑟𝑜 − 𝑟𝑒𝑐𝑎𝑙𝑙 = (35)
∑𝐾 𝐾
𝑖=1 𝑇 𝑃𝑖
𝑅𝑒𝑐𝑎𝑙𝑙 = ∑𝐾 ∑𝐾 (32) ∑𝐾
𝑖=1 𝑇 𝑃 𝑖 + 𝑖=1 𝐹 𝑁𝑖 𝑖=1 𝑇 𝑃𝑖
∑𝐾 ∑𝐾
𝑖=1 𝑇 𝑃𝑖 + 𝑖=1 𝐹 𝑃𝑖
The F1-score merges precision and recall into a singular metric, en- 𝑀𝑎𝑐𝑟𝑜 − 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (36)
𝐾
compassing both attributes within a multi-class SER task.
Weighted Average Recall (WAR) stands as one of the benchmarks
∑𝐾 ∑𝐾 for evaluating the performance of multi-class classification models, fo-
2 × 𝑖=1 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 × 𝑖=1 𝑅𝑒𝑐𝑎𝑙𝑙𝑖 cusing on the variations in importance among different categories. This
𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = ∑𝐾 ∑𝐾 (33)
evaluation metric assigns appropriate weights to each category when
𝑖=1 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 + 𝑖=1 𝑅𝑒𝑐𝑎𝑙𝑙𝑖
16
Table 5
Comparison of proposed model performance before and after data augmentation based
on RAVDESS, SAVEE, TESS, and CREMA-D datasets.
Datasets Model Without data augmentation (%) With data augmentation (%)
RAVDESS Model-A 53.476 77.819
Model-B 58.249 77.917
Model-C 60.573 84.931
Model-D 66.458 87.513
SAVEE Model-A 53.780 77.916

Model-B 56.423 80.208
Model-C 59.942 83.958
Model-D 63.743 86.233
TESS Model-A 95.213 99.321

Model-B 95.471 98.785
Model-C 96.751 99.828
Model-D 96.957 99.857
CREMA-D Model-A 51.353 63.168

Model-B 55.426 71.473
Model-C 59.314 80.542
Model-D 63.761 82.295
TOTAL Model-A 68.651 85.467

Model-B 70.354 87.729
Model-C 74.513 92.323
Model-D 77.140 97.546
computing recall, thus conducting a weighted averaging of recall values was significant, resulting in substantial improvements across all mod-
to provide a more comprehensive assessment of model performance. els. Model-A and Model-B saw their accuracies rise from 53.476% and
∑𝑁 58.249% to 77.819% and 77.917%, respectively. Likewise, Model-C and
𝑖=1 𝑅𝑒𝑐𝑎𝑙𝑙𝑖 × 𝑊 𝑒𝑖𝑔ℎ𝑡𝑖 Model-D displayed marked increases in accuracy. The SAVEE dataset
𝑊 𝐴𝑅 = ∑𝑁 (37)
𝑖=1 𝑊 𝑒𝑖𝑔ℎ𝑡𝑖 also showed positive effects on all models’ accuracies, with Model-A,
Model-B, and Model-C achieving 77.916%, 80.208%, and 83.958%, re-
In this equation, 𝑅𝑒𝑐𝑎𝑙𝑙𝑖 represents the recall of the i-th category,
𝑊 𝑒𝑖𝑔ℎ𝑡𝑖 stands for the weight of the corresponding category, and N spectively, while Model-D further elevated to 86.233%. For the TESS
is the total number of categories. dataset, the impact of data augmentation on model performance was
relatively minor due to the dataset’s simplicity. However, all models
5.2. Performance analysis still achieved commendable accuracies.
Within the complex CREMA-D dataset, data augmentation signifi-
In this study, we conducted a comprehensive evaluation of three cantly boosted the average accuracy of all models. It lifted the accura-
models across various datasets. Additionally, we compared the weighted cies of Model-A through Model-D from 51.353%, 55.426%, 59.314%,
average accuracy of the ensemble model (Model-D) before and after and 63.761% to 63.168%, 71.473%, 80.542%, and 82.295%, respec-
data augmentation. As depicted in Table 5, before data augmentation, tively. Considering the overall effect on the TF-Mix, it’s evident that
each model exhibited diverse performances across different datasets. data augmentation notably enhanced the emotion analysis models’ per-
Notably, Model-D showcased superior performance on the RAVDESS formance. Especially on the relatively complex RAVDESS and CREMA-D
and SAVEE datasets, achieving average accuracies of 66.458% and datasets, this technique substantially improved the models’ ability to
63.743%, respectively, while the other models demonstrated com- generalize and extract features, effectively boosting accuracy. However,
paratively lower accuracies. The notable improvement in Model-D’s its impact was relatively limited on the straightforward TESS dataset.
performance can be attributed to the broader scope of the RAVDESS After employing a grid search technique and adjusting Model-A, B, and
dataset, incorporating intricate emotional expressions alongside varia- C’s appropriate weights, we derived the confusion matrix for Ensemble
tions in speech rate, pitch, and accent, thereby challenging the emotion Model-D. It is displayed in Fig. 13, evaluating the performance pre and
recognition task. Conversely, the relatively simpler nature of the SAVEE
post data augmentation across all datasets. Furthermore, we provide ac-
dataset allowed the models to directly discern emotions. However, its
curacy and loss function curves for Model-D after data augmentation,
smaller scale might have constrained its ability to generalize, contribut-
covering all four datasets and the combined TOTAL dataset (Fig. 14 and
ing to lower accuracy. Across the TESS dataset, all models performed
Fig. 15).
exceptionally well, surpassing an average accuracy of 95%, with Model-
Tables 6 to 10 showcase the specific recognition outcomes of pro-
D once again achieving the highest accuracy of 96.957%. This can be
posed Model-D across various emotions in the RAVDESS, SAVEE, TESS,
attributed to the smaller sample size of the TESS dataset, making the
emotion recognition task more straightforward, with speech samples CREMA-D, and TOTAL datasets. Complementing these, Tables 11 and
displaying direct emotional expressions, resulting in higher accuracy. 12, along with Fig. 16, present the category-level performance of each
However, on the more intricate CREMA-D dataset, all models displayed individual model and the ensemble Model-D after data augmentation.
relatively lower average accuracies, with Model-A at only 51.353% and These assessments cover accuracy, precision, recall, F1 score, as well as
Model-D at 63.761%. This is owing to the diverse emotional expressions average accuracy, macro-precision, macro-recall, and macro-F1 score.
and dimensions encompassed in the CREMA-D dataset, rendering the A closer examination of Fig. 16 highlights the exceptional performance
emotion recognition task more challenging. The models encountered of Model-D, showcasing the highest levels of macro-precision, macro-
difficulties in fully capturing emotional features within this dataset, ul- recall, and macro-F1 score across nearly all datasets. Particularly no-
timately leading to reduced accuracy. table is its outstanding performance on the TESS and TOTAL datasets,
Following data augmentation, we thoroughly analyzed the models’ where it approaches nearly 100% in macro-precision, macro-recall, and
performances. The impact of this augmentation on the RAVDESS dataset macro-F1 score. This underscores its superior balance between clas-
17
Fig. 13. Comparison of ensemble model D before and after data augmentation (DA).
18
Fig. 14. Loss and accuracy in speech emotion analysis on four datasets.
19
Fig. 15. Loss and accuracy on the TOTAL dataset.
Table 6
Confusion matrix of our method in the RAVDESS dataset.
True Emotion Predicted Emotion
Angry Calm Disgust Fear Happy Neutral Sad Surprise
Angry 90.96 0.00 0.60 1.81 2.41 0.00 1.81 2.41
Calm 0.48 88.46 0.96 0.48 0.96 2.88 5.29 0.48
Disgust 3.79 0.47 89.57 0.95 0.00 0.47 3.32 1.42
Fear 3.83 0.00 4.37 81.97 3.28 0.55 2.19 3.83
Happy 3.02 1.01 0.50 3.52 82.91 3.52 1.01 4.52
Neutral 0.00 6.67 1.11 1.11 0.00 84.44 4.44 2.22
Sad 0.00 4.98 1.49 3.98 3.98 3.48 81.59 0.50
Surprise 0.00 0.00 1.65 3.30 3.85 1.10 1.65 88.46
Table 7
Confusion matrix of our method in the SAVEE dataset.
Angry Disgust Fear Happy Neutral Sad Surprise
Angry 85.00 0.00 1.67 3.33 1.67 1.67 6.67
Disgust 0.00 69.64 7.14 1.79 12.50 7.14 1.79
Fear 6.06 1.52 77.27 1.52 1.52 1.52 10.61
Happy 0.00 3.85 3.85 86.54 0.00 0.00 5.77
Neutral 0.00 3.17 0.00 0.79 92.86 2.38 0.79
Sad 0.00 4.76 3.17 0.00 1.59 90.48 0.00
Surprise 0.00 3.51 5.26 0.00 0.00 0.00 91.23
Table 8
Confusion matrix of our method in the TESS dataset.
Angry 100.00 0.00 0.00 0.00 0.00 0.00 0.00
Disgust 0.00 100.00 0.00 0.00 0.00 0.00 0.00
Fear 0.00 0.00 100.00 0.00 0.00 0.00 0.00
Happy 0.00 0.00 0.00 99.22 0.00 0.00 0.78
Neutral 0.00 0.00 0.00 0.00 100.00 0.00 0.00
Sad 0.00 0.25 0.00 0.00 0.00 99.75 0.00
Surprise 0.00 0.00 0.00 0.00 0.00 0.00 100.00
sification accuracy and recall comprehensiveness. Moreover, Model-C Table 9

and Model-B also demonstrate commendable performance by exhibiting Confusion matrix of our method in the CREMA-D dataset.
higher macro-precision, macro-recall, and macro-F1 scores across the True Emotion Predicted Emotion
datasets, indicating their overall competence in the classification task. Angry Disgust Fear Happy Neutral Sad
However, Model-A’s performance appears comparatively weaker on the
Angry 88.59 3.54 1.18 5.04 1.26 0.39
CREMA-D dataset, displaying the lowest macro-precision, macro-recall, Disgust 3.27 81.39 2.80 3.27 3.59 5.67
and macro-F1 score. Despite its strong performance on other datasets, Fear 2.43 3.84 77.10 5.57 3.92 7.14
it falls behind when compared to Model-D, Model-B, and Model-C. Con- Happy 5.35 4.36 5.89 79.36 2.98 2.06
Neutral 1.57 4.98 2.40 2.67 80.46 7.93
sidering the metrics of macro-precision, macro-recall, and macro-F1
Sad 0.40 5.36 6.08 1.76 5.60 80.82
score collectively, Model-D emerges as the optimal choice.
20
Fig. 16. Comparison of models A, B, C, and ensemble model D in terms of (a) macro precision, (b) macro recall, (c) mean accuracy, and (d) macro F1-score using
bar plot representations.
Table 10
Confusion matrix of our method in the TOTAL dataset.
Angry 97.98 0.34 0.27 0.94 0.34 0.13 0.00
Disgust 0.77 96.60 0.51 0.58 0.58 0.90 0.06
Fear 0.86 0.60 95.35 0.33 1.53 1.26 0.07
Happy 1.42 0.80 1.17 95.06 0.99 0.25 0.31
Neutral 0.58 0.96 0.32 0.71 96.98 0.45 0.00
Sad 0.34 0.74 1.15 0.07 0.88 96.75 0.07
Surprise 0.00 0.19 0.57 0.38 0.38 0.00 98.48
5.3. Comparison analysis with other methods the innovative TF-Mix model, which intelligently combines temporal
and frequency domain features, aiming to more accurately capture the
In the domain of human-computer interaction, speech emotion diverse spectrum of emotional expressions. Following data augmenta-
recognition has consistently garnered considerable attention, especially tion across diverse datasets, we conducted a 10-fold cross-validation
in applications involving smart furniture. However, the current research and integrated it into ensemble model D to compare its accuracy against
landscape concerning various methods to enrich emotion recognition-
other models. A comprehensive comparative analysis elucidates the re-
related data through augmentation remains relatively limited. Experi-
markable performance of the TF-Mix model in enhancing overall effec-
mental findings have showcased a significant enhancement in emotion
recognition accuracy through the adoption of multiple data augmenta- tiveness. This proposed model not only demonstrates exceptional gen-
tion techniques, underscoring the crucial role of data augmentation in eralization capabilities but also significantly improves accuracy while
boosting performance. Nevertheless, the utilization of distinct features demanding lower computational resources, making it better suited for
across different models aimed at enhancing performance has resulted real-time human-machine interaction applications. The experimental
in notable fluctuations in experimental outcomes. This study introduces results highlighted in Table 13 robustly corroborate this conclusion.
21
Table 11
Precision, recall, F1-score, and mean accuracy for different emotions on Model A and Model B.
Category Model A Model B
Precision Recall F1score Accuracy Precision Recall F1score Accuracy
(%) (%) (%) (%) (%) (%) (%) (%)
RAVDESS
Angry 78 86 82 73.819 84 81 82 77.917
Calm 83 76 79 87 85 86
Disgust 81 80 81 91 75 82
Fear 71 71 71 77 78 77
Happy 73 65 69 76 71 73
Neutral 68 56 61 75 68 71
Sad 56 69 62 63 75 69
Surprise 83 80 81 74 86 79
SAVEE
Angry 76 70 73 77.916 89 78 83 80.208
Disgust 88 52 65 67 68 67
Fear 72 83 77 88 68 77
Happy 73 83 77 75 88 81
Neutral 81 90 85 83 92 87
Sad 83 78 80 90 71 80
Surprise 73 75 74 71 84 77
TESS
Angry 99 100 100 99.321 99 98 98 98.785
Disgust 100 100 100 98 100 99
Fear 100 99 100 100 99 100
Happy 98 99 99 100 97 99
Neutral 100 98 99 99 100 99
Sad 98 100 99 99 98 98
Surprise 100 99 99 97 100 98
CREMA-D
Angry 75 74 75 63.168 85 76 80 71.473
Disgust 64 52 58 69 67 68
Fear 62 62 62 70 69 69
Happy 63 53 58 68 71 69
Neutral 56 69 62 72 67 70
Sad 60 70 64 68 78 73
TOTAL
Angry 88 90 89 85.467 85 93 89 87.729
Disgust 83 84 83 89 85 87
Fear 84 81 83 84 88 86
Happy 84 85 85 91 81 86
Neutral 89 86 87 90 89 89
Sad 82 85 83 85 91 88
Surprise 94 93 93 98 88 93
For a holistic evaluation of the overall model performance, it was to capture inherent speech signal characteristics. Models B and C in-
imperative to amalgamate RAVDESS, SAVEE, TESS, and CREMA-D tegrated global feature extraction modules to derive enduring global
into a unified corpus, given the constrained sample sizes in individual contextual dependencies and correlations from these local features. Rig-
datasets. Prior to the experiments, comprehensive training of models A, orous evaluations were conducted on standard sentiment recognition
B, and C was indispensable. The empirical evidence vividly illustrates datasets, validating the effectiveness of three innovative deep learning-
the remarkable advancements made by ensemble model D in augment- based weighted ensemble configuration models. Leveraging data aug-
ing the accuracy of emotion recognition. This advancement brings forth mentation techniques, the proposed weighted ensemble Model D ex-
promising prospects for employing speech emotion recognition in smart hibited remarkable achievements across the RAVDESS, SAVEE, TESS,
home environments. The amalgamation of multiple datasets into a uni- CREMA-D, and TOTAL datasets, achieving weighted average accuracies
fied corpus not only expands the training sample size but also facilitates of 87.513%, 86.233%, 99.857%, 82.295%, and 97.546%, respectively.
a more comprehensive assessment of model robustness and generaliza-
While strides have been made in methodologies, features, and ac-
tion capabilities. This approach holds profound significance in enhanc-
curacy within sentiment recognition, persistent limitations call for at-
ing the recognition performance of emotional states.
tention to develop efficient solutions applicable in industrial settings.
Presently, most existing datasets undergo deductive and scripted pro-
6. Conclusion and future work
cessing, capturing only a restricted range of discrete utterances and
Insufficient data presents a significant challenge for deep neural net- expressions from corpora. Discrepancies between real-world data and
work models in the domain of sentiment emotion recognition, often deduced datasets may introduce notable biases. Furthermore, the in-
limiting their capabilities. Scarcity in data samples commonly triggers adequacy of sample sizes in datasets presents a challenge for com-
overfitting issues in complex deep models. This study undertook a comprehensive training of deep learning models. Experimental setups for
prehensive exploration of deep learning-based systems for sentiment these datasets often simulate semi-natural scenarios, lacking real-world
recognition, spanning across multiple datasets. It involved extracting a environmental noise. This raises concerns regarding the capability of
diverse range of temporal and spectral features from each audio file. systems built on such datasets to accurately detect emotions amidst
In Model A, several local feature extraction modules were designed real-world noise.
22
Table 12
Precision, recall, F1-score, and mean accuracy for different emotions on Model C and Model D.
Category Model C Model D
Precision Recall F1score Accuracy Precision Recall F1score Accuracy
(%) (%) (%) (%) (%) (%) (%) (%)
RAVDESS
Angry 87 91 89 84.931 90 92 91 87.513
Calm 94 88 91 89 88 89
Disgust 93 87 90 90 90 90
Fear 78 81 79 86 90 85
Happy 78 79 78 85 85 85
Neutral 73 78 75 82 83 82
Sad 84 84 84 84 80 82
Surprise 88 89 88 90 92 91
SAVEE
Angry 84 87 85 83.958 93 85 89 86.233
Disgust 79 66 72 78 70 73
Fear 86 76 81 81 77 80
Happy 84 79 81 90 87 88
Neutral 88 97 92 92 93 92
Sad 83 87 85 86 90 88
Surprise 78 81 79 76 91 84
TESS
Angry 100 100 100 99.828 100 100 100 99.857
Disgust 99 100 100 100 100 100
Fear 100 100 100 100 100 100
Happy 100 100 99 100 99 100
Neutral 100 100 100 100 100 100
Sad 99 100 100 100 100 100
Surprise 100 100 100 99 100 100
CREMA-D
Angry 86 87 87 80.542 87 89 88 82.295
Disgust 79 80 80 80 81 80
Fear 78 78 78 81 82 80
Happy 82 77 79 82 79 81
Neutral 81 82 82 80 80 80
Sad 78 78 78 78 81 80
TOTAL
Angry 93 95 94 92.323 96 98 97 97.546
Disgust 91 90 91 97 97 97
Fear 92 90 91 96 95 96
Happy 93 91 92 96 97 96
Neutral 93 92 91 97 97 97
Sad 95 96 95 97 97 97
Surprise 98 99 98 98 98 98
Table 13
Accuracy comparison.
Datasets Methodology Split ratio WAR (%)
RAVDESS
Mustaqeem et al. [49] BiLSTM-Transformer 2D CNN 10-fold 80.19
Luna Jim et al. [50] CNN BiLSTM 10-fold 80.19
Mustaqeem et al. [51] DCNN 8:1:1 85.00
Alnuaim et al. [52] MLP 8:1:1 81.00
Our model Model D 10-fold 84.50
SAVEE
Hajarolasvadi et al. [34] 3D CNN 10-fold 81.05
Xu et al. [53] CNN Attention 10-fold 77.8
Mekruksavanich et al. [54] 1D CNN 10-fold 65.83
Kwon et al. [51] DCNN 10-fold 82.00
TESS
Chatterjee et al. [3] 1D CNN 8:1:1 95.79
Krishnan et al. [55] DNN, RNN, GRU 10-fold 95.82
CREMA-D
Feng et al. [56] CNN 10-fold 69.80
Huang et al. [57] CNN Transformer 10-fold 76.90
Aggarwal et al. [58] SVM 8:1:1 58.22
TOTAL (Our model) Model D 10-fold 94.46
23
Further exploration is crucial in the domain of sentiment detection, [15] Asghar A, Sohaib S, Iftikhar S, Shafi M, Fatima K. An urdu speech corpus for emotion
particularly in continuous voice dialogues involving multiple speak- recognition. PeerJ Comput Sci 2022;8:e954.
[16] Jothimani S, Premalatha K. Mff-saug: multi feature fusion with spectrogram aug-
ers. Addressing the complexity of multiple emotions within utterances
mentation of speech emotion recognition using convolution neural network. Chaos
might entail tackling a multi-label problem. Additionally, deeper in- Solitons Fractals 2022;162:112512.
vestigation into integrating information from various optimal acoustic [17] Ahmed MR, Islam S, Islam AM, Shatabda S. An ensemble 1d-cnn-lstm-gru
features using a multi-modal CNN architecture is warranted. While model with data augmentation for speech emotion recognition. Expert Syst Appl
2023;218:119633.
conventional acoustic features mainly encompass amplitude and phase
[18] Murugaiyan S, Uyyala SR. Aspect-based sentiment analysis of customer speech
data, existing sentiment recognition systems predominantly emphasize data using deep convolutional neural network and bilstm. Cogn Comput
amplitude. Therefore, exploring effective phase-based features holds 2023;15(3):914–31.
promise for future research. Despite demonstrating significant senti- [19] Singh J, Saheer LB, Faust O. Speech emotion recognition using attention model. Int
ment recognition performance across five datasets in this study, a more J Environ Res Public Health 2023;20(6):5140.
[20] Livingstone SR, Russo FA. The ryerson audio-visual database of emotional speech
comprehensive analysis remains necessary. Future endeavors will pri- and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in
oritize optimal feature selection methods and the integration of diverse North American English. PLoS ONE 2018;13(5):e0196391.
attention mechanisms to reduce training time requirements for individ- [21] Jackson P, Haq S. Surrey audio-visual expressed emotion (savee) database. Guild-
ual model ensemble predictions. These efforts aim to further augment ford, UK: University of Surrey; 2014.
[22] Pichora-Fuller MK, Dupuis K. Toronto emotional speech set (tess). Scholars Portal
the performance of sentiment recognition tasks.
Dataverse 2020;1:2020.
[23] Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R. Crema-d:
CRediT authorship contribution statement crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput
2014;5(4):377–90.
[24] Panda SK, Jena AK, Panda MR, Panda S. Speech emotion recognition using mul-
Mengsheng Wang: Conceptualization, Data curation, Formal analy- timodal feature fusion with machine learning approach. Multimed Tools Appl
sis, Methodology, Software, Validation, Visualization, Writing – original 2023:1–19.
draft, Writing – review & editing. Hongbin Ma: Data curation, Fund- [25] Alnuaim AA, Zakariah M, Alhadlaq A, Shashidhar C, Hatamleh WA, Tarazi H, et al.
ing acquisition, Investigation, Project administration, Supervision, Val- Human-computer interaction with detection of speaker emotions using convolution
neural networks. Comput Intell Neurosci 2022:2022.
idation. Yingli Wang: Investigation, Project administration, Software, [26] Salvati D, Drioli C, Foresti GL. A late fusion deep neural network for robust speaker
Validation, Visualization. Xianhe Sun: Software, Supervision, Writing identification using raw waveforms and gammatone cepstral coefficients. Expert
– review & editing. Syst Appl 2023;222:119750.
[27] Tarwireyi P, Terzoli A, Adigun MO. Using multi-audio feature fusion for Android
malware detection. Comput Secur 2023;131:103282.
Declaration of competing interest [28] Jagadeeshwar K, Sreenivasarao T, Pulicherla P, Satyanarayana K, Mohana Lak-
shmi K, Kumar PM. Asernet: automatic speech emotion recognition system using
The authors declare that they have no known competing financial mfcc-based lpc approach with deep learning cnn. Int J Model Simul Sci Comput
interests or personal relationships that could have appeared to influence 2022:2341029.
[29] Mao K, Wang Y, Ren L, Zhang J, Qiu J, Dai G. Multi-branch feature learning based
the work reported in this paper. speech emotion recognition using scar-net. Connect Sci 2023;35(1):2189217.
[30] Cao X, Jia M, Ru J, Pai T-w. Cross-corpus speech emotion recognition using sub-
Data availability space learning and domain adaption. EURASIP J Audio Speech Music Process
2022;2022(1):32.
[31] Wang M, You H, Ma H, Sun X, Wang Z. Sentiment analysis of online new energy
Data will be made available on request. vehicle reviews. Appl Sci 2023;13(14):8176.
[32] Pham NT, Dang DNM, Nguyen ND, Nguyen TT, Nguyen H, Manavalan B, et al.
References Hybrid data augmentation and deep attention-based dilated convolutional-recurrent
neural networks for speech emotion recognition. Expert Syst Appl 2023:120608.
[33] Huang Z, Dong M, Mao Q, Zhan Y. Speech emotion recognition using cnn. In: Pro-
[1] Seaborn K, Miyake NP, Pennefather P, Otake-Matsuura M. Voice in human–agent
ceedings of the 22nd ACM international conference on multimedia; 2014. p. 801–4.
interaction: a survey. ACM Comput Surv 2021;54(4):1–43.
[34] Hajarolasvadi N, Demirel H. 3d cnn-based speech emotion recognition using k-
[2] de Lope J, Graña M. An ongoing review of speech emotion recognition. Neurocom-
means clustering and spectrograms. Entropy 2019;21(5):479.
puting 2023;528:1–11.
[35] Mustaqeem, Kwon S. A cnn-assisted enhanced audio signal processing for speech
[3] Chatterjee R, Mazumdar S, Sherratt RS, Halder R, Maitra T, Giri D. Real-time
emotion recognition. Sensors 2019;20(1):183.
speech emotion analysis for smart home assistants. IEEE Trans Consum Electron
[36] Mao S, Ching P, Lee T. Deep learning of segment-level feature representation with
2021;67(1):68–76.
multiple instance learning for utterance-level speech emotion recognition. In: Inter-
[4] Abbaschian BJ, Sierra-Sosa D, Elmaghraby A. Deep learning techniques for speech
speech; 2019. p. 1686–90.
emotion recognition, from databases to models. Sensors 2021;21(4):1249. [37] Zhang S, Zhang S, Huang T, Gao W, Tian Q. Learning affective features with a
[5] Singh YB, Goel S. A systematic literature review of speech emotion recognition ap- hybrid deep model for audio–visual emotion recognition. IEEE Trans Circuits Syst
proaches. Neurocomputing 2022;492:245–63. Video Technol 2017;28(10):3030–43.
[6] Xu X, Li D, Zhou Y, Wang Z. Multi-type features separating fusion learning for speech [38] Zhao Z, Bao Z, Zhang Z, Cummins N, Schuller BW. Attention-enhanced connectionist
emotion recognition. Appl Soft Comput 2022;130:109648. temporal classification for discrete speech emotion recognition; 2019.
[7] Kethireddy R, Kadiri SR, Gangashetty SV. Exploration of temporal dynamics of fre- [39] Luo D, Zou Y, Huang D. Investigation on joint representation learning for robust
quency domain linear prediction cepstral coefficients for dialect classification. Appl feature extraction in speech emotion recognition. In: Interspeech; 2018. p. 152–6.
Acoust 2022;188:108553. [40] Chen M, He X, Yang J, Zhang H. 3-d convolutional recurrent neural networks
[8] Wu Q, Xiong S, Zhu Z. Replay speech answer-sheet detection on intelligent lan- with attention model for speech emotion recognition. IEEE Signal Process Lett
guage learning system based on power spectrum decomposition. IEEE Access 2018;25(10):1440–4.
2021;9:104197–204. [41] Xie Y, Liang R, Liang Z, Huang C, Zou C, Schuller B. Speech emotion classifi-
[9] Bhavan A, Chauhan P, Shah RR, et al. Bagged support vector machines for emotion cation using attention-based lstm. IEEE/ACM Trans Audio Speech Lang Process
recognition from speech. Knowl-Based Syst 2019;184:104886. 2019;27(11):1675–85.
[10] Rahman MR, Arefin MS, Hossain MB, Habib MA, Kayes A. Towards a framework for [42] Zhao Z, Li Q, Zhang Z, Cummins N, Wang H, Tao J, et al. Combining a parallel
acquisition and analysis of speeches to identify suspicious contents through machine 2d cnn with a self-attention dilated residual network for ctc-based discrete speech
learning. Complexity 2020;2020:1–14. emotion recognition. Neural Netw 2021;141:52–60.
[11] Ramesh S, Gomathi S, Sasikala S, Saravanan T. Automatic speech emotion detection [43] Liang R, Kong F, Xie Y, Tang G, Cheng J. Real-time speech enhancement algorithm
using hybrid of gray wolf optimizer and naïve Bayes. Int J Speech Technol 2021:1–8. based on attention lstm. IEEE Access 2020;8:48464–76.
[12] Yao B, Li G, Wu W. State space representation and phase analysis of gradient descent [44] Li S, Xing X, Fan W, Cai B, Fordson P, Xu X. Spatiotemporal and frequen-
optimizers. Sci China Inf Sci 2023;66(4):142102. tial cascaded attention networks for speech emotion recognition. Neurocomputing
[13] Wu X, Zhang Q. Design of aging smart home products based on radial basis function 2021;448:238–48.
speech emotion recognition. Front Psychol 2022;13:882709. [45] Zhang S, Yang Y, Chen C, Zhang X, Leng Q, Zhao X. Deep learning-based multimodal
[14] Singh JB, Lehana PK. Emotional speech analysis using harmonic plus noise model emotion recognition from audio, visual, and text modalities: a systematic review of
and Gaussian mixture model. Int J Speech Technol 2019;22:483–96. recent advancements and future prospects. Expert Syst Appl 2023:121692.
24
[46] Fan H, Zhang X, Xu Y, Fang J, Zhang S, Zhao X, et al. Transformer-based multi- [53] Xu M, Zhang F, Zhang W. Head fusion: improving the accuracy and robustness
modal feature enhancement networks for multimodal depression detection integrat- of speech emotion recognition on the iemocap and ravdess dataset. IEEE Access
ing video, audio and remote photoplethysmograph signals. Inf Fusion 2023:102161. 2021;9:74539–49.
[47] Zhang S, Liu R, Yang Y, Zhao X, Yu J. Unsupervised domain adaptation integrating [54] Mekruksavanich S, Jitpattanakul A, Hnoohom N. Negative emotion recognition us-
transformer and mutual information for cross-corpus speech emotion recognition. ing deep learning for Thai language. In: 2020 joint international conference on
In: Proceedings of the 30th ACM international conference on multimedia; 2022. digital arts, media and technology with ECTI northern section conference on elec-
p. 120–9. trical, electronics, computer and telecommunications engineering (ECTI DAMT &
[48] Singh P, Saha G, Sahidullah M. Non-linear frequency warping using constant-q NCON). IEEE; 2020. p. 71–4.
transformation for speech emotion recognition. In: 2021 international conference [55] Krishnan PT, Raj AN Joseph, Rajangam V. Emotion classification from speech signal
on computer communication and informatics (ICCCI). IEEE; 2021. p. 1–6. based on empirical mode decomposition and non-linear features: speech emotion
[49] Kim S, Lee S-P. A bilstm–transformer and 2d cnn architecture for emotion recogni- recognition. Complex Intell Syst 2021;7:1919–34.
tion from speech. Electronics 2023;12(19):4034. [56] Feng T, Hashemi H, Annavaram M, Narayanan SS. Enhancing privacy through do-
[50] Luna-Jiménez C, Griol D, Callejas Z, Kleinlein R, Montero JM, Fernández-Martínez F. main adaptive noise injection for speech emotion recognition. In: ICASSP 2022-2022
Multimodal emotion recognition on ravdess dataset using transfer learning. Sensors IEEE international conference on acoustics, speech and signal processing (ICASSP).
2021;21(22):7665. IEEE; 2022. p. 7702–6.
[51] Mustaqeem, Kwon S. Optimal feature selection based speech emotion recog- [57] Huang J, Tao J, Liu B, Lian Z. Learning utterance-level representations with label
nition using two-stream deep convolutional neural network. Int J Intell Syst smoothing for speech emotion recognition. In: Interspeech; 2020. p. 4079–83.
2021;36(9):5116–35. [58] Aggarwal A, Srivastava A, Agarwal A, Chahal N, Singh D, Alnuaim AA, et al. Two-
[52] Alnuaim AA, Zakariah M, Shukla PK, Alhadlaq A, Hatamleh WA, Tarazi H, et al. way feature extraction for speech emotion recognition using deep learning. Sensors
Human-computer interaction for recognizing speech emotions using multilayer per- 2022;22(6):2378.
ceptron classifier. J Healthc Eng 2022;2022.
25

1 s2.0 S0003682X24000379 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0003682X24000379 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0003682X24000379 Main

Uploaded by

Copyright:

Available Formats

Applied Acoustics 218 (2024) 109886

Contents lists available at ScienceDirect

Design of smart home system speech emotion recognition model based on

Fig. 1. Workﬂow of SER system and proposal of TF-mix structure.

Fig. 2. The original audio signal and its spectrogram.

transition at both the inception and culmination of the sampling pro- ∑

Fig. 8. The architecture of the proposed baseline model-A.

Fig. 9. The architecture of the proposed baseline model-B.

An optimizer with a learning rate of 0.0001 is employed, and Table 1

𝑤 = 𝑤 − 𝜂Δ𝑄𝑖 (𝑤) (18)

During model training, in order to store new important information, Table 2

𝑓 𝑓 𝑡 = 𝜎(𝑤𝑓 𝑋 𝑡 + 𝑝𝑓 𝑌 𝑡−1 + 𝑞𝑓 𝑐 𝑡−1 + 𝑏𝑓 )

Fig. 11. The architecture of the proposed baseline model-C.

SAVEE Model-A 53.780 77.916

TESS Model-A 95.213 99.321

CREMA-D Model-A 51.353 63.168

TOTAL Model-A 68.651 85.467

Fig. 15. Loss and accuracy on the TOTAL dataset.

siﬁcation accuracy and recall comprehensiveness. Moreover, Model-C Table 9

TOTAL (Our model) Model D 10-fold 94.46

You might also like