Enhancements in Immediate Speech Emotion Detection: Harnessing Prosodic and Spectral Characteristics

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR872
Enhancements in Immediate Speech

Emotion Detection: Harnessing Prosodic and
Spectral Characteristics
ZEWAR Shah 1; SHAN Zhiyong*1; Adnan2
1
College of Information Science and Technology, Donghua University, Shanghai 201620, P.R China
2
School of Computer Science and Technology Donghua University Shanghai 201620, China
Abstract:- Speech is essential to human communication for recognition, customer service, mental health diagnostics,
expressing and understanding feelings. Emotional speech education, and human-robot interactions. Researchers have
processing has challenges with expert data sampling, devoted a lot of time and energy over the last 20 years to trying
dataset organization, and computational complexity in to decipher the complex signs that speech uses to convey
large-scale analysis. This study aims to reduce data emotion [2]. In response to this need, several models based on
redundancy and high dimensionality by introducing a new machine learning have emerged with the express purpose of
speech emotion recognition system. The system employs extracting and categorizing emotional content from audio
Diffusion Map to reduce dimensionality and includes recordings. At its core, our effort revolves around feature
Decision Trees and K-Nearest Neighbors（KNN）ensemble extraction, the first and most important stage in classifying
classifiers. These strategies are suggested to increase voice emotional states in spoken words. The significance of feature
emotion recognition accuracy. Speech emotion recognition engineering in classification is shown by the fact that the
is gaining popularity in affective computing for usage in effectiveness of speech identification systems is dependent on
medical, industry, and academics. This project aims to the quantity and quality of these derived features [3].One of the
provide an efficient and robust real-time emotion main obstacles to automated emotion recognition in speech is
identification framework. In order to identify emotions the complex nature of acoustic feature extraction. Regardless,
using supervised machine learning models, this work makes numerous studies have tried to overcome these obstacles by
use of paralinguistic factors such as intensity, pitch, and developing speech emotion recognition systems that use
MFCC. In order to classify data, experimental analysis machine learning paradigms, all with the ultimate aim of better
integrates prosodic and spectral information utilizing comprehending human emotions [4]. The intricate nature of
methods like Random Forest, Multilayer Perceptron, SVM, human communication makes it extremely difficult to
KNN, and Gaussian Naïve Bayes. Fast training times make incorporate speech emotion detection into larger computer
these machine learning models excellent for real-time frameworks. In addition to being able to recognize and
applications. SVM and MLP have the highest accuracy at understand human emotions in speech, intelligent computer
70.86% and 79.52%, respectively. Comparisons to systems should be able to mimic the subtle behavioral reactions
benchmarks show significant improvements over earlier seen in real-life interactions. Consequently, there are a plethora
models. of machine learning techniques that may be employed to tackle
the complex problem of voice emotion detection categorization.
Keywords:- Feature Extraction, KNN, Speech Emotions, The overarching goal of this research is to advance our
Diffusion Map, MFCC, and Feature Engineering. understanding of human emotional expressions in
computational linguistics by delving into the many tactics and
I. INTRODUCTION methodologies used to build strong speech emotion detection
mechanisms [5]. Precise and computationally efficient
Speech is one of the most basic and pervasive forms of classification is essential for the effectiveness of speech
human communication, speech is capable of expressing and emotion recognition (SER) systems. Developing features
understanding a broad range of emotions. Emotional speech involves using strict selection criteria, estimating feature
processing is an interdisciplinary area of study that tries to number, and implementing dependable extraction processes,
understand and analyze the subtle emotional undertones in which are crucial for this endeavor [6]. Developers of SER
spoken language [1]. This task calls for a wide range of models should carefully consider the development procedure of
computational approaches and resources, all with the common the features as it impacts the temporal complexity and
goal of extracting and analyzing emotional indicators embedded classification accuracy. Identifying reliable emotional cues in
in voice signals. Emotional voice processing has numerous datasets necessitates having access to extensive and diverse
potentials uses in many fields, such as virtual assistants, speech repositories that encompass a broad spectrum of emotions and
IJISRT24APR872 www.ijisrt.com 1526

languages [7]. The main objectives of emotion recognition II. LITERATURE REVIEW
(SER) [8] are to (1) define the semantic basis of emotions and
(2) determine the speech signal components or parameters that When it comes to Speech Emotion Recognition (SER),
are important for emotional expression. Research in this subject optimization solutions are crucial, especially when it comes to
has highlighted the significance of feature aspects such quality feature extraction and selection. The key goal is to find the best
metrics, prosodic qualities, spectral features, and voice-related solutions as quickly as possible without sacrificing accuracy.
variables in determining the efficiency of SER systems [9-11]. Specifically, Yoon et al. sought to determine the most important
Developing effective SER frameworks requires more than just characteristics in speech data for emotion detection and
employing robust feature selection; it also entails utilizing proposed a particle swarm optimization method for feature
sophisticated classification methods [12]. Machine learning selection in SER. On the other hand, another study used deep
techniques for feature engineering and categorization have convolutional neural networks to extract spectrogram
gained increased attention from the academic community. The properties; nevertheless, their performance was limited because
three most commonly utilized classifiers in SER [13] are K- they solely optimized solutions without taking additional
Nearest Neighbor (KNN), Decision Trees (DT), and Support contextual aspects into account.
Vector Machine (SVM), each offering distinct advantages and
considerations during model development [14]. Rosalind Contrarily, Kanwal et al. suggested using DBSCAN and
Picard's pioneering research in emotive computing had a PCA to improve the precision of speech emotion recognition,
significant impact on Speech Emotion Recognition (SER), specifically for the diverse types of emotions described in the
which is crucial for HMI. The utilization of physical cues, such INTERSPEECH 2010 feature set. Their strategy outperformed
as facial expressions, speech, and body position, enhances the current state-of-the-art approaches in terms of accuracy and
field of human-computer interaction. Two potential applications recall. On the other hand, problems surfaced throughout testing,
include the detection of driver drowsiness and the such as using the wrong feature sets and using inefficient
dissemination of public education. The study proposes a system feature engineering techniques, which increased computational
that integrates ensemble classifiers with dimension reduction, costs and resulted in less-than-ideal convergence rates.
specifically emphasizing paralinguistic attributes such as
intensity and pitch for speech and emotion recognition (SER). Previous studies have attempted to address the time
The emotion categorization process involves the utilization of complexity of SER optimization; for example, [15] aimed to
the Ryerson Audio-Visual Database of Emotional Speech and decrease processing time by using a smaller feature set and a
Song, in conjunction with machine learning techniques such as one-class-in-one neural network, a type of specialized neural
Support Vector Machine and Gaussian Naïve Bayes. This study network. Optimal emotion prediction using a subset of Linear
emphasizes the need of carefully choosing distinguishing Predictive Coding (LPC) and Delta LPC parameters was the
characteristics and enhancing the performance of models by basis of this strategy, which sought to simplify computational
adjusting hyper parameters. It underscores the crucial role operations. Still, it has been observed that these approaches'
played by feature selection and classification methodologies. reliance on indirect comparisons increases processing time and
This study aims to enhance the accuracy and efficiency of requires big datasets for validating results effectively [16]. To
emotion identification in speech utterances by addressing the conclude, SER optimization solutions provide great potential
issues of redundancy and high dimensionality associated with for improving classification accuracy and computational
human feature extraction. This study accomplishes its efficiency. However, to guarantee dependable performance in
objectives by implementing a non-linear approach for feature real-world applications, it is crucial to address inherent
extraction using a principled technique based on learning limitations like biases in feature selection and indirect
models that utilize Diffusion Maps. The proposed method, comparisons. A careful approach to feature comparison is
utilizing Mel Frequency Cepstral Coefficients (MFCC) for needed to avoid high-dimensional speech emotion feature
feature extraction together with advanced techniques such as datasets that are prone to feature redundancy. This is because of
batch normalization and z-score standardization, achieves a the complex relationship between individual characteristics in
3.1% higher classification accuracy compared to conventional speech data. Feature data with too many dimensions causes
genetic algorithms. The rest of this academic paper is structured training periods to be longer and puts a load on computational
as follows: Section 2 provides an extensive literature analysis, resources. A key to overcoming these obstacles is zeroing down
Section 3 outlines the methodology and data sources utilized, on the most important aspects of speech that convey different
Section 4 displays the experimental results and subsequent emotions. Feature dimensionality reduction techniques have
discussion, and Section 5 closes by synthesizing the results and been the subject of several investigations [17], but large-scale
proposing future study directions. access to professional labeled datasets is required for the
development of effective classification systems.

The authors have come up with a new method of it comes to identifying happy feelings because there aren't
extraction and reduction to address these issues; this method enough language and auditory clues [4]. The novel framework
has potential in many other areas, including pattern recognition Bidirectional Long-Short-Term Memory with Directional Self-
and speech analysis. The Diffusion Map (DM) strategy, which Attention (BLSTM-DSA) addresses these concerns.
uses Mel Frequency Cepstral Coefficients (MFCC) in Incorporating Bi-Directional mechanisms into encoding and
conjunction with dimensionality reduction and exhaustive decoding processes may improve emotion identification
feature selection methodologies, provides a new way to systems in bimodal environments and diversify information
improve the extraction of speech emotion data. Applying the representations. Multi-class SVM, Random Forest (RF), and
benchmark EMO-DB dataset, we test DM's effectiveness across Single-layered Adaptive Boosting were applied to SAVEE and
seven different emotion classes and find encouraging numerical Polish datasets to improve Speech Emotion Recognition (SER).
results that point to its supremacy. Some argue that a large Random forest models using smaller feature sizes reduced
variety of features is necessary to capture the complex range of processing time and attained testing accuracies of 75.71% and
emotions expressed in speech data, while others argue that a 87.91% on these datasets [11]. Using 3D Mel Spectrograms,
smaller set of features is more appropriate for this purpose. Parallel AG-TFNN, and 2D Attention Gated Tensor Neural
Feature extraction and selection are promising ways to get Network architecture, we explored further SER design
around data dimensionality problems, and adding Independent developments. These architectures—augmented with
Data projection techniques opens up even more possibilities for LSTM+CNN baselines using Adam optimizer—performed
improving classification accuracy. Evaluation of ML models differently across emotion classes on the EMO-DB and
for speaker-independent emotion recognition from voice data IEMOCAP datasets in speaker-independent modes [7]. Both
highlights the importance of strong feature engineering datasets showed promising results, but independent testing
techniques for pushing the boundaries of speech emotion highlighted issues with the 2D and 3D representations'
recognition. The effectiveness of a system in the field of dimensionality and redundancy, requiring further parameter
bimodal information Speech Emotion Recognition (SER) is adjustments [3]. To address emotion threshold extraction's high
heavily dependent on the features that are input. Criteria dimensionality and redundancy, Diffusion Map-based learning
controlling feature extraction, whether using model-driven or algorithms were proposed. Non-linear feature extraction
data-driven learning approaches, are inextricably bound up with allowed this strategy to outperform random selection genetic
the categorization process. To overcome these difficulties algorithms. A diffusion map-based non-linear technique
caused by different language settings, a new method has been boosted accuracy by 3.1%, achieving an impressive 89.01%
suggested that uses cross-corpus multilingual Ensemble accuracy rate when extracting experimental features from the
learning supported by a majority voting strategy [3]. Applying EMO-DB dataset that are independent of speakers [3]. This
this strategy to three separate ML algorithms—Sequential unique approach revolutionizes SER methodology by laying the
Minimal Optimization (SMO), Random Forest (RF), and J48— groundwork for extracting complex emotional elements and
proved to significantly boost WA and UA on the Urdu, EMO- eliminating redundancy and high dimensionality. This approach
DB, and SAVEE (Surrey Audio-Visual Expressed Emotion) could advance Speech Emotion Recognition in affective
datasets. computing, HCI, and clinical diagnostics with rigorous testing
and improvement. Using a highly deep CNN to create extra
The availability of corpora covering several languages in feature maps for training acoustic models, the author in [23]
naturalistic contexts is still low, despite progress [2]. In an effort discovered that the enriched data aids in the development of
to decipher emotional signals in audio signals and frequency strong voice recognition systems. Using a blend of Gaussian
parameterizations, researchers have investigated various feature distributions as the random source, the author synthesized
extraction methodologies, such as Mel-Gammatone-Frequency feature vectors through an AAE in [24]. Although synthetic
Cepstral Coefficients (MGFCC), Gammatone-Frequency samples have the potential to enhance classification
Cepstral Coefficients (GFCC), and Mel Frequency Cepstral performance, they often adhere to a random distribution instead
Coefficients (MFCC). Although there are some limitations, of the actual data distribution. Their approach for creating
such as the dataset's categorical representation of emotional synthetic feature vectors was also constructed using
classes and difficulties in speech interpretation and data convolutional neural networks (cGANs). The cGAN has been
annotation, the results of analyses performed on the Ryerson trained using a number of training methods, including
Audio-Visual Database of Emotional Speech and Song initializing the generator with the weights from the AAE
(RAVDESS) dataset were promising [4]. Innovative spectral decoder [25] and updating the generator's weights multiple
feature modulation methods have been suggested to overcome times before updating the discriminator in each training epoch.
these restrictions [18，7]. These methods include Random Forest Finding an optimal trade-off between the generator's and
(RF) classifiers, Support Vector Machine (SVM) combinations discriminator's capabilities is a big challenge while training
with Single Frequency Filtering (SFF), and Linear Discriminant GANs. Applying dynamic alternation training can help you
scoring systems. Utilizing SFF to extract the amplitude overcome this challenge. In this training method, instead of
envelope of the speech signal in conjunction with classifiers fixing the number of training epochs between the discriminator
like SVM and RF has significantly improved emotion detection and generator, it is dynamically changed. Our suggested
rates on multiple datasets, such as EMO-DB, FAU-AIBO, and network’s goal is to enhance learning stability by making it
IEMOCAP [4][2]. But there are still obstacles, especially when easier for the generator to learn the target distribution, rather

than optimizing the number of training epochs. Our suggested B. Selecting Features from Subsets
network's generator, in particular, doesn't learn a fixed The architecture of machine learning models strongly
distribution but rather the distribution of a latent representation depends on the selection of feature subsets to enhance classifier
generated by a concurrently learned encoder. performance and reduce dimensionality. The Filter, Wrapper,
and Embedded approaches are among the most often employed
III. MATERIAL & METHOD techniques. Wrapper methods involve the iterative evaluation of
subsets, which can be computationally expensive. In contrast,
This section provides an extensive synopsis of the filter approaches are independent of the classifier. Embedded
suggested approach, covering the architecture of feature strategies integrate feature selection into the model itself to
extraction with a DM for fitness assessment included. In address the limitations of both filter and wrapper methods.
addition, the dataset, classification algorithms, and DM's Several techniques employed in recent research on speech
application are discussed in depth, with Figure 1 serving to emotion recognition encompass swarm-based optimization,
clarify the former. Boruta, principal component analysis (PCA), forward and
backward selection, among others. This work utilizes a filter-
A. Extraction of Feature s based technique to enhance statistical dependency on important
The proposed framework integrates characteristics features by leveraging mutual knowledge, owing to its non-
acquired from several sources, encompassing intensity, pitch, model-specific nature, simplicity, and efficiency.
LFCCs, MFCCs, formants, and spectral centroids. Non-linear
pitch perception is captured by MFCCs through the The methodology reduces the initial collection of 1582
representation of acoustic data on the Mel scale. The characteristics to a more manageable set of 90 features, which
comprehension of high-frequency vocal tract stimulation can be are crucial for classification, by computing a family of
enhanced by the utilization of low-frequency coulombic filters embeddings in Euclidean space drawn from the eigenvectors
(LFCCs), which exhibit variations in the linear scale of their and eigenvalues of a diffusion operator. While research into the
band-pass filters. Resonances in the vocal tract are represented specific features that promote efficient data clustering into
by formants, while spectra centroids display the weighted different categories is continuing, these extracted features are
average of sound frequencies. Pitch serves as a representation crucial for training models to determine the output labels that
of the fundamental frequency, whereas intensity denotes the are sought. An opportunity for additional optimization and
magnitude of sound strength per unit area. In order to account refinement within the framework of the suggested methodology
for the dynamics of speech, characteristics of velocity and is to test a large number of different features and incorporate
acceleration are included for MFCCs, LFCCs, intensity, and them into a standardized feature set. To fix α € R and a rotation
pitch. The feature vector with a dimensionality of 652 is invariant kernel.
obtained by multiplying the feature extraction process using the
Librosa and Parselmouth packages. These programs employ ||𝑥 − 𝑦 ||2
statistical methods to combine feature vectors from various 𝑘𝜀 (𝑥, 𝑦) = ℎ( )
𝜀
signal durations.
Now Suppose that;
𝑆𝜀(𝑥) = ∫ 𝐾𝜀(𝑥, 𝑦)𝑠(𝑦)𝑑𝑦
New kernel is
𝑘𝜀𝛼 (𝑥, 𝑦) = 𝐾𝜀(𝑥, 𝑦)⁄𝑆(𝑥)𝑆(𝑦)
Apply the weighted graph Laplacian normalization to this

kernel by setting.
𝑑𝜀(𝑥) = ∫ 𝑎𝐾𝜀(𝑥, 𝑦)𝑠(𝑥)𝑑𝑦
Anisotropic transition kernel:
𝑡𝜀𝑥 (𝑥, 𝑦) = 𝑘𝜀𝛼 (𝑥, 𝑦)⁄𝑡𝑑𝜀𝑥 (𝑥)
C. Classification Method (Naïve Bayes)

The penultimate stage is to use five machine learning
models to categorize speech into eight different moods, after
feature extraction and subset selection. To evaluate the models'
accuracy in both overall and class-wise terms, they are
evaluated individually using Stratified k-fold cross-validation
(k = 10). For its execution, the study makes use of the scikit-
Fig 1. Developed Methodology

learn package. Assuming conditional independence among F. Classification Method (SVM)

features, the Gaussian Naïve Bayes Classifier, one of the A strong binary classifier, Support Vector Machines
models, depends on the Bayes Theorem. Although it is (SVMs) maximize margins by finding a hyperplane between
computationally simple, the assumption of independence leads two classes. Using the kernel trick, support vector machines
to suboptimal performance because it is seen unreasonable. (SVMs) build this hyperplane in a space with more dimensions.
Stratified tenfold cross-validation with all 652 features yielded The training points are made linearly separable by selecting
a classifier accuracy of 43.48 percent. The accuracy rate was appropriate kernel functions. This research makes use of
44.98% when just 177 features were considered. support vector machines (SVMs) with Gaussian kernels, which
are defined by RBF kernels. The optimal values for the
Table 1. Selection of feature and features subset punishment term (C) and the kernel coefficient (γ) were
Features Description Total Features Selected Features determined to be 100 and 0.0029, respectively, by employing
Intensity 18 18 stratified tenfold cross-validation and grid search, respectively.
Using these hyperparameters, the model achieves an impressive
Pitch 18 18
67.57% accuracy with all 652 features and 70.86% accuracy
LFCCs 234 47 with the subset of 177 features, demonstrating a significant
Spectral Centroids 4 4 boost in performance.
MFCCs 360 72
G. Classification Method (MLP)
D. Classification Method (KNN) A multilayer perceptron (MLP) is made up of layers of
A simple machine learning method, k-Nearest Neighbors neurons, with each layer being represented by a node. Each
(k-NN) stores all training instances without instantly learning layer computes its output using a linear combination of inputs,
from them, using a lazy-learner approach. To find the k with the addition of nonlinearity introduced by an activation
locations that are geographically closest to the test case, the function (σ). Sigmoid, ReLU, and tanh are some of the most
classification algorithm uses the Euclidean distance. A grid common activation functions. The two-hidden-layer MLP used
search determines the ideal value of k, which can be anything in this research was trained using the Adam optimizer for
from 1 to 15, according to the inquiry. By setting k to 1, twenty iterations with categorical cross-entropy serving as the
stratified tenfold cross-validation was able to attain an overall loss function. When all characteristics are used in a stratified
accuracy of 53.25%. The classifier distributes weight equally tenfold cross-validation model, the overall accuracy is 73.76%.
among all neighbors. The overall accuracy is improved to 65.17 The performance is improved to 79.62% accuracy by using the
percent by using a subset of 177 characteristics, all subset of 177 features that were chosen.
hyperparameters are kept at their original levels. This
enhancement is noteworthy. IV. RESULTS
E. Classification Method (RF) This part encompasses an overview of the implementation

Ensemble learning is a technique that combines procedure, as well as a comprehensive analysis of the study's
predictions from numerous lesser-performing models in order findings and outcomes. The dataset utilized in this study is the
to create a resilient learner. Random forests, which consist of Ryerson Audio-Visual Database of Emotional Speech and Song
ensembles of decision trees, utilize a voting process that relies (RAVDESS). It comprises a total of 1440 audio files,
on majority to determine the final forecasts. Bootstrapping is encompassing the expressions of 24 performers across 8
employed to generate individual trees, while grid search is distinct emotions, each characterized by two intensity levels.
utilized to determine split criteria such as Gini impurity and The Librosa library in Python is utilized to read the data,
entropy. The performance of random forests generally exhibits followed by the extraction of features using Librosa and
enhancement as the number of decision trees increases, Parselmouth. Additionally, as part of the data preparation
particularly when considering Gini impurity, hence indicating a process, the data is scaled to conform to a Standard Normal
higher level of split quality. With 100 trees and Gini impurity, distribution. Feature selection, which relies on mutual
the RF classifier attains an overall accuracy of 61.84%. The information, retains the top 20% of LFCC and MFCC features.
classifier's accuracy increases to 64.06% when it is trained on a The Synthetic Minority Oversampling Technique (SMOTE)
specific subset of 177 characteristics. was employed to address the class imbalance, resulting in a
total of 1536 audio files, with 192 files matching each emotion.
In the process of training and validating models, stratified k-
fold cross-validation is employed to ensure that the subsets
utilized for training and evaluation accurately reflect the
entirety of emotions. In this study, the classification models are
assessed utilizing a stratified tenfold cross-validation
methodology.

This section provides a summary of the implementation

process, along with a thorough examination of the study's
results and consequences. This study employs the Ryerson
Audio-Visual Database of Emotional Speech and Song
(RAVDESS) as the dataset. The collection consists of 1440
audio files, which include the vocalizations of 24 artists
representing 8 different moods, each with two levels of
strength.
Fig 2. SVM Confusion Matrix
Fig 5. MLP Model ROC Plots
Fig 3. MLP Confusion Matrix
Fig 6. SVM Model ROC Plots.
The data is read using the Librosa library in Python, and

subsequently, features are extracted using Librosa and
Parselmouth. Furthermore, in the context of data preparation,
the data undergoes scaling to ensure its conformity to a
Standard Normal distribution. Feature selection selects the top
20% of MFCC and LFCC characteristics based on mutual
information. To rectify the class imbalance, the Synthetic
Minority Oversampling Technique (SMOTE) was utilized,
yielding a grand total of 1536 audio files, with 192 files
corresponding to each emotion. Stratified k-fold cross-
Fig 4. Different Emotion Classifier Performance validation is used during the training and validation of models

to verify that the subsets used for training and evaluation stratified tenfold cross-validation methodology to evaluate the
accurately represent all emotions. This study employs a categorization models.
Table 2. Developed Methodology Comparison with Existing State of the Art.

Ref Deployed Model with Article reference Number Accuracy Dataset
[1] Features optimization with DGA+PCA [1] 77.49% EMO-DB, RAVDESS

URDU (SMOEL), EMODB (SMOEL),
[4] SMO, RF, J48, Ensemble Learning [4] 69.31%
SAVEE (RFEL), EMOVO(SMOEL)
[5] TLFMRF [5] 68.89% CASIA Corpus, Emo-DB
T-FNN, 2D and 3D (AG-TFNN), Parallel, AGTFNN
[7] 70.09% Emo-DB, IEMOCAP
[7]
[19] SFF and nonlinear energy operator, SVM, RF [19] - EMODB, FAUAIBO, IEMOCAP
[6] iMEMD, SVM and k-NN [6] 71.70% SEED, DEAP
[8] D-MFCC [8] 86.96% CASIA
EMD and its second order difference plots (SODP)
[14] - DEAP
SVM [14]
Developed
DM+DT+KNN [Developed Methodology] 89.01% EMO-DB
Methodology
For speaker-independent fitness evaluation, our research 70% accuracy rate is their claimed maximum overall
presents a new feature extraction architecture that combines a performance. Keep in mind that MFCCs can't pick up on every
dimension reduction mechanism (DM) with Decision Tree (DT) nuance of human emotion in speech. The spectrum features, in
and K-Nearest Neighbor (KNN) classifiers. This method particular the delta and delta-delta coefficients, determine the
outperforms methods that use random selection to evaluate exact classification of emotions. In comparison to the
fitness, with an emphasis on using a non-linear approach based previously mentioned models on the RAVDESS dataset, our
on diffusion maps for feature extraction. To make emotion suggested models perform better. Because a human observer
recognition more reliable, we use the Mel Frequency Cepstral can only achieve a 67% accuracy rate when using the
Coefficients (MFCC) feature selection approach to extract RAVDESS dataset, it poses a serious challenge to emotion
features from audio signals and frequency parameterization. detection researchers [23]. Researchers used a bagged ensemble
Results from our speaker-independent studies showed an of SVMs to identify MFCCs and spectral centroids in their
average accuracy of 88.82% with a range of 86.93% to 89.01%; work [24]. The researchers achieved an impressive performance
compared to earlier results from the EMO-DB dataset, our of 75.69% on the RAVDESS dataset using this approach and a
suggested methodology showed a significant 3.1% train-test split ratio of 90:10. As a point of comparison, our best
improvement in accuracy. In addition, Table 2 presents result with MLP is 84.96%, and with SVM, it's 86.27%. To
extensive findings, including emotion-wise accuracy for train a convolutional neural network to determine the emotional
speaker-independent scenarios, so you may see how well our state of spoken language, researchers in [25] used a mix of
suggested approach works across different emotional MFCCs and modulation spectral characteristics. They achieved
categories. a maximum accuracy rate of 78.10%. It is noteworthy that the
Support Vector Machine (SVM) model, which achieves an
V. DISCUSSION accuracy of 77.86%, is a favorable choice because to its
comparatively shorter training time, while being a strong
Our SER framework will be compared to other SER contender among the recommended classification strategies.
frameworks in the literature, including SVM and MLP models, Consequently, the SER framework employed in the study
to see how well it performs. Table 2 compares the SVM and demonstrates superior performance compared to previously
MLP models' overall accuracy with that of existing SER models reported frameworks in terms of overall accuracy ratings. The
published for the RAVDESS dataset in the literature. Tenfold SVM system we propose is computationally efficient and
stratified cross-validation was employed to compile and present highly dependable, as it keeps just 177 features after selecting a
the findings of our study. A study in [20] employed support subset of features. The recommended Support Vector Machine
vector machines (SVMs) with different kernel functions to (SVM) framework is highly suitable for inclusion into real-time
classify data using characteristics obtained from continuous speech emotion recognition applications due to its shorter
wavelet transform (CWT). The maximum reported accuracy for training periods in comparison to its Multilayer Perceptron
the RAVDESS dataset was 60.1% author in [21] attained an (MLP) counterpart.
accuracy of 64.48 percent by utilizing spectrograms in
combination with a classifier that relies on deep neural
networks. To train a Logistic Model Tree, the authors in [22]
use a 13-dimensional feature vector made up of MFCCs. A

VI. CONCLUSION reduction, and feature selection. Different people can react
differently to the same speech, especially in real-life situations
The study's Speech Emotion Recognition model makes without a script, making emotion identification harder. A robust
use of machine learning algorithms like k-Nearest Neighbours, Speech Emotion Recognition (SER) system with diverse
Random Forest, Support Vector Machine (SVM), and datasets to train models and increase accuracy is needed to
Multilayer Perceptron (MLP). Analysis of spectral and prosodic handle this complexity. Future research employing the diffusion
features leads to the selection of 177 features among 652 map-based method and neural-based approaches could increase
overall, based on the information they share with the target diagnostic accuracy and efficiency by analysing clinical images
attribute. Feature subset selection techniques based on filters and facial expressions to learn about health issues and diseases.
are utilised in real-time applications because to their
computational efficiency. Results demonstrate that MLP and REFERENCES
SVM classifiers both achieve better validation accuracy than
others, even though SVM offers shorter training durations. [1]. S. Kanwal and S. Asghar, “Speech Emotion Recognition
Results from statistical tests show that SVM outperforms MLP using Clustering Based GA-Optimized Feature Set”, IEEE
in real-time scenarios, making it a good substitute for MLP. access, vol. 9, pp. 125830-125842, 2021.
Adding audio quality elements and adjusting MFCC and LFCC [2]. R. Hidayat, “Frequency Domain Analysis of MFCC
coefficients are potential areas for future study to improve Feature Extraction in Children’s Speech Recognition
system performance. Further research into gender-based SER System”, JURNAL INFOTEL (Informatics,
systems and alternative feature selection algorithms might lead Telecommunication, and Electronics), vol. 14, no. 1, pp.
to useful improvements. 30-36, 2022.
[3]. S. Yildirim, Y. Kaya and F. Kılıç, “A Modified Feature
Feature engineering and classification are the main phases Selection Method Based on Metaheuristic Algorithms for
in speech emotion recognition (SER). Our research introduces a Speech Emotion Recognition”, Applied Acoustics, vol.
novel dimensionality and feature reduction method utilising the 173, 107721, 2021.
Ryerson Audio-Visual Database of Emotional Speech and Song [4]. M. Swain, A. Routray and P. Kabisatpathy, “Databases,
dataset, outperforming baseline methods. The SER model, Features and Classifiers for Speech Emotion Recognition:
which uses the suggested dimensionality reduction technique, is A Review”, International Journal of Speech Technology,
compared to various state-of-the-art studies for recall and vol. 21, pp. 93-120, 2018.
accuracy; Table II shows the results. By incorporating [5]. Z. Zhang, “Mechanics of Human Voice Production and
numerous sets of characteristics at the data classification level, Control”, The Journal of the Acoustical Society of
the SER model that follows the suggested reduction and America, vol. 140, no. 4, pp. 2614-2635, 2016.
extraction strategy performs similarly. Fusion weights help us [6]. M. B. Akçay and K. Oğuz, “Speech Emotion Recognition:
understand how each feature set affects SER performance and Emotional Models, Databases, Features, Preprocessing
their magnitude. In earlier research without gender Methods, Supporting Modalities, and Classifiers”, Speech
consideration, standalone KNN classifiers have not consistently Communication, vol. 116, pp. 56-76, 2020.
given optimal results. Ensemble approaches, especially [7]. N. Salankar, P. Mishra and L. Garg, “Emotion
AdaBoost, perform better even when they just employ a subset Recognition from EEG Signals using Empirical Mode
of the training samples or attributes. This shows the features of Decomposition and Second-Order Difference Plot”,
subsets' value in new ensemble classification methods. Our Biomedical Signal Processing and Control, vol. 65,
preliminary Emo-Berlin dataset results suggested that our 102389, 2021.
ensemble approaches and state-of-the-art results may be [8]. R. Thirumuru, K. Gurugubelli and A. K. Vuppala, “Novel
improved. A binary-based classifier that displays confusion Feature Representation using Single Frequency Filtering
matrices may enhance SER accuracy. We anticipate exploring and Nonlinear Energy Operator for Speech Emotion
these alternatives may improve SER approaches and boost Recognition”, Digital Signal Processing, vol. 120, 103293,
emotion recognition performance. In recognizing emotions 2022.
from speech patterns, ensemble, decision tree (DT), and K- [9]. C. K. Yogesh, M. Hariharan, R. Ngadiran, A. H. Adom, S.
Nearest Neighbor (KNN) classifiers and diffusion map-based Yaacob, C. Berkai and K. Polat, “A New Hybrid PSO
machine learning have showed promising results. Diffusion Assisted Biogeography-Based Optimization for Emotion
maps can help us understand complex speech feature and Stress Recognition from Speech Signal”, Expert
interactions and improve emotion recognition algorithms. Systems with Applications, vol. 69, pp. 149-158, 2017.
Finally, the simple and effective DT and KNN classifiers make [10]. R. B. Lanjewar, S. Mathurkar and N. Patel,
it easy to categorize emotions using extracted voice data. The “Implementation and Comparison of Speech Emotion
diffusion map-based technique extracts data structure and Recognition System using GAUSSIAN Mixture Model
reduces feature space dimensionality. The nearest neighbor (GMM) and K-Nearest Neighbor (K-NN) Techniques”,
network integrates easily with pairwise-based classification Procedia Computer Science, vol. 49, pp. 50-57, 2015.
models and improves computing efficiency, especially with
high-dimensional data. Redundancy can be reduced by using
features with negligible effect on accuracy, dimensionality

[11]. C. C. Lee, E. Mower, C. Busso, S. Lee and S. Narayanan, [24]. Ye J, Wen XC, Wei Y, Xu Y, Liu K, Shan H. Temporal
“Emotion Recognition using a Hierarchical Binary modeling matters: A novel temporal emotional modeling
Decision Tree Approach”, Speech Communication, vol. approach for speech emotion recognition. InICASSP
53, no. 9-10, pp. 1162-1171, 2011. 2023-2023 IEEE International Conference on Acoustics,
[12]. K. S. Rao, S. G. Koolagudi and R. R. Vempada, “Emotion Speech and Signal Processing (ICASSP) 2023 Jun 4 (pp.
Recognition from Speech using Global and Local 1-5). IEEE.
Prosodic Features”, International Journal of Speech [25]. Jha T, Kavya R, Christopher J, Arunachalam V. Machine
Technology, vol. 16, pp. 143-160, 2013. learning techniques for speech emotion recognition using
[13]. S. Prasomphan and S. Doungwichain, “Detecting Human paralinguistic acoustic features. International Journal of
Emotions in a Large Size of Database by using Ensemble Speech Technology. 2022 Sep;25(3):707-25.
Classification Model”, Mobile Networks and
Applications, vo.. 23, pp. 1097-1102, 2018.
[14]. M. B. Mustafa, M. A. M. Yusoof, Z. M. Don and M. AUTHORS
Malekzedeh, “Speech Emotion Recognition Research: An
Analysis of Research Focus”, International Journal of First Author – Zewar Shah, Collage of Information science
Speech Technology, vol. 21, pp. 137-156, 2018. and Technology, Donghua University, Shanghai 201620, P.R
[15]. S. Yoon, S. Byun and K. Jung, “Multimodal Speech China.
Emotion Recognition using Audio and Text”, IEEE Second Author –Shan Zhiyong, Collage of Information
Spoken Language Technology Workshop (SLT), 18-21 science and Technology, Donghua University, Shanghai
December, 2018, Greece, pp. 112-118. 201620, P.R China.
[16]. E. Bingham and H. Mannila, “Random Projection in Third Author –Adnan, School of Computer Science and
Dimensionality Reduction: Applications to Image and Technology Donghua University Shanghai 201620, China
Text Data”, Proceedings of the Seventh ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, 26-29 August, 2001, California, pp. 245- Correspondence Author – Shan zhiyong, School of
250. Information and Technology, Donghua University, Shanghai
[17]. S. R. Livingstone and F. A. Russo, “The Ryerson Audio- 201620, P.R China
Visual Database of Emotional Speech and Song
(RAVDESS):A dynamic, Multimodal Set of Facial and
Vocal Expressions in North American English”, PloS
One, vol. 13, no. 5, e0196391, 2018.
[18]. S. R. Kadiri and P. Alku, “Excitation Features of Speech
for Speaker-Specific Emotion Detection”, IEEE Access,
vol. 8, pp. 60382-60391, 2020.
[19]. B. Schuller, R. Muller, M. Lang and G. Rigoll, “Speaker
Independent Emotion Recognition by Early Fusion of
Acoustic and Linguistic Features Within Ensembles”, 9th
European Conference on Speech Communication and
Technology, 04-08 September, 2005, Portugal, pp. 1-4
[20]. Shegokar P, Sircar P. Continuous wavelet transform based
speech emotion recognition. In2016 10th International
conference on signal processing and communication
systems (ICSPCS) 2016 Dec 19 (pp. 1-8). IEEE.
[21]. Zeng Y, Mao H, Peng D, Yi Z. Spectrogram based multi-
task audio classification. Multimedia Tools and
Applications. 2019 Feb;78:3705-22.
[22]. Zamil AA, Hasan S, Baki SM, Adam JM, Zaman I.
Emotion detection from speech signals using voting
mechanism on classified frames. In2019 international
conference on robotics, electrical and signal processing
techniques (ICREST) 2019 Jan 10 (pp. 281-285). IEEE.
[23]. de Lope J, Grana M. An ongoing review of speech
emotion recognition. Neurocomputing. 2023 Apr 1;528:1-
1.

Enhancements in Immediate Speech Emotion Detection: Harnessing Prosodic and Spectral Characteristics

Uploaded by

Copyright:

Available Formats

Enhancements in Immediate Speech Emotion Detection: Harnessing Prosodic and Spectral Characteristics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enhancements in Immediate Speech Emotion Detection: Harnessing Prosodic and Spectral Characteristics

Uploaded by

Copyright:

Available Formats

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR872

Enhancements in Immediate Speech

IJISRT24APR872 www.ijisrt.com 1526

IJISRT24APR872 www.ijisrt.com 1527

IJISRT24APR872 www.ijisrt.com 1528

Apply the weighted graph Laplacian normalization to this

Anisotropic transition kernel:

𝑡𝜀𝑥 (𝑥, 𝑦) = 𝑘𝜀𝛼 (𝑥, 𝑦)⁄𝑡𝑑𝜀𝑥 (𝑥)

C. Classification Method (Naïve Bayes)

IJISRT24APR872 www.ijisrt.com 1529

learn package. Assuming conditional independence among F. Classification Method (SVM)

E. Classification Method (RF) This part encompasses an overview of the implementation

IJISRT24APR872 www.ijisrt.com 1530

This section provides a summary of the implementation

Fig 2. SVM Confusion Matrix

Fig 5. MLP Model ROC Plots

Fig 3. MLP Confusion Matrix

Fig 6. SVM Model ROC Plots.

The data is read using the Librosa library in Python, and

IJISRT24APR872 www.ijisrt.com 1531

Table 2. Developed Methodology Comparison with Existing State of the Art.

[1] Features optimization with DGA+PCA [1] 77.49% EMO-DB, RAVDESS

IJISRT24APR872 www.ijisrt.com 1532

IJISRT24APR872 www.ijisrt.com 1533

IJISRT24APR872 www.ijisrt.com 1534

You might also like