Robust Speech Emotion Recognition Under Different Encoding Conditions
Robust Speech Emotion Recognition Under Different Encoding Conditions
Robust Speech Emotion Recognition Under Different Encoding Conditions
Abstract et al. [4] tested the low bitrate speech codecs AMR, AMR-
In an era where large speech corpora annotated for emotion WB, and AMR-WB+ and showed that accuracy does not al-
are hard to come by, and especially ones where emotion is ex- ways decrease with a decreasing bitrate. In fact, they showed
pressed freely instead of being acted, the importance of using how emotion classification accuracy can vary across different
free online sources for collecting such data cannot be over- codecs and acoustic feature sets. On the other hand, Siegert et
stated. Most of those sources, however, contain encoded au- al. [5] showed that the psychoacoustic model within the OPUS
dio due to storage and bandwidth constraints, often in very low codec can, as a by-product, remove emotion redundant inform-
bitrates. In addition, with the increased industry interest on ation from a speech signal, resulting in improved model accur-
voice-based applications, it is inevitable that speech emotion acy. In this case, the data rate of the speech signal is main-
recognition (SER) algorithms will soon find their way into pro- tained but redundant and imperceivable audio information is
duction environments, where the audio might be encoded in a removed. Siegert et al. [6] focused on emotion classification
different bitrate than the one available during training. Our con- accuracy, obtained via support vector machines (SVMs) [7],
tribution is threefold. First, we show that encoded audio still when using MP3, AM-WB and SPX coded speech. The au-
contains enough relevant information for robust SER. Next, we thors found that MP3 with a bitrate of 32 kbit/s or higher, was
investigate the effects of mismatched encoding conditions in suitable for achieving satisfactory unweighted average recall
the training and test set both for traditional machine learning (UAR) results. Garcı́a et al. [8] investigated the call centre use
algorithms built on hand-crafted features and modern end-to- case for SER and focused on band-limiting codecs like AMR-
end methods. Finally, we investigate the robustness of those al- NB, SILK, and also downsampled original speech. The authors
gorithms in the multi-condition scenario, where the training set were able to show that there was little degradation in model
is augmented with encoded audio, but still differs from the train- accuracy when features were extracted from voiced segments
ing set. Our results indicate that end-to-end methods are more only. On the other hand, model accuracy was severely degraded
robust even in the more challenging scenario of mismatched when features were extracted from unvoiced segments only.
conditions. Frühholz et al. [9] investigated the narrow-band encoded and
Index Terms: speech emotion recognition, speech and audio low-pass filtered cases for short-term speaker state and long-
compression acronym term speaker trait recognition. The study focused on narrow-
band low-bitrate speech coders used in telecommunications and
high dimensional feature exaction as input to an SVM classi-
1. Introduction fier. The authors showed that, under the given conditions, the
Data collection is an ongoing issue in the speech and audio ma- matched and multi-condition training methods showed only a
chine learning community. We require large, clean and well- slight performance degradation even at the lower bitrates for
controlled data sets to perform rigorous experiments. However, arousal and valence recognition. On the other hand, the mis-
we also need to test the robustness of our algorithms in real matched training condition resulted in a performance degrada-
world conditions, and this involves ‘in the wild’ recordings with tion.
natural, mixed content and potentially poor quality signals. A Much of the previous work on the topic of SER with en-
vast amount of available online audio sources contain a virtu- coded speech has focused on the telecommunications use case.
ally inexhaustible source of data for developing machine learn- In contrast, we see the large amount of ‘in the wild’ data on plat-
ing applications. However, due to storage and bandwidth con- forms like YouTube as an invaluable resource of data that can be
straints, this data is usually stored in a compressed form. Audio used both for training and testing a SER model. The most inter-
compression is known, in many cases, to degrade the perceptual esting case is that of testing an already trained classifier. Since
quality of the audio [1, 2]. However, the consequences of using the presence of coding artefacts are largely inconsequential to a
compressed audio for recognition models have not been suffi- human’s ability to recognise an emotion [10], one would expect
ciently studied, since most of the available emotional corpora the same from a SER model which has truly built an internal
are recorded in very high quality conditions. With the rapid rise representation of human emotion. Moreover, access to large
of speech emotion recognition (SER) research and its imminent amounts of speech data makes modern end-to-end deep neural
advance into industry applications [3], we wish to investigate networks (DNNs) which have demonstrated their superior per-
both the effects of using compressed audio to train SER models, formance over traditional machine learning algorithms in other
and the potential pitfalls of using an existing model in a produc- domains, a viable option for classification.
tion environment under mismatched encoding conditions. This paper will focus on the following three applications of
Previous studies can be found in the literature covering the SER:
topic of SER models trained with compressed speech. Albahri 1. SER models trained and tested on compressed speech
3936
of choice, as it is commonly used as a baseline algorithm for ComParE IS09_emotion eGeMAPS emo_large
SER. We also normalized our data using a mean and standard 0.8
deviation normalisation which was computed on the training set 0.7
EMO-DB
and applied on the test set. 0.6
0.5
2.3.2. End-to-end 0.4
0.3
Polish-EMO
[23]. We consequently adopt this approach as well, to test the
0.6
robustness of DNNs under different encodings. Our network
0.5
consists of two 1D convolution layers containing 20 and 40 fil-
0.4
ter banks accordingly, which are respectively followed by two
0.3
max pooling layers, with a stride of 2 and 10. The two max
pooling layers effectively downsample the signal to allow for 0.8
a more efficient implementation and help with learning higher 0.7
eNTERFACE
level representations. 0.6
Its output is first flattened and then fed to two uni-directional 0.4
AAC OPUS
LSTM layers [13], each of them having a size of 256. We 0.3 MP3 Original
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
also used a dropout of 0.5 after each convolutional layer to pre-
vent overfitting. The output of the last LSTM is mapped to the kilobits per second
arousal prediction through a fully-connected layer followed by
Figure 1: UAR of SER models in the matched condition setting
a tanh activation.
We train the model to maximize the Concordance Correl-
ation Coefficient (ρc ) for 50 epochs with a batch size of 25 ComParE IS09_emotion eGeMAPS emo_large
examples and a learning rate of 0.0001 and choose the model 0.8
that performed best on the validation set. We also performed all 0.7
0.6
EMO-DB
of the post-processing steps outlined in Trigeorgis et al. [23], 0.5
namely median filtering, centring, scaling and time shifting. 0.4
0.3
0.2 AAC OPUS
3. Experiments MP3 Original
unweighted average recall
0.1
0.8
We present the results of our experiments in Sections 3.1 0.7
Polish-EMO
and 3.2. As we shall see in Section 3.2, the end-to-end approach 0.6
0.5
does not suffer from the effects of encoding. We therefore only 0.4
include results for the most challenging scenario of mismatched 0.3
conditions. For the acoustic features approach, we report the 0.2
results from all scenarios. 0.1
0.8
0.7
eNTERFACE
3.1. openSMILE & SVM 0.6
0.5
Given the limited size of our data sets, we decided to report 0.4
0.3
results on a leave one speaker out (LOSO) speaker independent
0.2
cross validation (CV) for all our experiments, and report UAR. 0.1
In Figures 1 to 3, we show the average UAR across all folds and
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
3937
ComParE IS09_emotion eGeMAPS emo_large As expected, the OPUS codec now achieves UAR results sim-
0.8 ilar to the uncompressed results across all data sets, feature sets
0.7 and all but the lowest bitrates. In many cases, by augmenting
EMO-DB
0.6 the uncompressed training set with just two compressed ver-
0.5 sions of itself we are able to build a model which is robust to
0.4 the temporal and spectral artefacts introduced by the compres-
0.3 AAC OPUS
MP3 Original sion process.
unweighted average recall
0.8
0.7 3.2. End-to-end
Polish-EMO
0.6
As seen in Table 2, the end-to-end approach does not suffer
0.5
from the effects of encoding. We see only marginal differences
0.4
in the concordance correlation coefficient (CCC) results under
0.3
all encodings and bitrates. This indicates that such architectures
0.8 can potentially be more robust to encoding noise than traditional
0.7 machine learning approaches.
eNTERFACE
0.6
0.5 Table 2: RECOLA CCC results in mismatched conditions
0.4
0.3
Original Bitrate (kbit/s) OPUS MP3 AAC
12
16
24
32
64
96
128
12
16
24
32
64
96
128
12
16
24
32
64
96
128
12
16
24
32
64
96
128
3938
6. References [18] K. Brandenburg, ‘MP3 and AAC explained’, in International
Conference on High-Quality Audio Coding, 1999. [Online].
[1] A. Hines, E. Gillen, D. Kelly, J. Skoglund, A. Kokaram and N. Available: http://www.aes.org/e-lib/browse.cfm?elib=8079.
Harte, ‘Perceived audio quality for streaming stereo music’, in
ACM International Conference on Multimedia, 2014, pp. 1173– [19] F. Eyben, M. Wollmer and B. Schuller, ‘OpenEAR - introducing
1176. DOI: 10.1145/2647868.2655025. the Munich open-source emotion and affect recognition toolkit’,
in International Conference on Affective Computing and Intel-
[2] C. Hoene, J. Valin, K. Vos and J. Skoglund, ‘Summary of Opus ligent Interaction, 2009, pp. 1–6. DOI: 10 . 1109 / ACII . 2009 .
listening test results’, IETF, 2013. [Online]. Available: https : 5349350.
//tools.ietf.org/html/draft-ietf-codec-results-03.
[20] B. Schuller, S. Steidl and A. Batliner, ‘The Interspeech 2009
[3] B. W. Schuller, ‘Speech emotion recognition: Two decades in Emotion Challenge’, in Interspeech, 2009, pp. 312–315. [On-
a nutshell, benchmarks, and ongoing trends’, Communications line]. Available: https : / / www . isca - speech . org / archive /
of the ACM, vol. 61, no. 5, pp. 90–99, 2018. DOI: 10 . 1145 / interspeech 2009/i09 0312.html.
3129340.
[21] B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. Burgoon, A.
[4] A. Albahri, M. Lech and E. Cheng, ‘Effect of speech compres- Baird, A. Elkins, Y. Zhang, E. Coutinho and K. Evanini, ‘The
sion on the automatic recognition of emotions’, International Interspeech 2016 Computational Paralinguistics Challenge: De-
Journal of Signal Processing Systems, vol. 4, no. 1, pp. 55–61, ception, sincerity and native language’, in Interspeech, 2016,
2015. DOI: 10.12720/ijsps.4.1.55-61. pp. 2001–2005. DOI: 10.21437/Interspeech.2016-129.
[5] I. Siegert, A. Requardt, O. Egorow and S. Wolff, ‘Utilizing psy- [22] F. Eyben, K. R. Scherer, B. Schuller, J. Sundberg, E. Andre,
choacoustic modeling to improve speech-based emotion recog- C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan and
nition’, in International Conference on Speech and Computer, K. Truong, ‘The Geneva minimalistic acoustic parameter set
2018, pp. 625–635. DOI: 10.1007/978-3-319-99579-3 64. (GeMAPS) for voice research and affective computing’, IEEE
[6] I. Siegert, A. Requardt, L. Linda Duong and A. Wendemuth, Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202,
‘Measuring the impact of audio compression on the spectral 2015. DOI: 10.1109/TAFFC.2015.2457417.
quality of speech data’, in Elektronische Sprachsignalverarbei- [23] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nic-
tung, O. Jokisch, Ed., TUD Press, 2016, pp. 229–236. olaou, B. Schuller and S. Zafeiriou, ‘Adieu features? End-to-
[7] B. Scholkopf and A. J. Smola, Learning with Kernels: Sup- end speech emotion recognition using a deep convolutional re-
port Vector Machines, Regularization, Optimization, and Bey- current network’, in IEEE International Conference on Acous-
ond. MIT Press, 2001. tics, Speech and Signal Processing, 2016, pp. 5200–5204. DOI:
10.1109/ICASSP.2016.7472669.
[8] N. Garcı́a, J. Vasquez, J. D. Arias-Londoño, J. Vargas-Bonilla
and J. R. Orozco, ‘Automatic emotion recognition in com-
pressed speech using acoustic and non-linear features’, in Sym-
posium on Signal Processing, Images and Computer Vision,
2015, pp. 1–7. DOI: 10.1109/STSIVA.2015.7330399.
[9] S. Frühholz, E. Marchi and B. Schuller, ‘The effect of narrow-
band transmission on recognition of paralinguistic information
from human vocalizations’, IEEE Access, vol. 4, Sep. 2016. DOI:
10.1109/ACCESS.2016.2604038.
[10] A. Requardt, I. Siegert, M. Maruschke and A. Wendemuth, ‘Au-
dio compression and its impact on emotion recognition in affect-
ive computing’, Mar. 2017.
[11] (2019). FFmpeg codecs documentation, [Online]. Available:
https://www.ffmpeg.org/ffmpeg-codecs.html.
[12] F. Eyben, M. Wöllmer and B. Schuller, ‘Opensmile – the Mu-
nich versatile and fast open-source audio feature extractor’, in
ACM International Conference on Multimedia, 2010, pp. 1459–
1462. DOI: 10.1145/1873951.1874246.
[13] S. Hochreiter and J. Schmidhuber, ‘Long short-term memory’,
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI:
10.1162/neco.1997.9.8.1735.
[14] F. Burkhardt, A. Paeschke, M. A. Rolfes, W. F. Sendlmeier and
B. Weiss, ‘A database of German emotional speech’, in Inter-
speech, 2005, pp. 1517–1520. [Online]. Available: https://www.
isca-speech.org/archive/interspeech 2005/i05 1517.html.
[15] O. Martin, I. Kotsia, B. Macq and I. Pitas, ‘The eNTERFACE’05
audio-visual emotion database’, in International Conference on
Data Engineering Workshops, 2006. DOI: 10 . 1109 / ICDEW.
2006.145.
[16] P. Staroniewicz and W. Majewski, ‘Polish emotional speech
database – recording and preliminary validation’, in Cross-
Modal Analysis of Speech, Gestures, Gaze and Facial Expres-
sions, A. Esposito and R. Vı́ch, Eds., Springer, 2009, pp. 42–49.
DOI : 10.1007/978-3-642-03320-9 5.
3939