Nothing Special   »   [go: up one dir, main page]

Robust Speech Emotion Recognition Under Different Encoding Conditions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

INTERSPEECH 2019

September 15–19, 2019, Graz, Austria

Robust Speech Emotion Recognition under Different Encoding Conditions


Christopher Oates1 , Andreas Triantafyllopoulos1 , Ingmar Steiner1 , Björn Schuller1,2,3
1
audEERING GmbH, Gilching, Germany
2
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg,
Germany
3
GLAM – Group on Language, Audio & Music, Imperial College London, UK
coates@audeering.com, atriant@audeering.com

Abstract et al. [4] tested the low bitrate speech codecs AMR, AMR-
In an era where large speech corpora annotated for emotion WB, and AMR-WB+ and showed that accuracy does not al-
are hard to come by, and especially ones where emotion is ex- ways decrease with a decreasing bitrate. In fact, they showed
pressed freely instead of being acted, the importance of using how emotion classification accuracy can vary across different
free online sources for collecting such data cannot be over- codecs and acoustic feature sets. On the other hand, Siegert et
stated. Most of those sources, however, contain encoded au- al. [5] showed that the psychoacoustic model within the OPUS
dio due to storage and bandwidth constraints, often in very low codec can, as a by-product, remove emotion redundant inform-
bitrates. In addition, with the increased industry interest on ation from a speech signal, resulting in improved model accur-
voice-based applications, it is inevitable that speech emotion acy. In this case, the data rate of the speech signal is main-
recognition (SER) algorithms will soon find their way into pro- tained but redundant and imperceivable audio information is
duction environments, where the audio might be encoded in a removed. Siegert et al. [6] focused on emotion classification
different bitrate than the one available during training. Our con- accuracy, obtained via support vector machines (SVMs) [7],
tribution is threefold. First, we show that encoded audio still when using MP3, AM-WB and SPX coded speech. The au-
contains enough relevant information for robust SER. Next, we thors found that MP3 with a bitrate of 32 kbit/s or higher, was
investigate the effects of mismatched encoding conditions in suitable for achieving satisfactory unweighted average recall
the training and test set both for traditional machine learning (UAR) results. Garcı́a et al. [8] investigated the call centre use
algorithms built on hand-crafted features and modern end-to- case for SER and focused on band-limiting codecs like AMR-
end methods. Finally, we investigate the robustness of those al- NB, SILK, and also downsampled original speech. The authors
gorithms in the multi-condition scenario, where the training set were able to show that there was little degradation in model
is augmented with encoded audio, but still differs from the train- accuracy when features were extracted from voiced segments
ing set. Our results indicate that end-to-end methods are more only. On the other hand, model accuracy was severely degraded
robust even in the more challenging scenario of mismatched when features were extracted from unvoiced segments only.
conditions. Frühholz et al. [9] investigated the narrow-band encoded and
Index Terms: speech emotion recognition, speech and audio low-pass filtered cases for short-term speaker state and long-
compression acronym term speaker trait recognition. The study focused on narrow-
band low-bitrate speech coders used in telecommunications and
high dimensional feature exaction as input to an SVM classi-
1. Introduction fier. The authors showed that, under the given conditions, the
Data collection is an ongoing issue in the speech and audio ma- matched and multi-condition training methods showed only a
chine learning community. We require large, clean and well- slight performance degradation even at the lower bitrates for
controlled data sets to perform rigorous experiments. However, arousal and valence recognition. On the other hand, the mis-
we also need to test the robustness of our algorithms in real matched training condition resulted in a performance degrada-
world conditions, and this involves ‘in the wild’ recordings with tion.
natural, mixed content and potentially poor quality signals. A Much of the previous work on the topic of SER with en-
vast amount of available online audio sources contain a virtu- coded speech has focused on the telecommunications use case.
ally inexhaustible source of data for developing machine learn- In contrast, we see the large amount of ‘in the wild’ data on plat-
ing applications. However, due to storage and bandwidth con- forms like YouTube as an invaluable resource of data that can be
straints, this data is usually stored in a compressed form. Audio used both for training and testing a SER model. The most inter-
compression is known, in many cases, to degrade the perceptual esting case is that of testing an already trained classifier. Since
quality of the audio [1, 2]. However, the consequences of using the presence of coding artefacts are largely inconsequential to a
compressed audio for recognition models have not been suffi- human’s ability to recognise an emotion [10], one would expect
ciently studied, since most of the available emotional corpora the same from a SER model which has truly built an internal
are recorded in very high quality conditions. With the rapid rise representation of human emotion. Moreover, access to large
of speech emotion recognition (SER) research and its imminent amounts of speech data makes modern end-to-end deep neural
advance into industry applications [3], we wish to investigate networks (DNNs) which have demonstrated their superior per-
both the effects of using compressed audio to train SER models, formance over traditional machine learning algorithms in other
and the potential pitfalls of using an existing model in a produc- domains, a viable option for classification.
tion environment under mismatched encoding conditions. This paper will focus on the following three applications of
Previous studies can be found in the literature covering the SER:
topic of SER models trained with compressed speech. Albahri 1. SER models trained and tested on compressed speech

Copyright © 2019 ISCA 3935 http://dx.doi.org/10.21437/Interspeech.2019-1658


signals from the same codec (matched conditions) Table 1: Emotion data set information
2. SER models trained on uncompressed speech signals and
tested on compressed speech signals (mismatched condi- Data set EMO-DB eNTERFACE Polish-Emo RECOLA
tions)
3. SER models augmented with compressed speech sig- Subjects (m/f) 5/5 34/8 12/12 19/27
nals and tested on different compressed speech signals Sample Rate 16 kHz 48 kHz 44.1 kHz 44.1 kHz
(multi-condition) Language German English Polish French
Emotions anger, anger, anger, arousal,
To the best of the authors’ knowledge, evaluating an SER boredom, boredom, disgust, valence
model in the mismatched and multi-condition setting for wide- fear, joy, fear, joy, fear,
band encoding has not been thoroughly investigated before. sadness, sadness, happiness,
Moreover, utilising an end-to-end algorithm for classification neutral neutral, sadness,
in such a scenario has also not been thoroughly investigated disgust surprise
before. In addition, our study encompasses a broad range of
experimental conditions relevant to developing a production
SER model, namely: (1) codec, (2) bitrate (3) classification al- 2.2. Codecs
gorithm, (4) training method, (5) data set and (6) feature set.
We used three ubiquitous codecs, MP3, AAC [18] and OPUS
We will focus our attention on three ubiquitous codecs, namely,
[2], because they find widespread use on many online platforms
MP3, AAC and OPUS [11] and test a wide range of bitrates
like YouTube, iTunes and many more. In addition, they are
common to all 3 codecs. We will evaluate the consistency of
freely available, allowing for easy reproduction of the results
two different kinds of algorithms for 3 training methods over
presented. As reported in the FFmpeg Codecs Documenta-
four data sets and four feature sets. This study aims to extend
tion [11], the libraries libmp3lame, libfdk aac, and libOPUS
the body of knowledge on the topic with the unique and broad
are the best performing implementations. All three codecs
test cases presented.
largely share the same available bitrates and were forced to use
The remainder of the paper is organised as follows: In Sec- their constant bitrate mode to ensure a meaningful comparison
tion 2 we provide an overview of the data sets, codecs, and al- between codec-bitrates. Variable bitrate modes are difficult or
gorithms used in our evaluation. In Section 3 the results are impossible to compare as they are based on subjective quality
presented. Finally, in Section 4 we present our conclusions. levels. The speech data was encoded at the following bitrates:
12, 16, 24, 32, 48, 64, 96, 128, 192 and 256 kbit/s. However,
2. Experiment design due to a limited frame buffer size in the AAC and MP3 codec,
the upper-most bitrates available are 96 and 128 kbit/s, respect-
Speech Emotion Recognition is usually formulated as a simple ively. The OPUS codec was forced to use its internal SILK co-
classification of emotional labels or regression on emotional dec [2], since it is optimised for speech signals. MP3 and AAC,
dimensions for single utterances, where it is assumed that the on the other hand, were primarily designed for music content,
emotion is held constant within a relatively short temporal win- which may prove to be a disadvantage in the SER case.
dow. One of the most common approaches to modelling is to
use a set of acoustic features over this temporal window, and 2.3. Speech emotion recognition algorithms
then a machine learning algorithm to classify the utterance as
belonging to one of the classes in the set. In contrast, mod- 2.3.1. openSMILE & SVM
ern deep learning methods attempt to fuse the feature extraction In the traditional SER setting where the emotion is considered
and classification steps, and train an algorithm that jointly learns static within a single utterance, it is typical to extract a set of
rich representations and solves the downstream task. high level features in the pre-processing stage, which are then
In order to investigate the effects of encoding on both kinds fed into a standard machine learning model for classification or
of algorithms, we will first use a standard approach that utilises regression.
the openSMILE [12] open-source feature extraction toolkit and Previous studies have predominantly utilised one feature
SVMs [7] for classification. Then, we also will use an end-to- set, usually the emobase [19] feature set from the openSMILE
end approach that makes use of convolutional and long short- tool kit [5, 12, 6]. openSMILE has a range of example fea-
term memory (LSTM) [13] layers to predict the arousal dimen- ture sets of various sizes and combinations of features. We have
sion. selected the feature sets, IS09 emotion with 384 features [20],
ComParE with 6373 features [21], and emo large with 6552
2.1. Data sets features [19], as they have been used as the baseline feature
set in multiple Interspeech Computational Paralinguistic Chal-
In this study, four standard emotion data sets were used which lenges. These large feature sets aim to generate a rich feature
cover a range of acted and spontaneous emotions [14, 15, 16, space allowing the model to discover useful patterns for SER. In
17]. The relevant information for each data set can be found addition, we have also selected the eGeMAPS feature set with
in Table 1. EMO-DB, eNTERFACE, and Polish-Emo, contain 88 features [22]. This feature set was built using physiological
acted emotional speech, with each utterance limited to express- and signal processing expert knowledge of the voice produc-
ing a single emotion, whereas RECOLA contains spontaneous, tion system. It aims to focus on known voice characteristics to
long-term interactions where the emotion of the speaker evolves distinguish between emotions. Each set provides a single fea-
naturally over time. In order to compare SER results across ture vector per utterance. Each feature in the set is a statistical
data sets, a common sampling rate must be used. By using the summary of a frame-wise low-level descriptor (LLD). One such
same sample rate we ensure that the number of bits available example would be the mean fundamental frequency calculated
per sample of audio is the same for all speech samples. All data over an utterance.
sets have therefore been resampled to 16 kHz. We chose a simple SVM with linear kernel as our algorithm

3936
of choice, as it is commonly used as a baseline algorithm for ComParE IS09_emotion eGeMAPS emo_large
SER. We also normalized our data using a mean and standard 0.8
deviation normalisation which was computed on the training set 0.7

EMO-DB
and applied on the test set. 0.6
0.5
2.3.2. End-to-end 0.4
0.3

unweighted average recall


The RECOLA data set was successfully used to train an end-
0.8
to-end architecture for predicting arousal by Trigeorgis et al.
0.7

Polish-EMO
[23]. We consequently adopt this approach as well, to test the
0.6
robustness of DNNs under different encodings. Our network
0.5
consists of two 1D convolution layers containing 20 and 40 fil-
0.4
ter banks accordingly, which are respectively followed by two
0.3
max pooling layers, with a stride of 2 and 10. The two max
pooling layers effectively downsample the signal to allow for 0.8
a more efficient implementation and help with learning higher 0.7

eNTERFACE
level representations. 0.6

This first part of the network acts as a feature extractor. 0.5

Its output is first flattened and then fed to two uni-directional 0.4
AAC OPUS
LSTM layers [13], each of them having a size of 256. We 0.3 MP3 Original

12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
also used a dropout of 0.5 after each convolutional layer to pre-
vent overfitting. The output of the last LSTM is mapped to the kilobits per second
arousal prediction through a fully-connected layer followed by
Figure 1: UAR of SER models in the matched condition setting
a tanh activation.
We train the model to maximize the Concordance Correl-
ation Coefficient (ρc ) for 50 epochs with a batch size of 25 ComParE IS09_emotion eGeMAPS emo_large
examples and a learning rate of 0.0001 and choose the model 0.8
that performed best on the validation set. We also performed all 0.7
0.6

EMO-DB
of the post-processing steps outlined in Trigeorgis et al. [23], 0.5
namely median filtering, centring, scaling and time shifting. 0.4
0.3
0.2 AAC OPUS
3. Experiments MP3 Original
unweighted average recall

0.1
0.8
We present the results of our experiments in Sections 3.1 0.7

Polish-EMO
and 3.2. As we shall see in Section 3.2, the end-to-end approach 0.6
0.5
does not suffer from the effects of encoding. We therefore only 0.4
include results for the most challenging scenario of mismatched 0.3
conditions. For the acoustic features approach, we report the 0.2
results from all scenarios. 0.1
0.8
0.7

eNTERFACE
3.1. openSMILE & SVM 0.6
0.5
Given the limited size of our data sets, we decided to report 0.4
0.3
results on a leave one speaker out (LOSO) speaker independent
0.2
cross validation (CV) for all our experiments, and report UAR. 0.1
In Figures 1 to 3, we show the average UAR across all folds and
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256
12
16
24
32
64
96
128
192
256

an error bar denoting its standard deviation. kilobits per second


Figure 2: UAR of SER models trained in the mismatched condi-
3.1.1. Matched conditions
tion setting
We initially investigate the effects of using compressed audio on
each feature set’s ability to accurately represent emotion. This is
tested by comparing the performance of our models in the case 3.1.2. Mismatched conditions
where we train on compressed audio and test on compressed
audio of the same kind as opposed to training and testing on the After establishing that compressed audio still contains enough
original audio itself. relevant information to identify emotion, we investigate
As we see in Figure 1, for most combinations of feature whether an algorithm trained on uncompressed speech can per-
sets, codecs and bitrates, the UAR does not differ from the un- form satisfactorily in a production environment, where the in-
compressed counterpart. With a bitrate greater than 64 kbit/s, coming audio might have undergone any possible form of en-
there is little to no decrease in UAR across all codecs. This coding.
result implies that while there is a substantial perceptual loss in The results presented in Figure 2 show that in most cases
quality with a decreasing bitrate, there is still enough informa- there is a large drop in UAR on the compressed test sets. The
tion in the compressed speech to maintain UAR results compar- AAC codec shows on average the worst performance, with none
able to its uncompressed counterpart. of the bitrates achieving consistent results across data sets or

3937
ComParE IS09_emotion eGeMAPS emo_large As expected, the OPUS codec now achieves UAR results sim-
0.8 ilar to the uncompressed results across all data sets, feature sets
0.7 and all but the lowest bitrates. In many cases, by augmenting

EMO-DB
0.6 the uncompressed training set with just two compressed ver-
0.5 sions of itself we are able to build a model which is robust to
0.4 the temporal and spectral artefacts introduced by the compres-
0.3 AAC OPUS
MP3 Original sion process.
unweighted average recall

0.8
0.7 3.2. End-to-end

Polish-EMO
0.6
As seen in Table 2, the end-to-end approach does not suffer
0.5
from the effects of encoding. We see only marginal differences
0.4
in the concordance correlation coefficient (CCC) results under
0.3
all encodings and bitrates. This indicates that such architectures
0.8 can potentially be more robust to encoding noise than traditional
0.7 machine learning approaches.

eNTERFACE
0.6
0.5 Table 2: RECOLA CCC results in mismatched conditions
0.4
0.3
Original Bitrate (kbit/s) OPUS MP3 AAC
12
16
24
32
64
96
128
12
16
24
32
64
96
128
12
16
24
32
64
96
128
12
16
24
32
64
96
128

kilobits per second 12 0.4068 0.4221 0.4189


16 0.4089 0.4221 0.4205
Figure 3: UAR of SER models in the multi-condition setting 24 0.4089 0.4221 0.4216
32 0.4091 0.4221 0.4121
0.4195 64 0.4162 0.4198 0.4130
96 0.4151 0.4195 0.4135
128 0.4147 0.4195
feature sets. This results suggests that the AAC codec is not 192 0.4153
suitable in such a scenario. For the MP3 codec, only bitrates 256 0.4150
greater than 96 kbit/s perform well across data sets and return
UAR scores on par with the uncompressed test sets.
A closer examination of the Polish-EMO data set, where
OPUS suffers from the biggest decrease in UAR, −20 % on av- 4. Conclusion
erage, revealed that there is a DC offset in all of its samples. In this study, we established that enough emotion relevant
The OPUS codec applies a highpass filter to remove the DC information survives the encoding process to produce UAR
offset, a pre-processing step that is not performed in the MP3 scores similar to the uncompressed counterpart. This effect
and AAC codecs, and this is the likely cause of this huge drop was demonstrated across data sets and feature sets. However,
in performance, especially compared with OPUS’ consistency in the mismatched condition scenario, the changes imposed by
across the other two data sets. It should however be noted that the encoding process can result in a large loss in SER UAR. In
having a negligible DC offset is standard recording practice, as many cases, a model trained on high-quality uncompressed au-
in the EMO-DB and eNTERFACE data sets, where the OPUS dio will not work with low bitrate encoded input. In addition,
codec is able to maintain a UAR score comparable to its uncom- even at the higher bitrates, large inconsistencies in SER UAR
pressed counterpart for all but the lowest bitrates. Nevertheless were found across data sets, most notably and consistently for
this inconsistency across data sets means that in such a scenario the AAC codec but also for the OPUS codec with the Polish-
OPUS is also not suitable for highly robust SER. EMO data set. These negative effects can be mitigated by aug-
menting the training set with encoded audio. In fact, we have
3.1.3. Multi-condition shown that in the majority of cases, augmenting with only two
encoded versions is required to regain the lost UAR. Indeed, if,
Finally, we investigate the most realistic production setting,
in the multi-condition training, a representative subset of a co-
where our training set is augmented with compressed audio sig-
dec’s bitrate range is used for augmentation, the resulting model
nals of specific bitrates, but our test set consists of audio signals
is likely to be robust to the artefacts of the codec. For the condi-
compressed with a different bitrate. Ensuring different bitrates
tions investigated in the multi-condition case withstanding, this
between the training and test sets simulates a production envir-
result means that speech data can be procured from an online
onment scenario when the actual bitrate is unknown or variable.
source without concern for the encoding process affecting the
We always use higher bitrates to augment our training set be-
SER result. End-to-end architectures, on the other hand, seem
cause the more interesting case is when the test set contains
not to suffer from this effect at all and can be reliably used in-
lower audio quality. Specifically, we use the two immediately
dependent of the encoding conditions.
higher available bitrates for augmenting our training set, e.g. for
testing the 12 kbit/s audio, we augment our training set with 24
and 32 kbit/s samples. 5. Acknowledgements
The results presented in Figure 3 show that the worst case We would like to thank Uwe Reichel and Hagen Wierstorf for
AAC tests have greatly improved after data augmentation. In their helpful and constructive comments on the manuscript.
particular, for the ComParE feature set, the AAC codec im-
proves by 35 % on average across the data sets. A similar im-
provement can be seen for the MP3 codec at its lowest bitrates.

3938
6. References [18] K. Brandenburg, ‘MP3 and AAC explained’, in International
Conference on High-Quality Audio Coding, 1999. [Online].
[1] A. Hines, E. Gillen, D. Kelly, J. Skoglund, A. Kokaram and N. Available: http://www.aes.org/e-lib/browse.cfm?elib=8079.
Harte, ‘Perceived audio quality for streaming stereo music’, in
ACM International Conference on Multimedia, 2014, pp. 1173– [19] F. Eyben, M. Wollmer and B. Schuller, ‘OpenEAR - introducing
1176. DOI: 10.1145/2647868.2655025. the Munich open-source emotion and affect recognition toolkit’,
in International Conference on Affective Computing and Intel-
[2] C. Hoene, J. Valin, K. Vos and J. Skoglund, ‘Summary of Opus ligent Interaction, 2009, pp. 1–6. DOI: 10 . 1109 / ACII . 2009 .
listening test results’, IETF, 2013. [Online]. Available: https : 5349350.
//tools.ietf.org/html/draft-ietf-codec-results-03.
[20] B. Schuller, S. Steidl and A. Batliner, ‘The Interspeech 2009
[3] B. W. Schuller, ‘Speech emotion recognition: Two decades in Emotion Challenge’, in Interspeech, 2009, pp. 312–315. [On-
a nutshell, benchmarks, and ongoing trends’, Communications line]. Available: https : / / www . isca - speech . org / archive /
of the ACM, vol. 61, no. 5, pp. 90–99, 2018. DOI: 10 . 1145 / interspeech 2009/i09 0312.html.
3129340.
[21] B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. Burgoon, A.
[4] A. Albahri, M. Lech and E. Cheng, ‘Effect of speech compres- Baird, A. Elkins, Y. Zhang, E. Coutinho and K. Evanini, ‘The
sion on the automatic recognition of emotions’, International Interspeech 2016 Computational Paralinguistics Challenge: De-
Journal of Signal Processing Systems, vol. 4, no. 1, pp. 55–61, ception, sincerity and native language’, in Interspeech, 2016,
2015. DOI: 10.12720/ijsps.4.1.55-61. pp. 2001–2005. DOI: 10.21437/Interspeech.2016-129.
[5] I. Siegert, A. Requardt, O. Egorow and S. Wolff, ‘Utilizing psy- [22] F. Eyben, K. R. Scherer, B. Schuller, J. Sundberg, E. Andre,
choacoustic modeling to improve speech-based emotion recog- C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan and
nition’, in International Conference on Speech and Computer, K. Truong, ‘The Geneva minimalistic acoustic parameter set
2018, pp. 625–635. DOI: 10.1007/978-3-319-99579-3 64. (GeMAPS) for voice research and affective computing’, IEEE
[6] I. Siegert, A. Requardt, L. Linda Duong and A. Wendemuth, Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202,
‘Measuring the impact of audio compression on the spectral 2015. DOI: 10.1109/TAFFC.2015.2457417.
quality of speech data’, in Elektronische Sprachsignalverarbei- [23] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nic-
tung, O. Jokisch, Ed., TUD Press, 2016, pp. 229–236. olaou, B. Schuller and S. Zafeiriou, ‘Adieu features? End-to-
[7] B. Scholkopf and A. J. Smola, Learning with Kernels: Sup- end speech emotion recognition using a deep convolutional re-
port Vector Machines, Regularization, Optimization, and Bey- current network’, in IEEE International Conference on Acous-
ond. MIT Press, 2001. tics, Speech and Signal Processing, 2016, pp. 5200–5204. DOI:
10.1109/ICASSP.2016.7472669.
[8] N. Garcı́a, J. Vasquez, J. D. Arias-Londoño, J. Vargas-Bonilla
and J. R. Orozco, ‘Automatic emotion recognition in com-
pressed speech using acoustic and non-linear features’, in Sym-
posium on Signal Processing, Images and Computer Vision,
2015, pp. 1–7. DOI: 10.1109/STSIVA.2015.7330399.
[9] S. Frühholz, E. Marchi and B. Schuller, ‘The effect of narrow-
band transmission on recognition of paralinguistic information
from human vocalizations’, IEEE Access, vol. 4, Sep. 2016. DOI:
10.1109/ACCESS.2016.2604038.
[10] A. Requardt, I. Siegert, M. Maruschke and A. Wendemuth, ‘Au-
dio compression and its impact on emotion recognition in affect-
ive computing’, Mar. 2017.
[11] (2019). FFmpeg codecs documentation, [Online]. Available:
https://www.ffmpeg.org/ffmpeg-codecs.html.
[12] F. Eyben, M. Wöllmer and B. Schuller, ‘Opensmile – the Mu-
nich versatile and fast open-source audio feature extractor’, in
ACM International Conference on Multimedia, 2010, pp. 1459–
1462. DOI: 10.1145/1873951.1874246.
[13] S. Hochreiter and J. Schmidhuber, ‘Long short-term memory’,
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI:
10.1162/neco.1997.9.8.1735.
[14] F. Burkhardt, A. Paeschke, M. A. Rolfes, W. F. Sendlmeier and
B. Weiss, ‘A database of German emotional speech’, in Inter-
speech, 2005, pp. 1517–1520. [Online]. Available: https://www.
isca-speech.org/archive/interspeech 2005/i05 1517.html.
[15] O. Martin, I. Kotsia, B. Macq and I. Pitas, ‘The eNTERFACE’05
audio-visual emotion database’, in International Conference on
Data Engineering Workshops, 2006. DOI: 10 . 1109 / ICDEW.
2006.145.
[16] P. Staroniewicz and W. Majewski, ‘Polish emotional speech
database – recording and preliminary validation’, in Cross-
Modal Analysis of Speech, Gestures, Gaze and Facial Expres-
sions, A. Esposito and R. Vı́ch, Eds., Springer, 2009, pp. 42–49.
DOI : 10.1007/978-3-642-03320-9 5.

[17] F. Ringeval, A. Sonderegger, J. Sauer and D. Lalanne, ‘Intro-


ducing the RECOLA multimodal corpus of remote collaborat-
ive and affective interactions’, in IEEE International Confer-
ence and Workshops on Automatic Face and Gesture Recogni-
tion, 2013, pp. 1–8. DOI: 10.1109/FG.2013.6553805.

3939

You might also like