Background & Summary

Emotion is an umbrella term referring to various mental states that are brought on by changes to an individual’s environment and that can result in behavioral and cognitive changes in that individual. Emotion research has recently attracted considerable attention in many research fields, such as neuroscience, affective computing, ergonomics, medicine, psychology, and so on. In particular, research on emotion recognition has significantly increased because the next-generation HCI (Human-Computer Interaction) applications will be developed as adaptive systems that recognize user emotions1,2.

Many studies have attempted to automatically recognize human emotions, in particular by using various biosignals such as electroencephalography (EEG), electrocardiography (ECG), electromyography (EMG), photoplethysmogram (PPG), and galvanic skin response (GSR)3,4,5. Some of the studies attempted to distinguish dichotomous emotions, such as happiness vs. sadness6,7,8,9, whereas others attempted to quantify emotions, such as by measuring arousal level or valence10,11. Among the various biosignals, EEG has been the most widely used for emotion recognition, and many EEG-based open-access datasets are currently available for emotion recognition studies, such as DEAP12,13, MAHNOB-HCI11,14, SEED15,16, and so on. In previous EEG-based emotion studies1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16, visual or auditory stimuli pre-defined to trigger certain emotions were used to artificially evoke corresponding brain activity; pictures with natural scenes were used to evoke positive emotions, such as happiness and relaxation, whereas pictures with abominable scenes were used to evoke negative emotions, such as sadness and disgust. Although previous EEG-based emotion studies have shown promise for the possibility of decoding human emotions, it is necessary to decode human emotions induced in natural environments, such as human-to-human communication situations, to develop more reliable applications based on emotion recognition.

In this study, we provide a novel EEG dataset containing the emotional information induced during a realistic HCI using a voice user interface (VUI) system that mimics natural human-to-human communication, thereby contributing to the advancement of emotion recognition research. The EEG data were acquired while fifty subjects were interacting with the VUI system. To induce emotional changes during the experiment, we controlled the two answer parameters of the VUI system, i.e., voice type (child/adult) and information quantity (simple/detailed), and a combination of the two parameters was randomly used to answer a user’s question. During the experiment, we surveyed users concerning their satisfaction with the answers provided by the VUI system for each trial, and the satisfaction value was used as the ground truth for the induced emotions. We also simultaneously recorded other types of physiological data, such as ECG, PPG, GSR, and facial images, along with EEG data.

For data verification, we performed data analysis using a series of signal processing and machine learning methods that are widely used in EEG-based emotion recognition studies, and a simple detrending was applied only to the other physiological data that could be used for emotion recognition with the EEG data.

Methods

Subject

Fifty undergraduate students were recruited for this study, and they had no history of any psychiatric disease that could affect the results of the study. Prior to the experiment, the subjects were provided with the details of the experimental procedure, and they signed a form consenting to participate in the experiment and for videos of their likenesses to be shared in an open dataset. Among the subjects, 44 subjects (26 males and 18 females; 24.64 ± 2.13 years) agreed to data publication, whereas the others disagreed for privacy reasons. Adequate reimbursement was provided after the experiment. This study was approved by the Institutional Review Board (IRB) of Kumoh National Institute of Technology (202207-HR-003-01) and was conducted in accordance with the principles of the Declaration of Helsinki.

VUI system

We constructed a virtual VUI system with two different answer parameters, i.e., voice type (child/adult) and information quantity (simple/detailed), which were selected based on our previous studies that investigated various answer parameters in a VUI system in terms of emotion changes based on a subjective survey17,18. Additionally, to prevent potential biases from subjects familiar with the experimental concept due to previous participation in stimulus rating, we recruited new subjects without any overlap with previous studies for the current experiment. Because commercial VUI products generally use female adult voice, such as, AWS, Bixby, Naver, and Google, we selected girl and female voices as the first VUI parameter. Moreover, we found in our previous studies17,18 that age difference was a significant factor influencing emotions, and thus age difference was selected as the second VUI parameter. In other words, we used a girl’s voice and woman’s voices for the child and adult voices, respectively. In terms of information quantity, a short, correct answer to a question was provided for the simple answer, whereas a correct answer along with the source of the answer was provided for the detailed answer (Fig. 1). Because it was impossible to prepare all possible answers to subjects’ questions, which were unpredictable, we prepared 80 questions and their answers in such a way that all four answer types were identical in number (20 for each type; child/simple, child/detailed, adult/simple, and adult/detailed). These 80 questions are provided in a supplementary file (questions_answers.pdf). For this, we produced audio files for the answers to each of the 80 questions using a text-to-speech service (TYPECAST, https://typecast.ai), which is an audio creation tool that uses given voice types. All question-answers were based solely one factual information without emotional content, ensuring that emotions would be induced solely by the two VUI parameters. Additionally, each question-answer pair was used only once, eliminating any potential learning effect.

Fig. 1
figure 1

Example of a simple and a detailed answer in terms of information quantity.

Experimental paradigm

During the experiment, the subjects sat on a comfortable armchair, in front of a 21-inch monitor at a distance of 1 m, and they were instructed to remain relaxed and not to move. A single-blind experiment was performed to prevent the subjects’ anticipation of answer types, as well as to mimic a realistic conversational setting. The questions were presented to the subject in the order written in the supplementary file (questions_answers.pdf). The subject was instructed to ask the VUI system a question by reading a sentence presented on the monitor, without prior knowledge of the sequence of questions. Then, a corresponding answer was provided to the subject from among the four types of answers (child/simple, child/detailed, adult/simple, and adult/detailed). Note that the duration of the answer period varied across trials depending on the question, with a mean duration of 3.01 ± 1.02 s. Additionally, event information was recorded at the time points indicated by green arrows in Fig. 2, and the corresponding temporal information can be found in the mrk file. After listening to the answers, the subjects were given 3 seconds to think about the answers so that they could recognize their emotions and participate in the survey19,20,21. Then, the subject evaluated their feelings about the response made by the VUI system. In contrast to previous studies that have elicited specific emotions associated with valence-arousal through audio-visual stimuli, we focused on emotions induced in bidirectional interaction situations based on individual characteristics. Therefore, we adapted factor analysis based on the Kansei engineering technique to decode individual emotions, which is a commonly used method in the emotional engineering fields to assess user evaluation22. We initially adopted Kansei words related to emotions that can affect individual emotions in the conversation between the VUI system and users22. More specifically, we first empirically extracted 30 pairs of emotion-related adjectives by expert judgment. Then, we finally determined 9 pairs of adjectives using a card sorting method based on the expert and users’ opinions who experienced the VUI system23. We created a questionnaire sheet to determine the user’s emotion state in response to the answer of the VUI system using the extracted nine adjectives. After each VUI interaction, the subjects completed a questionnaire based on a 7-point numerical rating scale. Each subject performed 20 trials for each of the four answer types, which resulted in a total of 80 trials. The sequence of the 80 questions and their answers were the same for all subjects. The answers to the questionnaires are provided in a supplementary file (questions_answers.pdf).

Fig. 2
figure 2

Experimental paradigm for a single trial. A question is presented to the subject, and the subject asks the question by reading it out loud to the voice user interface (VUI) system. After the VUI system recognizes the question, an answer corresponding to one of the four types of answers (child/simple, child/detailed, adult/simple, and adult/detailed) is provided to the subject, and then a 3 s break is given to the subject. The duration of the answer period varied across trials depending on the question. A questionnaire about the response of the VUI system is presented to the subject, and the subject answers the questionnaire using a 7-point numeric rating scale for 9 contrasting adjective pairs. The single trial is repeated 80 times, with 20 trials for each of the four types of answers. The green arrows indicate the time points at which event information is recorded, and the start and end points of VUI answer are recorded as S1 and S2 in the mrk file, respectively.

Figure 2 shows the time sequence of the mentioned procedure for a single trial. To avoid excessive fatigue, a break was given to each subject whenever the subject wanted during the experiment.

Data recording

While the subjects were interacting with the VUI system, EEG data were measured using an ActiChamp EEG amplifier (Brain Products, GmbH Ltd., Germany) with a sampling rate of 1,000 Hz. Ground and reference electrodes were attached at Fpz and FCz, respectively. We used sixty-three active electrodes mounted on the scalp based on the international 10-10 system to measure the EEG data (FP1, Fz, F3, F7, FT9, FC5, FC1, C3, T7, TP9, CP5, CP1, Pz, P3, P7, O1, Oz, O2, P4, P8, TP10, CP6, CP2, Cz, C4, T8, FT10, FC6, FC2, F4, F8, Fp2, AF7, AF3, AFz, F1, F5, FT7, FC3, C1, C5, TP7, CP3, P1, P5, PO7, PO3, POz, PO4, PO8, P6, P2, CPz, CP4, TP8, C6, C2, FC4, FT8, F6, AF8, AF4, and F2).

We simultaneously measured other biosignals along with the EEG data, using PPG, GSR, ECG, and facial expressions. We attached a PPG sensor on the index finger of the left hand, bipolar GSR sensors on the middle and ring finger of the left hand, and three ECG sensors at the lead-I position (Einthoven’s triangle). Facial expression was measured between the starting point of an answer period and the waiting period of 3 s using a web camera (Logitech HD C920 Webcam, 1080p/30fps) based on FaceReader 8 (Noldus Information Technology, Wageningen, Netherlands). attached to the monitor for each trial. Various biosignals were also sampled at 1,000 Hz, the same as the EEG data, except facial expression (30 Hz), and all physiological signals were synchronously recorded19,23,24,25,26,27,28. Figure 3 shows the positions of the physiological sensors (EEG, PPG, GSR, and ECG) and the web camera, respectively. The subjects were instructed to concentrate on the instructions presented on the screen and minimize unnecessary movements, such as eye and muscle movement. No datasets were excluded from the analysis specifically due to motion artifacts.

Fig. 3
figure 3

Schematic diagram of sensor positions for recording EEG, ECG, PPG, GSR, and facial image data.

Factor analysis for questionnaire data

In this study, we used factor analysis to establish a criterion by identifying common factors underlying subjective individual emotions. Factor analysis is a multivariate statistical method used to examine the correlations between multiple variables to identify underlying common factors29,30. The questionnaire results were statistically analyzed by factor analysis to extract main factors representing emotions using Minitab 20 (Minitab Inc., State College, PA, USA)18. The factors of adjective pairs were derived based on the correlation between the scores of the 9 adjective pairs. The number of meaningful factors was determined by the eigenvalues in a factor analysis where two factors with eigenvalues greater than 1 were extracted29,30.

Table 1 shows the results of the factor analysis. The major adjective pairs were determined by the variance of each factor. For example, the variance of factor 1 is 3.563, which means that approximately four adjective pairs had a significant influence in constructing the corresponding factor. Therefore, we identified the most influential adjective pairs based on the absolute values of factor loadings, which indicate the correlation between each adjectives pair and the respective factors. Consequently, we selected four adjective pairs for factor 1 and two adjective pairs for factor 2, as denoted by asterisks in Table 1. The commonalities of factors 1 and 2 explain approximately 39.6% and 17.5% of the total variance in factor loadings, respectively. We examined the overall meaning of the major adjective pairs (i.e., asterisks in Table 1) for two factors, and we defined factor 1 as “stability” and factor 2 as “favorability” based on the empirical judgment of experts18. Then, the factor scores for each trial were calculated for the defined two main factors (stability and favorability). The factor scores are defined by the formula:

$${F}_{i}={b}_{i1}{X}_{1}+{b}_{i2}{X}_{2}+\cdots +{b}_{{ip}}{X}_{p}$$
(1)

where Fi is the score of the i-th factor, \({{\boldsymbol{X}}}_{{\bf{1}}},{{\boldsymbol{X}}}_{{\bf{2}}},{\ldots ,{\boldsymbol{X}}}_{{\boldsymbol{p}}}\) are the variables (pairs of questions and answers), and \({{\boldsymbol{b}}}_{{\boldsymbol{i}}{\bf{1}}},{{\boldsymbol{b}}}_{{\boldsymbol{i}}{\bf{2}}},{\ldots ,{\boldsymbol{b}}}_{{\boldsymbol{ip}}}\) are the factor loadings between the i-th factor and each variable. In the following section, we performed binary classification by labelling the data based on the factor scores, provided in a supplementary file (Individual_Classification_Accuracy.pdf).

Table 1 Results of a factor analysis for 9 contrasting adjective pairs.

EEG data analysis

The EEG data were pre-processed using the EEGLAB toolbox based on MATLAB 2019b (MathWorks, Natick, MA, USA). The measured EEG data were notch-filtered at 59–61 Hz to remove the DC component. Subsequently, the data were bandpass-filtered at 1–80 Hz for baseline correction and extraction of frequency bands of interest because emotion-related EEG studies have frequently investigated the high-frequency band above 50 Hz at the same time as low-frequency bands31,32. The filtered EEG data were downsampled from 1,000 Hz to 200 Hz to reduce computational load. To investigate emotional changes in response to the answer of VUI, the downsampled EEG data were segmented for the answer period without any overlap because it was expected that user emotions would be mostly induced in this answer period in our question-answer paradigm. The entire data without any segmentation can be found in the raw data folder, and can be flexibly used according to specific purposes. The entire data can be found in the raw data folder. Because the level of EEG amplitudes varied between subjects, we empirically identified bad channels showing exceptionally high activation by visually inspecting spectral topographic maps based on power spectral densities (PSDs), and excluded six bad channels (FT9, T7, TP9, FT10, T8, and TP10) for further analysis (see Supplementary Figure 2). Independent component analysis (ICA) was then applied to the EEG data concatenated by all segmented trials to remove physiological artifacts33. In addition, we excluded significantly contaminated EEG trials where any channel showed an amplitude greater than ± 75 µV34,35, which resulted in an average of 7.39 ± 6.08 trials eliminated over 80 trials across the subjects.

Based on the two adjective factors, “stability” and “favorability”, which were derived by the factor analysis, the factor scores of all single EEG trials were used to categorized and labeled based on the 2D factor coordinates (stability-favorability) to confirm the difference between the emotions felt during the experiment in terms of EEG patterns and to automatically discriminate different emotions based on the factor axis using EEG features, respectively. Figure 4 shows the emotional distribution of all subjects for the factor coordinates. Mean factor scores for each subject within each quadrant were displayed on a two-dimensional factor coordinate system (stability-favorability), instead of displaying all 80 factor scores for each subject for better visibility. Individual detailed factor scores are provided in the open access repository of Figshare (Questionnaire.xlsx)36.

Fig. 4
figure 4

Grand-averaged emotional distribution for each subject on the 2D factor coordinates (stability-favorability). Each point was estimated by averaging all factor analysis results of 80 trials for each subject in the 2D factor coordinates.

We performed emotion classification for all possible binary combinations of the emotions represented by the factor coordinates; thus six classifications were independently performed for each subject, e.g., high stability and high favorability vs. high stability and low favorability. However, we excluded a binary combination of emotions for classification when the ratio of inter-class trials was higher than 7:3 due to the significantly unbalanced numbers of inter-class trials; six subjects did not have any available cases for binary emotion classification in terms of the ratio of inter-class trials, and thus they were excluded from the emotion classification. Phase-locking values (PLVs) widely used as classification features for emotion recognition were computed between all possible pairs of EEG electrodes for six frequency bands as emotion classification features37: theta (4–8 Hz), alpha (8–13 Hz), low-beta (13–20 Hz), high-beta (20–30 Hz), low-gamma (30–50 Hz), and high-gamma (50–80 Hz)13,37,38,39,40. Thus, a total of 9,576 PLV features (57C2 × 6 frequency bands) were extracted for each subject. To determine optimal features for each subject, the inter-class significance of each feature was assessed using a t-test, aiming to extract the most discriminable feature41. The derived p-values were sorted in ascending order, and the classification was then performed independently, gradually increasing the number of features analyzed from rank 1 to 10042,43,44. Classification accuracy was evaluated using 5-fold cross-validation based on a SVM with a linear kernel, standardized, and uniform prior setting to account for prior probabilities. We evaluated performance using balanced accuracy and F1 score to compensate for the data imbalance issue between classes. The optimal number of features differed for each individual, but maximum classification accuracies were achieved when using 3.95 ± 3.18 features, on average, across subjects. Additionally, we investigated the top four PLV features that were most frequently selected during classification over all subjects for the six combinations of binary emotions represented by the factor coordinates with respect to the frequency band.

$${Balanced\; accuracy}=\frac{{Specificity}+{Sensitivity}}{2}$$
(2)
$$F1\,{score}=\frac{2({Sensitivity}\times {Precision})}{{Sensitivity}+{Precision}}$$
(3)

Because our main concern was to check whether emotion classification is possible using the EEG data measured in an interaction environment, the other physiological data, including the ECG, PPG, GSR, and facial image data, were simply analyzed to check their reliability. Thus, detrending was only applied to the other physiological data to remove the drift phenomenon, except a facial expression dataset that contained 30 image frames per seconds.

Data Records

The datasets are freely downloadable from the open access repository of Figshare36.

We provide raw and pre-processed data for all biosignals. The raw data refers to the unprocessed initial state of the data, which was measured at a sampling rate of 1,000 Hz. The pre-processed data includes noise rejection and data segmentation for the answer period, based on the description provided in the ‘EEG data analysis’ section. Figure 5 shows the folder structures of the raw and pre-processed data, respectively. The datasets are in the form of MATLAB files (.mat) because we performed the data pre-processing using MATLAB R2013b (MathWorks, Natick, MA, USA).

Fig. 5
figure 5

Folder structures of raw and pre-processed datasets.

Each subject has a folder (ex: Sub 1), and the raw data for each subject are stored in two sub-folders, one for the EEG and one for the other biosignal data. Each sub-folder has two MATLAB files (cnt and mrk) in common, which include continuous time-series data for all physiological data (cnt) and event information (mrk). Facial expression data are additionally included in the sub-folder of the other biosignal data as video files (.avi) for each trial. The cnt and mrk files of some subjects were split into several files because data recording was temporarily stopped and resumed whenever the subjects wanted to take a rest during the experiment. The questionnaire results for each trial are also included in the subject-specific folders (Questionnare.xls). The duration for completing the survey following the VUI’s response varied for each participant, with breaks as needed, but the total recording time did not exceed 1.5 hours. The pre-processed data for each subject were stored in two different sub-folders, for EEG and other biosignal data, respectively, as raw data, but they have only cnt files without mrk files because the information in the mrk files was merged in the cnt files during pre-processing. Facial expression data were image files (.png) (30 frames per second). Figure 5 shows the detailed folder structures of the raw and pre-processed datasets.

Technical Validation

Figure 6 shows the classification accuracies of each subject in descending order based on balanced accuracy. Since each individual would respond differently to VUI answers, leading to different discriminable emotions, the pair of most discriminable emotions for each individual was identified and its classification accuracy was selected as the representative accuracy for each subject. The mean balanced accuracy and F1 score were 69.8 ± 8.8% and 73.3 ± 7.6%, respectively. 28 of 39 subjects showed a classification accuracy greater than 70% for at least one of the performance metrics. Individual classification accuracies for all six test cases are provided in a supplementary file (Individual_Classification_Accuracy.pdf). Note that we excluded certain cases for classification where the ratio between two classes was more extreme than 7:3 in terms of the number of trials due to unbalanced numbers of inter-class trials. Therefore, their classification accuracies were not estimated and are denoted by a dash (−) in the table.

Fig. 6
figure 6

Classification accuracies for each subject based on balanced accuracy (gray), and F1-score (green). The maximum accuracy of six classification accuracies was used for each subject. The red line presents a threshold line of 70% based on balanced accuracy.

Figure 7 shows the PLV features most frequently selected during classification for the six binary combinations of emotions represented by the factor coordinates with respect to the frequency band. The PLV features selected for classification were generally found in the theta band, followed by the alpha and low-beta bands, and most of the EEG channels for the frontal areas were functionally connected with those in the temporal or occipital areas. These neurophysiological phenomena can be explained by the fact that frontal-occipital connections are closely related to emotion changes in the theta, alpha, and low-beta bands, respectively45,46,47,48,49,50. For instance, in Fig. 7(b), depicting the condition of ‘high stability and high favorability vs. low stability and low favorability’ (i.e., positive vs. negative emotions), significant patterns are observed in the frontal, temporal, and occipital regions of the left hemisphere, consistent with prior investigations on word-based emotion recognition51,52,53. In other classification scenarios, frontal-occipital connections were also prevalent, albeit with variations in the frequency bands exhibiting significant connections. Nonetheless, consistent with previous research45,46,47,48,49,50, meaningful connections were consistently observed in the theta, alpha, and low-beta bands.

Fig. 7
figure 7

PLVs most frequently selected for the six combinations of binary emotions for each frequency band: (a) high stability and high favorability vs. low stability and high favorability, (b) high stability and high favorability vs. low stability and low favorability, (c) high stability and high favorability vs. high stability and low favorability, (d) low stability and high favorability vs. low stability and low favorability, (e) low stability and high favorability vs. high stability and low favorability, (f) low stability and low favorability vs. high stability and low favorability.

Figure 8 shows exemplary results of other physiological data that were measured simultaneously with the EEG data. The representative data were extracted from Subject 25 for the 18th trial while the subject received an answer from the VUI system; regular ECG and PPG patterns were observed well with a slow GSR fluctuation, and a sample facial image was also presented because no significant change was observed during the trial due to the conversation experimental environment.

Fig. 8
figure 8

Examples of four types of physiological data: (A) ECG, (B) PPG, (C) GSR, and (D) facial image. Note that the representative subject gave permission for sharing their facial image.

Usage Notes

In this study, we provided a novel EEG dataset that contained emotion-related information acquired during interacting with VUI-based HCI. Because most existing EEG datasets for emotion recognition were acquired by presenting external stimuli that were pre-defined to induce certain emotions, the research results obtained using the existing EEG datasets may not be applicable to the development of real-world HCI applications based on emotion recognition. Because our EEG dataset has induced emotional information, it may be useful for the development of reliable HCI systems based on the classification of induced emotions. Note that few such EEG datasets have been published. Although we conducted a simple verification of the data availability in this study, our EEG dataset could be used in various ways for emotion recognition studies. For example, a multi-class emotion classification based on the 2D factor coordinates would be possible, e.g., high stability vs. low stability vs. high favorability vs. low favorability, and the estimation of discrete emotional levels is also possible for each of 9 contrasting adjective pairs based on the questionnaire results. Moreover, we tested various machine learning algorithms, including convolutional neural network-based deep learning algorithms, to determine the optimal classification algorithm for our dataset. As a result, the SVM model exhibited the highest mean classification accuracy. In general, a sufficient data is needed to effectively train deep learning models, which is typically larger than the amount required to train traditional machine learning models, such as SVM used in this study. Therefore, this multi-channel physiological dataset for emotion recognition can also be used to enhance the development of advanced emotion recognition classifiers. Furthermore, the effects of answer parameters and gender on induced emotions could be explored because our dataset was acquired using two different answer parameters, i.e., voice type and information quantity, from 26 males and 18 females.

Auxiliary physiological data were simultaneously measured with the EEG data, including ECG, PPG, GSR, and facial images. Even though we only provided an example result for the physiological data because our main concern was the EEG-based emotion recognition, they could be also utilized for emotion recognition studies. In fact, many previous studies have used ECG, PPG, GSR, and facial images for emotion recognition3,4,5,53,54,55,56,57,58, and thus the auxiliary physiological data could be used for emotion recognition independently as well as together in the same way as suggested for the EEG dataset. In particular, when using more than two different physiological data, it is expected that the performance of emotion recognition could be improved by appropriately hybridizing them from a machine learning point of view, as shown in hybrid brain-computer interface studies based on the fusion of EEG and near-infrared spectroscopy19,25,26,27,28,41,48,49,50,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79.