1 Introduction

Analysis of EEG in time domain mainly includes two perspectives: one is task-related EEG delay characteristics, which are mainly analyzed by event-related potentials; the other is the memory-related EEG period characteristics, which are closely related to the memory attributes in cognitive theory. Previous studies have shown that emotions have a short-term memory attribute, that is, emotions will continue for some time until the next emotional stimulus, and this phenomenon can be measured using brain electricity [1]. Because short-term EEG signals are usually considered to be stable, most studies use 1–4-s EEG signals to identify emotional states [2]. This article mainly focuses on emotion-related temporal memory attributes, and explores the correlations between different time scales and emotional states under different rhythms.

We define the concept of window function on the basis of the traditional full-response time-scale analysis, and determine the local brainwave component of the time-varying signal through the continual movement of the window function. The wavelet transform method is used to extract the EEG signals of different rhythms, and then the whole-time domain process of the rhythmic brain wave is decomposed into several stable equal-length sub-processes; then, the subsequent analysis and processing are performed. The physiological signal is unstable, for example, the long-window physiological signal has great variability, while short-term windows cannot provide sufficient information; so, choosing a suitable length of time window is crucial for the accuracy and computational efficiency of emotion recognition [3]. The windowing method can be applied to estimate the start and duration of different emotional states (such as high arousal). Especially, when we use movie clips or music videos to induce emotions, different stimulus materials have different durations, and due to the different plots of the stimulus material, the induced emotions are fast or slow. Therefore, it is more practical and useful to estimate the start and duration of different emotional states through windowing.

Recurrent neural networks inspired and validated by cognitive models and supervised learning methods have been proven to be effective methods for simulating the input and output of sequence forms (especially data in temporal form). For example, in the fields of cognitive science and computational neuroscience, many physiological research results have laid the foundation for the study of circulatory neural networks [4]. In addition, the idea of biological heuristics has also been validated by various experiments [5]. Based on the above theoretical support, we use the recurrent neural network to simulate and identify the emotional EEG signals at multiple time scales.

We will discuss the study on physiological characteristics (time characteristics) of emotional EEG first during the second section. And then tap, analyze and apply the binding relationship between emotion and rhythm, and the binding relationship between emotion and time. The following sections will elaborate on the relevant technologies, principles, and methods involved in the model.

2 Method

2.1 Rhythm and time characteristics analysis of EEG

A large number of studies on neurophysiological and cognitive science have shown that the brain has time consistency and delay in the process of emotional processing, memory attributes. This paper explores the binding relationship between emotion and time scale under different shock rhythms based on LSTM neural networks, and then address emotional recognition. The LSTM-based EEG “time” characteristic analysis mainly includes three parts: rhythm signal extraction, time scale division, and emotion recognition. The following is a detailed explanation.

2.1.1 Rhythm signal extraction

The EEG signal can be divided into several bands in the frequency: δ (0.5–4 Hz, generally appears when infants or adults are in a state of quietness, lethargy, fatigue, etc.), θ (4–8 Hz, generally appears when the person gradually becomes sleepy from the awake state, or the emotion gradually becomes calmer), α (8–13 Hz, generally appears when people are awake, relaxed, or closed eyes), β (14–30 Hz, generally appears when people are alert or focused), γ (> 30 Hz, generally appears in short-term memory process, multisensory information integration process, etc.) [6].

We use the discrete wavelet transform to extract the rhythm of the full-band EEG signal. The formula is as follows:

$$W_{f} \left( {j,k} \right) = \mathop \int \nolimits_{R}^{{}} 2^{{\frac{j}{2}}} f\left( t \right)\overline{{\psi \left( {2^{j} t - k} \right){\text{d}}t}}$$
(1)

Among them, \(\psi_{j,k} \left( t \right) = \left| a \right|^{{ - \frac{1}{2}}} \psi \left( {\frac{t - b}{a}} \right) = 2^{{\frac{j}{2}}} \psi \left( {2^{j} t - k} \right)\quad j,k \in Z\), j and k are scale parameters. With the change of j, \(\psi_{j,k} \left( t \right)\) is at different frequency bands in the frequency domain. With the change of k, \(\psi_{j,k} \left( t \right)\) is at different time bands in the time domain.

Different from the analysis of wavelet parameters with different rhythms, we consider the time properties of different rhythms. Therefore, to reconstruct the wavelet coefficients, the time domain signals corresponding to different rhythms are obtained. The formula is as follows:

$$f\left( t \right) = C\mathop \sum \limits_{j = - \infty }^{ + \infty } \mathop \sum \limits_{k = - \infty }^{\infty } W_{f} \left( {j,k} \right)\psi_{j,k} \left( t \right)$$
(2)

2.1.2 Division of time scales

To satisfy the different time scale analysis requirements, the rhythm signal is segmented by a rectangular window function. The time scales for the segmentation are: 0.25 s, 0.5 s, 0.75 s, 1 s, 2 s, 3 s, 4 s, 5 s and 6 s, as shown in Fig. 1.

Fig. 1
figure 1

Block diagram of window segmentation

2.2 Long–short-term memory neural network

Recurrent neural networks (RNNs) are a very effective connection model. On the one hand, it can learn input data at different time scales in real time. On the other hand, it is also possible to capture the model state information of the past time through the loop of the unit in the model, and it has the function of the memory module as well. The RNN model was originally proposed by Jordan [7] and Elman [8], and subsequently derived many different variants, such as time delay neural network (TDNN) [9] echo oscillating network (ESN) [10], etc. Due to the special design of recursion, RNN can theoretically learn history event information of any length. However, the length of the standard RNN model learning history information is limited in real application. The main problem is that the given input data will affect the status of the hidden layer unit, which will affect the output of the network. With the increase of the number of cycles, the output data of the network unit will be influenced by exponential growth and decrease, which is defined as the gradient disappearance and gradient explosion problem [11]. A large number of research efforts have attempted to solve these problems; the most popular is the long–short-term memory neural network structure proposed by Hochreiter and Schmidhuber [12].

The LSTM network structure is similar to the standard RNN model except that its hidden layer’s summation unit is replaced by a memory module. Each module contains one or more self-connected memory cells and three multiplication units (input gates, output gates, and oblivion gates). These multiplication units have writing, reading, and reset functions. Since these multiplication units allow the LSTM’s memory unit to store and retrieve long-term information from the network, the gradient disappearance problem can be mitigated.

The learning process of LSTM is divided into two steps, forward propagation and back propagation. The back propagation process of LSTM calculates the loss function based on the output of the model training and the real tag, and then adjust the weight of the model. Currently, two well-known algorithms have been used to calculate and adjust the weights in the back-propagation process: one is real-time recurrent learning (RTRL); and the other is back propagation through time (BPTT). In this article, we use BPTT for training because it is easy to be understood and has lower computational complexity.

LSTM model has been widely applied to a series of tasks that require long-term memory, such as learning context-confirmed statements [13] and requiring precise timing and counting [14]. In addition, the LSTM model is also widely used in practice, such as protein structure prediction [15], music generation [16], and speech recognition [17].

3 LSTM-based EEG emotion recognition model

Different from the analysis part, in this part, we directly use the optimal time and rhythm characteristics obtained from the analysis to construct an EEG emotion recognition method (RT-ERM) based on the “rhythm–time” characteristic inspiration, and then conduct emotion recognition. The analysis framework is shown in Fig. 2. The input is original multi-channel EEG signal, and the output is the emotion classification which is based on the valence and arousal.

Fig. 2
figure 2

An emotion recognition model inspired by “rhythm–time” characteristic


Step 1:

The RT-ERM method receives the multi-channel original EEG signals:

$$X\left( t \right) = \left[ {x^{{{\text{CH}}_{1} }} \left( t \right),x^{{{\text{CH}}_{2} }} \left( t \right), \ldots ,x^{{{\text{CH}}_{n} }} \left( t \right)} \right] \in R^{n \times N}$$
(3)

where \(n\) is the number of brain leads, \(N\) is the number of sample points, and \(x^{{{\text{CH}}_{i} }} \left( t \right)\) is the brain electrical signal of the \(i\)th channel.

Then, we use the open source toolbox EEGLab to perform the technique of artifact removal and blind source separation based on independent component analysis for multi-channel EEG signals. The most representative signal in each brain power source expressed in S(t).


Step 2:

Furthermore, the EEG signal is down-sampled to 256 Hz to obtain the preconditioned EEG signal, as follow:

$$F\left( t \right) = \left[ {f^{{{\text{CH}}_{1} }} \left( t \right),f^{{{\text{CH}}_{2} }} \left( t \right), \ldots ,f^{{{\text{CH}}_{n} }} \left( t \right)} \right] \in R^{n \times M} ,$$
(4)

where F(t) is the preconditioned EEG signal, \(M\) is the number of channel sample points after downsampling. Rhythm extraction is performed on the preprocessed EEG signal to obtain a rhythm signal of interest:

$$F_{\kappa } \left( t \right) = \left[ {f_{\kappa }^{{{\text{CH}}_{1} }} \left( t \right),f_{\kappa }^{{{\text{CH}}_{2} }} \left( t \right), \ldots ,f_{\kappa }^{{{\text{CH}}_{n} }} \left( t \right)} \right] \in R^{n \times M} ,$$
(5)

where \(\kappa\) represents the emotion-related rhythm obtained from the analysis.


Step 3:

Let tS be the time scale and sR be the sampling frequency, cut and merge the rhythm signals as follow:

$$I_{\kappa } \left( t \right) = \left[ {I_{\kappa }^{1} \left( t \right),I_{\kappa }^{2} \left( t \right), \ldots ,I_{\kappa }^{T} \left( t \right)} \right] \in R^{E \times T} ,$$
(6)

where \(E = n * {\text{tS}}* {\text{sR}}\), T is obtained by dividing the total sample time by tS, and the EEG data vector of the ith time node as follows:

$$I_{\kappa }^{i} \left( t \right) = \left[ {\begin{array}{*{20}c} {f_{\kappa }^{{{\text{CH}}_{1} }} \left( {{\text{ts}}*{\text{sR}}*\left( {i - 1} \right),{\text{tS}}*{\text{sR}}*i} \right), \ldots } \\ { \ldots ,f_{\kappa }^{{{\text{CH}}_{n} }} \left( {{\text{ts}}*{\text{sR}}*\left( {i - 1} \right),{\text{tS}}*{\text{sR}}*i} \right)} \\ \end{array} } \right]$$
(7)

Step 4:

After being cut and merged, the signal \({\text{I}}_{\kappa } \left( t \right)\) is input into the LSTM model for recognition learning.


Step 5:

Finally, the results of the emotion classification based on the valence and arousal of emotion are obtained using the output of the LSTM network.

4 Results and discussion

4.1 Data description

EEG data: The performance of the proposed emotional recognition model is investigated using DEAP Dataset. DEAP [18] is a multimodal dataset for analysis of human affective states. 32 Healthy participants (50% females), aged between 19 and 37 (mean age 26.9), participated in the experiment. 40 1-min-long excerpts of music videos were presented in 40 trials for each subject. There are 1280 (32 subjects × 40 trials) emotional state samples. Each sample has the valence rating (ScoreV, integer between 1 and 9, dividing the emotions into positive emotions and negative emotions according to the degree of pleasure that causes people’s emotion) and the arousal rating (ScoreV, integer between 1 and 9, reflecting the intensity of emotions that people feel) [19]. During the experiments, EEG signals were recorded with 512-Hz sampling frequency, which were down sampled to 256 Hz and filtered between 4.0 and 45.0 Hz, and the EEG artifacts are removed.

Sample distribution: Based on the above DEAP dataset, the proposed model is learned and tested for classifying the negative–positive states (ScoreV ≤ 3 or ≥ 7) and passive–active states (ScoreA ≤ 3 or ≥ 7), respectively. The sample size of negative state is 222; the sample size of positive state is 373; the sample size of passive state is 226; and the sample size of active state is 297.

4.2 Assessment method overview

This section uses four parameters to measure the final classification results, the Accuracy, the Sensitivity, the Specificity and the macro-F1. Their formula and definition are as follows:

The accuracy: The accuracy (ACC) measures the overall effectiveness of the classification model, which is the ratio of the positive sample size to the total sample size. The formula is:

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} \times 100\%$$
(8)

The sensitivity: The sensitivity characterizes the validity of the classifier’s recognition of positive samples, also known as the true positive rate (TPR). The formula is:

$${\text{Sensitivity}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} \times 100\%$$
(9)

The specificity: The specificity characterizes the validity of the classifier’s recognition of negative samples, also known as the true negative rate (TNR). The formula is:

$${\text{Specificity}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}} \times 100\%$$
(10)

The macro-F1: The macro F1 comprehensively considers the recall and precision of the algorithm, and can fully reflect the performance of the algorithm. The formulas are:

$${\text{macro-F1}} = \frac{{2 \times {\text{macro-P}} \times {\text{macro-R}}}}{{{\text{macro-P}} + {\text{macro-R}}}}$$
(11)
$${\text{macro-P}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} P_{i} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \frac{{{\text{TP}}_{i} }}{{{\text{TP}}_{i} + {\text{FP}}_{i} }}$$
(12)
$${\text{macro-R}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} R_{i} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \frac{{{\text{TP}}_{i} }}{{{\text{TP}}_{i} + {\text{FN}}_{i} }}$$
(13)

Among them, TP indicates that the sample belongs to the positive class and is also recognized as a positive class, while the negative class sample is distinguished as a positive class will be marked as FP. TN means recognizing the negative class sample correctly and FN is wrong.

In this paper, positive classes correspond to high valence (HV) and high arousal (HA) states, while negative classes correspond to states of low valence (LV) and low arousal (LA). In addition, a tenfold cross-validation method was used to verify the validity of the identification, and the average (mean) and standard deviation (Std.) of the evaluation index of 10 experiments was calculated.

4.3 Analysis of binding relationship between time and rhythm

Based on the analysis method in Sect. 3, the “rhythm–time” characteristics of EEG under emotional valence and arousal are analyzed separately. The following are results and discussion of analysis methods.

Tables 1, 2, 3, 4 are the recognition results obtained for different time scales of the EEG signals corresponding to the dimension of emotion valence under θ, α, β, and γ rhythms, respectively.

Table 1 The classification results of RT-ERM with different time scales for θ rhythm under the dimension of emotion valence
Table 2 The classification results of RT-ERM with different time scales for α rhythm under the dimension of emotion valence
Table 3 The classification results of RT-ERM with different time scales for β rhythm under the dimension of emotion valence
Table 4 The classification results of RT-ERM with different time scales for γ rhythm under the dimension of emotion valence

As can be seen from the Tables 1, 2, 3, 4, the four rhythms perform different from each other. For the θ rhythm, the time scale of 2.0 s gets the highest ACC (61.59%), TNR (63.17%) and macro-F1 (60.9783%) which corresponds to the best recognition effect; while the time scale of 0.25 s obviously reduces the recognition effect. For the α rhythm, the time scale of 6.0 s reaches the best ACC (61.06%) and TNR (62.1%), 5.0 s reaches the best TRP (63.16%), however, 0.25 s represents the greatest recognition effect with the highest macro-F1 (61.0404%). For the β rhythm, the time scale of 0.75 s performs similar to the 2.0 s in the θ rhythm, using the highest ACC (62.12%), TPR (66.85%) and macro-F1 (63.7077%) to gain the best recognition effect. When the time scale is smaller than 4.0 s, the β rhythm is better at identifying positive sample, and it becomes the opposite after 4.0 s. For the γ rhythm, the time scales of 5.0 s have the ACC of 60.52% and the best macro-F1 of 60.9008%, and the time scales of 2.0 s have the highest ACC of 60.54%, the highest TNR of 61.58% and the macro-F1 of 60.358%. These two scales behave so similarly that we hold the view that both of them correspond to the best recognition effect and high rhythms are good at recognizing the valence emotions (positive and negative emotions).

Tables 5, 6, 7, 8 are the recognition results obtained for different time scales of the EEG signals corresponding to the dimension of emotion arousal under θ, α, β, and γ rhythms, respectively.

Table 5 The classification results of RT-ERM with different time scales for θ rhythm under the dimension of emotion arousal
Table 6 The classification results of RT-ERM with different time scales for α rhythm under the dimension of emotion arousal
Table 7 The classification results of RT-ERM with different time scales for β rhythm under the dimension of emotion arousal
Table 8 The classification results of RT-ERM with different time scales for γ rhythm under the dimension of emotion arousal

According to the Tables 5, 6, 7, 8, for the θ rhythm, the time scale of 0.5 s corresponds to the best recognition effect with the highest average ACC (69.1%), the highest average TPR (65.5%) and the highest average macro-F1 (67.9658%). The θ rhythm uses a small time scale (such as 0.25 s and 0.5 s) to get the best results under the dimension of emotion arousal, that is contrary to emotion valence. The time scale of 0.25 s corresponds to the best recognition when it comes to the α rhythm, and it makes better in classifying negative samples. As for the β rhythm, the time scale of 0.5 s does well in recognizing positive samples, while 0.75 s is on the contrary, and these two scales obtain a close macro-F1 results, the former is 63.733% and the latter is 63.8073%. However, the experimental results of γ rhythm are more complicated. The time scales of 0.25 s and 2.0 s get the highest average ACC. 2.0 s and 6.0 s make best in the negative samples’ recognition. When 1.0 s and 3.0 s are used to distinguish the positive samples, they reach the best result. And we think that 0.25 s and 3.0 s correspond to the best recognition effect for their highest macro-F1 (61.8742% and 61.8743%). The results show that low rhythms (such as θ rhythm) can better identify emotional arousal.

4.4 Emotion recognition results comparison and analysis

From Table 9, it can be seen that most of the emotion recognition studies using the DEAP database currently select a time window of 1–8 s, and the time window with the highest recognition accuracy rate is 1–2 s.

Table 9 Comparison of results that use EEG signals of DEAP dataset for emotion recognition

In the statistical results in Table 9, Kuai [25], using rhythm synchronization patterns with joint time–frequency–space correlation model (RSP-ERM) to distinguish the emotion, obtained the average classification rates of 64% (arousal) and 66.6% (valence). In our work, for valence, RT-ERM can obtain the highest average recognition accuracy (62.12%) at the time scale of 0.75 s and β rhythm; In terms of arousal, RT-ERM can obtain the highest average recognition accuracy (69.1%) at the time scale of 0.5 s and θ rhythm, which is 0.7% higher than traditional SVM or KNN model [20], and 2.5% higher than Kuai’s [25] result. Through the statistical results, we found that the LSTM-based deep learning network can effectively identify the emotional state and obtain a good recognition effect.

5 Conclusions

This paper discusses the temporal memory characteristics of the brain in the process of emotional information processing, and then describes the theoretical basis and advantages of the cyclic neural network when it is used in the mining analysis of temporal characteristics, and finally constructs a model of sentiment analysis and recognition to achieve effective recognition and analysis of emotions. We discussed the emotion mechanism under different time scales corresponding to different rhythms, using the rhythm oscillation mechanism as the default mode of the brain. It can be found from the experimental results that high rhythms, such as β and γ rhythm, are good at recognizing the valence emotions, and low rhythms, such as θ rhythm, do well in the recognition of arousal emotions. For example, the recognition average accuracy rate can reach 69.1% at the time scale of 0.5 s and θ rhythm in our experiments, increasing 2.5% when compared with the existing EEG-based emotion analysis using rhythm characteristics (RSP-ERM model [25]). It is noteworthy that the smaller time scale shows better recognition performance no matter in the valence or arousal state. In summary, the “rhythm–time” characteristics obtained through RT-ERM affective model analysis not only have a greater significance for the in-depth understanding of the physiological properties of the brain in the process of emotional information processing, but also help to guide the application of emotion recognition model based on physiological inspiration.