EEGnet
EEGnet
EEGnet
E-mail: vernon.j.lawhern.civ@mail.mil
Abstract
Objective. Brain–computer interfaces (BCI) enable direct communication with a computer,
using neural activity as the control signal. This neural signal is generally chosen from a
variety of well-studied electroencephalogram (EEG) signals. For a given BCI paradigm,
feature extractors and classifiers are tailored to the distinct characteristics of its expected EEG
control signal, limiting its application to that specific signal. Convolutional neural networks
(CNNs), which have been used in computer vision and speech recognition to perform
automatic feature extraction and classification, have successfully been applied to EEG-based
BCIs; however, they have mainly been applied to single BCI paradigms and thus it remains
unclear how these architectures generalize to other paradigms. Here, we ask if we can design
a single CNN architecture to accurately classify EEG signals from different BCI paradigms,
while simultaneously being as compact as possible. Approach. In this work we introduce
EEGNet, a compact convolutional neural network for EEG-based BCIs. We introduce the
use of depthwise and separable convolutions to construct an EEG-specific model which
encapsulates well-known EEG feature extraction concepts for BCI. We compare EEGNet,
both for within-subject and cross-subject classification, to current state-of-the-art approaches
across four BCI paradigms: P300 visual-evoked potentials, error-related negativity responses
(ERN), movement-related cortical potentials (MRCP), and sensory motor rhythms (SMR).
Main results. We show that EEGNet generalizes across paradigms better than, and achieves
comparably high performance to, the reference algorithms when only limited training data is
available across all tested paradigms. In addition, we demonstrate three different approaches
to visualize the contents of a trained EEGNet model to enable interpretation of the learned
features. Significance. Our results suggest that EEGNet is robust enough to learn a wide
variety of interpretable features over a range of BCI tasks. Our models can be found at: https://
github.com/vlawhern/arl-eegmodels.
5
Author to whom any correspondence should be addressed.
Table 1. Summary of the data collections used in this study. Class imbalance, if present, is given as odds; i.e.: an odds of 2:1 means the
class imbalance is 2/3 of the data for class 1 to 1/3 of the data for class 2. For the P300 and ERN datasets, the class imbalance is subject-
dependent; therefore, the odds is given as the average class imbalance over all subjects.
Bandpass Trials per
Paradigm Feature type filter (Hz) # of Subjects subject # of Classes Class imbalance?
P300 ERP 1–40 15 ∼2000 2 Yes, ∼5.6:1
ERN ERP 1–40 26 340 2 Yes, ∼3.4:1
MRCP ERP/Oscillatory 0.1–40 13 ∼1100 2 No
SMR Oscillatory 4–40 9 288 4 No
detect a high amplitude and low frequency EEG response to Netherlands). Continuous EEG data were referenced offline to
a known, time-locked external stimulus. They are generally the average of the left and right earlobes, digitally bandpass
robust across subjects and contain well-stereotyped wave- filtered, using an FIR filter implemented in EEGLAB [53], to
forms, enabling the time course of the ERP to be modeled 1–40 Hz and downsampled to 128 Hz. EEG trials of target and
through machine learning efficiently [46]. In contrast to ERP- non-target conditions were extracted at [0, 1] s post stimulus
based BCIs, which rely mainly on the detection of the ERP onset, and used for a two-class classification.
waveform from some external event or stimulus, oscillatory
BCIs use the signal power of specific EEG frequency bands for 2.1.2. Dataset 2: Feedback error-related negativity (ERN).
external control and are generally asynchronous [47]. When Error-related negativity potentials are perturbations of the
oscillatory signals are time-locked to an external stimulus, they EEG following an erroneous or unusual event in the subject’s
can be represented through event-related spectral perturbation environment or task. They can be observed in a variety of
(ERSP) analyses [48]. Oscillatory BCIs are more difficult to tasks, including time interval production paradigms [54] and in
train, generally due to the lower signal-to-noise ratio (SNR) as forced-choice paradigms [55, 56]. Here we focus on the feed-
well as greater variation across subjects [47]. A summary of back error-related negativity (ERN), which is an amplitude
the data used in this manuscript can be found in table 1 perturbation of the EEG following the perception of an errone-
ous feedback produced by a BCI. The feedback ERN is char-
acterized as a negative error component approximately 350 ms,
2.1.1. Dataset 1: P300 event-related potential (P300). The
followed by a positive component approximately 500 ms, after
P300 event-related potential is a stereotyped neural response
visual feedback (see figure 7 of [57] for an illustration). The
to novel visual stimuli [49]. It is commonly elicited with the
detection of the feedback ERN provides a mechanism to infer,
visual oddball paradigm, where participants are shown repeti-
and to possibly correct in real-time, the incorrect output of a
tive ‘non-target’ visual stimuli that are interspersed with infre-
BCI. This two-stage system has been proposed as a hybrid BCI
quent ‘target’ stimuli at a fixed presentation rate (for example, 1
in [58, 59] and has been shown to improve the performance of
Hz). Observed over the parietal cortex, the P300 waveform is a
a P300 speller in online applications [60].
large positive deflection of electrical activity observed approxi-
The EEG data used here comes from [57] and was used in
mately 300 ms post stimulus onset, the strength of the observed
the ‘BCI Challenge’ hosted by Kaggle (www.kaggle.com/c/
deflection being inversely proportional to the frequency of the
inria-bci-challenge); a brief description is given below. 26
target stimuli. The P300 ERP is one of the strongest neural
healthy participants (16 for training, 10 for testing) participated
signatures observable by EEG, especially when targets are
in a P300 speller task, a system which uses a random sequence
presented infrequently [49]. When the image presentation rate
of flashing letters, arranged in a 6 × 6 grid, to elicit the P300
increases to 2 Hz or more, it is commonly referred to as rapid
response [61]. The goal of the challenge was to determine
serial visual presentation (RSVP), which has been used to
whether the feedback of the P300 speller was correct or incor-
develop BCIs for large image database triage [50–52].
The EEG data used here have been previously described rect. The EEG data were originally recorded at 600 Hz using 56
in [51]; a brief description is given below. 18 participants vol- passive Ag/AgCl EEG sensors (VSM-CTF compatible system)
unteered for an RSVP BCI study. Participants were shown following the extended 10–20 system for electrode placement.
images of natural scenery at 2 Hz rate, with images either Prior to our analysis, the EEG data were band-pass filtered,
containing a vehicle or person (target), or with no vehicle or using an FIR filter implemented in EEGLAB [53], to 1–40 Hz
person present (non-target). Participants were instructed to and down-sampled to 128 Hz. EEG trials of correct and incor-
press a button with their dominant hand when a target image rect feedback were extracted at [0, 1.25] s post feedback presen-
was shown. The target/non-target ratio was 20%/80%. Data tation and used as features for a two-class classification.
from three participants were excluded from the analysis due to
excessive artifacts and/or noise within the EEG data. Data from 2.1.3. Dataset 3: Movement-related cortical potential (MRCP).
the remaining 15 participants (9 male and 14 right-handed) Some neural activities contain both ERP as well as an oscil-
who ranged in age from 18 to 57 years (mean age 39.5 years) latory components. One particular example of this is the
were further analyzed. EEG recordings were digitally sam- movement-related cortical potential (MRCP), which can be
pled at 512 Hz from 64 scalp electrodes arranged in a 10–10 elicited by voluntary movements of the hands and feet and
montage using a BioSemi active two system (Amsterdam, The is observable through EEG along the central and midline
3
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
electrodes, contralateral to the hand or foot movement [62–65]. 2.2. Classification methods
The MRCP components can be seen before movement onset
2.2.1. EEGNet: compact CNN architecture. Here we intro-
(a slow 0–5 Hz readiness potential [66, 67] and an early
duce EEGNet, a compact CNN architecture for EEG-based
desynchronization in the 10–12 Hz frequency band), at
BCIs that (1) can be applied across several different BCI
movement onset (a slow motor potential [67, 68]), and after
paradigms, (2) can be trained with very limited data and (3)
movement onset (a late synchronization of 20–30 Hz activity
can produce neurophysiologically interpretable features. A
approximately 1 s after movement execution). The MRCP has
visualization and full description of the EEGNet model can
been used previously to develop motor control BCIs for both
be found in figure 1 and table 2, respectively, for EEG tri-
healthy and physically disabled patients [69–71]
als, collected at 128 Hz sampling rate, having C channels
The EEG data used here have been previously described
and T time samples. We fit the model using the Adam optim
in [72]; a brief description is given below. In this study, 13
izer, using default parameters as described in [75], minimiz-
subjects performed self-paced finger movements using the left
ing the categorical cross-entropy loss function. We run 500
index, left middle, right index, or right middle fingers. The data
training iterations (epochs) and perform validation stopping,
was recorded using a 256 channel BioSemi Active II system
saving the model weights which produced the lowest valida-
at 1024 Hz. Due to extensive signal noise present in the data,
tion set loss. All models were trained on an NVIDIA Quadro
the EEG data were first processed with the PREP pipeline
M6000 GPU, with CUDA 9 and cuDNN v7, in Tensorflow
[73]. The data were referenced to linked mastoids, bandpass
[76], using the Keras API [77]. We omit the use of bias units
filtered, using an FIR filter implemented in EEGLAB [53],
in all convolutional layers. Note that, while all convolutions
between 0.1 Hz and 40 Hz, and then downsampled to 128 Hz.
are one-dimensional, we use two-dimensional convolution
We further downsampled the channel space to the standard
functions for ease of software implementation. Our software
64 channel BioSemi montage. The index and middle finger
implementation can be found at https://github.com/vlawhern/
blocks for each hand were combined for binary classification
arl-eegmodels.
of movements originating from the left or right hand. EEG
trials of left and right hand finger movements were extracted •
In block 1, we perform two convolutional steps in
at [−.5, 1] s around finger movement onset and used for a two- sequence. First, we fit F1 2D convolutional filters of size
class classification. (1, 64), with the filter length chosen to be half the sam-
pling rate of the data (here, 128 Hz), outputting F1 feature
2.1.4. Dataset 4: Sensory motor rhythm (SMR). A common maps containing the EEG signal at different band-pass
control signal for oscillatory-based BCI is the sensorimotor frequencies. Setting the length of the temporal kernel
rhythm (SMR), wherein mu (8–12 Hz) and beta (18–26 Hz) at half the sampling rate allows for capturing frequency
bands desynchronize over the sensorimotor cortex contralat- information at 2 Hz and above. We then use a depthwise
eral to an actual or imagined movement. The SMR is very convolution [43] of size (C, 1) to learn a spatial filter. In
similar to the oscillatory component of the MRCP. Although CNN applications for computer vision the main benefit
SMR-based BCIs can facilitate nuanced, endogenous BCI of a depthwise convolution is reducing the number of
control, they tend to be weak and highly variable across and trainable parameters to fit, as these convolutions are not
within subjects, conventionally demanding user-training (neu- fully-connected to all previous feature maps (see figure 1
rofeedback) and long calibration times (20 min) in order to for an illustration). Importantly, when used in EEG-
achieve reasonable performance [45]. specific applications, this operation provides a direct
The EEG data used here comes from BCI Competition IV way to learn spatial filters for each temporal filter, thus
Dataset 2A [74] (called the SMR dataset for the remainder of enabling the efficient extraction of frequency-specific
the manuscript). The data consists of four classes of imagined spatial filters (see the middle column of figure 1). A
movements of left and right hands, feet and tongue recorded depth parameter D controls the number of spatial filters
from nine subjects. The EEG data were originally recorded to learn for each feature map (D = 1 is shown in figure 1
using 22 Ag/AgCl electrodes, sampled at 250 Hz and bandpass for illustration purposes). This two-step convolutional
filtered between 0.5 and 100 Hz. We resampled the timeseries sequence is inspired in part by the filter-bank common
to 128 Hz, and follow the same EEG pre-processing procedure spatial pattern (FBCSP) algorithm [78] and is similar
as described in [32], using software that was provided by the in nature to another decomposition technique, bilinear
authors; a brief summary is given here. The data were causally discriminant component analysis [79]. We keep both
filtered using a third-order Butterworth filter in the 4–40 Hz fre- convolutions linear as we found no significant gains in
quency band to minimize the influence of class-discriminative performance when using nonlinear activations. We apply
eye movements. The EEG signals were then standardized with batch normalization [80] along the feature map dimen-
an exponential moving average window with a decay factor of sion before applying the exponential linear unit (ELU)
0.999 (further details can be found in section A.7 of [32]). nonlinearity [81]. To help regularize or model, we use the
For both the training and test sets we epoched the data at dropout technique [82]. We set the dropout probability
[0.5, 2.5] seconds post cue onset (the same window which was to 0.5 for within-subject classification to help prevent
used in [40, 45]). Note that we make predictions for only this over-fitting when training on small sample sizes, whereas
time range on the test set. We perform a four-class classifica- we set the dropout probability to 0.25 in cross-subject
tion using accuracy as the summary measure. classification, as the training set sizes are much larger
4
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Figure 1. Overall visualization of the EEGNet architecture. Lines denote the convolutional kernel connectivity between inputs and
outputs (called feature maps) . The network starts with a temporal convolution (second column) to learn frequency filters, then uses
a depthwise convolution (middle column), connected to each feature map individually, to learn frequency-specific spatial filters. The
separable convolution (fourth column) is a combination of a depthwise convolution, which learns a temporal summary for each feature
map individually, followed by a pointwise convolution, which learns how to optimally mix the feature maps together. Full details about the
network architecture can be found in table 2.
Table 2. EEGNet architecture, where C = number of channels, T = number of time points, F1 = number of temporal filters, D = depth
multiplier (number of spatial filters), F2 = number of pointwise filters, and N = number of classes, respectively. For the Dropout layer, we
use p = 0.5 for within-subject classification and p = 0.25 for cross-subject classification (see section 2.2.1 for more details)
Block Layer # filters Size # params Output Activation Options
1 Input (C, T)
Reshape (1, C, T)
Conv2D F1 (1, 64) 64 * F1 (F1, C, T) Linear Mode = same
BatchNorm 2 * F1 (F1, C, T)
DepthwiseConv2D D * F1 (C, 1) C * D * F1 (D * F1, 1, T) Linear Mode = valid,
depth = D,
max norm = 1
BatchNorm 2 * D * F1 (D * F1, 1, T)
Activation (D * F1, 1, T) ELU
AveragePool2D (1, 4) (D * F1, 1, T // 4)
Dropout* (D * F1, 1, T // 4) p = 0.25 or
p = 0.5
2 SeparableConv2D F2 (1, 16) 16 ∗ D ∗ F1 + F2 ∗ (D ∗ F1 ) (F2, 1, T // 4) Linear Mode = same
BatchNorm 2 * F2 (F2, 1, T // 4)
Activation (F2, 1, T // 4) ELU
AveragePool2D (1, 8) (F2, 1, T // 32)
Dropout* (F2, 1, T // 32) p = 0.25 or
p = 0.5
Flatten (F2 * (T // 32))
Classifier Dense N * (F2 * T // 32) N Softmax Max
norm = 0.25
(see section 2.3 for more details on our within- and cross- 500 ms of EEG activity at 32 Hz) followed by F2 (1, 1)
subject analyses). We apply an average pooling layer of pointwise convolutions [43]. The main benefits of separable
size (1, 4) to reduce the sampling rate of the signal to convolutions are (1) reducing the number of parameters to
32 Hz. We also regularize each spatial filter by using a fit and (2) explicitly decoupling the relationship within and
2
maximum norm constraint of 1 on its weights; w < 1. across feature maps by first learning a kernel summarizing
• In block 2, we use a separable convolution, which is a each feature map individually, then optimally merging the
depthwise convolution (here, of size (1, 16), representing outputs afterwards. When used for EEG-specific applica-
5
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Table 3. Number of trainable parameters per model and per dataset for all CNN-based models. We see that the EEGNet models are up
to two orders of magnitude smaller than both DeepConvNet and ShallowConvNet across all datasets. Note that we use a temporal kernel
length of 32 samples for the SMR dataset as the data were high-passed at 4 Hz.
Trial length (s) DeepConvNet ShallowConvNet EEGNet-4,2 EEGNet-8,2
P300 1 174 127 104 002 1066 2258
ERN 1.25 169 927 91 602 1082 2290
MRCP 1.5 175 727 104 722 1098 2322
SMR* 2 152 219 40 644 796 1716
tions this operation separates learning how to summarize oscillatory signal classification (by extracting features related
individual feature maps in time (the depthwise convolu- to log band-power); thus, it may not work well on ERP-based
tion) with how to optimally combine the feature maps (the classification tasks. However, the DeepConvNet architecture
pointwise convolution). This operation is also particularly was designed to be a general-purpose architecture that is not
useful for EEG signals as different feature maps may restricted to specific feature types [32], and thus it serves as a
represent data at different time-scales of information. In more valid comparison to EEGNet. Table 3 shows the number
our case we first learn a 500 ms ‘summary’ of each feature of trainable parameters per model across all CNN models.
map, then combine the outputs afterwards. An average
pooling layer of size (1, 8) is used for dimension reduction. 2.2.3. Comparison with traditional approaches. We also
• In the classification block, the features are passed directly compare the performance of EEGNet to that of the best per-
to a softmax classification with N units, N being the forming traditional approach for each individual paradigm.
number of classes in the data. We omit the use of a dense For all ERP-based data analyses (P300, ERN, MRCP) the
layer for feature aggregation prior to the softmax clas- traditional approach is the approach which won the Kaggle
sification layer to reduce the number of free parameters BCI competition (code and documentation at http://github.
in the model, inspired by the work in [83]. com/alexandrebarachant/bci-challenge-ner-2015), which uses
a combination of xDAWN spatial filtering [84], Riemannian
We investigate several different configurations of the EEGNet
geometry [85, 86], channel subset selection and L1 feature
architecture by varying the number of filters, F1, and the
regularization (referred to as xDAWN + RG for the remain-
number of spatial filters per temporal filter, D to learn. We
der of the manuscript). Here we provide a summary of the
set F2 = D ∗ F1 (the number of temporal filters along with
approach, which is done in five steps:
their associated spatial filters from block 1) for the duration
of the manuscript, although in principle F2 can take any value; 1. Train two set of 5 xDAWN spatial filters, one set for
F2 < D ∗ F1 denotes a compressed representation, learning each class of a binary classification task, using the ERP
fewer feature maps than inputs, whereas F2 > D ∗ F1 denotes template concatenation method as described in [86, 87].
an overcomplete representation, learning more feature maps 2. Perform EEG electrode selection through backward elimi-
than inputs. We use the notation EEGNet-F1,D to denote the nation [88] to keep only the most relevant 35 channels.
number of temporal and spatial filters to learn; i.e.: EEGNet- 3. Project the covariance matrices onto the tangent space
4,2 denotes learning four temporal filters and two spatial fil- using the log-Euclidean metric [85, 89].
ters per temporal filter. 4. Perform feature normalization using an L1 ratio of 0.5, signi-
fying an equal weight for L1 and L2 penalties. An L1 penalty
2.2.2. Comparison with existing CNN approaches. We
encourages the sum of the absolute values of the parameters
compare the performance of EEGNet against the DeepCon-
to be small, whereas an L2 penalty encourages the sum of the
vNet and ShallowConvNet models proposed by [32]; full
squares of the parameters to be small (a theoretical overview
table descriptions of both models can be found in the appen-
of these penalties can be found in [90]).
dix. We implemented these models in Tensorflow and Keras,
5. Perform classification using an elastic net regression.
following the descriptions found in the paper. As their archi-
tectures were originally designed for 250 Hz EEG signals (as We use the same xDAWN+RG model parameters across
opposed to 128 Hz signals used here) we divided the lengths all comparisons (P300, ERN, MRCP) with the exception of
of temporal kernels and pooling layers in their architectures the initial number of EEG channels to use, which was set to
by two to correspond approximately to the sampling rate used 56 for ERN and 64 for P300 and MRCP. While the original
in our models. We train these models in the same way we train solution used an ensemble of bagged classifiers, for this anal-
the EEGNet model (see section 2.2.1). ysis we only compared a single model with this approach to
The DeepConvNet architecture consists of five convo- a single EEGNet model on identical training and test sets, as
lutional layers with a softmax layer for classification (see we expect any gains from ensemble learning to benefit both
figure 1 of [32]). The ShallowConvNet architecture consists approaches equally. The original solution also used a set of
of two convolutional layers (temporal, then spatial), a squaring ‘meta features’ that were specific to that data collection. As
nonlinearity (f(x) = x2), an average pooling layer and a log the goal of this work is to investigate a general-purpose CNN
nonlinearity ( f (x) = log(x)). We would like to emphasize that model for EEG-based BCIs, we omitted the use of these fea-
the ShallowConvNet architecture was designed specifically for tures as they are specific to that particular data collection.
6
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
For oscillatory-based classification of SMR, the traditional 30 different folds. We follow the same procedure for the ERN
approach is our own implementation of the one-versus-rest dataset, except we use the ten test subjects from the original
(OVR) filter-bank common spatial pattern (FBCSP) algorithm Kaggle competition as the test set for each fold. We perform
as described in [78]. Here we provide a brief summary of our statistical testing using a one-way analysis of variance, using
approach: classifier type as the factor. For the SMR dataset, we partitioned
the data as follows: For each subject, select the training data
1. Bandpass filter the EEG signal into nine non-overlapping
from five other subjects at random to be the training set and
filter banks in 4 Hz steps, starting at 4 Hz: 4–8 Hz, 8–12
the training data from the remaining three subjects to be the
Hz,..., 36–40 Hz.
validation set. The test set remains the same as the original test
2. As the classification problem is multi-class, we use OVR
set for the competition. Note that this enforces a fully cross-
classification, which requires that we train a classifier for
subject classification analysis as we never use the test subjects’
all pairs of OVR combinations, which there are four here
training data. This process is repeated ten times for each sub-
(class 1 versus all others, class 2 versus all others, etc).
ject, creating 90 different folds. The mean and standard error of
We train two CSP filter pairs (four filters total) for each
classification performance were calculated over the 90 folds.
filter bank on the training data using the auto-covariance
We perform statistical testing for this analysis using the same
shrinkage method by [91]. This will give a total of 36
testing procedure as the within-subject analysis.
features (nine filter banks × four CSP filters) for each
When training both the within-subject and cross-subject
trial and each OVR combination.
models, we apply a class-weight to the loss function when-
3. Train an elastic-net logistic regression classifier [92] for
ever the data is imbalanced (unequal number of trials for each
each OVR combination. We set the elastic net penalty
class). The class-weight we apply is the inverse of the propor-
α = 0.95.
tion in the training data, with the majority class set to 1. For
4. Find the optimal λ value for the elastic-net logistic regression
example, in the P300 dataset, there is a 5.6:1 odds between
that maximizes the validation set accuracy by evaluating the
non-targets and targets (table 1) . In this case the class-weight
trained classifiers on a held-out validation set. The multi-
for non-targets was set to 1, while the class-weight for targets
class label for each trial is the classifier that produces the
was set to 6 (when the odds are a fraction, we take the next
highest probability among the 4 OVR classifiers.
highest integer). This procedure was applied to the P300 and
5. Apply the trained classifiers to the test set, using the λ
ERN datasets only, as these were the only datasets where sig-
values obtained in step 4.
nificant class imbalance was present.
Note that this approach differs slightly from the original Note that for the SMR analysis, we set the temporal kernel
technique as proposed in [78], where they use a naive Bayes length to be 32 samples long (as opposed to 64 samples long
Parzen window classifier. We opted to use an elastic net as given in table 2) since the data were high-passed at 4 Hz.
logistic regression for ease of implementation, and the fact
that it has been used in existing software implementations of
2.4. EEGNet feature explainability
FBCSP (for example, in BCILAB [93]).
The development of methods for enabling feature explain-
2.3. Data analysis ability from deep neural networks has become an active
research area over the past few years, and has been proposed
Classification results are reported for two sets of analyses:
as an essential component of a robust model validation pro-
within-subject and cross-subject. Within-subject classification
cedure, to ensure that the classification performance is being
uses a portion of the subjects data to train a model specifically
driven by relevant features as opposed to noise or artifacts in
for that subject, although cross-subject classification uses the
the data [16, 94–100]. We present three different approaches
data from other subjects to train a subject-agnostic model.
for understanding the features derived by EEGNet:
While within-subject models tend to perform better than
cross-subject models on a variety of tasks, there is ongoing 1. Summarizing averaged outputs of hidden unit
research investigating techniques to minimize (or possibly activations: This approach focuses on summarizing the
eliminate) the need for subject-specific information to train activations of hidden units at layers specified by the user.
robust systems [45, 52]. In this work we choose to summarize the hidden unit
For within-subject, we use four-fold blockwise cross- activations representing the data after the depthwise con-
validation, where two of the four blocks are chosen to be the volution (the spatial filter operation in EEGNet). Because
training set, one block as the validation set, and the final block the spatial filters are tied directly to a particular temporal
as testing. We perform statistical testing using a repeated-meas- filter, they provide additional insights into the spatial
ures analysis of variance (ANOVA), modeling classification localization of narrow-band frequency activity. Here we
results (AUC for P300/MRCP/ERN and classification acc summarize the spatially-filtered data by calculating the
uracy for SMR) as the response variable with subject number difference in averaged time-frequency representations
and classifier type as factors. For cross-subject analysis in between classes, using Morlet wavelets [101].
P300 and MRCP we choose, at random, four subjects for the 2. Visualizing the convolutional kernel weights: This
validation set, one subject for the test set, and all remaining approach focuses on directly visualizing and interpreting
subjects for the training set (see table 1 for number of subjects the convolutional kernel weights from the model.
per dataset). This process was repeated 30 times, producing Generally speaking, interpreting the convolutional kernel
7
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Figure 2. 4-fold within-subject classification performance for the P300, ERN and MRCP datasets for each model, averaged over all folds
and all subjects. Error bars denote two standard errors of the mean. We see that, while there is minimal difference between all the CNN
models for the P300 dataset, there are significant differences in the MRCP dataset, with both EEGNet models outperforming all other
models. For the ERN dataset we also see both EEGNet models performing better than all others ( p < 0.05).
8
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Figure 4. Cross-Subject classification performance for the P300, ERN and MRCP datasets for each model, averaged for 30 folds. Error
bars denote two standard errors of the mean. For the P300 and MRCP datasets there is minimal difference between the DeepConvNet and
the EEGNet models, with both models outperforming ShallowConvNet. For the ERN dataset the reference algorithm (xDAWN + RG)
significantly outperforms all other models.
We illustrate three different approaches to characterize the fea- all non-target trials. Here we see four distinct filters appear. The
tures learned by EEGNet: (1) Summarizing averaged outputs time-frequency analysis of filter 1 shows an increase in low-fre-
of hidden unit activations, (2) visualizing convolutional kernel quency power approximately 500ms after image presentation,
weights, and (3) calculating single-trial feature relevances on followed by desynchronizations in alpha frequency. As nearly
classification decision. We illustrate approach 1 on the P300 all subjects in the P300 dataset are right-handed, we also see
dataset for a cross-subject trained EEGNet-4, 1 model. We significant activity along the left motor cortex. Time-frequency
chose to analyze the filters from the P300 dataset due to the fact analysis of filter 2 appears to show a significant theta-beta rela-
that multiple neurophysiological events occur simultaneously: tionship; while increases in theta activity have been previously
participants were told to press a button with their dominant noted in the P300 literature in response to targets [102], a rela-
hand whenever a target image appeared on the screen. Because tionship between theta and beta has not previously been noted.
of this, target trials contain both the P300 event-related poten- The time-frequency difference for filter 4 appears to correspond
tial as well as the alpha/beta desynchronizations in contralateral with the P300, with an increase low-frequency power approxi-
motor cortex due to button presses. Here we were interested in mately 350 ms after image presentation.
whether or not the EEGNet architecture was capable of sepa- We also conducted a feature ablation study, where we
rating out these confounding events. We were also interested in iteratively removed a set of filters (by replacing the filters
quantifying the classification performance of the architecture with zeros) and re-applied the model to predict trials in
whenever specific filters were removed from the model. the test set. We do this for all combinations of the four fil-
Figure 6 shows the spatial topographies of the four filters ters. Classification results for this ablation study are shown
along with an average wavelet time-frequency difference, calcu- in table 4. We see that test set performance is minimally
lated using Morlet wavelets [101], between all target trials and impacted by the removal of any single filter, with the largest
9
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Figure 6. Visualization of the features derived from an EEGNet-4, 1 model configuration for one particular cross-subject fold in the P300
dataset. (A) Spatial topoplots for each spatial filter. (B) The mean wavelet time-frequency difference between target and non-target trials for
each individual filter.
decrease occurring when removing filter 4. As expected, Table 4. Performance of a cross-subject trained EEGNet-4, 1 model
when removing pairs of filters the decrease in performance when removing certain filters from the model, then using the model
to predict the test set for one randomly chosen fold of the P300
is more pronounced, with the largest decrease observed when
dataset. AUC values in bold denote the best performing model when
removing filters 3 and 4. Removing filters 2 and 3 results in removing one, two or three filters at a time. As the number of filters
practically no change in classification performance when removed increases, we see decreases in classification performance,
compared to the full model, suggesting that the most impor- although the magnitude of the decrease depends on which filters are
tant features in this task are being captured by filters 1 and removed.
4. This finding is further reinforced when looking at classifi- Filters removed Test set AUC
cation performance when three filters are removed; a model
that contains only filter 4 (0.8637 AUC) performs fairly well (1) 0.8866
(2) 0.9076
when compared to models that contain only filter 2 (0.7108
(3) 0.8910
AUC) or filter 1 (0.7970 AUC).
(4) 0.8747
Figure 7 shows the filters learned for the EEGNet-8,2 (1, 2) 0.8875
model for a within-subject classification of subject 3 for the (1, 3) 0.8593
SMR dataset. Each column of this figure denotes the learned (1, 4) 0.8325
temporal kernel (top row) with its two associated spatial fil- (2, 3) 0.8923
ters (bottom two rows). Note that we are learning temporal (2, 4) 0.8721
filters of length 32 samples, which correspond to 0.25 s in (3, 4) 0.8206
time; hence, we estimate the frequency for each temporal filter (1, 2, 3) 0.8637
as four times the number of observed cycles. Here we see that (1, 2, 4) 0.8202
EEGNet-8,2 learns both slow-frequency activity at approxi- (1, 3, 4) 0.7108
mately 12 Hz (filters 1, 2, 6 and 8, which show three cycles (2, 3, 4) 0.7970
in a 0.25 s window) and high-frequency activity at approx- None 0.9054
imately 32 Hz (filter 3, which show eight cycles). Figure 8
compares the spatial filters associated with 8–12 Hz frequency Figure 9 shows the single-trial feature relevances for
band learned by EEGNet-8,2 with the spatial filters learned by EEGNet-8,2, calculated using DeepLIFT, for three three
FBCSP in the 8–12 Hz filter-bank for each of the four OVR different test trials for one cross-subject fold of the MRCP
combinations. For ease of description we will use the notation dataset. Here we see that the high-confidence predictions
X-Y to denote the row-column filter. Here we see many of the (figures 9(A) and (B), for left and right finger movement,
filters are strongly positively correlated across models (i.e.: respectively) both correctly show the contralateral motor
the 1–1 filter of EEGNet-8,2 with the 3–1 filter for FBCSP cortex relevance as expected, whereas for a low-confidence
(ρ = 0.93) and the 2–1 filter of EEGNet-8,2 with the 3–4 filter prediction (figure 9(C)), the feature relevance is more broadly
of FBCSP (ρ = 0.83)), while some are strongly negatively distributed, both in time and in space on the scalp.
correlated (the 3–1 filter of EEGNet-8,2 with the 1–1 filter of Figure 10 shows an additional example of using
FBCSP (ρ = −0.93)), indicating a similar filter up to a sign DeepLIFT to analyze feature relevance for a cross-subject
ambiguity. trained EEGNet-4,2 model for one test subject of the ERN
10
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Figure 7. Visualization of the features derived from a within-subject trained EEGNet-8, 2 model for Subject 3 of the SMR dataset. Each of
the 8 columns shows the learned temporal kernel for a 0.25 s window (top) with its two associated spatial filters (bottom two). We see that,
while many of the temporal filters are isolating slower-wave activity, the network identifies a higher-frequency filter at approximately 32 Hz
(temp. filter 3, which shows eight cycles in a 0.25 s window).
Figure 8. Comparison of the four spatial filters learned by FBCSP in the 8–12 Hz filter bank for each OVR class combination (A) with the
spatial filters learned by EEGNet-8, 2 (B) for four temporal filters that capture 12 Hz frequency activity for subject 3 of the SMR dataset
(temporal filters 1, 2, 6 and 8, see figure 7). We see that, for this subject, similar filters appear across both FBCSP and EEGNet-8, 2.
A B C
Figure 9. (Top row) Single-trial EEG feature relevance for a cross-subject trained EEGNet-8,2 model, using DeepLIFT, for three different
test trials of the MRCP dataset: (A) a high-confidence, correct prediction of left finger movement, (B) a high-confidence, correct prediction
of right finger movement and (C) a low-confidence, incorrect prediction of left finger movement. Titles include the true class label and the
predicted probability of that label. (Bottom row) Spatial topoplots of the relevances at two time points: approximately 50 ms and 150 ms
after button press. As expected, the high-confidence trials show the correct relevances corresponding to contralateral motor cortex for left
(A) and right (B) button presses, respectively. For the low-confidence trial we see the relevances are more mixed and broadly distributed,
without a clear spatial localization to motor cortices.
an EEG-specific model which encapsulates well-known EEG across nearly all within-subject analyses (with the exception
feature extraction concepts. Finally, through the use of fea- of P300). One possible explanation for this discrepancy is
ture visualization and ablation analysis, we show that neuro- the amount of training data used to train the model; in cross-
physiologically interpretable features can be extracted from subject analyses the training set sizes were about 10–15 times
the EEGNet model. This last finding is particularly important, larger than that of within-subject analyses. This suggests that
as it is a critical component to understanding the validity and DeepConvNet is more data-intensive compared to EEGNet, an
robustness of CNN model architectures not just for EEG [32, unsurprising result given that the model size of DeepConvNet
33], but for CNN architectures in general [16, 95, 100]. is two orders of magnitude larger than EEGNet (see table 3).
The learning capacity of CNNs comes in part from their We believe this intuition is consistent with the findings origi-
ability to automatically extract intricate feature representa- nally reported by the developers of DeepConvNet [32], where
tions from raw data. However, since the features are not hand- they state that a training data augmentation strategy was
designed by human engineers, understanding the meaning of needed to obtain good classification performance on the SMR
those features poses a significant challenge in producing inter- dataset. In contrast to their work, we show that EEGNet per-
pretable models [96]. This is especially true when CNNs are formed well across all tested datasets without the need for data
used for the analysis of EEG data where features from neural augmentation, making the model simpler to use in practice.
signals are often non-stationary and corrupted by noise artifacts In general we found that, both in within- and cross-subject
[103, 104]. In this study, we illustrated three different approaches analyses, that ShallowConvNet tended to perform worse on the
for visualizing the features learned by EEGNet: (1) analyzing ERP BCI datasets than on the oscillatory BCI dataset (SMR),
spatial filter outputs, averaged over trials, on the P300 dataset, while the opposite behavior was observed with DeepConvNet.
(2) visualizing the convolutional kernel weights on the SMR We believe this is due to the fact that the ShallowConvNet
dataset and comparing them to the weights learned by FBCSP, architecture was designed specifically to extract log band-
and (3) performing single-trial relevance analysis on the MRCP power features; in situations where the dominant feature
and SMR datasets. For the ERN dataset we compared single- is signal amplitude (as is the case in many ERP BCIs),
trial feature relevances to averaged ERPs and showed that rel- ShallowConvNet performance tended to suffer. The opposite
evant features coincided with the peak of the positive potential situation occurred with DeepConvNet; as its architecture was
for correct and incorrect feedback trials, which has been shown not designed to extract frequency features, its performance was
in previous literature to be positively correlated to classifier lower in situations where frequency power is the dominant fea-
performance [57]. In addition, we conducted a feature abla- ture. In contrast, we found that EEGNet performed just as well
tion study to understand the impact of a classification decision as ShallowConvNet in SMR classification and just as well as
on the presence or absence of a particular feature on the P300 DeepConvNet in ERP classification (and outperforming in the
dataset. In each of these analyses, we showed that EEGNet was case of within-subject MRCP, ERN and SMR classifications),
capable of extracting interpretable features that generally cor- suggesting that EEGNet is robust enough to learn a wide
responded to known neurophysiological phenomena. variety of features over a range of BCI tasks.
Generally speaking, the classification performance of The severe underperformance of ShallowConvNet on
DeepConvNet and EEGNet were similar across all cross-sub- within-subject MRCP classification was unexpected, given the
ject analyses, whereas DeepConvNet performance was lower similarity in neural responses between the MRCP and SMR,
12
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Figure 10. Single-trial EEG feature relevance for a cross-subject trained EEGNet-4,2 model, using DeepLIFT, for the one test subject of
the ERN dataset. (top row) Feature relevances for three correctly predicted trials of incorrect feedback, along with its predicted probability
P. (bottom row) Same as the top row but for three correctly predicted trials of correct feedback. The black line denotes the average ERP,
calculated at channel Cz, for incorrect feedback trials (top row) and for correct feedback trials (bottom row). The thin vertical line denotes
the positive peak of the average ERP waveform. Here we see feature relevances coincide strongly with the positive peak of the average ERP
waveform for each trial. We also see the positive peak occurring slightly earlier for correct feedback trials versus incorrect feedback trials,
consistent with the results in [57].
and the fact that ShallowConvNet performed well on SMR. three channels, the authors reduce the significant increase in
This discrepancy in performance is not due to the amount of dimensionality of the data. While this approach works well if
training data used, as within-subject MRCP classification has the feature of interest is known beforehand, this approach is
approximately 700 training trials, evenly split among left and not guaranteed to work well in other applications where the
right finger movements, whereas the SMR dataset has only features are not observed at those channels, limiting the overall
192 training trials, evenly split among four classes. In addi- utility of this approach. We believe models that fall in (1),
tion, we did not observe large deviations in ShallowConvNet such as EEGNet and others [28, 30, 31], offer the best tradeoff
performance on the other datasets (P300 and ERN). In fact, between input dimensionality and the flexibility to discover
ShallowConvNet performed fairly well on within-subject relevant features by providing all available channels. This is
ERN classification, even though this dataset is the smallest especially important as BCI technologies evolve into novel
among all datasets used in this study (only having 170 training application spaces, as the features needed for these future BCIs
trials total). Determining the underlying source of this phe- may not be known beforehand [3–5, 10–12].
nomena will be explored in future research.
Deep learning models for EEG generally employ one of
three input styles, depending on their targeted application: (1) Acknowledgments
the EEG signal of all available channels, (2) a transformed EEG
signal (generally a time-frequency decomposition) of all avail- This project was sponsored by the US Army Research
able channels [37] or (3) a transformed EEG signal of a subset Laboratory under ARL-H70-HR52, ARL-74A-HRCYB and
of channels [38]. Models that fall in (2) generally see a signifi- through the Cooperative Agreement Number W911NF-10-
cant increase in data dimensionality, thus requiring either more 2-0022. The views and conclusions contained in this docu-
data or more model regularization (or both) to learn an effec- ment are those of the authors and should not be interpreted as
tive feature representation. This introduces more hyperparam representing the official policies, either expressed or implied,
eters that must be learned, increasing the potential variability of the US Government. The US Government is authorized to
in model performance due to hyperparameter misspecification. reproduce and distribute reprints for Government purposes
Models that fall in (3) generally require a priori knowledge notwithstanding any copyright notation herein.
about the channels to select. For example, the model proposed
in [38] uses the time-frequency decomposition of channels Cz,
C3 and C4 as the inputs for a motor imagery classification task. Conflict of Interest Statement
This channel selection is intentional, given the fact that neural
responses to motor actions (the sensory motor rhythm) are The authors declare that the research was conducted in the
observed strongest at those channels and are easily observed absence of any commercial or financial relationships that
through a time-frequency analysis. Also, by only working with could be construed as a potential conflict of interest.
13
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
Appendix
Table A1. DeepConvNet architecture, where C = number of channels, T = number of time points and N = number of classes, respectively.
Table A2. ShallowConvNet architecture, where C = number of channels, T = number of time points and N = number of classes,
respectively. Here, the ‘square’ and ‘log’ activation functions are given as f(x) = x2 and f (x) = log(x), respectively. Note that we clip the
log function such that the minimum input value is a very small number ( = 10 × 10−7) for numerical stability.
Layer # filters Size # params Activation Options
Input (C, T)
Reshape (1, C, T)
Conv2D 40 (1, 13) 560 Linear Mode = same, max norm = 2
Conv2D 40 (C, 1) 40 * 40 * C Linear Mode = valid, max norm = 2
BatchNorm 2 * 40 epsilon = 1 × 10−05, momentum = 0.1
Activation Square
AveragePool2D (1, 35), stride (1, 7)
Activation Log
Flatten
Dropout p = 0.5
Dense N Softmax Max norm = 0.5
A.1. DeepConvNet and ShallowConvNet architectures ShallowConvNet is designed specifically for oscillatory
signal classification.
The DeepConvNet and ShallowConvNet architectures are
given in tables A1 and A2, respectively. The DeepConvNet ORCID iDs
was designed to be a general-purpose architecture
that is not restricted to specific feature types, whereas Vernon J Lawhern https://orcid.org/0000-0002-3921-8723
14
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
15
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
[40] Sakhavi S, Guan C and Yan S 2015 Parallel convolutional- [59] Millán J D R et al 2010 Combining brain–computer interfaces
linear neural network for motor imagery classification 23rd and assistive technologies: state-of-the-art and challenges
European Signal Processing Conf. pp 2736–40 Frontiers Neurosci. 4 161
[41] Lu N, Li T, Ren X and Miao H 2017 A deep learning scheme [60] Spüler M, Bensch M, Kleih S, Rosenstiel W, Bogdan M and
for motor imagery classification based on restricted Kübler A 2012 Online use of error-related potentials in
Boltzmann machines IEEE Trans. Neural Syst. Rehabil. healthy users and people with severe motor impairment
Eng. 25 566–76 increases performance of a p300-BCI Clin. Neurophysiol.
[42] Yin Z and Zhang J 2017 Cross-session classification of mental 123 1328–37
workload levels using eeg and an adaptive deep learning [61] Krusienski D J, Sellers E W, McFarland D J, Vaughan T M
model Biomed. Signal Process. Control 33 30–47 and Wolpaw J R 2008 Toward enhanced p300 speller
[43] Chollet F 2016 Xception: deep learning with depthwise performance J. Neurosci. Methods 167 15–21
separable convolutions CoRR (arXiv:1610.02357) [62] Toro C, Deuschl G, Thatcher R, Sato S, Kufta C and
[44] Yang Z, Moczulski M, Denil M, Freitas N D, Smola A, Hallett M 1994 Event-related desynchronization and
Song L and Wang Z 2015 Deep fried convnets IEEE Int. movement-related cortical potentials on the ECoG and EEG
Conf. on Computer Vision pp 1476–83 Electroencephalogr. Clin. Neurophysiol. Evoked Potentials
[45] Lotte F 2015 Signal processing approaches to minimize or Sect. 93 380–9
suppress calibration time in oscillatory activity-based [63] Pfurtscheller G and Aranibar A 1977 Event-related cortical
brain–computer interfaces Proc. IEEE 103 871–90 desynchronization detected by power measurements
[46] Fazel-Rezai R, Allison B Z, Guger C, Sellers E W, of scalp EEG Electroencephalogr. Clin. Neurophysiol.
Kleih S C and Kübler A 2012 P300 brain–computer 42 817–26
interface: current challenges and emerging trends [64] Pfurtscheller G and da Silva F L 1999 Event-related eeg/meg
Frontiers Neuroeng. 5 14 synchronization and desynchronization: basic principles
[47] Pfurtscheller G and Neuper C 2001 Motor imagery and Clin. Neurophysiol. 110 1842–57
direct brain–computer communication Proc. IEEE [65] Liao K, Xiao R, Gonzalez J and Ding L 2014 Decoding
89 1123–34 individual finger movements from one hand using human
[48] Makeig S 1993 Auditory event-related dynamics of the EEG signals PLoS ONE 9 1–12
{EEG} spectrum and effects of exposure to tones [66] Barrett G, Shibasaki H and Neshige R 1986 Cortical potentials
Electroencephalogr. Clin. Neurophysiol. 86 283–93 preceding voluntary movement: evidence for three
[49] Polich J 2007 Updating p300: an integrative theory of P3a and periods of preparation in man Electroencephalogr. Clin.
P3b Clin. Neurophysiol. 118 2128–48 Neurophysiol. 63 327–39
[50] Sajda P, Pohlmeyer E, Wang J, Parra L C, Christoforou C, [67] Yilmaz O, Birbaumer N and Ramos-Murguialday A 2015
Dmochowski J, Hanna B, Bahlmann C, Singh M K and Movement related slow cortical potentials in severely
Chang S F 2010 In a blink of an eye and a switch of a paralyzed chronic stroke patients Frontiers Hum. Neurosci.
transistor: cortically coupled computer vision Proc. IEEE 8 1033
98 462–78 [68] Deecke L, Scheid P and Kornhuber H H 1969 Distribution
[51] Marathe A R, Lawhern V J, Wu D, Slayback D and Lance B J of readiness potential, pre-motion positivity, and motor
2016 Improved neural signal classification in a rapid serial potential of the human cerebral cortex preceding voluntary
visual presentation task using active learning IEEE Trans. finger movements Exp. Brain Res. 7 158–68
Neural Syst. Rehabil. Eng. 24 333–43 [69] Leuthardt E C, Schalk G, Moran D and Ojemann J G
[52] Waytowich N, Lawhern V, Bohannon A, Ball K and Lance B 2006 The emerging world of motor neuroprosthetics: a
2016 Spectral transfer learning using information geometry neurosurgical perspective Neurosurgery 59 1–14
for a user-independent brain-computer interface Frontiers [70] Yom-Tov E and Inbar G F 2003 Detection of movement-
Neurosci. 10 430 related potentials from the electro-encephalogram for
[53] Delorme A and Makeig S 2004 Eeglab: an open source possible use in a brain–computer interface Med. Biol. Eng.
toolbox for analysis of single-trial EEG dynamics including Comput. 41 85–93
independent component analysis J. Neurosci. Methods [71] Karimi F, Kofman J, Mrachacz-Kersting N, Farina D and
134 9–21 Jiang N 2017 Detection of movement related cortical
[54] Miltner W H R, Braun C H and Coles M G H 1997 Event- potentials from EEG using constrained ICA for brain–
related brain potentials following incorrect feedback in a computer interface applications Frontiers Neurosci.
time-estimation task: evidence for a generic neural system 11 356
for error detection J. Cogn. Neurosci. 9 788–98 [72] Gordon S, Lawhern V, Passaro A and McDowell K 2015
[55] Gehring W J, Goss B, Coles M G H, Meyer D E and Informed decomposition of electroencephalographic data J.
Donchin E 1993 A neural system for error detection and Neurosci. Methods 256 41–55
compensation Psychol. Sci. 4 385–90 [73] Bigdely-Shamlo N, Mullen T, Kothe C, Su K M and
[56] Falkenstein M, Hohnsbein J, Hoormann J and Blanke L Robbins K A 2015 The prep pipeline: standardized
1991 Effects of crossmodal divided attention on late ERP preprocessing for large-scale EEG analysis Frontiers
components. II. Error processing in choice reaction tasks Neuroinformatics 9 16
Electroencephalogr. Clin. Neurophysiol. 78 447–55 [74] Tangermann M et al 2012 Review of the BCI competition iv
[57] Margaux P, Emmanuel M, Sébastien D, Olivier B and Frontiers Neurosci. 6 55
Jérémie M 2012 Objective and subjective evaluation of [75] Kingma D P and Ba J 2014 Adam: a method for stochastic
online error correction during p300-based spelling Adv. optimization (arXiv:1412.6980)
Hum. Comput. Interact. 2012 4 [76] Abadi M et al 2016 Tensorflow: a system for large-scale
[58] Zander T O, Kothe C, Welke S and Roetting M 2009 Utilizing machine learning Proc. 12th USENIX Conf. on Operating
secondary input from passive brain–computer interfaces Systems Design and Implementation (Berkeley, CA, USA:
for enhancing human–machine interaction Foundations of USENIX Association) pp 265–83
Augmented Cognition. Neuroergonomics and Operational [77] Chollet F 2015 Keras https://github.com/fchollet/keras
Neuroscience (FAC 2009) (Lect. Not. Comput. Sci. vol [78] Ang K K, Chin Z Y, Wang C, Guan C and Zhang H 2012 Filter
5638) ed D D Schmorrow et al (Berlin: Springer) bank common spatial pattern algorithm on BCI competition
pp 759–71 iv datasets 2a and 2b Frontiers Neurosci. 6 39
16
J. Neural Eng. 15 (2018) 056013 V J Lawhern et al
[79] Dyrholm M, Christoforou C and Parra L C 2007 Bilinear [93] Kothe C A and Makeig S 2013 Bcilab: a platform for brain–
discriminant component analysis J. Mach. Learn. Res. computer interface development J. Neural Eng. 10 056014
8 1097–111 [94] Baehrens D, Schroeter T, Harmeling S, Kawanabe M,
[80] Ioffe S and Szegedy C 2015 Batch normalization: accelerating Hansen K and MÞller K-R 2010 How to explain
deep network training by reducing internal covariate shift individual classification decisions J. Mach. Learn. Res.
(arXiv:1502.03167) 11 1803–31
[81] Clevert D, Unterthiner T and Hochreiter S 2015 Fast and [95] Zeiler M D and Fergus R 2014 Visualizing and understanding
accurate deep network learning by exponential linear units convolutional networks Computer Vision—ECCV ed
(elus) CoRR (arXiv:1511.07289) D Fleet et al (Cham: Springer) pp 818–33
[82] Srivastava N, Hinton G, Krizhevsky A, Sutskever I and [96] Nguyen A M, Yosinski J and Clune J 2014 Deep neural
Salakhutdinov R 2014 Dropout: a simple way to prevent networks are easily fooled: high confidence predictions for
neural networks from overfitting J. Mach. Learn. Res. unrecognizable images CoRR (arXiv:1412.1897)
15 1929–58 [97] Ribeiro M T, Singh S and Guestrin C 2016 ‘Why should I
[83] Springenberg J T, Dosovitskiy A, Brox T and Riedmiller M A trust you?’: explaining the predictions of any classifier
2014 Striving for simplicity: the all convolutional net In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge
(arXiv:1412.6806) Discovery and Data Mining (New York, NY: ACM)
[84] Rivet B, Souloumiac A, Attina V and Gibert G 2009 xDAWN pp 1135–44
algorithm to enhance evoked potentials: application to brain– [98] Shrikumar A, Greenside P and Kundaje A 2017 Learning
computer interface IEEE Trans. Biomed. Eng. 56 2035–43 important features through propagating activation
[85] Barachant A, Bonnet S, Congedo M and Jutten C 2012 differences CoRR (arXiv:1704.02685)
Multiclass brain–computer interface classification [99] Ancona M, Ceolini E, Öztireli C and Gross M 2018 Towards
by Riemannian geometry IEEE Trans. Biomed. Eng. better understanding of gradient-based attribution
59 920–8 methods for deep neural networks Int. Conf. on Learning
[86] Barachant A and Congedo M 2014 A plug & play P300 BCI Representations
using information geometry (arXiv:1409.0107 [cs, stat]) [100] Montavon G, Samek W and Müller K-R 2018 Methods for
[87] Congedo M, Barachant A and Andreev A 2013 A new interpreting and understanding deep neural networks Digit.
generation of brain–computer interface based on Signal Process. 73 1–15
riemannian geometry CoRR (arXiv:1310.8115) [101] Torrence C and Compo G P 1998 A practical guide to
[88] Barachant A and Bonnet S 2011 Channel selection procedure wavelet analysis Bull. Am. Meteorol. Soc. 79 61–78
using riemannian distance for bci applications 5th Int. [102] Mazaheri A and Picton T W 2005 EEG spectral dynamics
IEEE/EMBS Conf. on Neural Engineering pp 348–51 during discrimination of auditory and visual targets Cogn.
[89] Barachant A, Bonnet S, Congedo M and Jutten C 2013 Brain Res. 24 81–96
Classification of covariance matrices using a Riemannian- [103] Johnson G, Waytowich N and Krusienski D J 2011
based kernel for BCI applications Neurocomputing The challenges of using scalp-EEG input signals for
112 172–8 continuous device control Foundations of Augmented
[90] Ng A Y 2004 Feature selection, l1 versus l2 regularization, Cognition. Directing the Future of Adaptive Systems.
and rotational invariance Proc. 21st Int. Conf. on Machine Int. Conf. on Foundations of Augmented Cognition ed
Learning (New York, NY: ACM) p 78 D D Schmorrow and C M Fidopiastis (Berlin: Springer)
[91] Ledoit O and Wolf M 2004 A well-conditioned estimator for pp 525–7
large-dimensional covariance matrices J. Multivariate Anal. [104] Lawhern V, Hairston W D, McDowell K, Westerfield M and
88 365–411 Robbins K 2012 Detection and classification of subject-
[92] Zou H and Hastie T 2005 Regularization and variable generated artifacts in EEG signals using autoregressive
selection via the elastic net J. R. Stat. Soc. B 67 301–20 models J. Neurosci. Methods 208 181–9
17