1 s2.0 S1110016823000327 Main

Alexandria Engineering Journal (2023) 68, 817–840
H O S T E D BY
Alexandria University
Alexandria Engineering Journal

www.elsevier.com/locate/aej
www.sciencedirect.com
REVIEW
A Comprehensive Survey on Deep Facial

Expression Recognition: Challenges, Applications,
and Future Guidelines
Muhammad Sajjad a,*, Fath U Min Ullah b, Mohib Ullah a, Georgia Christodoulou c,
Faouzi Alaya Cheikh a,*, Mohammad Hijji d, Khan Muhammad e,*,
Joel J.P.C. Rodrigues f,g
a
The Software, Data and Digital Ecosystems (SDDE) Research Group, Department of Computer Science (IDI),
Norwegian University of Science and Technology (NTNU), 2815 Gjøvik, Norway
b
Sejong University, Seoul 143-747, Republic of Korea
c
Catalink Limited, Charistinis Sakkada 5, Nicosia 1040, Cyprus
d
Faculty of Computers and Information Technology (FCIT), University of Tabuk, Tabuk 47711, Saudi Arabia
e
Visual Analytics for Knowledge Laboratory (VIS2KNOW Lab), Department of Applied Artificial Intelligence, School
of Convergence, College of Computing and Informatics, Sungkyunkwan University, Seoul 03063, Republic of Korea
f
College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266555, China
g
Instituto de Telecomunicações, 6201-001 Covilhã, Portugal
Received 12 July 2022; revised 31 December 2022; accepted 9 January 2023

Available online 10 February 2023
KEYWORDS Abstract Facial expression recognition (FER) is an emerging and multifaceted research topic.
Facial expression recogni- Applications of FER in healthcare, security, safe driving, and so forth have contributed to the cred-
tion; ibility of these methods and their adoption in human-computer interaction for intelligent outcomes.
Edge vision; Computational FER mimics human facial expression coding skills and conveys important cues that
Deep learning; complement speech to assist listeners. Similarly, FER methods based on deep learning and artificial
Machine learning; intelligence (AI) techniques have been developed with edge modules to ensure efficiency and real-
Health care; time processing. To this end, numerous studies have explored different aspects of FER. Surveys
Security; of FER have focused on the literature on hand-crafted techniques, with a focus on general methods
Artificial intelligence for local servers but largely neglecting edge vision-inspired deep learning and AI-based FER tech-
nologies. To consider these missing aspects, in this study, the existing literature on FER is thor-
oughly analyzed and surveyed, and the working flow of FER methods, their integral and
intermediate steps, and pattern structures are highlighted. Further, the limitations in existing
FER surveys are discussed. Next, FER datasets are investigated in depth, and the associated chal-
lenges and problems are discussed. In contrast to existing surveys, FER methods are considered for
edge vision (on e.g., smartphone or Raspberry Pi, devices, etc.), and different measures to evaluate
the performance of FER methods are comprehensively discussed. Finally, recommendations and
* Corresponding authors.
E-mail addresses: muhammad.sajjad@ntnu.no (M. Sajjad), faouzi.cheikh@ntnu.no (F. Alaya Cheikh), khan.muhammad@ieee.org (K. Muhammad).
Peer review under responsibility of Faculty of Engineering, Alexandria University.
https://doi.org/10.1016/j.aej.2023.01.017
1110-0168 Ó 2023 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
818 M. Sajjad et al.
some avenues for future research are suggested to facilitate further development and implementa-
tion of FER technologies.
Ó 2023 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria
University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
1.1. Managerial and social implications of FER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
1.2. Applications of FER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
1.2.1. FER for the prognosis and diagnosis of neurological disorders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
1.2.2. FER in security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
1.2.3. FER for learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
2. Overview of the existing FER literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
3. Working flow of FER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
3.1. Data acquisition and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
3.2. ROI detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
3.3. Emotion recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824
3.3.1. Conventional learning-based FER techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824
3.3.2. Deep learning-based FER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824
3.4. Output emotion and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
4. FER datasets and associated statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 830
5. Challenges and future research directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832
5.1. FER challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832
5.1.1. Scarcity of FER datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832
5.2. Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
5.2.1. Surveillance-scaled FER datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
5.2.2. FER with lower computational resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
5.2.3. FER via E2E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
5.2.4. Group expression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
5.2.5. FER everywhere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
5.2.6. Federated learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
5.2.7. AML for FER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
Declaration of competing interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
1. Introduction tions of FER. Similarly, individuals suffering from a stroke

may have an impairment of the left anterior or posterior insula
The exponential growth of the facial expression recognition cortex, pallidus, and putamen, which can render the recogni-
(FER) methods performed using computer vision, deep learn- tion of some of emotions more difficult. Generally, these emo-
ing, and AI has been observed over the last few owing to its tions are complex and, in some cases, may be confused because
well-known applications in security [1,2], lecturing [3,4], med- their manifestation varies considerably among different people
ical rehabilitation [5], FER in the wild [6,7], and safe driving owing to differences in age, personal characteristics, gender,
[8]. Facial expressions are remarkably essential in human com- methods of communication, and so forth.
munication and are produced through the movement of facial FER is significantly affected by the illumination, pose,
muscles and communicate with a range of signal types, from a background, and camera viewpoint of a source image, as well
state of deep-rooted survival to subtle communicative signals, as occlusion or misalignment. Efficient FER relies on both the
such as raising the eyebrow in a conversational context [9]. computations that occur in the visual-perceptual system, sup-
Most psychological studies have reported that half of the ported by perceptual processes, and extrapolated information
information in a given speech is conveyed through emotions. from the perceptual system [11]. Three main representations
Patients with Parkinson’s disease develop symptoms of stiff- of visual FER are sufficient and necessary, namely 1) a series
ness in facial muscle movements and reduced facial expression, of visual-perceptual representations of postures and the move-
known as hypomania [10], which may contribute to percep- ment of observed expressions, 2) storage of the structural
A comprehensive Survey on Deep Facial Expression Recognition 819
description of features characterizing the known expressions, (b). Additionally, the working strategy and contributions of
and 3) semantic representations characterizing expressions. FER from 2015 onward were studied, and its coverage by dif-
An extensive body of literature has emerged on FER in the ferent sources such as journals, publishers, ArXiv, and confer-
form of articles, surveys, literature reviews, and proposals. ences are shown in Fig. 1 (c). Similarly, FER must be
However, the works thus far have focused primarily only on examined in terms of baseline strategies used to recognize
working flow and feature extraction. To address the missing expressions in video or image content. These strategies were
aspects, this article first analyzed and considered FER in detail broadly divided into three parts, and their visual representa-
by presenting a taxonomy of and statistics on prior works. To tions are given in Fig. 1 (d). A list of abbreviations used in this
collect FER literature, a yearly search strategy that helped work with their expansions are provided in Table 1.
cover a wide range of articles from each year sequentially FER has been considered from both academic and indus-
was applied; subsequently, the articles were correspondingly trial perspectives, and can provide a window to the tempera-
categorized. The search for articles involved several search ment, cognitive ability, personality, and psychopathology of
engines, including Google, Google Scholar, ScienceDirect, individuals. For example, an increase in the use of FER tech-
and IEEE Explore. The search revealed an increasing interest nology in the clinical investigation of the effects of neuropsy-
in FER in the form of more published articles, and the latest chiatric disorders on expression and perception has been
methods tended to be inspired by neural networks and end- shown to be tractable for quantitative research. Growth in
to-end network models. Additionally, newly developed emo- the field of FER has been achieved owing to their wide range
tion recognition datasets were constructed as the field was of applications in real-life scenarios, science fields, and medical
developed further over subsequent years. The number of arti- services. Some applications of FER include gauging con-
cles published each year is shown in Fig. 1 (a). Next, the qual- sumer’s emotions regarding products or identifying suspicious
ity of research on FER was investigated. Highly cited articles activity. Automotive companies applying FER technology aim
have a greater impact on the research community. High cita- to make cars safer and more personalized for individual
tion scores indicate the influence of leading research directions. customers.
Hence, article citation scores for each year were considered Human emotional expressions profoundly enrich our inter-
herein. Some statistics on these citations are shown in Fig. 1 actions with one another [12]. FER technology has been
Fig. 1 Statistics of FER publications across search engines in terms of their citations score and publisher-wise distribution of FER
methods. (a) Number of publications in each year ranging from 2015 to 2021. (b) Citations achieved by FER research in each year, where
2016 is the most cited year as the FER methods of this year have been well explored in the later research. (c) Division of publications in
each portal. (d) Categorization of FER methods based on their baseline strategy.
Table 1 Abbreviations used throughout the survey.

Word Description Word Description
ACNN Attention mechanism CNN KTN Knowledgeable teacher network
AI Artificial intelligence KDEF Karolinska directed emotional faces
AFEW Acted facial expressions in the wild LBP Local binary patterns
BOVW Bag-of-words LGIN Learnable graph inception network
BDLSTM Bi-directional LSTM LFW Labelled faces in the wild
BIGRU Bi-directional GRU LSTM Long short-term memory
BU-4DFE Binghamton University 4d Facial Expression LPDP Local prominent directional pattern
BU-3DFE Binghamton University 3d Facial Expression MTCNN Multitask cascaded convolutional networks
BAUM Bahçesßehir University Multimodal Affective MDSTFN Multichannel deep spatial–temporal feature fusion neural
Database network
CNN Convolutional neural network MLP Multi-layer perceptron
CLAHE Contrast limited adaptive histogram equalization MSCNN Multi-signal convolutional neural network
DCNN Difference CNN MUG Multimedia understanding group
DWT Discrete wavelet transform NB Naı̈ve Bayesian
DAM- Deep attentive multi-path CNN NCUFE Nanchang University Facial Expression
CNN
DNN Deep neural network ORB Oriented FAST and rotated BRIEF
DISFA Denver intensity of spontaneous facial action PNN Parallel neural network
EEM Emotional education mechanism PHRNN Part-based hierarchical bidirectional RNN
EDLM Ensemble deep learning model RAFD Radboud face database
ELM Extreme learning machine RF Random forest
FER Facial expression recognition R-CNN Region-based CNN
FACS Facial action coding system RNN Recurrent neural network
FNN Feedforward neural network RML Ryerson multimedia research lab
FMPN Facial motion prior networks RAVDESS Ryerson audio–visual database of emotional speech and
song
FERET Facial recognition technology SSD Single-shot detection
FEW Facial expressions in the wild databases SVM Support vector machine
GRU Gated recurrent unit SFEW Static facial expressions in the wild
GRNN General regression neural network SIFT Scale-invariant feature transform
GFT Group formation task SURF Speeded up robust features
HOG Histogram of oriented gradients SCN Self-cure network
IWFER Iranian wild facial expression recognition SSN Self-taught student network
IoT Internet of Things SWE Stationary wavelet entropy
KNN K-nearest neighbor TFEID Taiwanese facial expression image database
JAFFE Japanese female facial expression TFD Toronto faces dataset
applied in healthcare with AI-empowered recognition to recog- grammed behavior rather than the experience of a sentient
nize patients’ needs for medications or assist physicians in being. The expressions shown by robots’ faces are not reflex-
inquiring as to which patient may require more attention. ive but rather comprise a communication interface. In man-
Methods to exploring patients’ emotions for better health sys- agerial or social human interaction, expressions can deliver a
tem outcomes are being developed owing to their observed vast amount of information quite rapidly through the con-
positive impacts in several medical fields. Automatic FER traction of facial muscles in response to a particular action
can assist doctors in operating smart centers to detect stress or question [14]. For instance, if an individual asks a certain
and depression among patients for health purposes. This question or asks for permission to perform some action, a
approach may also help patients recognize psychological prob- response can be delivered through the movement of the
lems related to existing or previous medications [13]. Hospitals eye muscles or head pose. Similarly, a person’s state can
worldwide have begun to incorporate AI to handle patients’ be easily understood and discovered by observing only their
medication schedules as researchers have focused on applying facial appearance and muscle movements in response to a
neural networks to perform FER on patients. particular action. Thus, automatic FER methods are needed
to enable computational systems to accurately gauge a per-
1.1. Managerial and social implications of FER son’s mood. Regarding this, the proposed survey covers the
aspects of FER systems and their challenges in detail as a
Human expressions can show or conceal a variety of com- step toward the development of improved expression recog-
plex cognitive processes. Facial expressions elicit a rapid nition systems.
response and often imitate emotions. These effects occur
on peoples’ faces in a natural way and can be easily 1.2. Applications of FER
observed. By contrast, people recognize the expressions per-
formed by robots but understand that they exhibit pro- In this section, FER applications are discussed in detail.
Fig. 2 Work flow of this survey.
1.2.1. FER for the prognosis and diagnosis of neurological 1.2.2. FER in security
disorders FER also plays an important role in security, where the malicious
FER is widely utilized in rehabilitation to help and monitor intentions of criminal suspects or perpetrators may be recognized
the patients; herein, the emotions of the patients are analyzed by analyzing their expressions [17]. At present, ubiquitous surveil-
to help and provide medical care. Similarly, the doctors or a lance has been implemented using security cameras installed in
leading counsel can judge their patients or clients’ emotional various locations, such as subways, markets, and stores. These
states from their appearance and body movements to note camera feeds can be used to detect and analyze individual’s facial
damaged or affected parts of their body. Patients inpatient emotions. These systems can identify suspicious activity, which
care can be treated on a priority basis by capturing data on can thus be prevented beforehand [18].
their state and moods through FER. Similarly, FER has been
incorporated to facilitate the prognosis and diagnosis of neu- 1.2.3. FER for learning
rological disorders (i.e., brain conditions or diseases), such as Educators can adjust their style of presentation according to
stroke, multiple sclerosis, and Parkinson’s disease [10,15]. This learners by understanding learner’s emotional expressions of
enables clinicians to evaluate the mood of patients with neuro- their internal states. Students’ enthusiasm may be improved
logical disorders. For example, a patient may express inappro- by understanding their feelings in classroom or laboratory
priate or excessive emotions to express their state of mind or work [19].
conditions. Therefore, recognizing these emotions is of value Numerous groups are rapidly working on developing FER
in monitoring patients via smartphone cameras [16]. technology to improve performance and ensure real-time pro-
cessing capability in various potential applications. Research- resented FER in two ways, namely message-based and facial-
ers must confront several issues and challenges due to the sen- component movement-based methods. They further catego-
sitive nature of changes in facial expressions. This survey rized message-based methods into discrete and continuous
provides information on the development of platforms for dimensional methods. According to their review, the discrete
FER methods to show how they can be generalized to deliver categorical method is a long-standing method that has been
a compact representation and learning terminology. Studies on widely adopted by psychologists to describe emotions. Simi-
FER are limited and largely describe only particular methods, larly, the continuous method was adopted from psychology;
with little or no focus on the deployment of such models on it describes emotions in terms of continuous axes of a multidi-
mobile platforms, such as edge devices and smart phones. Fur- mensional space. By contrast, movement-based components
ther, to the best of our knowledge, no detailed overview of use the movement of facial muscles for expression encoding.
deep learning and AI-based methods applied to this task has Similarly, Rajan et al. [21] covered FER techniques, the con-
been conducted. ventional classifiers used for FER classification, and FER
To overcome the existing challenges faced by current datasets. Another recently published survey [22] considered
surveys, this study provides a comprehensive survey of the an in-depth study of FER datasets and their creation, and sub-
development and implementation of FER technologies, as sequently properly aligned all the steps of conventional FER
shown in Fig. 2. The main contributions of this study are processes. Further, they overviewed the deep networks,
summarized as follows. sequential learning mechanisms, issues related to FER, and
Contributions challenges faced by the researchers during FER; next, they
highlighted some possible directions for future research on
1. To the best of our knowledge, this survey is the first to pro- FER. These details are given Table 2.
vide a thorough taxonomy of recent literature on FER that Finally, this study presents the main contributions of this
considers deep learning, conventional learning, hybrid survey. The proposed survey presents a thorough FER taxon-
approaches, and edge vision by analyzing the patterns of omy and the most recent FER literature developed for medical
these works. In addition, the manner in which the FER applications targeting patients with Parkinson’s disease,
has been considered is described from a medical perspec- stroke, multiple sclerosis, and CFS. Similarly, the preprocess-
tive, such as for monitoring patients with Parkinson’s dis- ing, main architecture steps, and evaluation metrics used to
ease, stroke, or dementia. evaluate the performance of FER methods are extensively dis-
2. Existing surveys are largely limited to methods deployed to cussed. Furthermore, the current challenges and issues in FER,
cloud computing or PC setups. However, this study covers and the directions for future research on FER are presented.
both edge- and cloud-based FER methods. In addition, dif-
ferent platforms and products are investigated for this pur- 3. Working flow of FER
pose. Further, an extensive set of information on debates
on FER methods targeting the diagnosis of various dis- This section describes the stepwise working flow of FER for
eases, as well as the corresponding journal details, their real-time processing of the generic pipeline of FER, as shown
impact, and the number of citations are provided. in Fig. 3, and the details of the working procedure of the FER
3. A general framework followed by the FER methods is pre- are given in Fig. 4. A comprehensive discussion of each step of
sented. The datasets and challenges faced by researchers in the pipeline are provided below.
this field are discussed comprehensively. Furthermore,
these challenges are addressed by suggesting some promis- 3.1. Data acquisition and preprocessing
ing directions for future research.
The remainder of this paper is organized as follows. Sec- Data collection through vision sensors and preprocessing are
tion 2 focuses on existing surveys and their downsides. Sec- essential steps. The data are typically acquired from different
tion 3 covers the working flow of FER systems in detail sources such as Pi Cam devices, mobile phones, or surveillance
while considering of deep learning and conventional learning cameras. Different data variations, such as illumination, head
methods. Section 4 discusses existing FER datasets and some poses, and background, are common in uncertain scenarios.
associated challenges. Section 5 sheds light on FER challenges Therefore, before training a recognition model, preprocessing
and research guidelines. Finally, Section 6 concludes the paper is applied to normalize and align the visual semantic informa-
with some final remarks and suggests possible avenues for tion of the faces. Several face alignment techniques, such as
future research. holistic [26], part-based [27,28], DL-based [29–31], and cas-
caded alignment [32–34], have been widely applied for this
purpose.
2. Overview of the existing FER literature
State-of-the-art AI-models contain a considerable number
of parameters, typically in the order of millions. A sufficient
This section explains recently published articles that have sur- amount of training data is required to ensure the generalizabil-
veyed FER technology. This survey discusses the contributions ity of such models. However, most existing datasets available
and disadvantages of these previous articles and compares the for training are insufficient for this purpose. To overcome this
proposed article with state-of-the-art FER surveys. First, the challenge, FER methods must apply data augmentation
work presented by Zhang et al. [20] explained the advance- techniques.
ments made in the creation of FER datasets and technique Data augmentation methods are designed to expand the
development. They focused primarily on occlusion problems size of a dataset and its diversity by applying random pertur-
and studied their effects on FER systems. Moreover, they rep- bations, such as image shifting, skew, rotation, adding noise,
Table 2 Comparative analysis of the present work with existing recent surveys in terms of their categorization as considering deep
learning (DL), conventional learning (CL), and hybrid approaches (HA).
Ref Year Platform Categorization Contributions Remarks
PC Edge DL CL HA
[20] 2018 U ✗ U U ✗ -Data creation, technique development, and -Only partial occlusion is widely considered.
occlusion problem are investigated for FER –No workflow mechanism is provided to
systems and associated challenges are describe FER steps.
discussed. –No comparative study of mainstream FER
surveys.
[21] 2019 U ✗ ✗ U ✗ -FER techniques, classifiers, and datasets are -Most traditional FER techniques are
surveyed. Some discussion on face detection covered.
methods and features extraction is provided. -A concrete and easily understandable
framework is missing.
[23] 2020 U ✗ ✗ ✗ ✗ -Three aspects regarding to 3D FER such as -The entire paper is based only on the
face structure and its preprocessing and occlusion problem under conditions of real-
classification are investigated. time emotion recognition.
[24] 2021 U ✗ U U ✗ -FER methods based on CNN are widely –No coverage of challenges in FER. Methods
focused on with applications of FER. are limited to CNN techniques only.
[25] 2022 U U ✗ U ✗ Major steps including preprocessing, features -Most popular challenges are not covered.
extraction, and classification are explained. Further, directions and recommendations for
future research are not provided.
Our 2023 U U U U U -A thorough taxonomy of FER and the most -Widely focused on FER literature and
recent FER literature is covered. Next, both properly categorizing the FER algorithms as
edge- and cloud-based FER methods are DL, CL, and HL techniques. Open challenges
highlighted. An extensive set of discussions on in FER are discussed, along with
journals, citations, and FER applications is recommendations for future work.
performed.
Fig. 3 Generic pipeline of FER (in the case of conventional learning).
and image scaling. More unseen training samples [35] can be detection, including feature-based [42], knowledge-based [43],
generated through combinations of multiple operations that and appearance-based methods [44] as well as template match-
ensure a model’s robustness to rotated and deviated faces [36]. ing [45]. In knowledge- or rule-based methods, the human face
is described via defined rules and the representation depends
3.2. ROI detection entirely on how the rules are proposed. Similarly, feature
invariant methods use different types of features, such as
Region of interest (ROI) detection (in this study, the face) is human eyes or nose, for face detection. However, this tech-
also referred to as facial detection. ROI detection is performed nique can be negatively affected by light and noise. In template
by AI-based techniques to identify and locate faces in images. matching, an image is compared with features that were previ-
These methods have been widely adopted in several applica- ously stored or compared with standard face patterns and cor-
tions, such as security [37], law enforcement [38], entertain- related for face detection. Furthermore, appearance-based
ment [39], and personal safety [40] which involve tracking or techniques apply machine learning or statistical analysis to
surveillance. They have advanced considerably from rudimen- identify important face characteristics and have been widely
tary vision techniques to enhanced machine learning and arti- applied to perform emotion recognition.
ficial neural networks (ANN) [41]. Facial detection is A major improvement in face detection occurred in 2001
performed using conventional machine learning or deep learn- when Viola and Jones proposed a face detection framework
ing approaches. Several techniques have been studied for face with high accuracy [46]. They proposed the use of Haar-like
Fig. 4 Working flow of FER techniques using conventional and deep learning techniques. First, the data acquired from any source, such
as Raspberry Pi, onboard camera or mobile phone camera devices, is fed into the face detection step. The second step performs face
detection. The detected face is forwarded to the emotion recognition step.
features to detect faces. The algorithm observes numerous to feature extraction. A mapped LBP feature was proposed
small subregions and attempts to determine a face by looking in [56] for illumination-invariant FER. SIFT [57] features that
for specific features in each subregion. It passes through are robust against image rotation and scaling are employed for
numerous different positions and scales because an image multiview FER tasks. Combining several descriptors of tex-
may contain several faces of various sizes. ture, orientation, and color and using them as inputs helps
The Viola–Jones algorithm remains popular for the detec- enhance the performance of network [58,59].
tion of faces in real time but fails when a face is masked or cov- Similarly, part-based representation extracts features by
ered by a scarf, or may be limited when a face is not oriented removing noncritical parts from the image and exploiting the
or aligned properly. Therefore, to avoid such problems in con- key parts that are sensitive to the task. The authors in [60]
ventional techniques and improve face detection algorithms, reported that three regions of interest (ROIs), including the
deep learning algorithms, such as R-CNN [47], SSD [48], eyes, mouth, and eyebrows, are predominantly related to vari-
VGG-Face [49], FaceNet [50], have been developed. Among ations in emotion. Table 3 highlights recently published con-
these, R-CNN was initially introduced for object detection ventional machine learning FER methods.
and is significant for its capability of achieving high CNN
accuracy on classification task in face detection tasks. 3.3.2. Deep learning-based FER
Recently, deep learning has attracted considerable attention
3.3. Emotion recognition for research interest, and has achieved state-of-the-art perfor-
mance in numerous applications in a wide variety of fields
After face detection and ROI extraction, the flow proceeds to [78] such as computer vision [79,80], and time-series analysis
the FER stage. Numerous techniques, including conventional and prediction [81]. Deep learning attempts to capture high-
and deep-learning methods, are available for this. In conven- level abstractions via hierarchical networks comprising numer-
tional approaches, to conduct feature extraction, FER meth- ous nonlinear representations and transformations. Unlike
ods use hand-crafted feature engineering techniques, and the conventional learning for FER, where the feature extraction
extracted features are subsequently fed into the classifier. By and classification steps are independent, deep networks per-
contrast, deep learning approaches can automatically extract form FER in an end-to-end manner. In particular, a loss layer
features and perform classification in an end-to-end manner, is inserted at the network end to control the generated back-
where a loss layer is substituted to the end of the network to propagation error. Thus, the prediction probability obtained
regulate the backpropagation error. for each sample is directly produced as an output by the net-
work. Typically, in a CNN, the SoftMax loss function is used.
3.3.1. Conventional learning-based FER techniques In particular, these models aim to minimize the cross-entropy
Conventional learning approaches include HOG [51], SVM of the model across the entire training dataset. This is achieved
[52], SURF [53], SIFT [54], and Naive Bayes [55]. Conven- by calculating the average cross-entropy loss across all training
tional practices use hand-crafted feature engineering tech- examples and then back-propagating the loss through the net-
niques, such as preprocessing and data augmentation, prior work to optimize the defined loss function by tuning the
Table 3 FER methods based on conventional machine learning techniques with their contributions and corresponding training
datasets.
Ref Technique Contributions Dataset
[17] ORB, SVM -ORB features were extracted and fed into an SVM. MMI, JAFFE
[61] CNN, BoVW, -Features from a CNN were combined with handcrafted features FER-2013, FER+, AFFECTNET
SVM computed using BOVW. -SVM is applied for final classification.
[62] LPDP -An edge descriptor LPDP was developed which considered statistical CK+, MMI, FACES, ISED, GEMEP-
details of pixel neighborhoods to collect meaningful and reliable FERA, BU-3DFE
information.
[63] FERAtt -An end-to-end architecture which focused on human faces was CK+, BU-3DFE
proposed.
-The model applied a Gaussian space representation to recognize an
expression.
[64] CNN -Four-staged deep learning architectures were proposed. RAFD
-The first three networks segmented the essential facial components,
whereas the fourth combined the holistic facial information for better
robustness.
[65] CNN, C4.5 -Features from CNN are combined with C4.5. JAFFE, CK+, FER2013, RAFD
classifier
[66] SCN -SCN is proposed to efficiently suppresses uncertainties to prevent the RAFD, AFFECTNET, FERPLUS
network from overfitting.
-This suppression enabled a self-attention mechanism and careful
relabeling to perform well.
[67] FACS -FACS was developed to measure human facial behavior based on N/A
muscle movement.
[68] N/A -Bias and fairness were systematically investigated through three RAFD, CELEBA
approaches such as attribute-aware, baseline, and disentangled
approaches.
[69] 3D CNN -Deep spatiotemporal features were extracted based on deep appearance CK+, MMI, FERA
and neural network.
[70] CNN -An activation function was proposed for CNN models, and a piecewise JAFFE, FER-2013
activation technique was proposed for the procedure of FER tasks.
[71] LBP -An end-to-end network using an attention mechanism was proposed. JAFFE, OULU-CASIA, NCUFE, CK+
-The network comprised features extraction, attention module,
reconstruction module, and classification module components.
[72] N/A An FER system validation study was performed for a school in this NA
method.
[73] LBP, MSAU-Net -Fine-grained FER in the wild was primarily considered and FG- FG-EMOTIONS, CK+, MMI, FER-
Emotion was proposed. 2013, RAFD-BASIC, RAFD-
-FG-Emotions provided several features such as LBP and dense COMPOUND
trajectories that facilitated the research.
[74] Channel State -A system based on Wi-Fi signals known as WiFace was developed for CSI (PRIVATE DATA)
Information FER.
Processing -Series of algorithms were developed to process the channel state
information signal to extract the most representative waveform
patterns.
[75] KNN, NB, SVM, -A system for FER based on multi-channel, electro-encephalogram, and N/A
RF multi-modal physiological signals was developed.
[76] HOG, SVM -TV-series were considered for human behavior analysis using facial KDEF
expressions.
-The authors detected and tracked faces using the Viola-Jones and
Kanade-Lucas-Tomasi (KLT) algorithms
-They extracted HOG features and classified the expression using an
SVM model.
[77] EMM, KTN, SSN -A supervised objective AdaReg loss and a re-weighting category was RAFD, AFFECTNET, FERPLUS
proposed to address class imbalance and increase discrimination
expression power.
parameters of the network. In addition to end-to-end net- Furthermore, the works [84,85] presented a covariance descrip-
works, DNN models can be used to extract features. Subse- tor computed via deep CNN features, and its classification was
quently, a traditional classifier, such as an SVM or RF performed by Gaussian kernels on a symmetric positive defini-
model, is applied to the extracted feature descriptor [82,83]. tion. Table 4 highlights recently published conventional
Table 4 FER methods-based on deep learning mechanism with their contributions and data usage.
[99] CNN, MTCNN -MTCCN was used for face detection, while features were extracted via EMOTIW
ResNet-64 and were classified at a large margin; a softmax loss was used
for discriminative learning.
[100] CNN -A method based on the LeNet-5 architecture, comprising five trainable CK+
parameter layers, two subsampling, and a fully connected layer, was
proposed.
-A SoftMax function was used for the final FER classification.
[101] PHRNN, MSCNN -A deep evolutional spatial–temporal network (composed of PHRNN CK+, OULU-CASIA, MMI
and MSCNN) was used to extract the partial-whole, geometry-
appearance, and dynamic-still information, thus effectively improving the
performance of FER.
[102] LSTM-CNN -For the facial label prediction, the authors used LSTM-CNN. CK+, DISFA
[103] 3D inception-ResNet- -A model with layers of an Inception-ResNet model were followed by an CK+, MMI, FERA, DISFA
LSTM LSTM unit was proposed.
-This method extracted temporal and spatial relations within facial images
between different frames in video
[104] LSTM-CNN -Using temporal dependencies, the LSTMs were stacked. GFT, BP4D
-Outputs of CNN and LSTM were aggregated into a fusion network for
per-frame prediction.
[105] CNN -A prepressing step was used to clean and augment the data. CK+, JAFFE, BU-3DFE
-Subsequently, a CNN was used for feature extraction and classification.
[106] CNN -Four layers of CNN were used for features extraction and classification. FER-2013
[107] CNN, ACNN -A CNN with ACNN was proposed to perceive occlusion regions in the RAFD, AFFECTNET,
face and emphasize the most discriminative un-occluded regions. SFEW, CK+, MMI, OULU-
CASIA
[108] CNN-RNN -A hybrid CNN and RNN model was used for FER. JAFFE, MMI
[109] GoogLeNet, AlexNet -The performance of two different models was compared for FER. FER-2013
[110] Pre-trained CNN -Pre-trained state-of-the-art models were used for FER. CK+, JAFFE, FACES
Inception, VGG, VGG-
Face
[111] ConvNet, FaceNet -Facial parts were focused on based on depth learning in the field of LFW FACE
biometrics
[112] 3D and 2D CNN 3D FER was developed to accurately extract parts of face. BU-3DFE
[113] SWE and FNN -FER based on Jaya algorithm was performed, using SWE for features PRIVATE DATA: 700 FER
extraction and an FNN for classification. IMAGES.
[114] AlexNet CNN, FER- -Five different techniques for real-time basic expression recognition from CK+, KDEF
CNN, SVM, MLP images were compared.
[115] Hybrid CNN-SVM -Humanoid robot for real-time FER was proposed based on KDEF, CK+
convolutional self-learning feature extraction and an SVM classifier.
[116] FMPN -An FER framework called FMPN was proposed, in which a branch was CK+, MMI, AFFECTNET
introduced for facial mask generation to focus on muscle movement
regions.
[117] NA -Features extracted from an appearance-based network were fused with CK+, JAFFE
geometric features in hierarchical manner.
[118] Spatial CNN, Temporal -A hybrid deep learning model was proposed for FER. BAUM-1, RML, MMI
CNN -Two CNNs models, including Spatial and Temporal CNNs, were
investigated for FER.
[119] Ensembles of CNNs -Different aspects of ensemble generation and other factors influencing FER-2013, CK+, SFEW
the FER performance were studied.
[120] CNN -An FER approach was presented using a CNN. FER-2013
[121] SIFT, CNN -Features were extracted from SIFT and CNN. CK+, MMI
[122] Deep CNN -Different deep learning methods were employed, with a CNN selected as JAFFE
the best algorithm for FER.
[123] CNN -A framework that combines the discriminative features learned via CNN CK+
and handcrafted features was proposed.
[124] CNN, SVM -SIFT and deep features from CNN for FER were combined and CK+
classified by SVM.
[125] Light-CNN -Three CNN models, namely, the light-CNN, dual-branch CNN, and pre- CK+, BU-3DFE, FER-2013
trained CNN models, were used to extract features for FER.
[126] CNN -A CNN was employed for FER. FER-2013
[127] CNN -An FER system was developed based on a CNN model with data CK+, FER-2013, MUG
augmentation
Table 4 (continued)
[128] CNN -The Viola–Jones algorithm was applied for face detection, CLAHE for JAFFE, CK+
image enhancement, DWT to extract the features, and CNN for learning.
[129] DAM-CNN -A model called DAM-CNN was introduced for FER to automatically JAFFE, CK+, TFEID,
locate expression-based regions. BAUM-2I, SFEW
[130] CNN -Handcrafted features were proposed with a multi-stream structure to CK+, MUG, IWFER
improve performance.
[131] CNN, LBP -The abstract facial features learned via a deep CNN were fused with the ORL, CMU-PIE, FERET,
modified LBP features. FACE-SCRUB FACE
[132] DCNN -A two-staged framework based on a DCNN was proposed that was CK+, BU-4DFE
inspired by the nonstationary nature of facial expressions.
[133] MDSTFN -A multi-channel network was proposed to fuse and learn spatiotemporal CK+, RAFD, MMI
features for FER.
-An optical flow was extracted from the changes between the neutral and
peak expression.
[134] CNN, Auto encoder, -A CNN-based pre-trained model was used in core cloud to extract deep RML, ENTERFACE’05
SVM features.
[135] CNN, ELM, SVM -Speech signal was processed to obtain a mel-spectrogram treated as an PRIVATE DATA
image. The spectrogram was fed into a CNN.
-The most representative frames were provided to a CNN model and were
fused with the output obtained from another CNN model.
[136] CNN, EDLM -Based on ensemble learning model, an algorithm was proposed FER-2013, JAFFE,
comprising three sub-networks with different depths. AFFECTNET
-The sub-networks comprised CNN models that were trained separately.
[137] PNN, CNN, Residual -A PNN model designed to combine texture features was applied for CK+
Network, Capsule FER.
Network -This network was constructed using CNN, capsule network, and residual
network models.
[138] CNN -The impact of CNN parameters, such as kernel size and number of filters, FER-2013
was investigated for FER.
[139] CNN -A vectorized CNN model introducing the attention mechanism to extract CK+, FER2013
features in ROI of face was proposed. AFFECT-NET, JAFFE
-ROIs were marked before feeding them into the network.
[93] CNN, LSTM -An FER algorithm was proposed based on a multilayer maxout linear JAFFE, CK+
activation function to initialize CNN and LSTM models.
[140] CNN, LSTM -A framework based on CNN and LSTM structures was developed. CK+, MMI, SFEW
-Images were preprocessed and input to the CNN architecture.
[141] Fast R-CNN -A video-based infant monitoring system was proposed to analyze infant PRIVATE DATA
expressions.
-The expressions included discomfort, joy, unhappiness, and neutrality.
-The system was based on Fast R-CNN.
[142] CNN, LBP -A system for FER was proposed based on CNN and LBP models. FER-2013
[143] CNN-BDLSTM -An enhanced DNN framework was reported for pain intensity detection VGG-FACE
via facial expression image using four level thresholds.
[144] CNN -A CNN-based FER system was proposed from facial images considering JAFFE, CK+
edge computing.
-The authors trained the model in the cloud and tested the trained model
on edge devices.
[145] LGIN -A LGIN model proposed that was designed to learn to identify an RML, ENTERFACE,
underlying graph structure to recognize emotions. RAVDESS
[146] Transfer learning -A pre-trained CNN was utilized recognize facial emotions. CK+, JAFFE
[147] Firefly algorithm -An FER technique was proposed based on the firefly algorithm, which CK+, JAFFE, MMI
was mainly used for feature optimization.
[148] HOG, Deep CNN -A DNN model was proposed for real-time FER. KAGGLE FER DATASET
-The model was able to detect, track, and classify the human face with
high performance.
[149] Fusion Technique -Facial expressions were localized based on audio and video frames. RML AUDIO-VISUAL
-A network for audio recognition and facial recognition was proposed. DATABASE
-Both the networks were assembled as fusion network.
[150] Hybrid 3D CNN, RNN -A DNN was proposed for FER based on videos and a network was used AFEW-6.0, HAPPEI
for audio as well.
[151] VGGNet, ResNet, -First, the structure of CNN models was studied. Next, four different FER-2013
GoogleNet, AlexNet CNNs models were applied to recognize human emotion.
(continued on next page)
Table 4 (continued)
[152] DNN -A DNN was proposed for the classification of facial expression based on JAFFE, CK+
a naturalistic dataset.
[153] LBP, ANN -LBP was implemented for feature extraction from images. JAFFE, TFEID, CK+
-GRNN was implemented for the classification of FER based on frame
features.
[154] LSM-RNN, SVM -FER was performed based on LSTM-RNN and SVM models. EMOTIW-2015
[155] Deep learning methods -A DNN was proposed based on a webcam for a smart TV environment FER-2013, CK+
to recognize human facial expressions.
[156] DNNRL -A deep learning method with relativity learning was proposed. FER-2013, SFEW-2.0
-This model learned a mapping from the original images into a Euclidean
space, where relative distances corresponded to a measure of facial
expression similarity.
[157] CNN -A deep CNN was presented for accurate detection of human face FER-2013, JAFFE
expressions.
[158] CFS based on landmark -An ANN model was presented to classify facial expressions. N/A
and ANN -A points/landmark technique was applied to enhance the performance of
the ANN.
[159] DNN -Multiple DNNs were presented to detect face expressions and combine SFEW-2.0, FER-2013, TFD,
their performance. GENKI
machine learning methods. Table 5 summarizes FER for dif- the current output. However, training RNNs is challenging
ferent edge devices and platforms for different application set- owing to the vanishing or exploding gradient problem; this is
tings. Some of these methods were developed to be deployed a situation in which the network is unable to propagate gradi-
over IoT devices; a detailed explanation of the libraries, train- ents from the output end of the model back to the layers near
ing, settings, and other experiments involved is included in the the input end of the model. A solution to this problem is the
same table. long short-term memory (LSTM) networks, a category of
As discussed above, directly training deep networks on rel- RNNs that can learn long-term dependencies. LSTMs have a
atively small FER datasets leads to problems of overfitting. To chain-like structure comprising memory cells, which include
mitigate this problem, several studies have applied pre-training four neurons each, designed to interact in a very special way.
techniques, wherein popular networks such as AlexNet [86], Gated recurrent unit (GRU) models are a variation of the
VGG-face [49], and VGG [87] are pre-trained on benchmark LSTM architecture. GRU models use fewer training parame-
datasets (such as ImageNet), and their last layers are fine- ters and, therefore, less memory. GRUs execute computations
tuned to adapt the network to a particular task. The authors faster compared with LSTM models, whereas LSTM is more
of [88] experimented with the VGG-Face model, which was ini- accurate for larger datasets. Existing state-of-the-art results
tially trained for face recognition, and then fine-tuned using have been obtained using LSTM or GRU networks. Training
the FER 2013 dataset. The results of their experiments such networks for FER further improves performance. A
revealed that the VGG-Face model was more suitable for the sequence of frames is provided to an LSTM [90] or GRU
FER task, compared with other networks that were pre- [91] network to learn variations in facial expressions and deter-
trained on the ImageNet dataset, which was developed for mine a person’s emotional or mental state. Some of these
object recognition. Similarly, [89] observed that pre-training methods are listed in Table 4.
on large emotion recognition datasets positively affected
recognition performance, and found that fine-tuning with 3.3.2.2. CNN-LSTM and CNN-GRU. Several pre-trained
more FER data could improve performance. models based on CNN architectures and other related variants
Existing techniques commonly adopt RNN models and have been developed and trained for FER. These networks
their variants to recognize emotions in sequences of video include self-encoder and CNN models as well as confidence
frames. Hybrid connections with ConvNets models have networks. They typically exhibit a strong capability for auto-
achieved remarkable performance in several real-world appli- mated feature learning but have no ability to capture contex-
cations. Details of these networks are provided in the following tual time information. For this purpose, several variants
subsections. RNN models have been combined with CNNs to improve
their performance on FER takes such as CNN-LSTM [92–
3.3.2.1. LSTM and GRU. To capture the temporal dependen- 94], CNN-GRU [95]. Such networks obtain richer and more
cies of sequential data, deep recurrent networks, particularly discriminative expression information from facial expression
LSTMs, have achieved promising performance. Recurrent sequences by eliminating the influence of differences and the
neural networks (RNNs) are neural networks that contain cyc- external environment to improve recognition accuracy. In
lic connections (loops). This characteristic enables them to these networks, the CNN extracts deep visual information,
learn the temporal dynamics of sequential data well. RNNs and the LSTM learns to synthesize and identify the temporal
can connect past information to the present task to predict dynamic sequence details. These networks focus on the influ-
Table 5 FER over different edge and IoT platforms along with recent products.
Ref/Paper Description Platform
[77] -Training was performed on an NVIDIA TITAN Xp GPUs and deployed on a phone. Smartphone
[144] -Three prototypes were used.
-The first prototype was an end device implemented on Android version 10, and the second was an edge
component implemented using CUDA 10.0-enabled NVIDIA GeForce RTX 2070 8 GB GPU drivers with
cuDNN v7.6 for deep learning models. The final result was a communication component with two parts, one
running on a smartphone using Apache HttpClient to communicate with server and the other is running in the
server with Django.
[145] -PyTorch was used with an NVIDIA RTX-2080Ti GPU for experiments.
[160] -An algorithm implemented in Python with PyTorch and OpenCV was used for the preprocessing operations on
the images. The training of the CNN took approximately one hour with a single NVIDIA Titan X GPU.
-To run the trained model on mobile device, it was converted into ONNX format and used ONNX-CoreML to
obtain a CoreML model for use on iOS v1 1 or higher.
[161] -A smartphone app was used to analyze facial expressions and to construct a classifier to predicts emotional states
in mobile settings.
-In a testing phase, the feasibility of the approach was demonstrated for certain emotions using a person-
dependent classifier.
[74] -The proposed model was easily deployable to smartphone devices.
[105] N/A
[162] N/A
[144] N/A Raspberry
Pi
[163] N/A Samsung S3
[137] -The Python programming language on a GTX1070 GPU was used to train the model. IoT devices
-A model was proposed for IoT; however, the device was not defined.
[134] -The model was proposed for IoT; however, the device was not defined.
[135] -The model was proposed for edge devices; however, the device was not defined. Edge
[136] -The model was proposed for IoT devices; however, the device was not defined. devices
Different Products
Product Link Platform
Name
AffdexMe [AffdexMe on the App Store (apple.com)] IPhone,
IPad
MorphCasto [MorphCast - Facial Expression and Emotion Recognition AI | Face Emotion Analysis] Mac/Apple
Emotient [20 + Emotion Recognition APIs That Will Leave You Impressed, and Concerned | Nordic APIs |] Apple
Affectiva Smart
phone
ence of micro-expression recognition. Some of these methods clouds. The output emotion is generally one of seven emotions:
are listed in Table 4. happy, angry, fear, disgust, sad, surprise, or neutral. Perfor-
mance is evaluated using several metrics, including precision,
3.3.2.3. CNN-BDLSTM and CNN-BIGRU. BDLSTM and accuracy, recall, specificity, and F1-score. (Eq. (1)–(6)) More-
Bidirectional GRU (BIGRU) are extensions of traditional over, the method uses a confusion matrix that consists of true
LSTM and GRU architectures, respectively; they improve positive (TP), true negative (TN), false positive (FP), and false
the performance of learning models for more effective FER. negative (FN) rates. Similarly, models are analyzed in terms of
BDLSTM trains two LSTM, and the sequence is processed their real-time deployment and sentiment analysis. The time
in both the forward and backward directions. Thus, an addi- complexity and FER model size were investigated for real-
tional context is provided to the network, which results in fas- time deployment on edge devices.
ter learning of the sequence of an expression. Therefore, for TP þ TN
FER, a CNN is inserted at the end as a hybrid connection Accuracy ¼ ð1Þ
TP þ TN þ FP þ FN
to help the model to deeply process the changes evident in
facial expressions. These hybrid connection models include TP
CNN-BDLSTM [96,97] and CNN-BIGRU [98]. Table 4 lists Precision ¼ ð2Þ
ðTP þ FPÞ
some of the hybrid methods.
TP
3.4. Output emotion and evaluation Recall ¼ ð3Þ
ðTP þ FNÞ
Once an FER model is competent in distinguishing different TPR þ TNR

AccuracyB ¼ ð4Þ
expressions in real-time, it is deployed on edge devices or 2
TN CK+ [165]: CK+ is a widely used laboratory-controlled

Specificity ¼ ð5Þ
ðTN þ FPÞ dataset for evaluating FER methods. This dataset consists
of sequences that convert a neutral expression into a peak
Precision Recall facial appearance. For assessment purposes, data selection
F1 score ¼ 2 ð6Þ
Precision þ Recall is performed by selecting the latter one or two frames that
contain peak information.
4DFAB [166]: This is a large-scale expression dataset with
4. FER datasets and associated statistics
1,800,000 high-resolution 3D facial images recorded from
180 subjects captured in distinct sessions. Videos of subjects
For the effective and efficient design of deep ER, FER meth- are presented in 4D dynamics, displaying both posed and
ods require a large, labeled training dataset that includes spontaneous facial variations of six basic emotions.
numerous variations of the surrounding environment and face MMI [167]: This is another laboratory-controlled dataset.
structures. In this section, the public benchmark FER data- However, unlike CK+, the sequences of frames present
bases that include basic facial expressions, which are widely in MMI are labeled in terms of onset and offset. For exam-
used in the studied papers, are discussed. Table 6 provides sev- ple, the frame sequence may start with a neutral expression,
eral FER datasets that have been widely used in different and reach a peak between the first and last neutral
applications, such as security and law-enforcement, and expressions.
Fig. 5 depicts frame samples from each expression of the JAFFE [168]: This contains approximately 213 samples of
FER datasets. facial expressions taken from ten Japanese women, with
3–4 images corresponding to each individual showing six
KDEF [164]: KDEF consists of 4900 total set of human basic expressions with a neutral expression.
expression images, where the averaged KDEF (AKDEF) ExpW [169]: The Expression in the Wild dataset consists of
is an image set proposed from the original KDEF. This 91,793 faces that are downloaded from Google where each
dataset was announced in 1998, and since then, it has been face image was manually annotated with a category from
publicly available. KDEF has been applied in more than among seven expressions. Images with no clear facial
1500 research articles. appearance were removed.
Table 6 Overview of the FER dataset with some statistical details.

Dataset Classes Samples Individuals Link
KDEF/AKDEF 7 4900 272 [https://www.kdef.se/]
[164]
CK+ [165] Sequence: 593 123 [https://www.consortium.ri.cmu.edu/ckagree/]
4DFAB [166] 1.8 million 3D 180 N/A
faces
GEMEP FERA 5 750,000 10 [https://www.cs.nott.ac.uk/]
[183]
MMI [167] 7 Images: 740 25 [https://mmifacedb.eu/]
Videos: 2900
JAFFE [168] Images: 213 10 [https://www.kasrl.org/jaffe.html]
TFD [184] Images: 112,234 N/A [josh@mplab.ucsd.edu]
ExpW [169] Images: 91,793 [https://mmlab.ie.cuhk.edu.hk/projects/socialrelation/index.html]
FER-2013 [170] Images: 35,887 [https://www.kaggle.com/c/challenges-in-representation-learning-facial-
expression-recognition-challenge]
AFEW [171] Videos: 1809 [https://sites.google.com/site/emotiwchallenge/]
SFEW [172] Images: 1766 [https://cs.anu.edu.au/few/emotiw2015.html]
BP4D+ [185] 8 N/A 140 [https://suny.technologypublisher.com/]
Multi-PIE [173] 6 Images: 755,370 337 [https://www.flintbox.com/public/project/4742/]
EB+ [186] N/A — N/A [https://www.cs.binghamton.edu/]
BU 3DFE [174] 7 Images: 2500 100 [https://www.cs.binghamton.edu/lijun/Research/3DFE/3DFEAnalysis.html]
BU 4DFE [175] 3D sequences: 101
606
Oulu CASIA 6 Image sequence: 80 [https://www.cse.oulu.fi/CMV/Downloads/Oulu-CASIA]
[176] 2880
RAF-DB [177] 7 Images: 29,672 N/A [https://www.whdeng.cn/RAF/model1.html]
KDEF [187] Images: 4900 70 [https://www.emotionlab.se/kdef/]
EmotioNet [178] 23 Images: N/A [https://cbcsl.ece.ohio-state.edu/dbformemotionet.html]
1000,000
CASME II [179] N/A N/A N/A [https://fu.psych.ac.cn/]
AffectNet [180] 7 Images: 450,000 1 million [https://mohammadmahoor.com/databases-codes/]
HAPPEI [181] 6 Images: 4886 greater [https://cs.anu.edu.au/few/Group.htm]
than1
The existing FER challenges are comprehensively discussed in detail.
Fig. 5 Visual representation of facial expressions from different well-known datasets: (a) Cohn_kanade, (b) JAFFE, (c) MMI, (d)
KDEF, and (e) BU-3DFE.
FER-2013 [170]: This is an unrestrained large-scale dataset BU 4DFE [175]: This dataset is used to analyze facial
collected from the API of Google image search, wherein the actions from static 3D space to dynamic 3D space. It con-
images were registered and resized to 48 48 pixels after tains 606 3D expression sequences in approximately 60,600
discarding incorrectly labeled frames. This dataset consists frames.
of 35,887 total images with seven emotion labels. Oulu CASIA [176]: This includes 2880 sequences obtained
AFEW [171]: This dataset consists of videos clips gathered from 80 individuals, of which each video was recorded
from movies with impulsive expressions, diverse head and processed by either infrared or visible light systems
poses, illuminations, and occlusions. This is a multimodal installed with three distinct illumination settings. The initial
dataset that provides a wide range of environmental condi- frame shows a neutral expression, while the peak expression
tions for video and audio. is given in the last frame. The initial frame with neutral
SFEW [172]: This dataset was gathered from the static expression and the last three frames from 480 videos deliv-
frames of the AFEW dataset. The most commonly applied ered by the visible light system under illumination were
version of SFEW 2.0 comprise three sets: training, testing, investigated experimentally.
and validation. These labels are publicly accessible. RAF-DB [177]: The real-world affective face dataset (RAF-
Multi-PIE [173]: This dataset comprises 755,370 images DB), contains 29,672 diverse ranges of facial images col-
ranging from 337 subjects with 19 illumination conditions lected from different sources on the Internet. Seven are
up to four recorded sessions and 15 viewpoints, where each basic, and eleven are compound emotion labels that were
face image is labeled as one of six expressions. A multiview manually annotated.
FER can be achieved using this dataset. EmotionNet [178]: This is a large dataset with one million
BU 3DFE [174]: The BU 3DFE consists of 606 emotion facial expressions collected from the Internet, of which
sequences captured from 100 individuals. The six expres- 950,000 images were annotated using an automatic detec-
sions were developed from each subject in different man- tion model in [178] and 25,000 images are annotated via
ners consisting of multiple intensities. Multi-PIE is also 11 automatic detections.
applicable to multiview FER analyses. CASME II [179]: This is also laboratory-controlled dataset
from which roughly 3000 facial movements, 247 expres-
sions were chosen for the dataset with action units labeled. cantly affects the results. Changes in illumination can drasti-
The samples showed spontaneous and dynamic expressions. cally change facial appearance. Hence, the difference between
AffectNet [180]: This consists of more than one million two faces captured under different illuminations is higher than
images gathered from the Internet by querying different that of two distinct faces captured under the same illumina-
search engines with search terms related to emotions. tion. This issue makes FER particularly challenging and has
HAPPEI [181]: The happy people images is provided to attracted attention over the last few decades. Numerous algo-
evaluate the intensity of happiness in a group of people. rithms have been proposed to handle illumination, and they
This dataset contains 4886 samples sourced from Flickr broadly involve three distinctions. The first approach deals
using keywords that are associated with groups of people with image processing methods that are helpful for the normal-
and occasions, such as parties, marriages, reunions, and ization of faces with distinct lighting effects. For this purpose,
bars. All collected samples contained more than one indi- histogram equalization (HE) [188,189], logarithm transforms
vidual subject that was annotated with group-level mood. [190], or gamma intensity correlation [189] have been consid-
Synthetic FER Dataset [182]: Existing techniques have vari- ered. Another approach is 3D facial modeling. Researchers
ous limitations, such as sharpness, translation of distinct in [191,192] suggested that a face viewed form the front with
images, and preservation of identity. These issues are addressed different illumination creates a cone known as the illumination
via the texture deformation-based generative adversarial net- cone. Similarly, in the third approach, the features of the face
work, which disentangles the texture from a new image and are extracted where they are illuminated, and the features are
based on the extracted textures, and transfers the domains. subsequently forwarded for recognition.
Challenges in FER Datasets: Several challenges and issues 5.1.1.2. Face pose. Face pose is another major challenge; FER
related to FER datasets, such as a lack of large-scale expres- systems are very sensitive to slight changes in pose. The face pose
sion data, image quality, and size, widely influence the recog- varies with the head movement and changes in viewing angle.
nition of emotion in both indoor and outdoor conditions. The head movement or variation in the camera point of view
Numerous solutions have been applied to overcome these chal- can cause changes in the facial appearance, thus creating intra-
lenges. If the images are of very low quality, a diverse range of class variations and considerably decreasing the performance of
cleansing and smoothing filters can improve the quality of the FER methods [193]. However, despite the powerful recognition
frames and thus increase the accuracy of FER. Typically, data- rate of CNN models to extract features, their recognition rate
sets contain a limited amount of data. However, as deep learn- decreases significantly with the introduction of face poses [194].
ing models require large-scale data for training, data The human face is roughly shaped like a convex spheroid, and
augmentation methods have been exploited to improve the pose leads to the self-occlusion phenomenon and reduces the
diversity of training data and assist in training the network. FER accuracy. Therefore, performing FER reliably for different
head posed remains a significant challenge.
5. Challenges and future research directions
5.1.1.3. Occlusion. Occlusion refers to cases in which a certain
This section explains some notable challenges and identifies part of the face is not visible or is hidden. Occlusions occur
possible directions for future research. because of beards, accessories, moustaches, masks, and so forth.
The presence of such components makes the subjects more
5.1. FER challenges diverse and can causes recognition systems to fail. Owing to
the complex and variable environment in which a face is pre-
sented, occlusion may change significantly. Occlusions in FER
Defining an expression as representative of a certain emotion can be can be categorized into temporary and systematic [20]. Tempo-
difficult even for humans. Studies have shown that different people rary occlusions occur when the face portion are temporarily
recognize different types of emotions in the same facial expression. obscured by other objects; for example, a hand-covering face,
FER involves numerous challenges, such as the fact that diverse people moving across the face, or different environmental
training data are required, as well as imagery with diverse back- changes, such as lightening and shadows. Sometimes, self-
grounds, different genders, and different nationalities, etc. occlusion may occurs owing to variation in head pose. Whereas,
systematic occlusion is produced by the occurrence of individual
5.1.1. Scarcity of FER datasets facial components, such as hair, scars, or a moustaches [195].
Existing publicly available datasets do not suffice for effective
FER, nor are they sufficiently diverse. These problems require 5.1.1.4. Ageing. Human facial features tend to change with age,
effective solutions, such as data augmentation, combination of such as, lines, shapes, and some other aspects. Recognizing
several datasets, modification of existing data, or creating a emotions in such cases is a very challenging, and solving this
new dataset [79]. Typically, complex deep learning models problem requires a considerable amount of training data. Con-
are extremely ‘‘data-hungry,” and require data in different sidering the age of the face, the majority of the mainstream
forms for more effective and easier training. This solution research has investigated whether posed facial expressions
avoids the overfitting problem in training the network. There- are decoded less accurately compared to young people faces
fore, FER requires data where the expression should be cap- [196,197], regardless of the expression. This occurs with facial
tured from all possible angles for effective outcomes. muscle contraction and the actual landmark change [198]. Ear-
lier literature attempted to discuss the decline in the recogni-
5.1.1.1. Illumination. Illumination refers to light variation from tion of expressions in several ways. For example, older
different or single angles. A slight change in light conditions is people are presumed to focus on the lower half of their face
a significant challenge for emotion recognition and signifi-
during communications. Therefore, they can fail FER, which 5.2.1. Surveillance-scaled FER datasets
is expressed primarily in the eye regions. As the focus of FER research shifts toward challenging in-
the-wild environmental conditions, several researchers have
5.1.1.5. Low resolution. Low-resolution images or videos in focused on deep learning technologies designed to handle dif-
FER systems represent another challenge. The minimum reso- ficulties such as occlusions, illumination problems, nonfrontal
lution for a standard image is 16 16, whereas an image less poses, and recognition of lower-intensity emotion. As FER is
than 16 16 is considered as low resolution for FER. Images a data-driven task in which the training of a deep network
with low resolution lead to the loss of feature information requires a large amount of training data to capture subtle
extracted via traditional techniques and the degradation of facial expression-related deformations, the lack of large-
better recognition. Similarly, the feature distribution changes scale training data is a major challenge in terms of quality
with a reduction in the resolution. This reduction occurs and quantity. Owing to different genders and cultures, emo-
because of the limitations in the quality of the camera equip- tions are interpreted in different ways. An ideal dataset must
ment and the distance of the person from the lens; therefore, include images with precise facial attribute labels, along with
the captured face image has different resolutions. Image other attributes, such as gender, race, ethnicity, and age, thus
super-resolution technology can recover high-resolution facilitating related research on different genders, distant age
images from low-resolution images with rich information ranges, and distinct cultural FER via deep learning methods,
[199–201]. Some studies [202] have used image super- such as transfer learning approaches and deep networks.
resolution to enhance low-resolution images for better FER. Similarly, existing FER datasets are widely captured using
normal cameras, whereas FER patterns are only recorded
5.2. Recommendations in terms of regular patterns. Models trained on such data
are less effective in recognizing expressions in surveillance
A thorough investigation of FER methods throughout the lit- footage or expressions that occur far from the camera view-
erature reveals numerous drawbacks and limitations that need point. The problem of occlusion and face pose has also
to be solved and addressed. A summary of these limitations is attracted significant attention to overcome the scarcity of a
provided in Table 7, and a detailed discussion of these limita- diverse range of FER datasets covering different head-
tions and future research directions is provided below. posing annotations and surveillance-based captured
expressions.
Table 7 Summarized form of limitation/drawbacks in existing 5.2.2. FER with lower computational resources
FER. Combining edge computing with the deep learning technolo-
# Terms Remarks gies is expected to further enhance data processing and ensure
real-time processing to provide instant decisions. FER over
1 Bias and imbalanced Bias and inconsistency exist in
data distribution annotations that occur owing to an edge improves connectivity and security, and the data
different conditions and subjectivity of are processed over the edge. Edge intelligence further
annotations. Therefore, the algorithms improves the network control of data and communication
using intra-datasets lack management, and helps reduce the time delay. Thus, the
generalizability on unseen data and FER is performed with less computation, and the decision
exhibit reduced performance. is made on the same platform where the entire processing
2 Single modalities Humans with different behaviors in the is performed. For the FER domain, this may be considered
real world include an encoding from a ‘‘missing concept” of performing recognition of emotions
various perspectives, whereas facial
at the edge and making real-time decisions. Similarly, several
expressions in existing methods are
devices can be clustered, thereby forming an IoT-assisted net-
based primarily on single modality.
3 Head motions, These variations widely effect the work, where all devices are interconnected and share infor-
illumination, and performance of the FER methods, mation [17]. Such methods enable complex applications to
aging particularly in videos and 2D images, be executed on the network edge with limited process power
whereas 3D data is somewhat robust to [203].
such variations.
4 Dependency FER algorithms are dependent 5.2.3. FER via E2E
predominantly on large number of
Although a various technique that choose the learned features
features points.
5 Manual intervention Although FER methods are automatic, for FER as a prerequisite step can be found in the literature,
several systems still require deep networks or models that obtain a single video image as
intervention. input and process it directly to generate the type of expression
6 Age Most methods do not consider the time are lacking. The FER literature lacks such end-to-end (E2E)
and effects of age. deep CNN models that can directly process frames and pro-
7 Dissimilarity in data Facial data exhibit a high degree of vide real-time expressions. Thus, the development of such
dissimilarity, and FER systems can models is highly recommended in the future for FER with sat-
accurately recognize the expressions isfactory accuracy. Such networks are intended to process
only for faces similar learned in
frames or sequences of frames from the camera through differ-
training.
ent convolutional layers and pooling layers. These models are
8 Action Units- (AU) Detection of AU or combination of
several AUS has not been addressed. expected to be relatively user-friendly, easy to operate, and
employed for real-time FER.
5.2.4. Group expression analysis siders deep learning and AI using edge modules to ensure
Recognition of emotions of a single individual is compara- efficiency. To this end, numerous studies have contributed to
tively easy for deep network models. However, a collective the literature on FER. Most existing FER surveys focus on
and group emotion may positively provide a thorough sce- the features and characteristics of emotions from methods with
nario of the ongoing action to analyze the mood and examine different application directions. However, they have ignored
the subject’s actions and probable gestures. Therefore, a group the challenges of existing datasets and their solutions. Further-
FER method, wherein the overall expression of all individuals more, most studies do not provide any direction or motivation
is computed, is required. AI-based deep models should be pro- towards the edge/IoT setup for facial emotion recognition. In
posed and fine-tuned for this purpose. Similarly, deep models this study, the existing FER techniques were surveyed, and the
are developed for deployment on the network edge to be easily relevant literature was thoroughly analyzed and surveyed,
equipped in a learning class or workplace. essentially highlighting the FER working flow, integral and
intermediate steps of most methods, as well as pattern struc-
5.2.5. FER everywhere tures and limitations in existing FER surveys. In contrast to
current surveys, the FER for edge vision (that is, on mobile
The exposure of FER-based code and implementation
devices such as smartphones or Raspberry Pi computers) has
resources is a very important consideration in future research
been deliberately examined, and different FER evaluation tac-
owing to its positive impact on real-world applications [76].
tics have been comprehensively discussed. Finally, a discussion
Although several techniques either introduce a novel way of
on the challenges in FER along with some possible directions
learning expressions using hybrid frameworks or modified
for future research were presented.
ER-based systems, such methods are limit to applications in
In the future, we plan to provide a detailed comparative
homes, organizations, or other private sectors. Their imple-
analysis of FER methods applied for different purposes by
mentation and related resources are private and unavailable
exploring their implementation resources and algorithms.
for the development of real-time FER systems. Therefore, pub-
Our efforts will focus on the investigation and inclusion of
licizing source codes along with all the resources used on dif-
FER in security, performance on edge devices, precision, and
ferent websites, including GitHub and ‘‘Papers with Code,”
so forth. Similarly, data from different genders, races, and sce-
is highly recommended for effective usage by the FER
narios are not widely available; therefore, we plan to explore
researchers.
such datasets and evaluate their performance in terms of dif-
ferent aspects considering different modalities.
5.2.6. Federated learning
Federated learning (FL) is a novel concept in machine learn- Declaration of competing interest
ing; herein, an algorithm is dispersed among other edge devices
or servers storing the data sample locally without exchanging it The authors declare that they have no known competing
[204]. This procedure is different from the commonly applied financial interests or personal relationships that could have
centralized algorithms that require all local datasets to be appeared to influence the work reported in this paper.
loaded on a single server [205]. This learning enables the model
to gain more experience from a wide range of datasets at dif-
ferent locations. Features are extracted from both audio and Acknowledgement
images [204] and the collected information recognizes facial
expressions. This research was funded by the European Union through
the Horizon 2020 Research and Innovation Program, in the
5.2.7. AML for FER context of the ALAMEDA (Bridging the Early Diagnosis
In adversarial machine learning (AML), adversaries act as and Treatment Gap of Brain Diseases via Smart, Connected,
malicious inputs designed to ensure that the model fails to pre- Proactive and Evidence-based Technological Interventions)
dict the correct labels. In recent years, AML has become a cru- project under grant agreement No GA 101017558. This work
cial part of computer vision tasks, such as FER, object is also partially funded by FCT/MCTES through national
detection, and activity recognition. In [206], an AML approach funds and when applicable co-funded EU funds under the Pro-
was proposed that provides anonymity for individual subjects ject UIDB/50008/2020; and by the Brazilian National Council
whose expressions have to be recognized by applying convolu- for Scientific and Technological Development-CNPq, via
tional transformation, which degrades the individual relevant Grant No. 313036/2020-9.
data for fully connected layers. The output was passed to
two classifiers to recognize the expression. References
6. Conclusion [1] Y. Nan, J. Ju, Q. Hua, H. Zhang, B. Wang, A-MobileNet: An

approach of facial expression recognition, Alex. Eng. J. 61 (6)
Facial/emotion Recognition in real-time has a wide range of (2022) 4435–4444.
applications in healthcare, security, mood analysis, and safety [2] Z. Li, T. Zhang, X. Jing, Y. Wang, Facial expression-based
analysis on emotion correlations, hotspots, and potential
measurements. Numerous studies have been conducted on this
occurrence of urban crimes, Alex. Eng. J. 60 (1) (2021) 1411–
topic in the form of proposals, techniques, networks, and sur- 1420.
veys. Computational FER mimics human coding skills and [3] K. Mannepalli, P.N. Sastry, M. Suman, A novel adaptive
conveys important cues that complement speech to assist lis- fractional deep belief networks for speaker emotion
teners. Similarly, the latest progress in FER development con- recognition, Alex. Eng. J. 56 (4) (2017) 485–497.
[4] G. Tonguç, B.O. Ozkara, Automatic recognition of student network: A review, J. Soft Comput. Data Min. 2 (1) (2021)
emotions from facial expressions during a lecture, Comput. 53–65.
Educ. 148 (2020) 103797. [25] I.M. Revina, W.S. Emmanuel, A survey on human face
[5] S.S. Yun, J. Choi, S.K. Park, G.Y. Bong, H. Yoo, Social skills expression recognition techniques, J. King Saud Univ.-
training for children with autism spectrum disorder using a Comput. Inform. Sci. 33 (6) (2021) 619–628.
robotic behavioral intervention system, Autism Res. 10 (7) [26] T. Cootes, J. Edwards, C. Taylor, Active apperance models.
(2017) 1306–1323. ieee transactions on pattern analysis and machine intelligence,
[6] H. Li, M. Sui, F. Zhao, Z. Zha, and F. Wu, ‘‘Mvt: Mask vision IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (1998) 681685.
transformer for facial expression recognition in the wild,” arXiv [27] X. Zhu and D. Ramanan, ‘‘Face detection, pose estimation, and
preprint arXiv:2106.04520, 2021. landmark localization in the wild,” in 2012 IEEE conference on
[7] X. Liang, L. Xu, W. Zhang, Y. Zhang, J. Liu, Z. Liu, A computer vision and pattern recognition, 2012: IEEE, pp. 2879-2886.
convolution-transformer dual branch network for head-pose [28] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, ‘‘Robust
and occlusion facial expression recognition, Vis. Comput. discriminative response map fitting with constrained local
(2022) 1–14. models,” in Proceedings of the IEEE conference on computer
[8] M. Jeong, B.C. Ko, Driver’s facial expression recognition in vision and pattern recognition, 2013, pp. 3444-3451.
real-time for safe driving, Sensors 18 (12) (2018) 4270. [29] Y. Sun, X. Wang, and X. Tang, ‘‘Deep convolutional network
[9] K. Kaulard, D.W. Cunningham, H.H. Bülthoff, C. Wallraven, cascade for facial point detection,” in Proceedings of the IEEE
The MPI facial expression database—a validated database of conference on computer vision and pattern recognition, 2013, pp.
emotional and conversational facial expressions, PLoS One 7 3476-3483.
(3) (2012) e32321. [30] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and
[10] M.R. Ali, T. Myers, E. Wagner, H. Ratnu, E. Dorsey, E. alignment using multitask cascaded convolutional networks,
Hoque, Facial expressions can detect Parkinson’s disease: IEEE Signal Process Lett. 23 (10) (2016) 1499–1503.
preliminary evidence from videos collected online, npj Digital [31] F.U.M. Ullah, M.S. Obaidat, A. Ullah, K. Muhammad, M.
Med. 4 (1) (2021) 1–4. Hijji, S.W. Baik, A Comprehensive Review on Vision-based
[11] Y. Du, F. Zhang, Y. Wang, T. Bi, J. Qiu, Perceptual learning of Violence Detection in Surveillance Videos, ACM Comput.
facial expressions, Vision Res. 128 (2016) 19–29. Surv. (2023) 1–44.
[12] A. A. Varghese, J. P. Cherian, and J. J. Kizhakkethottam, [32] X. Xiong and F. De la Torre, ‘‘Supervised descent method and
‘‘Overview on emotion recognition system,” in 2015 its applications to face alignment,” in Proceedings of the IEEE
International Conference on Soft-Computing and Networks conference on computer vision and pattern recognition, 2013, pp.
Security (ICSNS), 2015: IEEE, pp. 1-5. 532-539.
[13] M. Egger, M. Ley, S. Hanke, Emotion recognition from [33] S. Ren, X. Cao, Y. Wei, and J. Sun, ‘‘Face alignment at 3000
physiological signal analysis: A review, Electron. Notes Theor. fps via regressing local binary features,” in Proceedings of the
Comput. Sci. 343 (2019) 35–55. IEEE Conference on Computer Vision and Pattern Recognition,
[14] G. Mattavelli et al, Facial expressions recognition and 2014, pp. 1685-1692.
discrimination in Parkinson’s disease, J. Neuropsychol. 15 (1) [34] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic,
(2021) 46–68. ‘‘Incremental face alignment in the wild,” in Proceedings of
[15] B. Sonawane, P. Sharma, Review of automated emotion-based the IEEE conference on computer vision and pattern recognition,
quantification of facial expression in Parkinson’s patients, Vis. 2014, pp. 1859-1866.
Comput. 37 (5) (2021) 1151–1167. [35] W. Li, M. Li, Z. Su, and Z. Zhu, ‘‘A deep-learning approach to
[16] Y.-S. Lee, W.-H. Park, Diagnosis of Depressive Disorder facial expression recognition with candid images,” in 2015 14th
Model on Facial Expression Based on Fast R-CNN, IAPR International Conference on Machine Vision Applications
Diagnostics 12 (2) (2022) 317. (MVA), 2015: IEEE, pp. 279-282.
[17] M. Sajjad, M. Nasir, F.U.M. Ullah, K. Muhammad, A.K. [36] Z. Yu and C. Zhang, ‘‘Image based static facial expression
Sangaiah, S.W. Baik, Raspberry Pi assisted facial expression recognition with multiple deep network learning,” in
recognition framework for smart security in law-enforcement Proceedings of the 2015 ACM on international conference on
services, Inf. Sci. 479 (2019) 416–431. multimodal interaction, 2015, pp. 435-442.
[18] Y. Huang, X. Li, W. Wang, T. Jiang, and Q. Zhang, ‘‘Towards [37] J. Tan et al, Face detection and verification using lensless
cross-modal forgery detection and localization on live cameras, IEEE Trans. Comput. Imaging 5 (2) (2018) 180–194.
surveillance videos,” in IEEE INFOCOM 2021-IEEE [38] R. Ranjan et al, A fast and accurate system for face detection,
Conference on Computer Communications, 2021: IEEE, pp. 1- identification, and verification, IEEE Trans. Biometrics,
10. Behavior, Identity Sci. 1 (2) (2019) 82–96.
[19] K. Wang, X. Peng, J. Yang, D. Meng, Y. Qiao, Region [39] C. Hong, J. Yu, J. Zhang, X. Jin, K.-H. Lee, Multimodal face-
attention networks for pose and occlusion robust facial pose estimation with multitask manifold deep learning, IEEE
expression recognition, IEEE Trans. Image Process. 29 (2020) Trans. Ind. Inf. 15 (7) (2018) 3952–3961.
4057–4069. [40] G. Sikander, S. Anwar, Driver fatigue detection systems: A
[20] L. Zhang, B. Verma, D. Tjondronegoro, V. Chandran, Facial review, IEEE Trans. Intell. Transp. Syst. 20 (6) (2018) 2339–2352.
expression analysis under partial occlusion: A survey, ACM [41] Z.-Q. Zhao, P. Zheng, S.-T. Xu, X. Wu, Object detection with
Computing Surveys (CSUR) 51 (2) (2018) 1–49. deep learning: A review, IEEE Trans. Neural Networks Learn.
[21] S. Rajan, P. Chenniappan, S. Devaraj, N. Madian, Facial Syst. 30 (11) (2019) 3212–3232.
expression recognition techniques: a comprehensive survey, [42] W. Kim, S. Suh, J.-J. Han, Face liveness detection from a single
IET Image Proc. 13 (7) (2019) 1031–1040. image via diffusion speed model, IEEE Trans. Image Process.
[22] S. Li and W. Deng, ‘‘Deep facial expression recognition: A 24 (8) (2015) 2456–2465.
survey,” IEEE transactions on affective computing, 2020. [43] S. Zafeiriou, C. Zhang, Z. Zhang, A survey on face detection in
[23] G.R. Alexandre, J.M. Soares, G.A.P. Thé, Systematic review of the wild: past, present and future, Comput. Vis. Image
3D facial expression recognition methods, Pattern Recogn. 100 Underst. 138 (2015) 1–24.
(2020) 107108. [44] H. Yang, L. Liu, W. Min, X. Yang, X. Xiong, Driver yawning
[24] S.M.S. Abdullah, A.M. Abdulazeez, Facial expression detection based on subtle facial action recognition, IEEE
recognition based on deep learning convolution neural Trans. Multimedia 23 (2020) 572–583.
[45] T. Zhang, J. Li, W. Jia, J. Sun, H. Yang, Fast and robust [66] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, ‘‘Suppressing
occluded face detection in ATM surveillance, Pattern Recogn. uncertainties for large-scale facial expression recognition,” in
Lett. 107 (2018) 33–40. Proceedings of the IEEE/CVF Conference on Computer Vision
[46] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. and Pattern Recognition, 2020, pp. 6897-6906.
Computer Vision 57 (2) (2004) 137–154. [67] B. Waller, E. Julle-Daniere, J. Micheletta, Measuring the
[47] W. Wu, Y. Yin, X. Wang, D. Xu, Face detection with different evolution of facial ‘expression’using multi-species FACS,
scales based on faster R-CNN, IEEE Trans. Cybern. 49 (11) Neurosci. Biobehav. Rev. 113 (2020) 1–11.
(2018) 4017–4028. [68] T. Xu, J. White, S. Kalkan, H. Gunes, Investigating bias and
[48] R. Ranjan, V.M. Patel, R. Chellappa, Hyperface: A deep multi- fairness in facial expression recognition, in: European
task learning framework for face detection, landmark Conference on Computer Vision, Springer, 2020, pp. 506–523.
localization, pose estimation, and gender recognition, IEEE [69] D. Jeong, B.-G. Kim, S.-Y. Dong, Deep joint spatiotemporal
Trans. Pattern Anal. Mach. Intell. 41 (1) (2017) 121–135. network (DJSTN) for efficient facial expression recognition,
[49] O. M. Parkhi, A. Vedaldi, and A. Zisserman, ‘‘Deep face Sensors 20 (7) (2020) 1936.
recognition,” 2015. [70] Y. Wang, Y. Li, Y. Song, X. Rong, The influence of the
[50] F. Schroff, D. Kalenichenko, and J. Philbin, ‘‘Facenet: A activation function in a convolution neural network model of
unified embedding for face recognition and clustering,” in facial expression recognition, Appl. Sci. 10 (5) (2020) 1897.
Proceedings of the IEEE conference on computer vision and [71] J. Li, K. Jin, D. Zhou, N. Kubota, Z. Ju, Attention
pattern recognition, 2015, pp. 815-823. mechanism-based CNN for facial expression recognition,
[51] P. Carcagnı̀, M. Del Coco, M. Leo, C. Distante, Facial Neurocomputing 411 (2020) 340–350.
expression recognition and histograms of oriented gradients: a [72] M. Andrejevic, N. Selwyn, Facial recognition technology in
comprehensive study, Springerplus 4 (1) (2015) 1–25. schools: Critical questions and concerns, Learn. Media
[52] L. Chen, C. Zhou, L. Shen, Facial expression recognition based Technol. 45 (2) (2020) 115–128.
on SVM in E-learning, Ieri Procedia 2 (2012) 781–787. [73] L. Liang, C. Lang, Y. Li, S. Feng, J. Zhao, Fine-grained facial
[53] Q. Rao, X. Qu, Q. Mao, and Y. Zhan, ‘‘Multi-pose facial expression recognition in the wild, IEEE Trans. Inf. Forensics
expression recognition based on SURF boosting,” in 2015 Secur. 16 (2020) 482–494.
international conference on affective computing and intelligent [74] Y. Chen, R. Ou, Z. Li, K. Wu, WiFace: Facial Expression
interaction (ACII), 2015: IEEE, pp. 630-635. Recognition Using Wi-Fi Signals, IEEE Trans. Mob. Comput.
[54] H. Soyel and H. Demirel, ‘‘Improved SIFT matching for pose (2020).
robust facial expression recognition,” in 2011 IEEE [75] J. Zhang, Z. Yin, P. Chen, S. Nichele, Emotion recognition
International Conference on Automatic Face & Gesture using multi-modal data and machine learning techniques: A
Recognition (FG), 2011: IEEE, pp. 585-590. tutorial and review, Information Fusion 59 (2020) 103–126.
[55] N. Sebe, M.S. Lew, I. Cohen, A. Garg, T.S. Huang, Emotion [76] M. Sajjad, S. Zahir, A. Ullah, Z. Akhtar, K. Muhammad,
recognition using a cauchy naive bayes classifier, Object Human behavior understanding in big multimedia data using
recognition supported by user interaction for service robots, CNN based facial expression recognition, Mobile Networks
vol. 1, IEEE, 2002, pp. 17–20. Appl. 25 (4) (2020) 1611–1621.
[56] G. Levi and T. Hassner, ‘‘Emotion recognition in the wild via [77] H. Li, N. Wang, X. Ding, X. Yang, X. Gao, Adaptively
convolutional neural networks and mapped binary patterns,” Learning Facial Expression Representation via CF Labels and
in Proceedings of the 2015 ACM on international conference on Distillation, IEEE Trans. Image Process. 30 (2021) 2016–2028.
multimodal interaction, 2015, pp. 503-510. [78] L. Deng, D. Yu, Deep learning: methods and applications,
[57] D. G. Lowe, ‘‘Object recognition from local scale-invariant Found. Trends Signal Processing 7 (3–4) (2014) 197–387.
features,” in Proceedings of the seventh IEEE international [79] F.U.M. Ullah, M.S. Obaidat, K. Muhammad, A. Ullah, S.W.
conference on computer vision, 1999, vol. 2: Ieee, pp. 1150-1157. Baik, F. Cuzzolin, J.J. Rodrigues, V.H. de Albuquerque, An
[58] Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, ‘‘Facial intelligent system for complex violence pattern analysis and
Expression Recognition with deep age,” in 2017 IEEE detection, Int. J. Intell. Syst. 37 (12) (2022) 10400–10422.
International Conference on Multimedia & Expo Workshops [80] F.U.M. Ullah, K. Muhammad, I.U. Haq, N. Khan, A.A.
(ICMEW), 2017: IEEE, pp. 657-662. Heidari, S.W. Baik, V.H. de Albuquerque, et al, AI assisted
[59] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, A.M. Dobaie, Edge Vision for Violence Detection in IoT based Industrial
Facial expression recognition via learning deep sparse Surveillance Networks, IEEE Trans. Ind. Inf. 18 (8) (2021)
autoencoders, Neurocomputing 273 (2018) 643–649. 5359–5370.
[60] L. Chen, M. Zhou, W. Su, M. Wu, J. She, K. Hirota, Softmax [81] F.U.M. Ullah, N. Khan, T. Hussain, M.Y. Lee, S.W. Baik,
regression based deep sparse autoencoder network for facial Diving Deep into Short-Term Electricity Load Forecasting:
emotion recognition in human-robot interaction, Inf. Sci. 428 Comparative Analysis and a Novel Framework, Mathematics 9
(2018) 49–61. (6) (2021) 611.
[61] M.-I. Georgescu, R.T. Ionescu, M. Popescu, Local learning [82] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson,
with deep and handcrafted features for facial expression ‘‘CNN features off-the-shelf: an astounding baseline for
recognition, IEEE Access 7 (2019) 64827–64836. recognition,” in Proceedings of the IEEE conference on
[62] F. Makhmudkhujaev, M. Abdullah-Al-Wadud, M.T.B. Iqbal, B. computer vision and pattern recognition workshops, 2014, pp.
Ryu, O. Chae, Facial expression recognition with local prominent 806-813.
directional pattern, Signal Process. Image Commun. 74 (2019) 1–12. [83] J. Donahue et al., ‘‘Decaf: A deep convolutional activation
[63] P. D. M. Fernandez, F. A. G. Pena, T. I. Ren, and A. Cunha, feature for generic visual recognition,” in International
‘‘Feratt: Facial expression recognition with attention net,” conference on machine learning, 2014: PMLR, pp. 647-655.
arXiv preprint arXiv:1902.03284, vol. 3, 2019. [84] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool,
[64] G. Yolcu et al, Facial expression recognition for monitoring ‘‘Covariance pooling for facial expression recognition,” in
neurological disorders based on convolutional neural network, Proceedings of the IEEE Conference on Computer Vision and
Multimed. Tools Appl. 78 (22) (2019) 31581–31603. Pattern Recognition Workshops, 2018, pp. 367-374.
[65] Y. Wang, Y. Li, Y. Song, X. Rong, Facial expression [85] N. Otberdout, A. Kacem, M. Daoudi, L. Ballihi, and S.
recognition based on random forest and convolutional neural Berretti, ‘‘Deep covariance descriptors for facial expression
network, Information 10 (12) (2019) 375. recognition,” arXiv preprint arXiv:1805.03869, 2018.
[86] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet [105] A.T. Lopes, E. de Aguiar, A.F. De Souza, T. Oliveira-Santos,
classification with deep convolutional neural networks, Adv. Facial expression recognition with convolutional neural
Neural Inf. Proces. Syst. 25 (2012) 1097–1105. networks: coping with few data and the training sample
[87] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional order, Pattern Recogn. 61 (2017) 610–628.
networks for large-scale image recognition,” arXiv preprint [106] H. Yar, T. Jan, A. Hussain, and S. Din, ‘‘Real-Time Facial
arXiv:1409.1556, 2014. Emotion Recognition and Gender Classification for Human
[88] H. Kaya, F. Gürpınar, A.A. Salah, Video-based emotion Robot Interaction Using CNN,” ed.
recognition in the wild using deep transfer learning and score [107] Y. Li, J. Zeng, S. Shan, X. Chen, Occlusion aware facial
fusion, Image Vis. Comput. 65 (2017) 66–75. expression recognition using CNN with attention mechanism,
[89] B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko, IEEE Trans. Image Process. 28 (5) (2018) 2439–2450.
‘‘Convolutional neural networks pretrained on large face [108] N. Jain, S. Kumar, A. Kumar, P. Shamsolmoali, and M. J. P.
recognition datasets for emotion classification from video,” R. L. Zareapoor, ‘‘Hybrid deep neural networks for face
arXiv preprint arXiv:1711.04598, 2017. emotion recognition,” vol. 115, pp. 101-106, 2018.
[90] Z. Yu, G. Liu, Q. Liu, J. Deng, Spatio-temporal convolutional [109] P. Giannopoulos, I. Perikos, I. Hatzilygeroudis, Deep learning
features with nested LSTM for facial expression recognition, approaches for facial emotion recognition: A case study on
Neurocomputing 317 (2018) 50–57. FER-2013, in: Advances in hybridization of intelligent
[91] R.-H. Huan, J. Shu, S.-L. Bao, R.-H. Liang, P. Chen, K.-K. methods, Springer, 2018, pp. 1–16.
Chi, Video multimodal emotion recognition based on Bi-GRU [110] A. Sajjanhar, Z. Wu, and Q. Wen, ‘‘Deep learning models for
and attention fusion, Multimed. Tools Appl. 80 (6) (2021) facial expression recognition,” in 2018 digital image computing:
8213–8240. Techniques and applications (dicta), 2018: IEEE, pp. 1-6.
[92] B.T. Hung, L.M. Tien, Facial expression recognition with [111] X. Han, Q. Du, Research on face recognition based on deep
CNN-LSTM, in: Research in Intelligent and Computing in learning, in: Sixth International Conference on Digital
Engineering, Springer, 2021, pp. 549–560. Information, Networking, and Wireless Communications
[93] F. An, Z. Liu, Facial expression recognition algorithm based (DINWC), Beirut, 2018, pp. 53–58, https://doi.org/10.1109/
on parameter adaptive initialization of CNN and LSTM, Vis. DINWC.2018.8356995.
Comput. 36 (3) (2020) 483–498. [112] A. Jan, H. Ding, H. Meng, L. Chen, and H. Li, ‘‘Accurate
[94] W. M. S. Abedi, A. T. Sadiq, and I. Nadher, ‘‘Modified CNN- facial parts localization and deep learning for 3D facial
LSTM for Pain Facial Expressions Recognition,” 2020. expression recognition,” in 2018 13th IEEE International
[95] M. T. Vu, M. Beurton-Aimar, and S. Marchand, ‘‘Multitask Conference on Automatic Face & Gesture Recognition (FG
multi-database emotion recognition,” in Proceedings of the 2018), 2018: IEEE, pp. 466-472.
IEEE/CVF International Conference on Computer Vision, 2021, [113] S.-H. Wang, P. Phillips, Z.-C. Dong, Y.-D. Zhang, Intelligent
pp. 3637-3644. facial emotion recognition based on stationary wavelet entropy
[96] Z.-X. Liu, D.-G. Zhang, G.-Z. Luo, M. Lian, B. Liu, A new and Jaya algorithm, Neurocomputing 272 (2018) 668–676.
method of emotional analysis based on CNN–BiLSTM hybrid [114] A. Kartali, M. Roglić, M. Barjaktarović, M. Ðurić-Jovičić, and
neural network, Clust. Comput. 23 (4) (2020) 2901–2913. M. M. Janković, ‘‘Real-time Algorithms for Facial Emotion
[97] P. Du, X. Li, and Y. Gao, ‘‘Dynamic Music emotion Recognition: A Comparison of Different Approaches,” in 2018
recognition based on CNN-BiLSTM,” in 2020 IEEE 5th 14th Symposium on Neural Networks and Applications
Information Technology and Mechatronics Engineering (NEUREL), 2018: IEEE, pp. 1-4.
Conference (ITOEC), 2020: IEEE, pp. 1372-1376. [115] A. Ruiz-Garcia, M. Elshaw, A. Altahhan, V. Palade, A hybrid
[98] W. Yan, L. Zhou, Z. Qian, L. Xiao, H. Zhu, Sentiment deep learning neural approach for emotion recognition from
Analysis of Student Texts Using the CNN-BiGRU-AT Model, facial expressions for socially assistive robots, Neural Comput.
Sci. Program. 2021 (2021). & Applic. 29 (7) (2018) 359–373.
[99] L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao, [116] Y. Chen, J. Wang, S. Chen, Z. Shi, and J. Cai, ‘‘Facial motion
‘‘Group emotion recognition with individual facial emotion prior networks for facial expression recognition,” in 2019 IEEE
CNNs and global image based CNNs,” in Proceedings of the Visual Communications and Image Processing (VCIP), 2019:
19th ACM International Conference on Multimodal Interaction, IEEE, pp. 1-4.
2017, pp. 549-552. [117] J.-H. Kim, B.-G. Kim, P.P. Roy, D.-M. Jeong, Efficient facial
[100] M. Mohammadpour, H. Khaliliardali, S. M. R. Hashemi, and expression recognition algorithm based on hierarchical deep
M. M. AlyanNezhadi, ‘‘Facial emotion recognition using deep neural network structure, IEEE Access 7 (2019) 41273–41285.
convolutional networks,” in 2017 IEEE 4th international [118] S. Zhang, X. Pan, Y. Cui, X. Zhao, L. Liu, Learning affective
conference on knowledge-based engineering and innovation video features for facial expression recognition via hybrid deep
(KBEI), 2017: IEEE, pp. 0017-0021. learning, IEEE Access 7 (2019) 32297–32304.
[101] K. Zhang, Y. Huang, Y. Du, L. Wang, Facial expression [119] ‘‘IEA, Electricity mix in the European Union, January-May
recognition based on deep evolutional spatial-temporal 2020, IEA, Paris https://www.iea.org/data-and-statistics/
networks, IEEE Trans. Image Process. 26 (9) (2017) 4193– charts/electricity-mix-in-the-european-union-january-may-
4203. 2020.”.
[102] D.K. Jain, Z. Zhang, K. Huang, Multi angle optimal pattern- [120] I. Talegaonkar, K. Joshi, S. Valunj, R. Kohok, A. Kulkarni,
based deep learning for automatic facial expression Real time facial expression recognition using deep learning, in
recognition, Pattern Recogn. Lett. (2017). Proceedings of International Conference on Communication and
[103] B. Hasani, M.H. Mahoor, Facial expression recognition using Information Processing (ICCIP), 2019.
enhanced deep 3D convolutional neural networks, in: in [121] X. Sun, M. Lv, Facial expression recognition based on a hybrid
Proceedings of the IEEE Conference on Computer Vision and model combining deep and shallow features, Cogn. Comput. 11
Pattern Recognition Workshops, 2017, pp. 30–40. (4) (2019) 587–597.
[104] W.-S. Chu, F. De la Torre, and J. F. Cohn, ‘‘Learning spatial [122] C. M. M. Refat and N. Z. Azlan, ‘‘Deep learning methods for
and temporal cues for multi-label facial action unit detection,” facial expression recognition,” in 2019 7th International
in 2017 12th IEEE International Conference on Automatic Face Conference on Mechatronics Engineering (ICOM), 2019:
& Gesture Recognition (FG 2017), 2017: IEEE, pp. 25-32. IEEE, pp. 1-6.
[123] X. Fan, T. Tjahjadi, Fusing dynamic deep learned features and Information Technology, Networking, Electronic and
handcrafted features for facial expression recognition, J. Vis. Automation Control Conference (ITNEC), 2020, vol. 1:
Commun. Image Represent. 65 (2019) 102659. IEEE, pp. 2304-2308.
[124] F. Wang, J. Lv, G. Ying, S. Chen, C. Zhang, Facial expression [143] G. Bargshady, X. Zhou, R.C. Deo, J. Soar, F. Whittaker, H.
recognition from image based on hybrid features Wang, Enhanced deep learning algorithm development to
understanding, J. Vis. Commun. Image Represent. 59 (2019) detect pain intensity from facial expression images, Expert Syst.
84–88. Appl. 149 (2020) 113305.
[125] J. Shao, Y. Qian, Three convolutional neural network models [144] G. Muhammad and M. S. Hossain, ‘‘Emotion Recognition for
for facial expression recognition in the wild, Neurocomputing Cognitive Edge Computing Using Deep Learning,” IEEE
355 (2019) 82–92. Internet of Things Journal, 2021.
[126] K.-C. Liu, C.-C. Hsu, W.-Y. Wang, and H.-H. Chiang, ‘‘Real- [145] A. Shirian, S. Tripathi, T. Guha, Dynamic Emotion Modeling
Time facial expression recognition based on cnn,” in 2019 with Learnable Graphs and Graph Inception Network, IEEE
International Conference on System Science and Engineering Trans. Multimedia (2021).
(ICSSE), 2019: IEEE, pp. 120-123. [146] D. Duncan, G. Shine, C. English, ‘‘Facial emotion recognition
[127] T. U. Ahmed, S. Hossain, M. S. Hossain, R. ul Islam, and K. in real time,”, Comput. Sci. (2016) 1–7.
Andersson, ‘‘Facial expression recognition using convolutional [147] T. Zhang, W. Jia, X. He, J. Yang, Discriminative dictionary
neural network with data augmentation,” in 2019 Joint 8th learning with motion weber local descriptor for violence
International Conference on Informatics, Electronics & Vision detection, IEEE Trans. Circuits Syst. Video Technol. 27 (3)
(ICIEV) and 2019 3rd International Conference on Imaging, (2016) 696–709.
Vision & Pattern Recognition (icIVPR), 2019: IEEE, pp. 336- [148] J. Jeon et al, A real-time facial expression recognizer using deep
341. neural network, in: Proceedings of the 10th international
[128] R.I. Bendjillali, M. Beladgham, K. Merit, A. Taleb-Ahmed, conference on ubiquitous information management and
Improved facial expression recognition based on DWT feature communication, 2016, pp. 1–4.
for deep CNN, Electronics 8 (3) (2019) 324. [149] S. Zhang, S. Zhang, T. Huang, W. Gao, Multimodal deep
[129] S. Xie, H. Hu, Y. Wu, Deep multi-path convolutional neural convolutional neural network for audio-visual emotion
network joint with salient region attention for facial expression recognition, in: Proceedings of the 2016 ACM on International
recognition, Pattern Recogn. 92 (2019) 177–191. Conference on Multimedia Retrieval, 2016, pp. 281–284.
[130] J.A. Aghamaleki, V. Ashkani Chenarlogh, Multi-stream CNN [150] Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition
for facial expression recognition in limited training data, using CNN-RNN and C3D hybrid networks, in: Proceedings of
Multimed. Tools Appl. 78 (16) (2019) 22861–22882. the 18th ACM international conference on multimodal
[131] F. Kong, Facial expression recognition method based on deep interaction, 2016, pp. 445–450.
convolutional neural network combined with improved LBP [151] Y. Gan, ‘‘Facial expression recognition using convolutional
features, Pers. Ubiquit. Comput. 23 (3) (2019) 531–539. neural network,” in Proceedings of the 2nd international
[132] J. Chen, Y. Lv, R. Xu, C. Xu, Automatic social signal analysis: conference on vision, image and signal processing, 2018, pp. 1-5.
Facial expression recognition using difference convolution [152] X. Peng, Z. Xia, L. Li, and X. Feng, ‘‘Towards facial expression
neural network, J. Parallel Distrib. Comput. 131 (2019) 97–102. recognition in the wild: A new database and deep recognition
[133] N. Sun, Q. Li, R. Huan, J. Liu, G. Han, Deep spatial-temporal system,” in Proceedings of the IEEE conference on computer vision
feature fusion for facial expression recognition in static images, and pattern recognition workshops, 2016, pp. 93-99.
Pattern Recogn. Lett. 119 (2019) 49–61. [153] K. Talele, A. Shirsat, T. Uplenchwar, and K. Tuckley, ‘‘Facial
[134] M.S. Hossain, G. Muhammad, Emotion recognition using expression recognition using general regression neural
secure edge and cloud computing, Inf. Sci. 504 (2019) 589–601. network,” in 2016 IEEE Bombay Section Symposium (IBSS),
[135] M.S. Hossain, G. Muhammad, Emotion recognition using 2016: IEEE, pp. 1-6.
deep learning approach from audio–visual emotional big data, [154] L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen, ‘‘Long short
Information Fusion 49 (2019) 69–78. term memory recurrent neural network based encoding method
[136] W. Hua, F. Dai, L. Huang, J. Xiong, G. Gui, HERO: Human for emotion recognition in video,” in 2016 IEEE International
emotions recognition for realizing intelligent Internet of Conference on Acoustics, Speech and Signal Processing
Things, IEEE Access 7 (2019) 24321–24332. (ICASSP), 2016: IEEE, pp. 2752-2756.
[137] X. Zhenghao, Y. Niu, J. Chen, X. Kan, H. Liu, Facial [155] I. Lee, H. Jung, C. H. Ahn, J. Seo, J. Kim, and O. Kwon,
Expression Recognition of Industrial Internet of Things by ‘‘Real-time personalized facial expression recognition system
Parallel Neural Networks Combining Texture Features, IEEE based on deep learning,” in 2016 IEEE International
Trans. Ind. Inf. (2020). Conference on Consumer Electronics (ICCE), 2016: IEEE,
[138] A. Agrawal, N. Mittal, Using CNN for facial expression pp. 267-268.
recognition: a study of the effects of kernel size and number of [156] Y. Guo, D. Tao, J. Yu, H. Xiong, Y. Li, and D. Tao, ‘‘Deep
filters on accuracy, Vis. Comput. 36 (2) (2020) 405–412. neural networks with relativity learning for facial expression
[139] X. Sun, S. Zheng, H. Fu, ROI-attention vectorized CNN recognition,” in 2016 IEEE International Conference on
model for static facial expression recognition, IEEE Access 8 Multimedia & Expo Workshops (ICMEW), 2016: IEEE, pp.
(2020) 7183–7194. 1-6.
[140] S. Rajan, P. Chenniappan, S. Devaraj, N. Madian, Novel deep [157] A. Jaiswal, A. K. Raju, and S. Deb, ‘‘Facial emotion detection
learning model for facial expression recognition based on using deep learning,” in 2020 International Conference for
maximum boosted CNN and LSTM, IET Image Proc. 14 (7) Emerging Technology (INCET), 2020: IEEE, pp. 1-5.
(2020) 1373–1381. [158] A. Durmusßoğlu and Y. Kahraman, ‘‘Facial expression
[141] C. Li, A. Pourtaherian, L. van Onzenoort, W. T. a Ten, and P. recognition using geometric features,” in 2016 International
de With, ‘‘Infant facial expression analysis: towards a real-time Conference on Systems, Signals and Image Processing
video monitoring system using r-cnn and hmm,” IEEE Journal (IWSSIP), 2016: IEEE, pp. 1-5.
of Biomedical and Health Informatics, vol. 25, no. 5, pp. 1429- [159] B.-K. Kim, J. Roh, S.-Y. Dong, S.-Y. Lee, Hierarchical
1440, 2020. committee of deep convolutional neural networks for robust
[142] Q. Xu and N. Zhao, ‘‘A facial expression recognition algorithm facial expression recognition, J. Multimodal User Interfaces 10
based on CNN and LBP feature,” in 2020 IEEE 4th (2) (2016) 173–189.
[160] D. Sokolov and M. Patkin, ‘‘Real-time emotion recognition on [178] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez,
mobile devices,” in 2018 13th IEEE International Conference on ‘‘Emotionet: An accurate, real-time algorithm for the
Automatic Face & Gesture Recognition (FG 2018), 2018: IEEE, automatic annotation of a million facial expressions in the
pp. 787-787. wild,” in Proceedings of the IEEE conference on computer vision
[161] T. Kosch, M. Hassib, R. Reutter, and F. Alt, ‘‘Emotions on the and pattern recognition, 2016, pp. 5562-5570.
Go: Mobile Emotion Assessment in Real-Time using Facial [179] W.-J. Yan et al, CASME II: An improved spontaneous micro-
Expressions,” in Proceedings of the International Conference on expression database and the baseline evaluation, PLoS One 9
Advanced Visual Interfaces, 2020, pp. 1-9. (1) (2014) e86041.
[162] H. Alshamsi, V. Kepuska, H. Meng, Real time automated [180] A. Mollahosseini, B. Hasani, M.H. Mahoor, Affectnet: A
facial expression recognition app development on smart database for facial expression, valence, and arousal computing
phones, IEEE, 2017, pp. 384–392. in the wild, IEEE Trans. Affect. Comput. 10 (1) (2017) 18–31.
[163] M. Suk, B. Prabhakaran, Real-time facial expression [181] A. Dhall, J. Joshi, I. Radwan, R. Goecke, Finding happiest
recognition on smartphones, IEEE, 2015, pp. 1054–1059. moments in a social context, in: Asian Conference on
[164] E. Goeleven, R. De Raedt, L. Leyman, B. Verschuere, The Computer Vision, Springer, 2012, pp. 613–626.
Karolinska directed emotional faces: a validation study, Cogn. [182] W. Chen, X. Xie, X. Jia, L. Shen, Texture Deformation Based
Emot. 22 (6) (2008) 1094–1118. Generative Adversarial Networks for Multi-domain Face
[165] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and Editing, in: Pacific Rim International Conference on Artificial
I. Matthews, ‘‘The extended cohn-kanade dataset (ck+): A Intelligence, Springer, 2019, pp. 257–269.
complete dataset for action unit and emotion-specified [183] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer,
expression,” in 2010 ieee computer society conference on ‘‘The first facial expression recognition and analysis challenge,”
computer vision and pattern recognition-workshops, 2010: in 2011 IEEE International Conference on Automatic Face &
IEEE, pp. 94-101. Gesture Recognition (FG), 2011: IEEE, pp. 921-926.
[166] S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou, ‘‘4dfab: A [184] J. M. Susskind, A. K. Anderson, and G. E. Hinton, ‘‘The
large scale 4d database for facial expression analysis and toronto face database. Department of Computer Science,
biometric applications,” in Proceedings of the IEEE conference University of Toronto, Toronto, ON,” Canada, Tech. Rep, 3,
on computer vision and pattern recognition, 2018, pp. 5117-5126. 2010.
[167] M. Valstar and M. Pantic, ‘‘Induced disgust, happiness and [185] Z. Zhang et al., ‘‘Multimodal spontaneous emotion corpus for
surprise: an addition to the mmi facial expression database,” in human behavior analysis,” in Proceedings of the IEEE
Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): conference on computer vision and pattern recognition, 2016,
Corpora for Research on Emotion and Affect, 2010: Paris, pp. 3438-3446.
France., p. 65. [186] I. O. Ertugrul, J. F. Cohn, L. A. Jeni, Z. Zhang, L. Yin, and Q.
[168] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, ‘‘Coding Ji, ‘‘Cross-domain au detection: Domains, learning
facial expressions with gabor wavelets,” in Proceedings Third approaches, and measures,” in 2019 14th IEEE International
IEEE international conference on automatic face and gesture Conference on Automatic Face & Gesture Recognition (FG
recognition, 1998: IEEE, pp. 200-205. 2019), 2019: IEEE, pp. 1-8.
[169] Z. Zhang, P. Luo, C.C. Loy, X. Tang, From facial expression [187] D. Lundqvist, A. Flykt, and A. Öhman, ‘‘The Karolinska
recognition to interpersonal relation prediction, Int. J. directed emotional faces (KDEF),” CD ROM from Department
Comput. Vis. 126 (5) (2018) 550–569. of Clinical Neuroscience, Psychology section, Karolinska
[170] I.J. Goodfellow et al, Challenges in representation learning: A Institutet, vol. 91, no. 630, pp. 2-2, 1998.
report on three machine learning contests, in: International [188] S.M. Pizer et al, Adaptive histogram equalization and its
conference on neural information processing, Springer, 2013, variations, Comput. Vision, Graphics, Image Processing 39 (3)
pp. 117–124. (1987) 355–368.
[171] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. [189] S. Shan, W. Gao, B. Cao, and D. Zhao, ‘‘Illumination
Gedeon, ‘‘From individual to group-level emotion recognition: normalization for robust face recognition against varying
Emotiw 5.0,” in Proceedings of the 19th ACM international lighting conditions,” in 2003 IEEE International SOI
conference on multimodal interaction, 2017, pp. 524-528. Conference. Proceedings (Cat. No. 03CH37443), 2003: IEEE,
[172] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. pp. 157-164.
Gedeon, ‘‘Video and image based emotion recognition [190] M. Savvides, B.V. Kumar, Illumination normalization using
challenges in the wild: Emotiw 2015,” in Proceedings of the logarithm transforms for face authentication, in: International
2015 ACM on international conference on multimodal Conference on Audio-and Video-Based Biometric Person
interaction, 2015, pp. 423-426. Authentication, Springer, 2003, pp. 549–556.
[173] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi- [191] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman,
pie, Image Vis. Comput. 28 (5) (2010) 807–813. ‘‘From few to many: Generative models for recognition under
[174] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, ‘‘A 3D variable pose and illumination,” in Proceedings fourth ieee
facial expression database for facial behavior research,” in 7th international conference on automatic face and gesture
international conference on automatic face and gesture recognition (cat. no. pr00580), 2000: IEEE, pp. 277-284.
recognition (FGR06), 2006: IEEE, pp. 211-216. [192] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few
[175] X. Zhang et al., ‘‘A high-resolution spontaneous 3d dynamic to many: Illumination cone models for face recognition under
facial expression database,” in 2013 10th IEEE international variable lighting and pose, IEEE Trans. Pattern Anal. Mach.
conference and workshops on automatic face and gesture Intell. 23 (6) (2001) 643–660.
recognition (FG), 2013: IEEE, pp. 1-6. [193] J. Liu, Y. Feng, H. Wang, Facial expression recognition using
[176] G. Zhao, X. Huang, M. Taini, S.Z. Li, M. PietikäInen, Facial pose-guided face alignment and discriminative features based
expression recognition from near-infrared videos, Image Vis. on deep learning, IEEE Access 9 (2021) 69267–69277.
Comput. 29 (9) (2011) 607–619. [194] W. Wu, Y. Yin, Y. Wang, X. Wang, and D. Xu, ‘‘Facial
[177] S. Li, W. Deng, Reliable crowdsourcing and deep locality- expression recognition for different pose faces based on special
preserving learning for unconstrained facial expression landmark detection,” in 2018 24th International Conference on
recognition, IEEE Trans. Image Process. 28 (1) (2018) 356–370. Pattern Recognition (ICPR), 2018: IEEE, pp. 1524-1529.
[195] X. Zhu, Z. He, L. Zhao, Z. Dai, Q. Yang, A Cascade Attention [201] M. Sajjad, I. Mehmood, S.W. Baik, Image super-resolution
Based Facial Expression Recognition Network by Fusing using sparse coding over redundant dictionary based on
Multi-Scale Spatio-Temporal Features, Sensors 22 (4) (2022) effective image representations, J. Vis. Commun. Image
1350. Represent. 26 (2015) 50–65.
[196] N.C. Ebner, M.K. Johnson, H. Fischer, Neural mechanisms of [202] Z. Liu, L. Li, Y. Wu, C. Zhang, Facial expression restoration
reading facial emotions in young and older adults, Front. based on improved graph convolutional networks, in:
Psychol. 3 (2012) 223. International Conference on Multimedia Modeling, Springer,
[197] N.C. Ebner, M.R. Johnson, A. Rieckmann, K.A. Durbin, M. 2020, pp. 527–539.
K. Johnson, H. Fischer, Processing own-age vs. other-age [203] J. Yang, T. Qian, F. Zhang, S.U. Khan, Real-time facial
faces: neuro-behavioral correlates and effects of emotion, expression recognition based on edge computing, IEEE Access
Neuroimage 78 (2013) 363–371. 9 (2021) 76178–76190.
[198] M. Sajjad, A. Shah, Z. Jan, S.I. Shah, S.W. Baik, I. Mehmood, [204] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y
Facial appearance and texture feature-based robust facial Arcas, ‘‘Communication-efficient learning of deep networks
expression recognition framework for sentiment knowledge from decentralized data,” in Artificial intelligence and statistics,
discovery, Clust. Comput. 21 (1) (2018) 549–567. 2017: PMLR, pp. 1273-1282.
[199] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, ‘‘Second- [205] I. Feki, S. Ammar, Y. Kessentini, K. Muhammad, Federated
order attention network for single image super-resolution,” in learning for COVID-19 screening from Chest X-ray images,
Proceedings of the IEEE/CVF conference on computer vision Appl. Soft Comput. 106 (2021) 107330.
and pattern recognition, 2019, pp. 11065-11074. [206] V. Narula and T. Chaspari, ‘‘An adversarial learning
[200] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, ‘‘Enhanced framework for preserving users’ anonymity in face-based
deep residual networks for single image super-resolution,” in emotion recognition,” arXiv preprint arXiv:2001.06103, 2020.
Proceedings of the IEEE conference on computer vision and
pattern recognition workshops, 2017, pp. 136-144.

1 s2.0 S1110016823000327 Main

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

1 s2.0 S1110016823000327 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1110016823000327 Main

Uploaded by

Copyright:

Available Formats

Alexandria Engineering Journal (2023) 68, 817–840

Alexandria Engineering Journal

A Comprehensive Survey on Deep Facial

Received 12 July 2022; revised 31 December 2022; accepted 9 January 2023

1. Introduction tions of FER. Similarly, individuals suffering from a stroke

Table 1 Abbreviations used throughout the survey.

Fig. 2 Work ﬂow of this survey.

Fig. 3 Generic pipeline of FER (in the case of conventional learning).

Once an FER model is competent in distinguishing different TPR þ TNR

TN CK+ [165]: CK+ is a widely used laboratory-controlled

Table 6 Overview of the FER dataset with some statistical details.

6. Conclusion [1] Y. Nan, J. Ju, Q. Hua, H. Zhang, B. Wang, A-MobileNet: An

You might also like