Nothing Special   »   [go: up one dir, main page]

An Ensemble Approach For Facial Expression Analysis in Video

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

An Ensemble Approach for Facial Expression Analysis in Video

Hong-Hai Nguyen1 , Van-Thong Huynh1 , Soo-Hyung Kim*


Department of Artificial Intellegence Convergence
Chonnam National University
Gwangju, South Korea
honghaik14@gmail.com, {vthuynh,shkim}@jnu.ac.kr
arXiv:2203.12891v1 [cs.CV] 24 Mar 2022

Abstract per only focuses on the VA task. In this task, participants


will predict a valence-arousal dimension based on data from
Human emotions recognization contributes to the devel- the video.
opment of human-computer interaction. The machines un- In this study, we utilize feature extraction from the deep
derstanding human emotions in the real world will signifi- learning model. The images are extracted feature from the
cantly contribute to life in the future. This paper will intro- RegNet network [17]. Multimodel applied with the Reg-
duce the Affective Behavior Analysis in-the-wild (ABAW3) Net feature, which is the combination of Gated Recurrent
2022 challenge. The paper focuses on solving the prob- Units (GRUs) [1] and Transformer [5] to create new fea-
lem of the valence-arousal estimation and action unit de- tures as stage 1. For stage 2, We utilized new features. The
tection. For valence-arousal estimation, we conducted two GRU was applied to get temporal features. Besides, local
stages: creating new features from multimodel and tempo- attention was used to improve the model.
ral learning to predict valence-arousal. First, we make new In this work, we focus on valence-arousal prediction.
features; the Gated Recurrent Unit (GRU) and Transformer Our contributions in this paper are summarized as:
are combined using a Regular Networks (RegNet) feature,
which is extracted from the image. The next step is the GRU • Utilization features from the deep learning model.
combined with Local Attention to predict valence-arousal.
The Concordance Correlation Coefficient (CCC) was used • Using multimodal to create new features which in-
to evaluate the model. crease speed training.

• Combination of Local Attention with GRU for senti-


ment analysis.
1. Introduction
People’s emotions affect their lives and work. Re- • Conducting experiments on different models to com-
searchers are trying to create machines that can detect and pare with baseline method.
analyze human emotions that contribute to the development
The next parts of our paper are presented in the follow-
of intelligent machines. Therefore, there are a lot of appli-
ing sections: section 2 is related work, section 3 is method-
cations in life, such as medicine, health, tracking or driver
ology and the experimental results in section 4 and finally,
fatigue, etc. [6].
the conclusion in section 5.
The practical applications have many challenges from an
uncontrolled environment. However, data sources are in-
creasingly various from social networks and applications.
2. Related work
Besides, the deep learning network improved the analysis In this section, we shortly summarize some datasets and
and recognition process. Therefore, the ABAW3 2022 [6] works related to the problem of affective behavior in the
was organized for affective behavior analysis in the wild. previous challenge.
The challenge includes four tasks: Valence-Arousal (VA)
Estimation, Expression (Expr) Classification, Action Unit 2.1. Affect Annotation Dataset
(AU) Detection, and Multi-Task-Learning (MTL). This pa-
In the previous challenge [7–13, 20], the ABAW3 pro-
* Corresponding author vides a large-scale dataset Aff-Wild2 for affective behav-
1 Equal contribution ior analysis in-the-wild. The dataset used is the Aff-wild2

1
which was extended from Aff-wild1 [20]. The dataset con- GRU block

Concatenate
tains annotations for challenges: Valence-Arousal regres-
RegNet backbone
sion, basic emotions, and Action Unit. Aff-wild2 expand Frame sequence

the number of videos with 567 videos annotated by valence- Transformer block

arousal, 548 videos annotated by 8 expression categories,


547 videos annotated by 12 AUs, and 172,360 images are Figure 1. Feature extraction architecture: Multimodal fusion with
used that contain annotations of valence-arousal; 6 basic ex- combined loss.
pressions, plus the neutral state, plus the ’other’ category;
12 action units.

Local-Attention
Local-Attention
2.2. Affective Behavior Analysis in the wild

GRU
valence/arousal
The affective behavior analysis in the wild challenge
has attracted a lot of researchers. Deng et al. [3] applied
deep ensemble models learned by a multi-generational self-
Average
distillation algorithm to improve emotion uncertainty esti-
mation. About architectures, the author used features ex-
tractors from the efficient CNN model and applied GRU as Figure 2. Overview of prediction model: Gated Recurrent Unit
temporal models. Wei Zhang et al. [21] introduced multi- combined with Local Attention. Where s is sequence.
task recognition which is a streaming network by exploit-
ing the hierarchical relationships between different emotion
representations in the second ABAW challenge. In the pa-
3.2. Valence and arousal estimation
per, Vu et al. [19] used a multi-task deep learning model Our temporal learning for unimodal consists of a combi-
for valence-arousal estimation and facial expressions pre- nation of Gated recurrent units (GRU) block [1] - a standard
diction. The authors applied the distillation knowledge ar- recurrent network, and Transformer block [18] - attention
chitecture to train two networks: teacher and student model, based for sequential learning, as shown in Figure 1. The
because the dataset does not include labels for all the two representations from GRU and Transformer are concate-
tasks. Another author, Kuhnke et al. [14] introduced a two- nated to form a new feature vector and fed to a fully con-
stream network for multi-task training. The model used the nected (FC) layer for producing valence and arousal scores.
multimodel information extracted from audio and vision. FC layers are also attached to GRU and Transformer blocks
The author [2] solved two challenges of the competition. to obtain emotion scores for calculating loss function and
First, the problem is highly imbalanced in the dataset. Sec- combine to the final loss to optimize the whole system.
ond, the datasets do not include labels for all three tasks. We conducted K-fold cross validation to obtain different
The author applied balancing techniques and proposed a models for ensemble learning with K = 5. The scores from
Teacher-Student structure to learn from the imbalance la- each fold are combined together and to form a single vector
bels to tackle the challenges. for each frame in video, Fj , which can be formulated as
1 2 3 4 5
3. Methodology Fj = {Vj,i , Vj,i , Vj,i , Vj,i , Vj,i , A1j,i , A2j,i , A3j,i , A4j,i , A5j,i } (1)

k
For this section, we introduce the proposed method for where Vj,i , Akj,i are valence and score for fold k th of ith
continuous emotion estimation. Our approach contains two frame in j th video with k = 1, K, i = 1, Nj , and Nj
stages: Create new features to increase training speed in is the length of j th video. To ensemble results from K-
Figure 1 and use GRU to learn temporal information, illus- fold models, we deployed GRU-based architecture for mod-
trated in Figure 2. Besides, Local Attention was applied to elling temporal relationship, followed by two local attention
improve the model. layers to adjust the contribution of features, as in Figure 2.

3.1. Visual feature extraction 3.3. Action unit detection


Our visual feature is based on RegNet [17] architecture, In this task, two branches of feature are deployed with
a lightweight and efficient network. RegNet consists four Transformer [18] blocks, T1 and T2 , Figure 3. In T2 , the
stages to operate progressively reduced resolution with se- source feature are expanded to higher dimension and then
quence of identical blocks. The pretrained weight from Im- compressed to original dimension with an aim to improve
ageNet [4] was used as initial training, and the last three the robustness of the model. The new representation from
stages are unfreezing to learn new representation from fa- this block also fed to a FC layer for creating an output to
cial data. fuse with results from T1 and T2 .

2
Transformer block
Fully connected layer

Action Unit scores


Feature fusion
Frame sequence
Feature sequence
Fully connected layer

RegNet backbone

Feature expansion
Transformer block
Fully connected layer

and compression

Figure 3. An overview of our action unit detection system.

4. Experiments and results Fold Valence Arousal Combined


4.1. Dataset 1 0.290 0.491 0.391
2 0.348 0.435 0.391
The Valence-Arousal Estimate task includes 567 videos
3 0.339 0.500 0.419
containing valence and arousal annotations. With 455 ob-
4 0.294 0.491 0.392
jects (277 males and 178 females) was annotated by four
5 0.362 0.492 0.427
experts. The value range of valence and arousal is from -1
Average 0.390 0.540 0.465
to 1.
Baseline [6] 0.31 0.17 0.24
The Action Unit Detection task includes 548 videos an-
notating the six basic expressions, plus the neutral state, Table 1. Predicted results of Valence-Arousal, Combined (0.5 ∗
plus a category ’other’ that denotes expressions/affective V alence + 0.5 ∗ Arousal). We used the RegNet feature and
states other than the six basic ones. Approximately 2,6 mil- multimodal (GRU combined with Transformer) for training on the
lion frames, with 431 participants (265 males and 166 fe- validation set.
males), have been annotated by seven experts.

4.2. Valence and arousal estimation addition, the best result, 0.465, is obtained by averaging 5
folds.
GRU network was conducted with the PyTorch Deep
Table 2 presented the results in the experiment. We con-
Learning toolkit. The GRU was set with 256-dimensional
ducted experiments both GRU separate and GRU combined
hidden states, two-layer for multimodal and four-layered for
with Local Attention by k-fold features. Moreover, we also
prediction. The networks is trained in 25 epochs with ini-
experimented on GRU combined with Transformer by Reg-
tial learning rate of 0.001 and Adam optimizer. The mean
Net feature. All methods give better results than baseline by
between valence and arousal Concordance Correlation Co-
approximate twice time. The local attention made a small
efficient (CCC), P, is used to evaluate the performance of
improvement of the ensemble model, which proving the po-
the model as
tential of applying attention to the system. Furthermore, our
PV + PA method is higher than previous works [3, 21], respectively
PVA = , (2)
2 0.494 and 0.495.
where PV and PA are the CCC of valence and arousal, re- 4.3. Action Unit Detection
spectively, which is defined as
Our network architectures is trained by using SGD with
2ρσŶ σY learning rate of 0.9 combine and Cosine annealing warm
P= 2 2 (3) restarts scheduler [16]. We optimized the network in 20
σŶ + σŶ + (µŶ − µY )2
epochs with Focal loss function [15] and evaluate with F1
where µY was the mean of the label Y , µŶ was the mean score.
of the prediction Ŷ , σŶ and σY were the corresponding
5. Conclusions
standard deviations, ρ was the Pearson correlation coeffi-
cient between Ŷ and Y . This paper utilized features from deep learning represen-
Table 1 shown the results of our K-fold cross validation tations to the Valence-Arousal Estimation sub-challenge of
experiments with training set of Affwild2, and evaluate on ABAW3 2022. To extract information over time, GRU is
Affwild2 validation set separately. All folds give better re- used for sentiment analysis. To enhance the advantages of
sults than baseline with combined valence and arousal. In GRU, we have connected the Local Attention mechanism

3
Method Feature PV PA PVA
Baseline [6] ResNet 0.31 0.17 0.24
Deng et al. [3] MobilefaceNet + MarbleNet 0.442 0.546 0.494
Zhang et al. [21] Expression Embedding [22] 0.488 0.502 0.495
GRU + Transformer RegNet 0.391 0.565 0.478
GRU k-fold 0.432 0.575 0.504
GRU + Attention k-fold 0.437 0.576 0.507

Table 2. The comparison with previous works on valence arousal estimation with Affwild2 validation set.

Method Feature F1 database. In 2009 IEEE conference on computer vision and


pattern recognition, pages 248–255. Ieee, 2009. 2
Baseline [6] VGG16 0.39
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Our method RegNet 0.533 Toutanova. Bert: Pre-training of deep bidirectional
Our method with only T1 RegNet 0.544 transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018. 1
Table 3. The comprasion with prior methods on Affwild2 valida- [6] Dimitrios Kollias. Abaw: Valence-arousal estimation, ex-
tion set for action unit detection. pression recognition, action unit detection & multi-task
learning challenges. arXiv preprint arXiv:2202.10659, 2022.
1, 3, 4
into our model. The CCC function was used to predict [7] D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. Analysing
arousal/valence. Experimental results show that our pro- affective behavior in the first abaw 2020 competition. In
posed model outperforms the baseline method. We demon- 2020 15th IEEE International Conference on Automatic
strated that our results were better when combined GRUs Face and Gesture Recognition (FG 2020)(FG), pages 794–
and Local Attention. Furthermore, we introduced a new 800. 1
feature extracted from multimodal by combining folds. We [8] Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos
showed that the new features improve not only speed but Zafeiriou. Face behavior a la carte: Expressions, af-
also accuracy. fect and action units in a single network. arXiv preprint
arXiv:1910.11111, 2019. 1
Acknowledgments [9] Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos
Zafeiriou. Distribution matching for heterogeneous multi-
This work was supported by the National Research task learning: a large-scale face study. arXiv preprint
Foundation of Korea (NRF) grant funded by the Korea gov- arXiv:2105.03790, 2021. 1
ernment (MSIT) (NRF-2020R1A4A1019191) and Basic
[10] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou,
Science Research Program through the National Research
Athanasios Papaioannou, Guoying Zhao, Björn Schuller,
Foundation of Korea (NRF) funded by the Ministry of Edu-
Irene Kotsia, and Stefanos Zafeiriou. Deep affect prediction
cation (NRF-2021R1I1A3A04036408).
in-the-wild: Aff-wild database and challenge, deep architec-
tures, and beyond. International Journal of Computer Vision,
References pages 1–23, 2019. 1
[1] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, [11] Dimitrios Kollias and Stefanos Zafeiriou. Expression, affect,
and Yoshua Bengio. On the properties of neural machine action unit recognition: Aff-wild2, multi-task learning and
translation: Encoder-decoder approaches. arXiv preprint arcface. arXiv preprint arXiv:1910.04855, 2019. 1
arXiv:1409.1259, 2014. 1, 2 [12] Dimitrios Kollias and Stefanos Zafeiriou. Affect analysis
[2] Didan Deng, Zhaokang Chen, and Bertram E Shi. Multitask in-the-wild: Valence-arousal, expressions, action units and a
emotion recognition with incomplete labels. In 2020 15th unified framework. arXiv preprint arXiv:2103.15792, 2021.
IEEE International Conference on Automatic Face and Ges- 1
ture Recognition (FG 2020), pages 592–599. IEEE, 2020. 2 [13] Dimitrios Kollias and Stefanos Zafeiriou. Analysing affec-
[3] Didan Deng, Liang Wu, and Bertram E Shi. Iterative distil- tive behavior in the second abaw2 competition. In Proceed-
lation for better uncertainty estimates in multitask emotion ings of the IEEE/CVF International Conference on Com-
recognition. In Proceedings of the IEEE/CVF International puter Vision, pages 3652–3660, 2021. 1
Conference on Computer Vision, pages 3557–3566, 2021. 2, [14] Felix Kuhnke, Lars Rumberg, and Jörn Ostermann. Two-
3, 4 stream aural-visual affect analysis in the wild. In 2020 15th
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, IEEE International Conference on Automatic Face and Ges-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image ture Recognition (FG 2020), pages 600–605. IEEE, 2020. 2

4
[15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollár. Focal loss for dense object detection. In Pro-
ceedings of the IEEE international conference on computer
vision, pages 2980–2988, 2017. 3
[16] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas-
tic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983, 2016. 3
[17] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
Kaiming He, and Piotr Dollár. Designing network design
spaces. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 10428–
10436, 2020. 1, 2
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 2
[19] Manh Tu Vu, Marie Beurton-Aimar, and Serge Marchand.
Multitask multi-database emotion recognition. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 3637–3644, 2021. 2
[20] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou,
Athanasios Papaioannou, Guoying Zhao, and Irene Kot-
sia. Aff-wild: Valence and arousal ‘in-the-wild’challenge.
In Computer Vision and Pattern Recognition Workshops
(CVPRW), 2017 IEEE Conference on, pages 1980–1987.
IEEE, 2017. 1, 2
[21] Wei Zhang, Zunhu Guo, Keyu Chen, Lincheng Li, Zhimeng
Zhang, and Yu Ding. Prior aided streaming network for
multi-task affective recognitionat the 2nd abaw2 competi-
tion. arXiv preprint arXiv:2107.03708, 2021. 2, 3, 4
[22] Wei Zhang, Xianpeng Ji, Keyu Chen, Yu Ding, and Changjie
Fan. Learning a facial expression embedding disentangled
from identity. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 6759–
6768, 2021. 4

You might also like