An Ensemble Approach For Facial Expression Analysis in Video
An Ensemble Approach For Facial Expression Analysis in Video
An Ensemble Approach For Facial Expression Analysis in Video
1
which was extended from Aff-wild1 [20]. The dataset con- GRU block
Concatenate
tains annotations for challenges: Valence-Arousal regres-
RegNet backbone
sion, basic emotions, and Action Unit. Aff-wild2 expand Frame sequence
the number of videos with 567 videos annotated by valence- Transformer block
Local-Attention
Local-Attention
2.2. Affective Behavior Analysis in the wild
GRU
valence/arousal
The affective behavior analysis in the wild challenge
has attracted a lot of researchers. Deng et al. [3] applied
deep ensemble models learned by a multi-generational self-
Average
distillation algorithm to improve emotion uncertainty esti-
mation. About architectures, the author used features ex-
tractors from the efficient CNN model and applied GRU as Figure 2. Overview of prediction model: Gated Recurrent Unit
temporal models. Wei Zhang et al. [21] introduced multi- combined with Local Attention. Where s is sequence.
task recognition which is a streaming network by exploit-
ing the hierarchical relationships between different emotion
representations in the second ABAW challenge. In the pa-
3.2. Valence and arousal estimation
per, Vu et al. [19] used a multi-task deep learning model Our temporal learning for unimodal consists of a combi-
for valence-arousal estimation and facial expressions pre- nation of Gated recurrent units (GRU) block [1] - a standard
diction. The authors applied the distillation knowledge ar- recurrent network, and Transformer block [18] - attention
chitecture to train two networks: teacher and student model, based for sequential learning, as shown in Figure 1. The
because the dataset does not include labels for all the two representations from GRU and Transformer are concate-
tasks. Another author, Kuhnke et al. [14] introduced a two- nated to form a new feature vector and fed to a fully con-
stream network for multi-task training. The model used the nected (FC) layer for producing valence and arousal scores.
multimodel information extracted from audio and vision. FC layers are also attached to GRU and Transformer blocks
The author [2] solved two challenges of the competition. to obtain emotion scores for calculating loss function and
First, the problem is highly imbalanced in the dataset. Sec- combine to the final loss to optimize the whole system.
ond, the datasets do not include labels for all three tasks. We conducted K-fold cross validation to obtain different
The author applied balancing techniques and proposed a models for ensemble learning with K = 5. The scores from
Teacher-Student structure to learn from the imbalance la- each fold are combined together and to form a single vector
bels to tackle the challenges. for each frame in video, Fj , which can be formulated as
1 2 3 4 5
3. Methodology Fj = {Vj,i , Vj,i , Vj,i , Vj,i , Vj,i , A1j,i , A2j,i , A3j,i , A4j,i , A5j,i } (1)
k
For this section, we introduce the proposed method for where Vj,i , Akj,i are valence and score for fold k th of ith
continuous emotion estimation. Our approach contains two frame in j th video with k = 1, K, i = 1, Nj , and Nj
stages: Create new features to increase training speed in is the length of j th video. To ensemble results from K-
Figure 1 and use GRU to learn temporal information, illus- fold models, we deployed GRU-based architecture for mod-
trated in Figure 2. Besides, Local Attention was applied to elling temporal relationship, followed by two local attention
improve the model. layers to adjust the contribution of features, as in Figure 2.
2
Transformer block
Fully connected layer
RegNet backbone
Feature expansion
Transformer block
Fully connected layer
and compression
4.2. Valence and arousal estimation addition, the best result, 0.465, is obtained by averaging 5
folds.
GRU network was conducted with the PyTorch Deep
Table 2 presented the results in the experiment. We con-
Learning toolkit. The GRU was set with 256-dimensional
ducted experiments both GRU separate and GRU combined
hidden states, two-layer for multimodal and four-layered for
with Local Attention by k-fold features. Moreover, we also
prediction. The networks is trained in 25 epochs with ini-
experimented on GRU combined with Transformer by Reg-
tial learning rate of 0.001 and Adam optimizer. The mean
Net feature. All methods give better results than baseline by
between valence and arousal Concordance Correlation Co-
approximate twice time. The local attention made a small
efficient (CCC), P, is used to evaluate the performance of
improvement of the ensemble model, which proving the po-
the model as
tential of applying attention to the system. Furthermore, our
PV + PA method is higher than previous works [3, 21], respectively
PVA = , (2)
2 0.494 and 0.495.
where PV and PA are the CCC of valence and arousal, re- 4.3. Action Unit Detection
spectively, which is defined as
Our network architectures is trained by using SGD with
2ρσŶ σY learning rate of 0.9 combine and Cosine annealing warm
P= 2 2 (3) restarts scheduler [16]. We optimized the network in 20
σŶ + σŶ + (µŶ − µY )2
epochs with Focal loss function [15] and evaluate with F1
where µY was the mean of the label Y , µŶ was the mean score.
of the prediction Ŷ , σŶ and σY were the corresponding
5. Conclusions
standard deviations, ρ was the Pearson correlation coeffi-
cient between Ŷ and Y . This paper utilized features from deep learning represen-
Table 1 shown the results of our K-fold cross validation tations to the Valence-Arousal Estimation sub-challenge of
experiments with training set of Affwild2, and evaluate on ABAW3 2022. To extract information over time, GRU is
Affwild2 validation set separately. All folds give better re- used for sentiment analysis. To enhance the advantages of
sults than baseline with combined valence and arousal. In GRU, we have connected the Local Attention mechanism
3
Method Feature PV PA PVA
Baseline [6] ResNet 0.31 0.17 0.24
Deng et al. [3] MobilefaceNet + MarbleNet 0.442 0.546 0.494
Zhang et al. [21] Expression Embedding [22] 0.488 0.502 0.495
GRU + Transformer RegNet 0.391 0.565 0.478
GRU k-fold 0.432 0.575 0.504
GRU + Attention k-fold 0.437 0.576 0.507
Table 2. The comparison with previous works on valence arousal estimation with Affwild2 validation set.
4
[15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollár. Focal loss for dense object detection. In Pro-
ceedings of the IEEE international conference on computer
vision, pages 2980–2988, 2017. 3
[16] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas-
tic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983, 2016. 3
[17] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
Kaiming He, and Piotr Dollár. Designing network design
spaces. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 10428–
10436, 2020. 1, 2
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 2
[19] Manh Tu Vu, Marie Beurton-Aimar, and Serge Marchand.
Multitask multi-database emotion recognition. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 3637–3644, 2021. 2
[20] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou,
Athanasios Papaioannou, Guoying Zhao, and Irene Kot-
sia. Aff-wild: Valence and arousal ‘in-the-wild’challenge.
In Computer Vision and Pattern Recognition Workshops
(CVPRW), 2017 IEEE Conference on, pages 1980–1987.
IEEE, 2017. 1, 2
[21] Wei Zhang, Zunhu Guo, Keyu Chen, Lincheng Li, Zhimeng
Zhang, and Yu Ding. Prior aided streaming network for
multi-task affective recognitionat the 2nd abaw2 competi-
tion. arXiv preprint arXiv:2107.03708, 2021. 2, 3, 4
[22] Wei Zhang, Xianpeng Ji, Keyu Chen, Yu Ding, and Changjie
Fan. Learning a facial expression embedding disentangled
from identity. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 6759–
6768, 2021. 4