Abstract
Fast and discriminative feature extraction has always been a critical issue for spontaneous micro-expression recognition applications. In this paper, a micro-expression analysis framework based on new facial representation is proposed. Firstly, to remove redundant information in the micro-expression video sequences, the key frame is adaptively selected on the criteria of structural similarity index (SSIM) between different face images. Then, robust principal component analysis (RPCA) obtains the sparse information of the Key frame, which not only retains the expression attributes of the micro-expression sequence, but also eliminates useless information. Furthermore, we use Dual-cross patterns (DCP) to extract features of sparse key frame. Repeated comparison experiments were performed on the SMIC database to evaluate the performance of the method. Experimental results demonstrate that our proposed method achieves promising performance for micro-expression recognition.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Micro-expression recognition
- Key frame
- Robust principal component analysis
- Dual-cross patterns
- Feature extraction
1 Introduction
In recent years, micro-expressions have received more and more attention. In many cases, people hide, camouflage or suppress their true emotions [1], so they produce partial, fast facial expressions, which we call micro-expressions. Compared to ordinary expressions, the short duration of micro-expressions is a typical feature, usually they last 1/25 s to 1/3 s [2]. In addition, micro-expressions have potential uses in many areas, such as national security, interrogation, and medical care. It should be noted that only trained people can distinguish micro-expressions, but even after training, the recognition rate is only 47% [3]. Therefore, the research of micro-expression recognition is of great significance.
Previous research on facial expressions focused on facial micro-expressions [4] found in macroscopic expressions. In recent years, spontaneous facial expressions have attracted more and more researchers’ attention. The recognition of micro-expressions requires a large amount of data for training and modeling, but it is difficult for non-professionals to collect data, which is also the difficulty of micro-expression recognition. Commonly used spontaneous micro-expression data sets are: SMIC [5] of the University of Oulu and CASME [6], CASME2 [7] of the Chinese Academy of Sciences. The SMIC dataset consists of three subsets of HS, VIS, and NIR captured by a high-speed camera, a normal camera, and a near-infrared camera, respectively.
The object of micro-expression processing is a video clip, and a gray-scale video clip can be regarded as 3D, and many micro-expression algorithms focus on extracting 3D texture features. Local binary pattern from three orthogonal planes (LBP-TOP) [8] is an extension of LBP in three-dimensional space and is widely used in micro-expression analysis. Since then, LBP-TOP has proven to be effective in micro-expression recognition, and many researchers have proposed improvements based on LBP-TOP. For example, Huang et al. proposed a Completed Local Quantized Pattern (CLQP) [9] to reduce the dimensions of features. Subsequently, an integral projection method based on a difference image (STLBP-IP) [10] is also proposed. This method first obtains the difference image of the micro-expression sequence, and then uses the integral projection method to combine the LBP to obtain the feature vector. In 2017, Huang et al. [11] proposed an RPCA-based integral projection (STLBP-RIP) method for identifying spontaneous micro-expressions. This method has better performance than other methods.
In this paper, we present a new algorithm Dual-cross patterns with RPCA of Key frame (DCP-RKF) for feature extraction of micro-expressions. For each video sequence, starting frames and ending frames are used as standard frames, and structural similarity index (SSIM) [12] is used to find key frames in the video sequence. Sparse information is extracted from key frames using RPCA, and feature extraction is performed using DCP [13].
2 Key Frame Based on Structural Similarity (SSIM)
The spatial domain SSIM index is based on similarities of local luminance, contrast, and structure between reference and distorted image. In fact, because it is a symmetric measure, it can be thought of as a similarity measure for comparing any two signals [16]. Given two image \( x \) and \( y \), SSIM index is defined as
Where \( \mu_{x} \), \( \mu_{y} \) are the pixel average of the image \( x \) and \( y \), \( \sigma_{x}^{2} \) and \( \sigma_{y}^{2} \) are the variance of \( x \), \( y \), \( \sigma_{xy} \) is the covariance of \( x \), \( y \). \( c_{1} \) and \( c_{2} \) are used to maintain stability when the pixel average are close to zero. By default, \( c_{1} = (0.01*L)^{2} \), \( c_{2} = (0.03*L)^{2} \), where L is the specified ‘Dynamic Range’ value. SSIM range from 0 to 1, when the two images are identical, the value of SSIM is equal to 1.
The pioneering work by Wang et al. [12] showed that SSIM-motivated optimization for video coding played a very important role in video processing, which is more relative to micro-expression recognition.
For micro-expression video sequence, the traditional feature extraction method is to consider the entire sequence or part of it for reproduction. There are always problems with alignment, lighting, etc. in the micro-expression database, so too much data is the bane of accurate identification [14]. A novel proposition is presented in this paper, whereby we utilize only one image per video, called key frame. The key frame of a video contains the highest intensity of expression changes among all frames, while the onset and offset frame is the perfect choice of a reference frame with neutral expression and SSIM is used to extract key frame.
Given a micro-expression video sequence \( f_{i} |i = 1, \ldots ,n \), \( R_{1} \) and \( R_{2} \) are the reference frames of this sequence which are the first and last frames, respectively, defined as \( R_{1} = f_{1} \), \( R_{2} = f_{n} \). For each frame in the video sequence, the total SSIM is represented as:
Combine the previously proposed SSIM index (Eq. (1)), Eq. (2) can be re-defined as:
Where \( i\, = \,\text{2},\text{3}, \cdots ,\text{(n} - \text{1}) \). According to the definition of total SSIM, we can get TSSIM except for the first and last frames. Finally, by comparing the size of TSSIM, can get key frame. We think that TSSIM value is the smallest, that is, the frame with the largest difference compared with the reference frame is the key frame.
3 Sparse Information Extracted from Key Frame Using RPCA
Although the key frame based on structural similarity preserves the main information of different micro-expression and discriminative ability, it also has a lot of facial information. Just as STLBP-IP [10] considers video clips from difference image method can well characterize the different micro-expression. Next, we exploit the nature of STLBP-IP to get the motion feature from the robust principal component analysis.
Based on Eq. 4, we can obtain the key frame of any video sequence. For convenience, we denote it as \( M \). First we know that \( M \) is a large data matrix and the data are characterized by low-rank subspaces, so it may be decomposed as
Where \( L_{0} \) is a low-rank matrix and \( S_{0} \) is sparse matrix, aiming at recovering \( S_{0} \). This problem can be solved by tractable convex optimization. Equation 5 is formulated as follows
Where \( ||.||_{*} \) denotes the nuclear norm, which is the sum of its singular values, \( \lambda \) is a positive weighting parameter. The iterative threshold technique minimizes the combination of the \( L_{0} \) norm and the ‘nuclear’ norm, and the scheme converges very slowly.
Now discuss the Augmented Lagrangian Multiplier (ALM). The ALM method operates on the augmented Lagrangian
Where \( Y \) is a Lagrange multiplier and \( \mu \) is a positive scalar. A genetic Lagrange multiplier algorithm would solve PCP (principle component pursuit) by repeatedly setting \( (L_{k} ,S_{k} ) \) = arg min \( l(L,S,Y_{k} ) \), and then updating the Lagrange multiplier matrix via \( Y_{k + 1} \) = \( Y_{k} \) + \( \mu (M - L_{k} - S_{k} ) \). Equation 7 can resolved by ALM proposed by EJ et al. [15] Fig. 1 shows the key frame selected from a micro-expression video clip, in which it is labeled as negative. It is found from Fig. 2 that after using RPCA, the sparse part we extracted is used to extract the feature part, and the information is reduced more, which also reflects the simplicity of the proposed method. As seen from Fig. 1, the subtle motion image obtained by RPCA well characterizes the specific regions of facial movements.
4 Dual-Cross Patterns (DCP)
DCP is a kind of local binary descriptors focus on local sampling and pattern encoding, which are the important part of a face image descriptor. DCP encodes the second-order statistical information in the most abundant direction of the face image. The research of Ding et al. [13] shows that DCP has strong recognition ability and strong robustness to posture, expression and illumination changes. Compared to LBP, the local sampling method of DCP is different, as shown in Fig. 2.
The purpose of DCP is to perform local sampling and mode encoding on the direction in which the amount of information contained in the face image is the largest. After the face image is normalized, some facial expressions such as eyes, nose, mouth, and eyebrows extend horizontally or outward, and converge toward the diagonal direction (\( \pi /4 \) and \( 3\pi /4 \)). As shown in Fig. 2(a), for each pixel in an image, sample in 8 directions, such as 0, \( \pi /4 \), \( \pi /2 \), \( 3\pi /4 \), \( \pi \), \( 5\pi /4 \), \( 3\pi /2 \) and \( 7\pi /4 \). Two pixels are sampled in each direction. The final sampling points are \( \{ A_{0} ,B_{0} ;A_{1} ,B_{1} ; \ldots ;A_{7} ,B_{7} \} \), we define the radius of A is \( R_{in} \) and the radius of B is \( R_{ex} \).
Define the encoding for each direction as follows
Where \( S(t) = \left\{ {_{0, \, t < 0}^{1, \, t \ge 0} } \right. \), and \( I_{o} \), \( I_{A} \), \( I_{B} \) are the gray value of points \( O \), \( A_{i} \) and \( B_{i} \), respectively.
In order to reflect the horizontal and diagonal information of the image, the \( DCP_{i} \) is further divided into two cross encoders. We define \( \left\{ {DCP_{0} ,DCP_{2} ,DCP_{4} ,DCP_{6} } \right\} \) as the first subset and name is DCP-1; \( \left\{ {DCP_{1} ,DCP_{3} ,DCP_{5} ,DCP_{7} } \right\} \) as the second subset and name is DCP − 2 as shown in Fig. 2(b). The codes at each pixel are represented as
Thus, the DCP descriptor for each pixel in an image can be represented by the two codes generated by the cross encoders.
5 Results and Discussion
For evaluating DCP-RKF, the experiments are implemented on SMIC-HS databases for micro-expression recognition [10]. The SMIC-HS database consists of 16 subjects with 164 spontaneous micro-expressions recorded by a 100-fps camera and spatial resolution with 640 × 480 pixel size. There are 3 classes of the micro-expressions in this database: negative (70 samples), positive (51 samples) and surprise (43 samples)
For SMIC-HS databases, we firstly use active shape model (ASM) to extract the 68 facial landmarks for a micro-expression image and aligned to a standard frame. And then we crop facial images into 170 × 139. In the experiments, we use leave-one-sample-out cross validation protocol, one of which was used for testing and the remaining samples were used for training. For the classification, we use the chi-square distance.
We introduce the Dual-cross patterns with RPCA of key frame (DCP-RKF). The block size N of sparse key frame, the inner and outer radius \( (R_{in} ,R_{ex} ) \) of DCP are two important parameters for DCP-RKF, which determine the complexity of the algorithm and the performance of the classification. In this subsection, our aim is to evaluate the effects of parameters N and \( (R_{in} ,R_{ex} ) \). This paper evaluates the performance of DCP-RKF caused by various N on SMIC-HS database. We know that the number of blocks of a sparse key frame is represented by the number of rows and the number of columns, defined as \( N = (row,col) \). In order to avoid bias and compare the performance of features at a more general level, we extract features by changing the number of blocks without regard to the radius effect of DCP. The results of DCP-RKF on SMIC-HS databases are presented in Fig. 3, at this point we take \( (R_{in} ,R_{ex} ) \) as (5, 7).
It is known from Fig. 3 that when the radius of the DCP is (5, 7), the DCP-RKF recognition rate is up to 62.8% when the blocks are at (8, 9) and (10, 10). If the block is (1, 1), that is, without block, the overall recognition rate will be lower, which also indicates that the block is helpful for the recognition rate. Theoretically, we believe that blocking helps to improve the positional information of the micro-expressions. Overall, within a certain range, as the number of blocks increases, the recognition rate generally has an upward trend. However, exceeding a certain range, the increase in the number of blocks does not increase the recognition rate.
Based on the designed N(8, 9), we verify the influence of \( (R_{in} ,R_{ex} )\)\( (R_{in} ,R_{ex} |1,2,3,4,5,6,7,8,9 \, \text{and} \, R_{in} < R_{ex} ) \) on SMIC-HS database, of which results are shown in Table 1. From Table 1, we can see that when the sparse Keyframe is divided into 8 × 9 blocks, the micro-expression recognition rate of DCP-RKF is related to the radius of the DCP. For DCP-RKF, the greater the difference between \( R_{in} \) and \( R_{ex} \), the better recognition we obtain. However, if the radii differ too much, the effect may be reduced. In the case of 8 × 9 blocks, the best recognition rate 63.41% we get is at radius (4, 9).
To verify the proposed method, we compare the recognition rate of DCP-RKF with the algorithm of LBP-TOP [7], STLBP-IP [10], STLBP-RIP [11] on SMIC-HS database. It should be noted here that DCP-1-RKF and DCP-2-RKF means that we only use DCP-1, DCP-2 for feature extraction of sparse Key frame.
For DCP-RKF, we use 8 × 9 blocks on sparse key frame, and the radius of DCP is (4, 9). Leave-one-sample-out cross validation protocol is used to select the training and testing samples, the last chi-square distance is used for classification. For LBP-TOP, STLBP-IP, STLBP-RIP, in order to reflect the experimental contrast, we use the 8*9 block, and the other optimal parameters proposed in the respective articles, the same classification method, and repeatedly implement the related algorithm on the SMIC-HS database. Results on recognition rate are reported in Table 2. As seen from the table, LBP-TOP achieves the recognition rate of 55.49%, at the same time, STLBP-IP and STLBP-RIP only achieve a recognition rate of 50%. However, DCP-RKF reaches the best recognition rate of 63.41%, which is 7.92% higher than LBP-TOP. These results show that DCP-RKF has good geometric and texture features, and can be well applied to micro-expression feature extraction.
The confusion matrix of LBP-TOP, STLBP-IP, STLBP-RIP and our methods are shown in Fig. 4. Compared to other methods, DCP-RKF performs better on all emoticons (negative, positive, surprise). On the negative micro-expression, DCP-RKF achieved a recognition rate of 67.14%, higher than 62.86% of LBP-TOP, 50% of STLBP-IP, and 60% of STLBP-RIP. Similarly, on positive and surprise micro-expression, DCP-RKF achieved 64.7% and 55.81%, which are also higher than the other three methods.
6 Conclusions
In this paper, we propose Dual-cross patterns with RPCA of key frame (DCP-RKF) for micro-expression recognition. Specifically, we first use SSIM to obtain the key frame of the micro-expression sequence, then apply RPCA to obtain the sparse information of the key frame, and finally use DCP to extract the features. Experimental results demonstrate that our proposed method gets higher recognition rates and achieves promising performance, compared with the state-of-the-art performance on the SMIC-HS micro-expression database.
References
Ekman, P., Friesen, W.V., O’sullivan, M., et al.: Universals and cultural differences in the judgments of facial expressions of emotion. J. Pers. Soc. Psychol. 53(4), 712–717 (1987)
Shen, X., Wu, Q., Fu, X.: Effects of the duration of expressions on the recognition of micro-expressions. J. Zhejiang (Univ. Sci. B) 13(3), 221–230 (2012)
Frank, M.G., Herbasz, M., Sinuk, K., et al.: I see how you feel: training laypeople and professionals to recognize fleeting emotions, In: The Annual Meeting of the International Communication Association, Sheraton New York, pp. 1–2 (2009)
Shreve, M., Godavarthy, S., Goldgof, D., et al.: Macro-and micro-expression spotting in long videos using spatio-temporal strain. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011), pp. 51–56. IEEE (2011)
Pfister, T., Li, X., Zhao, G., et al.: Recognizing spontaneous facial micro-expressions. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1449–1456. IEEE (2011)
Yan, W.J., Wu, Q., Liu, Y.J., et al.: CASME database: a dataset of spontaneous micro-expressions collected from neutralized faces. In: 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pp. 1–7, IEEE (2013)
Yan, W.J., Li, X., Wang, S.J., et al.: CASME II: an improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE 9(1), e86041 (2014)
Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expression. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)
Huang, X., Zhao, G., Hong, X., Pietikäinen, M., Zheng, W.: Texture description with completed local quantized patterns. In: Kämäräinen, J.-K., Koskela, M. (eds.) SCIA 2013. LNCS, vol. 7944, pp. 1–10. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38886-6_1
Huang, X., Wang, S.J., Zhao, G., et al.: Facial micro-expression recognition using spatiotemporal local binary pattern with integral projection. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–9. IEEE (2015)
Huang, X., Wang, S.J., Liu, X., et al.: Discriminative spatiotemporal local binary pattern with revisited integral projection for spontaneous facial micro-expression recognition. IEEE Trans. Affect. Comput. 10(1), 32–47 (2019)
Wang, S., Rehman, A., Wang, Z., et al.: SSIM-motivated rate-distortion optimization for video coding. IEEE Trans. Circuits and Syst. Video Technol. 22(4), 516–529 (2012)
Ding, C., Choi, J., Tao, D., et al.: Multi-directional multi-level dual-cross patterns for robust face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 518–531 (2016)
Liong, S.T., See, J., Wong, K.S., et al.: Less is more: micro-expression recognition from video using apex frame. Signal Process.: Image Commun. 62(1), 82–92 (2018)
Candès, E.J., Li, X., Ma, Y., et al.: Robust principal component analysis? J. ACM (JACM) 58(3), 11–12 (2011)
Acknowledgments
This paper is supported by the National Nature Science Foundation of China (No. 61861020), the Natural Science Foundation of Jiangxi Province of China (No. 20171BAB202006).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, X., Xie, Z., Zong, W. (2019). Dual-Cross Patterns with RPCA of Key Frame for Facial Micro-expression Recognition. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11902. Springer, Cham. https://doi.org/10.1007/978-3-030-34110-7_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-34110-7_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34109-1
Online ISBN: 978-3-030-34110-7
eBook Packages: Computer ScienceComputer Science (R0)