An Automatic Base Expression Selection Algorithm Based on Local Blendshape Model

Ziqi Tu¹⁴,
Dongdong Weng^14,15,
Dewen Cheng¹⁴,
Yihua Bao¹⁵,
Bin Liang¹⁴ &
…
Le Luo¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11902))

Included in the following conference series:

International Conference on Image and Graphics

2165 Accesses

Abstract

In order to give a virtual human rich and realistic facial expression in the film production process, a good blendshape model is needed. But selecting and capturing base expressions for blendshape model requires a lot of manual work, time and effort, and the model also lacks expressiveness. A method for automatically selecting a set of base expressions from a sequence of facial motions is proposed in this paper. In this method, the Procrustes analysis is used to estimate the difference between face meshes and determine the composition of the base expressions. And the base expressions are used to build a local blendshape model which can enhance expressiveness. The results of reconstructing facial expressions by the local blendshape model are shown in this paper. By this method, the base expressions can be automatically selected from the expression sequence, reducing the manual operation.

This project was funded by the National Key Research and Development Program of China (No. 2017YFB1002805), the National Natural Science Foundation of China (No. U1605254) and Microsoft Research Asia.

You have full access to this open access chapter, Download conference paper PDF

Automatic 3D Facial Landmark-Based Deformation Transfer on Facial Variants for Blendshape Generation

Article 02 December 2022

Blendshape Facial Animation

Keywords

1 Introduction

In the process of making movies, lifelike facial expressions of virtual human play an important role in expressing the character’s emotions and language. The commonly used method of facial expression acquisition is through facial motion capture devices and blendshape models. In order to achieve a good effect of blending, it is necessary to capture as many expressions as possible. In the process of facial expression scanning, Facial Action Coding System (FACS) [7] is commonly used to guide the selection of expressions to be captured. FACS is based on the anatomical structure of the face and the laws of facial movement. It contains more than 100 expressions. During the scanning of the models, some other expressions will also be added empirically based on FACS. For example, in the movie “The Curious Case of Benjamin Button”, 170 blendshapes based on FACS principles were used [6]. However, choosing the expressions that need to be scanned mostly depend on the experience of artists. During the scanning, actors may make different expressions each time, and the description of the expressions is often confusing, so the same expressions need to be scanned many times. These reasons lead to a cost of time and manual operation in the process of constructing a blendshape model, and the actors also need to make a lot of effort. How to select the base expressions automatically for a blendshape model is worth studying.

Disney Research proposed a local blendshape model based on facial anatomical constraints [15]. They selected 10 from a subset of FACS containing 26 expressions as the base expressions for their blendshape model. Thus realizing monocular facial expression capture. Their way of choosing base expressions is inspiring.

For the local blendshape model, the entire face is divided into several local regions, each region can perform an independent blendshape. The local blendshape model is more flexible because it provides more degrees of freedom and has the advantage of exploiting hidden data [19]. Features that each face region is independent of each other can help fuse multiple locally deformations of the face together, which can further reduce the number of base expressions for blendshape model.

In this paper, a method for automatically selecting base expressions for blendshape model from a set of random expressions is proposed. The method is based on the local blendshape model. In a set of facial expression models with randomness, a subset is selected iteratively as the base expressions, and the unselected expressions are reconstructed by the facial expression reconstruction algorithm with the selected subset. The one with the greatest reconstruction error in the model to be selected as the next base expression. Finally, a set of base expressions that are most suitable are obtained. This paper shows the set of basic expressions that are automatically selected and the results of facial expression reconstructions.

2 Related Work

For building a blendshape model, most of the base expression selection are based on FACS. Ichim [9] proposed a physical based facial animation method. They used 48 blendshapes inspired by FACS as templates, sculptured by the artists, for facial animation. In the movie “King Kong”, the expression space used in the reconstruction of facial animation was a superset of FACS [12]. Cao created a 3D facial expression database [4], which contains 150 individual’s expression data, and created blendshape contains 46 expression units for each person. These 46 expression units are from FACS. Weise proposed a performance-driven real-time facial expression system [14], in which 39 facial expressions based on FACS were used to construct the blendshape model. FACS plays an important role in guiding the process of blendshape. However, in the current method, the number of base expressions that need to be scanned is too large, and the actor may need to make an expression many times because each time may have a different expression, which is challenging for the actor, and the scans also need to be selected later.

The local blendshape model is to divide the face into different regions and construct blendshape for each part so that every face region can deform independently. The local blendshape model is more flexible and can express the shape of the face that is not in the base expression. As early as 1995, Black [1] has studied the local parametric model for the recovery and recognition of non-rigid facial motion. Decarlo [5] manually created a parameterized 3D face model for tracking faces in the video; the shape of the model is controlled by parameterized deformation, which is used for a specific region of the face, from a single region to the entire face. Blanz [2] manually divided the face into 5 parts, creating a deformation model for each part. JOSHI [10] used a region-based blendshape model for keyframe facial animation and automatically determines the best segmentation through the physical model. Zhang [17] proposed a system for synthesizing facial expressions for 2D images and 3D face models. They empirically divided the face into several regions in order to synthesize asymmetric expressions. Tena [13] built a region-based PCA model based on motion capture data, allowing manual manipulation of face models. Brunton [3] used many local multi-linear models to reconstruct faces from noisy point cloud model. Neumann et al. [11] proposed a method for extracting sparse spatial local deformation patterns from animated mesh sequences. The local blendshape model has more parameters than the global blendshape model and provides more degrees of freedom, so it is more flexible, and can still perform well with a small number of base expressions.

Based on the local blendshape model, this paper automatically selects the base expressions in a set of random face models iteratively, and reconstruct the unselected expressions, realizing covering the whole with a small number of expressions.

3 Overview

The overall process of automatically selecting the base expression is shown in the Fig. 1. First, obtaining a series of random facial expression models is required. Starting with the neutral expression, select the base expressions from the set of models, and the unselected expressions for reconstruction. Then use the base expressions to construct a local blendshape model. Next, treat each unselected expression as the target model, perform monocular facial expression reconstruction with the constructed local blendshape model to obtain the corresponding reconstructed expression models. Then reconstructed expression meshes are compared with the target models, and the difference $d_i$ between the reconstructed expression and the target expression is calculated as the error using the Procrustes analysis [8]. After that, the expression model with the largest error is selected as the new base expression, and the next iteration is performed. When the change rate of the error becomes relatively small, a set of suitable base expressions can be obtained. Each part will be introduced as follows.

4 Facial Expression Sequence

Before performing 3D face modeling, it is necessary to determine all the expressions to be scanned, to inform the actor of the expressions in advance, and tell the actor the essentials of facial expressions, and let the actors practice. In the process of scanning, in order to prevent the movement from being insufficient, a plurality of scans are generally performed for each expression. The goal of this paper is to simplify this process.

This paper refers to the requirements of the facial animation capture device Dynamixyz, select 56 extreme expressions as a sequence of expressions. These expressions are also based on FACS. This set of 56 expressions can be used as the base expression for the global blendshape model, for reconstructing most facial expressions. If a subset can be chosen to cover this set of expressions, then the subset can also be used to construct most facial expressions. Additional 10 expressions are added to the subset, supplementing some Chinese pronunciations and random expressions.

There are many ways to obtain 3D facial expression models. This paper uses a multi-view face reconstruction method similar to [18]. The images of different angles of the face are acquired by the multi-view camera array, and then the facial feature points are matched and the point cloud is calculated, and finally, the face model is obtained. In order to build a blendshape model, all models need to have the same topology. In this paper, a face model template is used to fit the reconstructed face with the help of ZBrush and R3DS warp3 software.

5 Local Blendshape Model

The base expressions need to include the deformation form of all the face regions, and on the basis of the global blendshape model, it is necessary to select a large number of expressions. The local blendshape model is more flexible and has more degrees of freedom. The deformation state of each region can be determined with a small number of base expressions, thus realizing the reconstruction of facial expressions. Therefore, the local blendshape model is necessary for this paper.

The first step in constructing the local blendshape model is to divide the facial area and segment the face model. There are many ways to divide the facial area, such as along the lines of the skin’s surface, or according to the distribution of muscles. In a word, face segmentation is a very challenging task. Wu [15] used an UV-based partitioning method to segment the face model with 700k points into 1000 patches. This paper uses a similar approach. The facial area is selected in the UV map and is divided in the horizontal and vertical directions, and the corresponding portion of the face model is segmented. In order to ensure that each patch contains the appropriate number of points, the face model used in this paper has about 50k points and is divided into 100 patches. In order to ensure the integrity of the face model, there are overlap areas between the patches, which is convenient for fusing the patches into a whole. Each patch has approximately 20% overlapping vertices with neighboring patches by adjusting the patch size.

After segmenting the face model, all the blendshape models are segmented in the same way. Assuming that N is the number of expressions other than neutral expressions, then each patch has $N+1$ shapes and can blend separately. Let $X_i$ be the shape of the i-th patch, it can be obtained by the following formula

$$\begin{aligned} X_i=\left( U_i+\sum _{n=1}^N\alpha _i^nD_i^n\right) \end{aligned}$$

(1)

Where $U_i$ represents the shape of the i-th patch in the neutral expression, $\alpha _i^n$ is the weight of the blendshape in the patch and $D_i^n$ refers to the deformation of the i-th patch with the n-th blendshape. Let $S_i^n$ be the n-th shape of the i-th patch, then $D_i^n=S_i^n-U_i$.

6 Select the Base Expression

This paper will select a subset of N expressions as the base expression from 66 random face models. In order to ensure the efficiency of facial expression reconstruction, and considering the accuracy, it is important to select an appropriate number of basic expressions. This paper uses an iterative approach. The facial expression reconstruction method based on local blendshape is used to reconstruct the unselected expressions. The Procrustes analysis is used to estimate the errors of the reconstructed models. And the expression with the largest error is added to the set base expression. Then proceed to the next iteration. Finally, a set of appropriate base expressions is acquired.

6.1 Facial Expression Reconstruction

In order to reconstruct facial expressions, it is necessary to get the blendshape parameters of each patch, as well as the position and pose of the patch in three-dimensional space. Given that the current trend of facial expression reconstruction algorithms is for a monocular RGB camera, the method used in this paper is also for the same situation. In the energy function built in this paper, the blendshape parameters and pose of each patch are taken as the unknown quantity, and the selected two-dimensional feature points in the face image and the corresponding three-dimensional points in the model are taken as inputs. The formula is as follows:

$$\begin{aligned} E=E_M+E_O \end{aligned}$$

(2)

Wherein, the energy E is composed of $E_M$ and $E_O$. $E_M$ is the constraint on the patch shape and the pose, and it is used for determining the blendshape parameters of each patch in the face model and the position of the patch in the space; $E_O$ is the constraint on the overlap of the patch and the neighboring patch.

In the local blendshape model, each patch can be independently deformed. The shape $X_i$ of each patch can be obtained by Eq. 1, assuming that $x_i$ is the three-dimensional coordinate of the feature point selected in the i-th patch, $P_i =K\cdot [R_i |T_i]$ is the projection matrix of the i-th patch. K is the camera intrinsic matrix obtained by the calibration. $R_i$ and $T_i$ are the rotation matrix and the translation matrix of the i-th patch. $p_i$ is the corresponding 2D feature point coordinate in the i-th patch. $E_M$ is computed as:

$$\begin{aligned} E_M=\lambda _M\sum _{i=1}^V\left\| P_i\cdot x_i-p_i \right\| \end{aligned}$$

(3)

Where V represents the number of patches for local blendshape model, $V=100$. $\lambda _M$ is the weight of the shape constraint. $\lambda _M=1$ in this paper.

The overlap constraint of the patch is used to limit the position and shape of the neighboring patches to ensure the integrity of the facial expression.

$$\begin{aligned} E_O=\lambda _O\sum _{i=1,j=1}^V\left\| x_i-x_j \right\| ^2 \end{aligned}$$

(4)

Where $x_j$ represents the three-dimensional coordinates of the vertices in the j-th patch neighboring to the i-th patch. $\lambda _O$ is the weight of the overlap constraint. Through adjustment, it is found that $\lambda _O=7$ is suitable for this paper.

By minimizing the energy function $E=E_M+E_O$, we can get the blendshape weight $\alpha _i^n$ and the poses $R_i$ and $T_i$ of each patch. Through these parameters, the face model can be reconstructed. But there still have seams between the patches, so we need to fuse all the patches into a full face model. According to the distances between the vertexes in the patch and the center of the patch, the fusion weight of each point is calculated, and the coordinates of the patch vertexes of the overlapping portion are adjusted according to the weight to obtain a whole face model.

6.2 Select Base Expression Iteratively

Selecting the base expression begins with the neutral expression. By comparing the neutral expression with all the rest expressions, using the Procrustes analysis to calculate the difference between the neutral expression and the rest expressions, selecting the expression with the largest difference as the new base expression. Construct the first local blendshape model with the selected base expressions. Then use the local blendshape model to reconstruct all the unselected expressions, the reconstructed models of all expressions are obtained as results. And then use the Procrustes analysis again, compare the reconstructed models with the corresponding expression models, select the expression model with the largest error, and add to the base expression. Then proceed to the next iteration. When all the errors are small enough and the rate of error changing is small, the current set of expressions is the selected base expressions.

7 Result

For each iteration, the currently selected base expression and the reconstruction error are recorded. Figure 2 shows the line chart of the number of base expressions and the maximum error in each iteration.

In Fig. 2, the abscissa represents the number of base expressions, and the ordinate represents the maximum value of the Procrustes distance between the reconstructed model and the corresponding unselected model, and also represents the reconstruction error of the blendshape model. As can be seen from the figure, when the number of base expression reaches 10 (including the neutral expression), the error has been significantly reduced, and the rate of change of the error is 0.0878, indicating that the error changes slowly. The 10 facial expressions are chosen as the set of base expressions and are shown in Fig. 3.

Using the local blendshape model constructed by this set of basic expressions, the remaining expressions are reconstructed, and some of the result is shown in Fig. 4. The color of the reconstructed mesh represents the reconstruction error, red is 0.5 and blue is 0. Assuming that the distance between the inner corners of the human eye is 3 cm, and in the model the distance is 2.2572, then the error represented by 0.5 is 0.66 cm. As can be seen from the figure, the overall error is relatively small. In some expressions, there is little error in the corners of the mouth and the chin, which may be due to the small number of overall expressions, it is not enough to cover the rich mouth deformations.

Besides using this set of scanned face models. Two additional public face datasets are also used in this paper. A set of facial blendshape models contains 47 facial expressions (with neutral face) are selected from Facewarehouse [4]. Face retopology is applied to all the models to facilitate the segmentation process. With the same procedure of Fig. 1, the base expressions are selected iteratively. The Procrustes distance is shown in the line chart of Fig. 5.

Four base expressions are selected according to Fig. 5. The rest expressions are reconstructed with the selected base expressions. The result is shown in Fig. 6.

The other dataset to be used is provided by the University of Washington Graphics and Imaging Laboratory [16], it contains 384 frames of facial meshes with different facial expression. In order to reduce computing time, this paper select one mesh from every 4 frames, and 96 face models are selected as random expressions. The result of base expression selection and reconstruction is shown in Figs. 7 and 8.

These results show that the number and the composition of base expressions are different due to the random expression sequence used, but the errors of facial expression reconstruction are all relatively small.

8 Conclusion

This paper attempts to automatically select the base expressions. In a set of 66 random facial expression scans, a set of 10 base expressions is selected by an iterative method, and a local blendshape model is constructed using the set of base expressions to reconstruct the unselected expressions. This paper also tests the method of base expression selection on two public face model datasets. The result of the test is similar to the result of the expression sequence acquired in this paper. Which indicates that the method has the ability to choose a set of base expressions from a sequence of facial expressions. The constructed local blendshape model with the base expressions can cover all facial expressions. According to the reconstruction results, the method proposed in this paper is practical for facial expression animation.

The method of automatically selecting a set of base expressions proposed in this paper can be used in the process of facial expression tracking and reconstruction. Without requiring an actor to make a specific set of facial expressions many times, only need to capture a sequence of facial expressions, and then the base expressions needed to construct the local blendshape model is selected automatically. The process of constructing the local blendshape model can be simplified, and facial expression reconstruction can be performed more conveniently.

This paper attempts to auto select a set of base expressions and builds a local blendshape model. But in the reconstruction process, the scanned expression model instead of the real face image is used as input. In the future, feature points extracted from the real face image will be used for reconstructing facial expressions.

The local blendshape model used in this paper also can be improved. The current method to segment face is by UV mapping, but the deformation of different parts of the face is different. In the future, the method of face segmentation can be further studied.

References

Black, M.J., Yacoob, Y.: Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In: Proceedings of IEEE International Conference on Computer Vision, pp. 374–381. IEEE (1995)
Google Scholar
Blanz, V., Vetter, T., et al.: A morphable model for the synthesis of 3D faces. In: Siggraph 1999, pp. 187–194 (1999)
Google Scholar
Brunton, A., Bolkart, T., Wuhrer, S.: Multilinear wavelets: a statistical shape space for human faces. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 297–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_20
Chapter Google Scholar
Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: a 3D facial expression database for visual computing. IEEE Trans. Visual Comput. Graph. 20(3), 413–425 (2014)
Article Google Scholar
Decarlo, D., Metaxas, D.: Optical flow constraints on deformable models with applications to face tracking. Int. J. Comput. Vision 38(2), 99–127 (2000)
Article Google Scholar
Flueckiger, B.: Computer-generated characters in avatar and Benjamin button. Digitalitat und Kino 1 (2011). Translation from German by B. Letzler
Google Scholar
Friesen, E., Ekman, P.: Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3 (1978)
Google Scholar
Gower, J.C.: Generalized procrustes analysis. Psychometrika 40(1), 33–51 (1975)
Article MathSciNet Google Scholar
Ichim, A.E., Kadleček, P., Kavan, L., Pauly, M.: Phace: physics-based face modeling and animation. ACM Trans. Graph. (TOG) 36(4), 153 (2017)
Article Google Scholar
Joshi, P., Tien, W.C., Desbrun, M., Pighin, F.: Learning controls for blend shape based realistic facial animation. In: ACM Siggraph 2006 Courses, p. 17. ACM (2006)
Google Scholar
Neumann, T., Varanasi, K., Wenger, S., Wacker, M., Magnor, M., Theobalt, C.: Sparse localized deformation components. ACM Trans. Graph. (TOG) 32(6), 179 (2013)
Article Google Scholar
Sagar, M.: Facial performance capture and expressive translation for King Kong. In: ACM SIGGRAPH 2006 Courses, p. 7. ACM (2006)
Google Scholar
Tena, J.R., De la Torre, F., Matthews, I.: Interactive region-based linear 3D face models. ACM Trans. Graph. (TOG) 30, 76 (2011)
Google Scholar
Weise, T., Bouaziz, S., Li, H., Pauly, M.: Realtime performance-based facial animation. ACM Trans. Graph. (TOG) 30, 77 (2011)
Google Scholar
Wu, C., Bradley, D., Gross, M., Beeler, T.: An anatomically-constrained local deformation model for monocular face capture. ACM Trans. Graph. (TOG) 35(4), 115 (2016)
Google Scholar
Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: high-resolution capture for modeling and animation. In: ACM Annual Conference on Computer Graphics, pp. 548–558, August 2004
Google Scholar
Zhang, Q., Liu, Z., Quo, G., Terzopoulos, D., Shum, H.Y.: Geometry-driven photorealistic facial expression synthesis. IEEE Trans. Visual Comput. Graph. 12(1), 48–60 (2006)
Article Google Scholar
Zhang, Y., Ji, Q., Zhu, Z., Yi, B.: Dynamic facial expression analysis and synthesis with MPEG-4 facial animation parameters. IEEE Trans. Circuits Syst. Video Technol. 18(10), 1383–1396 (2008)
Article Google Scholar
Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Computer Graphics Forum, vol. 37, pp. 523–550. Wiley Online Library (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, Beijing, China
Ziqi Tu, Dongdong Weng, Dewen Cheng, Bin Liang & Le Luo
AICFVE of Beijing Film Academy, Beijing, China
Dongdong Weng & Yihua Bao

Authors

Ziqi Tu
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Weng
View author publications
You can also search for this author in PubMed Google Scholar
Dewen Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yihua Bao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liang
View author publications
You can also search for this author in PubMed Google Scholar
Le Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongdong Weng .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Peking, China
Baoquan Chen
The Technical University of Munich, München, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tu, Z., Weng, D., Cheng, D., Bao, Y., Liang, B., Luo, L. (2019). An Automatic Base Expression Selection Algorithm Based on Local Blendshape Model. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11902. Springer, Cham. https://doi.org/10.1007/978-3-030-34110-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-34110-7_19
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34109-1
Online ISBN: 978-3-030-34110-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

An Automatic Base Expression Selection Algorithm Based on Local Blendshape Model

Abstract

Similar content being viewed by others