research-article

Open access

Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution

Authors:

Hyojoon Park,

Sangeetha Grama Srinivasan,

Eftychios SifakisAuthors Info & Claims

ACM Transactions on Graphics, Volume 43, Issue 5

Article No.: 158, Pages 1 - 20

https://doi.org/10.1145/3670687

Published: 09 August 2024 Publication History

PDF eReader

Abstract

We present a neural network-based simulation super-resolution framework that can efficiently and realistically enhance a facial performance produced by a low-cost, real-time physics-based simulation to a level of detail that closely approximates that of a reference-quality off-line simulator with much higher resolution (27× element count in our examples) and accurate physical modeling. Our approach is rooted in our ability to construct a training set of paired frames, from the low- and high-resolution simulators respectively, that are in semantic correspondence with each other. We use face animation as an exemplar of such a simulation domain, where creating this semantic congruence is achieved by simply dialing in the same muscle actuation controls and skeletal pose in the two simulators. Our proposed neural network super-resolution framework generalizes from this training set to unseen expressions, compensates for modeling discrepancies between the two simulations due to limited resolution or cost-cutting approximations in the real-time variant, and does not require any semantic descriptors or parameters to be provided as input, other than the result of the real-time simulation. We evaluate the efficacy of our pipeline on a variety of expressive performances and provide comparisons and ablation experiments for plausible variations and alternatives to our proposed scheme. Our code is available at https://github.com/hjoonpark/3d-sim-super- res.git.

1 Introduction

Physics-based simulation is widely used to drive animations of both human bodies and faces. However, in order to obtain the highest levels of visual quality and realism, traditional simulation pipelines based on anatomic first principles resort to costly design choices. Detailed specifications of geometry and materials are essential, including the muscle and tendon shapes and attachment; bone geometry and motion; and constitutive properties of soft tissue and skin. Collision and frictional contact are ubiquitous in faces, and the resolution of such effects is dependent on mesh detail and the sophistication of detection and response algorithms. Finally, recreating intricate local shapes to match performance details from real actors may impose further directability demands on the simulation pipeline. Such feature demands in conjunction with the sheer geometric mesh resolution necessary for detailed facial expressions often place reference-quality face simulation well beyond the cost that would allow for real-time performance.

This article explores an alternative approach to achieving faithful and accurate facial animation at a much reduced execution cost, ideally as close as possible to real-time. Our method (Figure 1) seeks to convincingly approximate a full, high-resolution (HR) 3D simulation with the combination of a simulator that uses lower resolution and model simplifications, paired with a deep neural network that boosts the resolution, detail, and accuracy of this coarse simulated deformation. Our simulation super-resolution (SR) module is trained on a dataset of coordinated performances crafted using the high- and low-resolution (LR) face simulators and generalizes to novel performances by boosting the output of the LR simulator to the quality anticipated from its HR counterpart.

Fig. 1.

We aspire to create the best preconditions for the success of such a SR module by focusing our attention on types of physics-based simulations where it may be possible to craft animations from the LR and HR simulators that have strong semantic correspondence on a frame-by-frame basis. In other words, we look for types of simulation where it might be possible to infer—at some level of abstraction—what the fine-resolution simulation would want to do, by observing what the LR simulator was able to do. Face simulation is a good exemplar of this concept; regardless of resolution, the same core drivers of deformation can be seen as being present in both cases: the action of muscles, and the kinematic state of skeletal bones and other collision objects. This allows us to create a training set by simply dialing in the same control parameters for these driving factors of simulations both in the LR and HR models. Hence, we can hope that this semantic correspondence can be learned in a SR neural network that generalizes this semantic correspondence between resolutions to unseen performances.

We highlight that even “semantically corresponding” simulated poses from the respective simulators described above can be quite different. In particular, the LR result can deviate significantly from the mere downsampling of the HR simulation, with discrepancies extending beyond high-frequency details. There are at least three core causes of such discrepancy: First, and most obvious, the reduced mesh resolution of the coarser simulation will be unable to resolve fine geometric features such as localized folds, wrinkles, and bulges that the fine-resolution mesh would capture. Second, the fact that governing physics and topology have to be represented using a coarser discretization may create bulk deviations from the expected behavior of the continuous medium. For example, the action of thin muscles might have to be dissipated over larger elements, reducing the crispness of their action. Fine topological features like the corners of the lips may be under-resolved, especially if at lower resolution we opt for an embedding simulation mesh that does not conform to the model boundary. Non-conforming embedded simulation offers well-conditioned elements and improved convergence that is attractive for real-time performance, but it also leads to a crude first-order approximation of the material volume for elements on the model boundary, leading to artificial stiffness and resistance to bending. The third and final contributor to bulk discrepancy between resolutions could be conscious design choices for the sake of interactive performance; for example, we may choose to perform elaborate contact/collision processing in our reference-quality simulation but forego collision processing altogether in the LR simulator (as in our examples). Thus, our SR module must account for much more than localized high-frequency deformation details and should compensate for all factors (mesh resolution, discretization non-convergence, and physical simplifications) of bulk differences between the two simulation resolutions.

Our objective is to build a framework capable of producing high-accuracy animations without incurring the cost of simulations on HR meshes. We achieve this by training a deep neural network to act as a SR upsampler of simulations performed on a coarser 3D mesh. In practice, this allows for real-time simulations of facial animations that preserve many of the qualities associated with much slower HR simulations.

We simulate a coarse LR face mesh with significantly fewer mesh elements allowing for real-time simulations and reconstruct the HR details learned from data. Our upsampling module accounts for both high-frequency details and bulk differences between resolutions, responses to dynamics and external forces, and can also approximate a degree of collision response even if collision handling is omitted from the LR simulator. Our end-to-end animation attains near-realtime at 18.46 FPS from 30.06 FPS simulation and 47.82 FPS upsampling. We also emphasize that true real-time end-to-end animation (i.e., 24 or more FPS) is attainable by scaling down to coarser representations at a modest sacrifice of upsampling accuracy (discussed more in Section 5.5.1).

Previous efforts to accelerate physics-based simulations of deforming elastic bodies have focused on building faster numerical methods [Hauth and Etzmuss 2001; Kharevych et al. 2006; Stern and Grinspun 2009; Su et al. 2013], employing alternative constraint-based formulations such as Position Based Dynamics [Müller et al. 2007; Macklin et al. 2016; Bender et al. 2013] and its variants [Bouaziz et al. 2014; Liu et al. 2013; Stam 2009], and other techniques such as adaptively computing higher resolutions only when needed [Bergou et al. 2007]. However, given the real-time performance afforded by regular, embedded models for LR simulations and the fast inferencing time of deep models, our framework can reconstruct HR facial expressions faster and with reduced developmental effort.

We extend the concept of SR to the domain of physics-based simulation, contrasting with most prior applications of this process to purely geometric 3D models without regard to the fact the data originated from simulation. We summarize our core contributions as follows:

—

We demonstrate a neural network-based pipeline that can convincingly approximate a HR facial simulation, using as input a real-time LR approximate simulation and a fast inference step that performs the resolution boost. We show that this pipeline can robustly compensate for discrepancies between the two simulation resolutions extending beyond localized high-frequency deformation details.

—

We identify the opportunity to create a training set for our SR module with a high degree of semantic correspondence between LR and HR simulation frames, by giving the two simulators the same anatomical controls of muscle activations and bone kinematics.

—

We demonstrate near-realtime performance of the end-to-end pipeline, and a robust ability to generalize to expressions not in the training set. We can even demonstrate this ability on deformations that extend beyond the parametric space used in the simulations that generated the training set (e.g., dynamics, external forces, collisions, or constraints not present in the training data).

2 Related Work

2.1 3D Super-Resolution

Our framework shares the motivation (and also adopts the terminology) of SR approaches that operate in the domain of images. SR was initially introduced for 2D images to restore HR images from their LR observations [Nasrollahi and Moeslund 2014]. SR for 3D shapes shares similar characteristics with several relevant research areas.

Surface reconstruction. A closely related and widely studied area is a surface reconstruction from sampled points [Alexa et al. 2003]. Prior research can be classified into two groups: global and local methods. Global methods are more robust than local methods against noise and sparsity of the observations but at the cost of reconstruction accuracy, and vice versa. Global methods include, namely, the radial basis function (RBF) [Carr et al. 2001; Turk and O’brien 2002; Ohtake et al. 2005b] and Poisson problem [Kazhdan et al. 2006; Kazhdan and Hoppe 2013]. On the other hand, local methods include MLS [Alexa et al. 2001;, 2003; Fleishman et al. 2005], fitting of piecewise functions [Ohtake et al. 2005a; Nagai et al. 2009], and construction of signed distance functions [Curless and Levoy 1996]. A comprehensive review of this topic can be found in Berger et al. [2017].

Point cloud upsampling. Another widely studied area that resembles several aspects of our work is point cloud upsampling, which has been actively explored by both traditional and learning-based methods for many applications such as robotics, autonomous cars, and rendering [Zhang et al. 2022]. A pioneering approach is PU-Net [Yu et al. 2018a] which operates on patches to learn per-point multi-level features and expands them through a multi-branch convolution network. Follow-up works include EC-Net [Yu et al. 2018b], 3PU [Yifan et al. 2019], PU-GAN [Li et al. 2019], PUGeo-Net [Qian et al. 2020], and PU-GCN [Qian et al. 2021]. While all the previous works supported only a fixed integer ratio of upsampling, Meta-PU [Ye et al. 2021] pioneered in adapting to arbitrary non-integer upsampling ratios.

Although we similarly adopt point cloud representations, we do not assume the input and output points are from the same geometry which motivates us to carefully design the upsampling method to adapt to the geometric discrepancy between the LR and HR points and arbitrary non-integer upsampling ratios (Section 3.2 and more discussion in Section 5.5.2).

3D face SR. Existing works focusing on 3D face SR can be categorized as either method- or learning-based methods. Method-based works include registration and filtering of the 3D acquisitions [Berretti et al. 2012;, 2014; Bondi et al. 2016], whereas learning-based methods map from a LR model to its HR counterpart, namely, via intermediate cylindrical coordinate representations [Peng et al. 2005], progressive resolution chain [Pan et al. 2006], database retrieval [Liang et al. 2014], curve fitting [Zhang et al. 2020], and mapping from a set of rig parameters to the 2D deformation maps [Bailey et al. 2020]. Recently, the problem was formulated as a point cloud upsampling to predict $z$-coordinates of the HR face point cloud given its $(x,y)$ coordinates; however, the upsampling ratio is fixed by a factor of 2, and each $(x,y)$ coordinate can only correspond to a unique $z$ coordinate [Li et al. 2021].

In contrast to acquiring the LR surface data from a 3D scanner, depth camera, or multi-view fusion, our work is rooted in a fast but fully volumetric physics-based simulator which allows us to provide as an input to our model a set of points that reach deep into the flesh volume and convey richer information about deformation and strain.

2.2 3D Super-Resolution in other Domains

3D SR has also been actively explored in different simulation domains, namely, garments and fluids. Notably, garment surface upsampling by learning of per-vertex deformations [Zurdo et al. 2012] and 2D normal map representations [Zhang et al. 2021] have been explored. For fluids, procedural [Kim et al. 2008] and GAN-based [Xie et al. 2018] methods have been explored to enhance the resolution of the simulated coarse turbulent flows.

2.3 Coordinate-Based MLPs

We employ coordinate-based multilayer perceptrons (MLPs) [Tancik et al. 2020] to model our upsampling (Section 3.2) and reconstruction modules (Section 3.3). Coordinate-based MLPs learn a continuous mapping from input coordinates to signals and have shown promising results for various visual tasks, such as 3D shape representation [Mescheder et al. 2019; Park et al. 2019; Jiang et al. 2020; Saito et al. 2019], novel view synthesis [Mildenhall et al. 2021; Ma et al. 2021; Chan et al. 2021], and SR frameworks for images [Chen et al. 2021]. Coordinate-based MLPs have also been employed to enforce physical constraints in the SR framework for physics simulations and generate continuous grid-free HR solutions from LR data [Esmaeilzadeh et al. 2020].

Recently, SIREN [Sitzmann et al. 2020] leverages periodic activation functions for implicit neural representations and has also demonstrated superior expressivity (with principled initialization scheme) in modeling continuous and fine-detailed signals in various tasks [Chan et al. 2021; Ma et al. 2021; Yang et al. 2022].

2.4 Model Reduction Methods

Model reduction methods (also referred to as subspace simulation methods) are used for accelerating physics simulations by creating a lower dimensional representative subspace for the full space degrees of freedom in the discretization of choice. The subspace can be constructed by computing an appropriate subspace basis for nonlinear models [Krysl et al. 2001; Barbič and James 2005]. Extensions to accelerate force computations [An et al. 2008] or utilize an adaptive combination of the full space and reduced subspace degrees of freedom [Teng et al. 2015] have also been proposed. Furthermore, deep learning models have been integrated with these subspace simulation methods employing variational autoencoder [Fulton et al. 2019] and deep autoencoder leveraging its high-order differentiability [Shen et al. 2021]. Recently, a framework to augment parametric skeletal models with subspace soft-tissue deformations has been proposed [Tapia et al. 2021] to combine the benefits of data-driven skeletal models [Romero et al. 2017] and skinning-based subspace methods [Wang et al. 2015]. Recently, reduced order models for material point method using implicit neural representations were proposed to construct low-dimensional manifolds of deformation fields [Chen et al. 2023] as well as stress and affine fields [Zong et al. 2023]. The low-dimensional manifolds were subsequently employed in conjunction with projection-based dynamics.

While our method and the class of model reduction methods share the common goal of simulation acceleration, we propose a complementary approach of using physics simulators augmented with deep learning for simulation SR. Model reduction methods have been almost exclusively demonstrated only on linear or isotropic nonlinear constitutive models for passive bodies and require careful consideration to accommodate objects with varying shapes.

To the best of our knowledge, there has been no prior work on reduced-order modeling that accommodates anisotropic constitutive models for active biomechanical systems such as muscles. Incorporating anisotropic modeling and localized collision resolution into the lower-dimensional subspaces computed for model reduction methods, such as those proposed in Fulton et al. [2019] and Shen et al. [2021], is non-trivial. It requires a separate line of investigation and hinders their extensibility for accurate facial animation. In contrast, physics simulators are well-known for supporting anisotropic active models and resolving localized collisions [Cong et al. 2016; Sifakis et al. 2005]. Our method utilizes a GPU-accelerated physics simulator capable of meeting both of these requirements. We demonstrate that our framework can achieve accurate and detailed facial animation without sacrificing speed.

3 Method

tIn this section, we present the specific design choices for our model architecture, aimed at learning to map from a LR volumetric mesh to a HR surface mesh depicting the same facial expression (Figure 2). The input LR volumetric mesh contains 15,872 vertices and is derived from regular BCC (body-centered cubic) lattices for real-time simulation leveraging on its sparse and regular distribution of the vertices but with a compromise on accuracy and visual fidelity (Figure 4(c)). On the other hand, the target HR mesh contains 35,637 vertices and is a triangular mesh conforming to a denser volumetric mesh capable of producing fine details of deformations but at a significantly slower simulation speed (Figure 4(b)). More information about the data generation is outlined in Section 4.

Fig. 2.

Fig. 3.

Fig. 4.

We represent our input and output as a set of 3D displacement vectors from a rest pose stacked in an arbitrary yet consistent order. We divide our pipeline into three modules for (1) feature encoding (FE), (2) coordinate-based upsampling (CU), and (3) surface reconstruction. The hyperparameters are specified in Appendix A.1.

3.1 FE Network

The FE network computes feature embedding for each input vector. We first concatenate each input displacement vector with a positional encoding $\in \mathbb {R}^{32}$ using sine and cosine functions as done in Transformers [Vaswani et al. 2017]. Then, the concatenated input $\in \mathbb {R}^{D_0}$ (in our implementation, $D_0=35$) goes through the submodules of the FE network.

While deformations in the human face are primarily attributed to the activation and motion of the underlying muscles and bones, respectively, they can also be a result of deformations in other parts of the face (e.g., a wide smile can cause the skin around the eyes to fold); therefore, the localized per-vertex information of deformation needs to be shared with other vertices. For this reason, we model the submodules of the FE network with edge convolutional layers, dubbed EdgeConv, introduced in DGCNN [Wang et al. 2019] which is capable of aggregating neighborhood information in feature space rather than coordinate space by dynamically constructing a $k$-NN graph in each layer.

We initialize the first $k$-NN graph of the network using geodesic distances based on the edge information of the LR mesh in the rest pose. The subsequent graphs are constructed on the fly in their learned feature spaces. The motivation is to encourage capturing local spatial correlations in the first submodule and potentially global feature correlations in the subsequent submodules (discussed more in Section 5.5.4).

We apply max and average pooling on the intermediate outputs from EdgeConv to extract global features. They are repeated and concatenated with the outputs from EdgeConv and the preceding input encoding feature, which are then passed through a shared fully connected network. We repeat the submodule $S=2$ times with the intermediate outputs from one module passed as input to the next. The output of the last submodule is concatenated with all of the previous $S$ intermediate features (including the position-encoded input) to construct the final encoded feature. Specifically, denoting the output of the $s^{th}$ submodule for the $i^{th}$ LR mesh vertex as $\mathbf {z}^L_i \in \mathbb {R}^{D_s}$, the final encoded output has the dimension of $\mathbf {z}^L_i\in \mathbb {R}^{\sum _{s=0}^SD_s}$. In our implementation, we used $S=2$ with $D_1=64$ and $D_2=128$.

3.2 Coordinate-Based Upsampling Network

The upsampling network takes as the input a set of encoded per-vertex features from the LR mesh and outputs per-vertex features for the HR surface. To generalize over arbitrary and non-integer upsampling ratios, we propose to formulate the upsampling operation as a continuous local interpolation of the input features.

Formally, let the set of encoded features contributing to the upsampled $j^{th}$ feature be $\lbrace \mathbf {z}^L_i\rbrace _{i\in \mathcal {N}_j}$ where $\mathbf {z}^L_i$ denotes the encoded $i^{th}$ LR mesh feature, and $\mathcal {N}_j$ denotes a set of local interpolation neighbors for the $j^{th}$ feature. Then, the upsampling operation can be expressed as

\begin{equation} \mathbf {z}_j^H = \sum _{i\in \mathcal {N}_j}{w_{ij} \mathbf {z}^L_i}, \end{equation}

(1)

where $w_{ij}$ indicates the contribution of the $i^{th}$ LR mesh feature to the $j^{th}$ HR mesh feature. Different modeling options can be explored for defining the local neighbors set $\mathcal {N}_j$ (e.g., number and criteria of neighbors) and computing the interpolation weight $w_{ij}$ (e.g., inverse distance weighting (IDW), RBF), which we describe next.

Neighborhood locality. We define the local neighbors set $\mathcal {N}_j$ as the indices of the $k$ nearest LR mesh vertices from the $j^{th}$ HR mesh vertex in terms of geodesic distances (illustrated in the blue point cloud in center-bottom of Figure 2). Since the LR and HR vertices do not live on the same surface, we first map the LR vertices $\lbrace \mathbf {x}^L_i\rbrace$ to the HR vertices (we temporarily denote the resulting mapped vertices as $\lbrace \mathbf {x}^{\prime L}_i\rbrace$) using the linear assignment algorithm [Crouse 2016]. This finds the optimal one-to-one mapping between the LR and HR vertices by minimizing the mapping distance (Euclidean). Then, we use Dijkstra’s algorithm to find the $k$ nearest mapped vertices $\lbrace \mathbf {x}^{\prime L}_i\rbrace$ (which directly corresponds to the original LR vertices $\lbrace \mathbf {x}^L_i\rbrace$) for every HR vertex using the edges of the HR surface mesh as paths (Figure 3). The local neighbor information is pre-computed offline once. In this work, we use $k=20$ and additionally explore the effects of different values of $k$ in Section 5.5.

Weighting function. The weighting function $w^{\prime }_{ij}=f_\theta (\mathbf {u}_{ij})$ outputs the interpolation weight $w^{\prime }_{ij}\in \mathbb {R}$ for the $i^{th}$ LR mesh vertex neighboring the $j^{th}$ HR mesh vertex, given some input vector $\mathbf {u}_{ij}$.

Conceptually, the HR surface mesh can be thought of as a discretization of a continuous and smooth limit-surface, i.e. its vertices are approximations of the sampled points from the continuous surface. Thus, one could sample an infinite number of continuously varying features from any point on this surface. For this reason, we model $f_\theta$ as a trainable coordinate-based MLP where we employ SIREN [Sitzmann et al. 2020] for its superiority in modeling continuous (and differentiable) functions.

As the input to $f_\theta (\mathbf {u}_{ij})$, we provide the spatial information using a concatenated vector of coordinates of the HR and LR mesh vertices ($\mathbf {x}^H_j, \mathbf {x}^L_i \in \mathbb {R}^3$, respectively) and their mutual Euclidean distance, written as

\begin{equation} \mathbf {u}_{ij} = \left[\mathbf {x}^H_j, \mathbf {x}^L_i, \left|\left|\mathbf {x}^L_i-\mathbf {x}^H_j \right|\right|_2 \right]. \end{equation}

(2)

Then, we normalize the output weight $w^{\prime }_{ij}$ across the local neighbors $\mathcal {N}_j$ using the softmax function $\sigma _{j}$ and obtain the final interpolation weight $w_{ij}$, expressed as

\begin{equation} w_{ij} = \sigma _j \left(w^{\prime }_{ij}|\left\lbrace w^{\prime }_{kj} \right\rbrace _{k\in \mathcal {N}_j}\right) =\frac{e^{w^{\prime }_{ij}}}{\sum _{k\in \mathcal {N}_j} e^{w^{\prime }_{kj}}}, \end{equation}

(3)

for $j=1, \ldots , M$ and $i\in \mathcal {N}_j$.

3.3 Surface Reconstruction Network

The surface reconstruction network predicts the per-vertex displacements $\Delta \mathbf {x}_j^H$ from the upsampled features $\mathbf {z}_j^H$. Since $\mathbf {z}_j^H$ implicitly inherits coordinate information $\mathbf {x}_j^H$ from the upsampling network and to reconstruct fine deformation details on the HR surface, we also model the surface reconstruction network using SIREN [Sitzmann et al. 2020] to exploit its ability to model high-frequency signals utilizing coordinate information. As the last step, the predicted deformations are added to the HR mesh in its rest pose to reconstruct the final deformed HR surface.

We also note that we use a minimal modeling technique for the surface reconstruction network not only to reduce the computational overhead for processing a relatively large number of HR mesh vertices ($\gt$36k) but also because we assume all the information needed for the fine-detailed surface reconstruction is to be encoded in the LR mesh features.

3.4 Loss Function

We minimize the reconstruction loss $\mathcal {L}_{recon}$ between the predicted and ground-truth per-vertex deformations of the HR surface mesh denoted $\Delta \hat{\mathbf {x}}_j^H$ and $\Delta \mathbf {x}_j^H$, respectively:

\begin{equation} \mathcal {L}_{recon} = \sum _{j=1}^M \left|\left|\Delta \hat{\mathbf {x}}_j^H - \Delta {\mathbf {x}}_j^H \right|\right|_1. \end{equation}

(4)

Moreover, we introduce the loss term $\mathcal {L}_{fn}$ for local smoothness which encourages the face normal of triangles on the predicted and target HR surface meshes (denoted $\hat{\mathbf {n}}_k$ and $\mathbf {n}_k$, respectively) to be equivalent in terms of cosine similarity:

\begin{equation} \mathcal {L}_{fn} = \sum _{k=1}^F1-\frac{\hat{\mathbf {n}}_k \cdot \mathbf {n}_k}{||\hat{\mathbf {n}}_k||||\mathbf {n}_k||}, \end{equation}

(5)

where $F$ is the number of triangles on the HR surface mesh.

We also include the regularization term $\mathcal {L}_{reg}$ to encourage the encoded intermediate features $\lbrace \lbrace \bar{\mathbf {z}}_{s, i}\rbrace _{i=1}^N\rbrace _{s=1}^S$ (Figure 2) to center around 0, encouraging their prior to follow a multivariate normal distribution [Park et al. 2019; Chabra et al. 2020]:

\begin{equation} \mathcal {L}_{reg} = \sum _{s=1}^S\sum _{i=1}^N||\bar{\mathbf {z}}_{s,i}||_F. \end{equation}

(6)

We find that the face normal loss improves the visual fidelity of the reconstructed face and the regularization term helps prevent overfitting.

The final loss function $\mathcal {L}$ is written as

\begin{equation} \mathcal {L}=\mathcal {L}_{recon} + \alpha \mathcal {L}_{fn} + \beta \mathcal {L}_{reg}, \end{equation}

(7)

where $\alpha$ and $\beta$ are the scalar weight terms whose values are reported in Table 4 of the Appendix.

Table 1.

[mm]	Mean	Median	Std.	Max.	Min.
Ours	0.37	0.36	0.07	0.59	0.24
Embedded	0.80	0.77	0.13	1.40	0.55
$\beta$-VAE	0.94	0.87	0.25	1.60	0.46
RBF	1.10	1.08	0.13	1.57	0.77
MLS	1.09	1.07	0.14	1.58	0.74
DDE	1.01	0.99	0.12	1.58	0.78

Table 1. Descriptive Statistic Measures of Mean Surface Reconstruction Errors (in Millimeters) on Unseen Facial Expressions for Each Tested Model

Table 2.

[mm]	Mean	Median	Std.	Max.	Min.
Low-res. sim. (Ours)	0.37	0.36	0.07	0.59	0.24
Blendshape animation	0.51	0.50	0.07	0.75	0.36
Blendshape weights	0.95	0.95	0.23	1.75	0.37

Table 2. Descriptive Statistic Measures (Normalized Mean, Median, Standard Deviation, and Min/max Values for Each Method) of Mean Surface Reconstruction Errors (in Millimeters) on Unseen Facial Expressions

Table 3.

[mm]	Ours	w/o FE	w/o CU
Mean	0.38	0.45	0.59
Std.	0.06	0.07	0.10
Median	0.38	0.45	0.58
Max.	0.64	0.75	1.11
Min.	0.27	0.33	0.41

Table 3. Descriptive Statistic Measures of Surface Reconstruction Errors in the Absence of our FE and Coordinate-Based Upsampling (CU) Network

Table 4.

Notation	Value
$N$ (num. of LR volumetric mesh vertices)	15,872
$M$ (num. HR surface mesh vertices)	35,637
S (num. of submodule layers in Feature Encoding network)	2
$D_0$	35
$D_1$	64
$D_2$	128
$\alpha$ in Equation (7)	0.001
$\beta$ in Equation (7)	gradually increased from 0.001 to 20
$k$ neighbors in the $k$-NN graphs from Feature Encoding networks	5
Interpolation neighbors for Coordinate-based Upsampling	20

Table 4. Specifications of Parameters in the Implemented Model

4 Dataset Generation

In this section, we outline the process for acquiring the mesh models and attachment of muscle fibers, as well as our simulation framework for synthesizing the dataset consisting of the LR volumetric simulation mesh for flesh and the corresponding HR) surface mesh for the face as shown in Figure 4.

4.1 Acquisition of Simulation Models

In this section, we explain the process for sculpturing our LR and HR simulation models ((b) and (c) in Figure 4, respectively) which are then used for generating semantically corresponding facial animation dataset.

Anatomical model. Following prior common approaches [Sifakis et al. 2005; Cong et al. 2015], we construct an anatomically and biomechanically motivated simulation model of our subject’s face. Given a HR neutral face mesh, we model the underlying anatomy including the cranium, mandible, teeth, and a comprehensive set of facial muscles with the aid of anatomical references. For each facial muscle, we calculate volumetric fiber directions by first tetrahedralizing the muscle and then applying the approach of Choi and Blemker [2013]. Alternatively, a morphing approach such as Ali-Hamadi et al. [2013] and Cong et al. [2015] can also be employed to estimate the underlying anatomy.

High-resolution volumetric mesh. For our highest resolution model, we create a tetrahedral simulation mesh consisting of 1.9 million tetrahedra [Molino et al. 2003] (Figure 4(b)) that conforms to the HR neutral face mesh (Figure 4(a)) as well as the underlying skull. We opted for a conforming tetrahedralized simulation mesh in order to maximize deformation accuracy and minimize artificial stiffness often associated with non-conforming tetrahedra. The tradeoff is the potential for less well-conditioned tetrahedra and longer simulation times.

LR volumetric mesh. For our LR model, we create a regular nonconforming tetrahedralized simulation mesh consisting of 73 thousand tetrahedra (Figure 4(c)), to be used in an embedded simulation. We begin by voxelizing the HR conforming tetrahedron mesh at a coarse granularity and discarding tetrahedra outside the regions of the face most responsible for facial expression, including the neck and the back of the head. Then, we subdivide each voxel into eight regular tetrahedra. In constrast to our HR model, our nonconforming regular LR model consists of regular well-conditioned tetrahedra that enables us to target real-time simulation. In order to avoid merging the upper and lower lips with our coarse discretization, we separate the lips via linear blend skinning, pre-deforming the HR conforming tetrahedralized simulation mesh by a small rotation of the jaw joint along its axis. This results in a rest configuration with the mouth slightly open; this necessary modeling discrepancy is among the factors that our SR network must compensate for (and is largely successful in doing so).

Muscle fibers and attachments. Following the prior approaches of Cong et al. [2016] and Sifakis et al. [2005], we rasterize the volumetric muscle fiber directions onto both the high- and LR simulation meshes. Then, we specify anatomically-motivated cranium and jaw attachments of the muscles on both simulation meshes via Dirichlet boundary conditions. Finally, the HR neutral face mesh (containing 61,520 vertices) is embedded in both the high and LR simulation mesh, respectively, via barycentric weights enabling us to deform the face mesh by interpolating vertex positions from the respective deformed simulation mesh.

Discrepencies between high- and LR surfaces. Figure 5 illustrates the discrepancies between the surface embedded in the simulated LR mesh and the surface simulated using the conforming HR mesh. Even though the two performances show semantic similarities, there have both macroscopic (lips) and microscopic (forehead and eyes) differences owing to simulation resolution.

Fig. 5.

4.2 Simulation Framework

We employ a CUDA-accelerated implementation of Cong et al. [2016] as our simulation framework for both resolutions. This framework endows the simulation mesh with the anisotropic constitutive model consisting of three components for modeling elasticity, incompressibility, and muscle contractions [Teran et al. 2003] as well as optional kinematic muscle tracks for additional expressivity and directability. Both the finite element forces and the track spring stiffnesses are parameterized to be invariant to mesh refinement in order to maintain consistent bulk behavior across resolutions. Given a set of control parameters and (optionally) kinematic muscle tracks, we calculate the deformation of the tetrahedralized simulation mesh using the quasistatic framework of Teran et al. [2005], factoring in object and self-collisions for the HR simulation. In contrast, we forgo collision handling in our LR simulation for the sake of robustness and performance.

HR dataset. Prior to synthesizing our HR dataset, we ran simulations targeting a wide range of facial performance capture data as well as a set of 31 artist-sculpted blendshapes [Cong et al. 2016] using our HR anatomical model. This allowed us to validate that our simulation can accurately reproduce the performance range of the actor while also outputting a corresponding set of 31 kinematic muscle blendshapes. These kinematic muscle blendshapes are combined into a blendshape muscle rig which can be used to deform the kinematic muscle tracks and control the simulation. In addition, we also express the simulation control parameters in terms of the blendshape weights thus extending our simulation framework to be fully differentiable [Bao et al. 2019].

Using the Gauss-Newton optimization proposed in Sifakis et al. [2005] in conjunction with Bao et al. [2019], we solve for four sequences of high-fidelity facial performance capture data corresponding to four different semantic themes (amazement, anger, fear, and pain) totaling 880 frames using our HR simulation mesh. This results in a simulated HR simulation and surface mesh, as well as time-varying blend shape weights and jaw transforms for each performance.

Low-resolution dataset. Since our facial muscles are in correspondence between the HR and LR, we can use the same blend shape muscle rig to drive the LR simulation and synthesize a corresponding LR dataset. We use the blend shape weights and jaw transforms resulting from the HR optimization as input into our LR simulation and run the quasistatic solver to obtain the corresponding LR tetrahedral simulation mesh deformations across all four sequences. The discrepancies between the surfaces embedded in the simulated LR mesh and conforming HR mesh, respectively, are illustrated in Figure 5 of Section 4.1.

5 Experiments and Evaluation

We report performance metrics in terms of reconstruction speed (Section 5.1) and as well as quantitative and qualitative reconstruction errors (Section 5.2). We use the unseen performances in the test set to evaluate the generalization capacity of the trained model. We also evaluate our framework’s ability to generalize to unseen dynamics and forces (Section 5.3). Additionally, we present the experimental results pertaining to the utilization of blendshape inputs as a substitute for the LR physics-based simulator in generating the input LR tetrahedral mesh (Section 5.4).

We also conduct ablation experiments. In Section 5.5.1, we explore the tradeoffs in the reconstruction performance of our model when trained using the coarser LR volumetric mesh capable of attaining the true real-time end-to-end animation at 28.04 FPS as compared to our recommended near real-time at 18.46 FPS. In Section 5.5.2, we explore how the submodules of our framework, namely FE and CU modules, contribute to the reconstruction accuracy, and, in Section 5.5.3, evaluate the effects of using different interpolation neighbors $\mathcal {N}_j$ for the CU network and different neighbors $k$ for the $k$-NN graph from the FE network. Then in Section 5.5.4, we qualitatively evaluate the correlations among different parts of the face learned by the EdgeConv layers in the FE submodules.

In addition, we investigate our framework’s capability to approximate self-collisions between the upper and lower lips in Section A.3, and we conduct ablation experiments to assess the impact of incorporating higher degrees of wrinkle details on the target surface mesh in Section A.5.

5.1 Near-Realtime High-Resolution Facial Animations

Simulations speed. The average time to simulate the HR conforming simulation with 1,944,549 tetrahedral elements is 6.22s per frame or a frame rate of 0.16 FPS. Conversely, the average time to simulate the LR embedding mesh with 73,128 tetrahedral elements is 0.033s, corresponding to 30.06 FPS, i.e. 188$\times$ faster than the HR simulation. These simulation times are recorded on a workstation with a single GeForce RTX 4090 GPU.

SR inference speed. To approximate the HR surface from the LR simulation, we need to infer the HR displacements from our model. The computational overhead of our model inference on a single GeForce RTX 4090 GPU is 0.0209s per frame, corresponding to 47.82 FPS for inference alone.

End-to-end speed and additional performance boosting. Consequently, our simulation SR framework takes a total of 0.054 FPS per frame, or 18.46 FPS, which implies that we achieve a speedup of 115$\times$ relative to the HR simulation that takes 6.22s per frame (0.16 FPS). We emphasize that there are multiple ways to bridge the gap from near-realtime, e.g., 18.46 FPS, to true real-time, i.e., 24 or more FPS.

First and foremost, using a coarser LR simulation mesh can easily attain the true real-time end-to-end animation given tolerance to a minute tradeoff in the quality of reconstructions which our current LR mesh enjoy (we explore the tradeoff in Section 5.5.1). Similarly, we can also achieve faster inference time by choosing to use fewer interpolation neighbors in the CU module but with a tradeoff in the overall reconstruction accuracy (see Section 5.5), as we identify the bottleneck of inference is the neighborhood information gathering step in the CU module.

On the other hand, while adhering to the strict bar for the permissible reconstruction quality, we could pipeline the LR simulation and inference steps using a 2 GPU workstation. In such a set up, we could achieve an end-to-end speed of 30.06 FPS after tolerating a single frame latency. Conversely, we could also move away from the inference library (we use ONNX Runtime for PyTorch) and implement custom inference kernels on GPUs that speed up computation.

5.2 Generalization to Unseen Facial Expressions

Using the simulation data, generated as described in Section 4, we select the amazement and pain sequences for training (435 frames) and test on anger and fear sequences (445 frames), ensuring that the test set contains unseen performances. We use the trained model to infer the HR face surface from unseen LR volumetric mesh performances in the test set.

Quantitative evaluation. As we have access to the HR simulations of the test data, we can readily compute the reconstruction error in terms of per-point Euclidean distance between the reconstructed and the target (reference) mesh whose dimension is $179.8\times 257.3\times 164.5$ [mm] (Figure 4). We also set up other commonly used reconstruction methods to serve as comparisons for our method. We train a $\beta$-VAE [Higgins et al. 2016], on the same dataset to serve as a baseline generative neural framework comparison. We implement two of the commonly used surface reconstruction methods: the RBF and moving least-square (MLS)-based methods as the representative global and local methods, respectively, where we employ the Gaussian function for RBF. Lastly, we compare with Deep Detail Enhancement (DDE) framework [Zhang et al. 2021] as the representative state-of-the-art SR framework for 3D garment surfaces which uses normal maps to synthesize plausible wrinkle details on a coarse geometry. The formulations for RBF and MLS along with details on the $\beta$-VAE and DDE can be found in Section A.2.1, A.2.2, A.2.3, and A.2.4, respectively.

Our method outperformed the others and robustly achieved the lowest mean reconstruction errors per frame $\lt$0.59mm. We plot the frame-wise mean reconstruction errors of the comparisons to validate that our method has the least error for every test performance in Figure 6. The evaluation result is summarized in Table 1.

Fig. 6.

Qualitative evaluation. In Figure 7, we evaluate the visual fidelity of the inferred face mesh by visualizing the reconstructed HR surfaces and heatmaps of corresponding reconstruction errors for all the methods. Our method can infer the target facial expression from the input LR volumetric mesh more faithfully than other methods, allowing us to conserve both the expression and the subtle deformation details that otherwise would have been compromised by using the LR simulation.

Fig. 7.

5.3 Generalization Beyond Parametric Space

We test the ability of our framework to handle deformations that extend beyond the parametric space used in simulations. To evaluate, we simulate the LR simulation mesh with unseen dynamics and external forces, respectively, and qualitatively evaluate the inference accuracy.

5.3.1 Unseen Dynamics.

To evaluate our model’s capability in generalizing to non-quasi-static simulations, we simulate the dynamics of the LR simulation mesh using a semi-implicit backward Euler scheme. This allows us to model ballistic effects that are not present in our training dataset which was simulated under the quasi-static assumption. We further exaggerate the ballistic effects in the simulation by shaking the head back and forth in conjunction with the muscle contractions and jaw motion.

We compare the reconstructed surface inferred from the input mesh with unseen dynamics (middle row of Figure 8(b)) and the reference surface conforming to the quasi-static simulation mesh (middle row of Figure 8(a)). Also, we visualize heatmaps showing average facial deformations across the training data (top row of Figure 8(c)) and the deformation differences between the predicted and reference surfaces, respectively (middle row of Figure 8(c)). We highlight that although the nose shows little or no deformations throughout the training data (thus, showing the nose as a dark blue region in the first heatmap), our model is capable of inferring them from the unseen input (showing as a lighter blue region in the second heatmap).

Fig. 8.

Similarly, we visualize the dynamic simulations (with yaw rotation motions of the head) and their reconstructions in a time sequence in Figure 9 along with the heatmaps (Figure 9(e)–(f)) showing deformation differences between the quasi-static/dynamic simulation meshes (Figure 9(a)/(b)), and also the reference conforming quasi-static surface (Figure 9(c)) and the reconstructed surface inferred from the dynamic LR simulation mesh (Figure 9(d)), respectively. Regions with distinctive facial deformations of the inferred faces (Figure 9(e)) are in line with the deformed regions of the input simulation meshes (Figure 9(f)), implying generalizations beyond the quasi-static simulation data.

Fig. 9.

5.3.2 Unseen Forces.

We craft two quasi-static simulation examples with external forces applied on the rest pose mesh (Figure 8(d)). In the first example (Figure 8(e)), we apply a spring force pulling the side of the lips. This force can also be interpreted as a candy cane pulling on one side of the lips. In the second example (Figure 8(f)), we collide the LR simulation mesh with a sphere, pushing the cheek inward. The LR performances, reconciled by the simulator, are given as input to our framework. The predictions indicate that our framework is able to handle inputs that have deformations not seen in the training performances. Moreover, for side-by-side comparisons, we visualize the surface mesh embedded in the LR simulation mesh in Appendix A.4.

5.4 Experiments with Blendshape Inputs

Employing a LR physics-based simulator for producing the input mesh is perfectly affordable and absorbs much of the nonlinearities in mapping from the simulation parameters (e.g., muscle activations) to the input mesh. Moreover, incorporating dynamics or external forces into the input mesh is a straightforward application for the physics-based simulator, providing an inherent advantage to its usage. Additionally, our SR framework can produce intended facial expressions of the HR surface mesh from its semantically corresponding LR input while compensating for topological discrepancies and can extrapolate to unseen physical effects after being trained only on purely quasi-static simulations.

In this section, we further investigate whether our SR framework can still predict the intended facial expressions from a non-physics-based LR input animated using blendshapes. Specifically, we conduct two experiments employing the blendshape system as a replacement for the LR physics-based simulator. First, we construct volumetric blendshapes of our LR input mesh and generate the training dataset using a blendshape animator, instead of the physics simulator. We also go a step further and use the low-dimensional blendshape weights to approximate the HR facial performances by training a decoder-style neural network with around $628\times$ more trainable parameters than our method. The architecture of the neural network is specified in Appendix A.2.5. We highlight that in both approaches, incorporating dynamics or external forces into the input mesh presents significant challenges compared to the straightforward application of the LR physics-based simulator, which inherently confers an advantage to its use.

In the following subsections, we describe our blendshape system setup used for constructing the volumetric blendshapes and weights for producing facial performances. Then, we provide the evaluation results of the two approaches.

5.4.1 Construction of Low-Resolution Tetrahedral Mesh Blendshapes.

For each blendshape in the blendshape muscle rig constructed in Section 4.2, we set its weight to 1.0 and zero out the remaining weights in order to obtain the kinematic muscle deformation corresponding to solely that blendshape. Then, we run the quasi-static solver to obtain the muscle-driven deformation of the LR tetrahedral mesh which is then stored as the corresponding LR tetrahedral mesh blendshape.

Volumetric blendshape animation as input. In the first scenario, we use the tetrahedral mesh animated using the blendshape weights constructed in Section 4.2 as input, as a replacement for the LR physics-based simulator. We then re-initialize and train our existing neural network (Section 3) to learn to predict the corresponding HR surface mesh.

Blendshape weights as lower-dimensional input. In the second scenario, we directly use the blendshape weights of the facial performances as inputs, bypassing the use of the simulator. To achieve this, we construct a fully connected neural network with ample capacity (443,840,125 trainable parameters) to learn the mapping from 38-dimensional blendshape weight vector (comprised of 31 blendshapes weights and a 7-dimensional vector for the rigid transformation of the jaw - quaternion and a translation vector) to the HR surface mesh.

5.4.2 Evaluation Results.

We infer the HR surface mesh in the test dataset and plot the framewise errors for both methods and ours utilizing the LR physics-based simulator. We overlay the plots in Figure 6 to highlight the overall difference. As shown in Figure 10 and detailed in Table 2, using the blendshape weights as inputs (in blue) yields the largest reconstruction error compared to the other two methods (in red and green). We explain the larger error by noting that the neural network, despite having 628$\times$ more learnable parameters than our method, must learn the blendshapes and produce accurate jaw transformations - tasks that the blendshape animator can easily produce.

Fig. 10.

On the other hand, using the input tetrahedral mesh produced by the blendshape animator (in green) leads to marginally higher error when compared to using the LR physics-based simulator (in red). This finding aligns with our expectations, given that the physics-based simulator can generate an input mesh that more faithfully adheres to the target surface mesh, accommodating the highly nonlinear and intricate nature of the physics-based simulations.

Notably, relying on blendshape weights as inputs often leads to difficulties in generalizing to unseen jaw transformations. This is clearly observed in the close-up side view of the mouth in the 3rd row of Figure 11(d), where the red background highlights the reconstruction difference between the target mesh (Figure 11(a)). Employing the blendshape animator helps to mitigate this issue by generating the LR tetrahedral mesh with accurate jaw motions, as depicted in Figure 11(c). Nevertheless, using the LR physics-based simulator demonstrates the superior performance in faithfully predicting the target facial deformations, particularly evident in the close-up front views of the mouth in the 2nd rows of Figure 11(a)–(c).

Fig. 11.

5.5 Additional Experiments

In this section, we compare the quality of reconstructed faces inferred by our model trained using the original LR simulation mesh with 73k elements (Figure 4(c)) and another one trained using a coarser LR simulation mesh with 34k elements (Figure 4(d)). The coarser mesh attains the true real-time end-to-end animation at 28.04 FPS (67.79 FPS simulation and 47.82 FPS inference) on the same hardware setup.

Furthermore, we evaluate the contributions of our FE (Section 3.1) and CU (Section 3.2) modules. We explore the effects of the key parameters in each of the two modules, namely, the neighbors $k$ in the FE module and the interpolations neighbors in the upsampling module, respectively. Additionally, we qualitatively validate the correlations among different parts of the face learned by our FE network.

5.5.1 Comparison with Coarser Low-Resolution Simulation Mesh.

For training, we use the same hyperparameters as the training on the original LR simulation mesh. Following the same procedure in Section 5.2, we evaluate the surface reconstruction errors on the unseen facial expressions in the test dataset.

As shown in the error plot of Figure 12, using the coarser LR mesh expectedly attains slightly larger reconstruction errors across most of the frames compared to the original mesh. We observe increased artifacts in the inferred surfaces especially around the mouth regions in Figure 12(a)-(b). We highlight that, in practice, true real-time end-to-end animation is easily attainable had we tolerated a minute deterioration of the reconstruction quality which could become unnoticeable to human eyes with different rendering techniques such as using texture map as opposed to a plain diffuse rendering. However, we choose to adhere to the current resolution for the robustness of generalization capabilities beyond the parametric space used in the simulation (e.g., unseen dynamics and external forces), given that true real-time animation is also attainable, in practice, had we tolerated one frame latency.

Fig. 12.

5.5.2 Contributions of FE and Coordinate-Based Upsampling Modules.

We evaluate the contributions of the FE and CU modules by excluding them (one at a time). We compare the predictions on test performances.

Specifically, we train three different models using the same dataset and hyperparameters for the same number of epochs (1000). The first model we train includes both the FE and CU modules (our proposed framework). The second model excludes the FE module and directly feeds the output of position-encoding to the CU module. In the third model, we reintroduce the FE module and exclude the CU module. To replace the CU module, we opt for a different and standard upsampling method (with a fixed upsampling ratio) that uses the transposed convolution operation, widely adopted in upsampling images for SR [Yang et al. 2019]. To mimic the transposed convolution operator, we find 20 nearest LR mesh vertices from each HR mesh vertex in terms of Euclidean distance (same number as our neighbor interpolation in the CU module). We then compute weighted sums of the 20 LR mesh features for every HR mesh vertices. For a fair comparison, we learn these weights, similar to the weights learned in our CU module.

From the three trained models, we compare the reconstruction error on the test dataset. As summarized in Table 3, our model which includes both the FE and CU modules outperforms the other two variants which have been trained in the absence of the FE and CU modules, respectively.

We qualitatively validate the visual fidelity of the performances reconstructed by the three models in Figure 13. We observe that in the absence of the FE module, the model fails to reconstruct the parts of the face with larger deformations accurately (like the mouth area in Figure 13(c)), and replacing the CU module leads to reconstruction artifacts and discontinuities in the HR surface (Figure 13(d)).

Fig. 13.

5.5.3 Effects of Different Locality Parameters.

Interpolation neighbors in CU. We explore the effects of using a different number of interpolation neighbors for defining the local neighbors set $\mathcal {N}_j$ in Section 3.2. For this experiment, we train our model using the same training dataset and hyperparameters for 500 epochs but vary the number of interpolation neighbors as 1, 3, 5, 10, and 20. We fix $k=5$ for the $k$-NN graph in the FE module for these experiments. We plot the mean surface reconstruction error on the test dataset to study the effect of varying the number of interpolation neighbors on reconstruction accuracy.

As shown in the plot in Figure 14(a), we observe that using a higher number of interpolation neighbors achieves lower mean reconstruction error on unseen performances (shown in red). However, the tradeoff is a linearly increasing time consumption for each inference (shown in blue).

Fig. 14.

Number of neighbors $k$ in FE. We conduct another experiment to study the effect of varying the neighbors $k$ used in constructing the $k$-NN graph in the EdgeConv layer of the FE module. We train our model for 500 epochs while varying $k$ from 1 to 10 in each experiment, and evaluate the mean surface reconstruction error on the test dataset. We fix the number of interpolation neighbors in the CU module to 10 for these experiments. As shown in the plot in Figure 14(b), we find that using $k=4, 5$ gives the minimum reconstruction error (shown in red) without a large tradeoff in the inference time (shown in blue).

5.5.4 Correlations Learned in FE Module.

We visualize the heatmaps of the feature similarities learned by the EdgeConv layer in the second FE network submodule. This can reveal the correlations among different parts of the face learned from data. As outlined in Section 3.1, we encourage the first submodule to learn local spatial correlations by constructing the $k$-NN graph in based on geodesic distances, and the second submodule to learn (potentially global) feature correlations in its learned feature space.

Figure 15 shows the learned similarities for four selected frames where the red point in each image denotes a queried point, and the similar colors and shades represent higher similarities. We observe that the FE module has captured the correlations among different parts of the face, such as the right part of the chin being correlated with the left part of the mouth (third image from the left).

Fig. 15.

6 Conclusion

We have proposed a data-driven deep neural network framework which, using as input a LR simulation of facial expression, enhances its detail and visual fidelity to levels commensurate with that of a much more expensive, HR simulation. The combined performance of the low-resolution simulator and the upsampling module itself is efficient enough to yield 18.46 FPS end-to-end, with the potential of the true real-time 28.04 FPS end-to-end for a modest sacrifice of accuracy. We demonstrate that our SR framework is able to convincingly bridge the visual quality gap between the real-time LR and offline HR simulations, even in instances where the two simulations have substantial differences due to discretization, modeling, and resolution disparities. Our SR network successfully upsamples even deformations that go beyond the parametric poses exemplified in the training set (triggered by muscle action and bone motion), to include dynamics, external forces, and collision objects and constraints. Finally, we observe that our framework can approximate a degree of collision response purely via generalization from the training data. Our code is available on https://github.com/hjoonpark/3d-sim-super-res.git

6.1 Limitations and Future Work

We have adopted a number of design choices that may consciously limit the scope of our work. We have chosen the output of our upsampling module to be the surface of the face model, rather than a description that includes the interior of the HR target simulation mesh. The same output is also purely geometry, as opposed to physical quantities such as volumetric strain tensor fields or action potentials (e.g., in the style of Srinivasan et al. [2021] and Yang et al. [2022]) which might have been useful for an extra simulation pass at the HR to incorporate additional effects. Both such choices are made to reduce the dependency of our system on any internal traits of the simulation engine that was used to produce the HR training data, requiring only surfaces at HR for training (those could even have originated from performance acquisition, as opposed to simulation), and stay as close to the real-time regime as possible.

Our SR approach strives to recreate physical behaviors as exemplified at the HR component of the training set; however, the degree at which such physical traits are conveyed is limited by how large and representative our training set is, and not enforced via explicit physics-based simulation at the HR output. For example, traits such as volume preservation, strain limits, or contact/collision behavior are only approximated to the degree that the network can learn them from data, while a full-fledged simulator could provide stronger guarantees. Specifically, if the LR simulation does not employ collision handling and the HR simulator used for training does, it would be very challenging to resolve behaviors where the exact result of contact resolution is history dependent and admits multiple solutions. A typical example would be a facial motion that brings the lips into deep collision at LR; at HR, any result including the lips being pressed together, or sliding under one another in any order, would not have the benefit of history dependence or friction to naturally lead to one of the possible scenarios.

In future work, we wish to further investigate possibilities for boosting our method’s efficacy of collision handling, by tuning the training loss to more directly emphasize collision avoidance (rather than just matching the target provided), and possibly augment the LR simulation with cheap approximations to collisions (e.g., using proxy geometry and repulsive forces to create a “soft” collision response) to help disambiguate collision scenarios where multiple solutions are admissible. We would also investigate adding a temporal element to our prediction; this could be beneficial both as a way to enhance temporal consistency of our animation, and perhaps as a pathway to adding dynamic effects to the resulting animation (even if the LR simulation was overdamped or quasistatic). Lastly, our method is trained on the facial model of a single identity, overfitting on a specific face mesh. Extending our proposed simulation SR framework to accommodate multiple identities is also an interesting direction for future work.

Supplementary Material

MP4 File (3670687.supp.mp4)

Download
36.56 MB

tog-23-0119-File003 (tog-23-0119-file003.mp4)

Supplementary material

Download
36.56 MB

A Appendices

A.1 Additional Information of Our Framework

A.1.1 Neural-Network Architecture.

We report the specifications of parameters in the implemented model in Table 4, whose definitions and uses are as introduced in Section 3. Our model is comprised of 706,871 trainable parameters.

A.1.2 Training Statistics.

Each training epoch takes 45s on a workstation with 2 NVLink-connected NVIDIA RTX A6000 GPUs, for a batch size of 6. We trained the model for 2800 epochs (which took about 35 hours on the 2 GPU workstation). We used Adam [Kingma and Ba 2014] to optimize the loss with a learning rate of 1e-4.

A.2 Additional Information of Compared Models

A.2.1 RBF.

Following the standard RBF techniques [Anjyo et al. 2014], we formulate our surface reconstruction based on RBF interpolation to predict the deformation vectors $\lbrace \Delta \mathbf {x}^H_j\rbrace _{j=1}^M$ for vertices on the HR surface mesh $\lbrace \mathbf {x}^H_j\rbrace _{j=1}^M$.

Each deformation vector of the LR mesh can be approximated as

\begin{align} \Delta \mathbf {x}^L_i = \sum _{k=1}^N \mathbf {w}_k \phi \left(\left|\left|\mathbf {x}^L_i-\mathbf {x}^L_k \right|\right|_2 \right), \end{align}

(8)

where $\lbrace \mathbf {w}_k\in \mathbb {R}^3\rbrace$ is the set of weights we wish to find, and $\phi (||\mathbf {x}^L_i-\mathbf {x}^L_k||_2)\in \mathbb {R}$ is the radial function centered at $\mathbf {x}^L_k$ modeled as the Gaussian function

\begin{equation} \phi (R) = e^{-R^2/\sigma ^2_{RBF}} \end{equation}

(9)

We compute the distance measure $R(\cdot)$ geodesically following the method in Section 3.2, and use $\sigma _{RBF}^2=25$. The weights $\lbrace \mathbf {w}_k\rbrace$ then can be obtained by solving the following linear system in each frame:

\begin{equation} \underbrace{\begin{bmatrix} \phi _{1,1} & \dots & \phi _{1,N}\\ \vdots & \ddots &\vdots \\ \phi _{N,1} & \dots & \phi _{N,N} \end{bmatrix}}_{=: \text{ $\Phi $}} \begin{bmatrix} \mathbf {w}^T_1\\ \vdots \\ \mathbf {w}^T_N \end{bmatrix} = \begin{bmatrix} \Delta \mathbf {x}^L_1\\ \vdots \\ \Delta \mathbf {x}^L_N \end{bmatrix}, \end{equation}

(10)

where $\Phi$ is invertible for the given Gaussian radial function.

Finally, the deformation vectors $\lbrace \Delta \mathbf {x}^H_j\rbrace ^M_{j=1}$ of the HR surface mesh is calculated as

\begin{equation} \begin{bmatrix} \Delta \mathbf {x}^H_1\\ \vdots \\ \Delta \mathbf {x}^H_M \end{bmatrix} = \begin{bmatrix} \phi (||\mathbf {x}^H_1-\mathbf {x}^L_1||_2) & \dots & \phi (||\mathbf {x}^H_1-\mathbf {x}^L_N||_2) \\ \vdots &\ddots &\vdots \\ \phi (||\mathbf {x}^H_M-\mathbf {x}^L_1||_2) &\dots & \phi (||\mathbf {x}^H_M-\mathbf {x}^L_N||_2) \\ \end{bmatrix} \begin{bmatrix} \mathbf {w}^T_1\\ \vdots \\ \mathbf {w}^T_N \end{bmatrix}. \end{equation}

(11)

A.2.2 Moving Least-Square.

Similarly, following the standard MLS technique for approximating scalar functions [Anjyo et al. 2014; Liu et al. 1995] we formulate our MLS-based surface reconstruction as approximating each component of displacement vectors $[\Delta x^H_j, \Delta y^H_j, \Delta z^H_j] \in \mathbb {R}^3$ for every vertex on the HR surface mesh $\lbrace \mathbf {x}^H_j\rbrace _{j=1}^M$.

The approximation is a linear combination of polynomials of degree $r$ (we use $r=2$) which, using the $y$ component (i.e., $\Delta y^H_j$) for an example, can be written as

\begin{equation} \Delta y^H_j = \mathbf {b}^T \left(y^L_k \right) \mathbf {c}\left(\mathbf {x}^H_j \right), \end{equation}

(12)

where $\mathbf {b}(y) =: [1, y, y^2, \ldots , y^r]\in \mathbb {R}^{r+1}$ is the basis function, and $\mathbf {c}(\mathbf {x}^H_j)=[c_0, c_1, \ldots , c_r]\in \mathbb {R}^{r+1}$ is a vector of unknown coefficients dependent on $\mathbf {x}^H_j$, which we wish to find.

The coefficients can be obtained by solving the following weighted least-square problem:

\begin{equation} \mathbf {c}\left(\mathbf {x}^H_j \right) = \arg \min _{\mathbf {c}\in \mathbb {R}^{r+1}} \sum _{k\in \mathcal {N}_j} w_k\left(\left|\left|\mathbf {x}^H_j - \mathbf {x}^L_k \right|\right|_2 \right) \left(\mathbf {b}^T \left(y^L_k \right)\mathbf {c}-\Delta y^L_k \right)^2, \end{equation}

(13)

where $\mathcal {N}_j$ is a set of indices of LR mesh vertices neighboring $\mathbf {x}^H_j$ (we use the same 20 neighbors as defined in Section 3.2), and $w_k(R)$ is a weighting function modeled as

\begin{equation} w_k(R)=e^{-R^2/\sigma _{MLS}^2}, \end{equation}

(14)

where we use the geodesic distance between $\mathbf {x}^H_j$ and $\mathbf {x}^L_k$ for the distance measure $R(\cdot)$ (as computed in Section 3.2), and use $\sigma _{MLS}^2=200$.

Then, $\mathbf {c}(\mathbf {x}^H_j)$ can be computed by differentiating Equation (13) w.r.t. $\mathbf {c}$ and setting it to zero:

\begin{align} \begin{split} &\frac{\partial }{\partial \mathbf {c}} \left(\sum _{k\in \mathcal {N}_j} w_k\left(\left|\left|\mathbf {x}^H_j-\mathbf {x}^L_k \right|\right|_2 \right) \left(\mathbf {b}^T \left(y^L_k \right)\mathbf {c}-\Delta y^L_k\right)^2 \right) \Biggr \vert _{\mathbf {c}(\mathbf {x}^H_j)}=0\\ \Leftrightarrow & \underbrace{\left[\sum _{k\in \mathcal {N}_j} w_k \left(\left|\left|\mathbf {x}^H_j-\mathbf {x}^L_k \right|\right|_2 \right) \mathbf {b} \left(y^L_k \right)\mathbf {b}^T \left(y^L_k \right) \right]}_{=: \text{ $M$}} \mathbf {c}(\mathbf {x}^H_j)\\ &= \underbrace{\sum _{k\in \mathcal {N}_j} w_k \left(\left|\left| \mathbf {x}^H_j - \mathbf {x}^L_{k} \right|\right|_2 \right) \Delta y^L_k \mathbf {b}\left(y^L_k \right)}_{=: \text{ $\mathbf {d}$}},\end{split} \end{align}

(15)

and solving $\mathbf {c}=M^{-1}\mathbf {d}$, where the matrix $M$ is invertible for a non-negative value of $w_k(D)$. For numerical stability, we re-center the polynomial basis around $\mathbf {x}^H_j$ [Liu et al. 1995], replacing $\mathbf {b}(y^L_{k})$ with $\mathbf {b}(y^L_{k}-y^H_j)$ which reduces Equation (12) to

\begin{equation} \Delta y^H_j = c_0. \end{equation}

(16)

This process is repeated for each of $x$, $y$, $z$ components (i.e., $\Delta x^H_j, \Delta y^H_j$, and $\Delta z^H_j$) for every vertex on the HR mesh $\lbrace \mathbf {x}^H_j\rbrace ^M_{j=1}$.

A.2.3 $\beta$-Variational Auto Encoder.

We train a $\beta$-Variational Auto Encoder ($\beta$-VAE) [Higgins et al. 2016] to predict HR displacements using LR displacements as input to serve as a baseline generative neural network. The $\beta$-VAE has two fully connected layers in the encoder and three fully connected layers in the decoder. The encoder has 2 hidden layers with 1024 neurons in the first layer and 512 neurons in the second layer. The output of the encoder is composed of 256 neurons (128 neurons for the mean and 128 neurons for the variance). The decoder has three hidden layers with 256, 1024, and 4096 neurons. All the hidden layers use Leaky RELU activations. During every training epoch, the mean and variance output from the encoder are used to compute latent parameters by sampling from a normal distribution. To train the weights of this network, we compute the loss on the output displacements (L2-norm) and the KL-Divergence of the latent parameters. The former penalizes reconstruction error while the latter encourages disentanglement between latent parameters. The KL-Divergence term is also scaled by a hyperparameter $\beta$ which controls the degree of disentanglement between the latent parameters. We fixed $\beta$ to be 0.01 for this dataset and used Adam [Kingma and Ba 2014] to train the network weights, with a learning rate of 1e-4. Since the input and output dimensions of our $\beta$-VAE are different, we do not design identical encoder and decoder architectures. We use the same partition for the train and test sets as our method.

A.2.4 DDE Framework.

We compare with DDE framework [Zhang et al. 2021] as the representative state-of-the-art method for synthesizing plausible wrinkle details on a coarse garment geometry based on normal maps. For implementation, we first bake two UV normal maps of size 512$\times$512 for each of the surface mesh embedded in the LR simulation mesh (e.g., left image of Figure 5) and the surface conforming to the HR simulation mesh (e.g., right image of Figure 5) on a frame-by-frame basis. Then, we train the DDE network (with U-Net architecture) to predict the HR normal map from its LR counterpart, baked from the training dataset. We train on the full-size normal maps rather than randomly subsampled patches as in the original work and omit training of the garment material classifier since we have only one type of mesh, the face. Also, we added one layers of downsampling and upsampling, respectively, given our input dimension is larger compared to the original work (128$\times$128) and also follow the same energy-minimization method to recover 3D surfaces from the normal maps, initialized with the coarse embedded mesh.

A.2.5 Decoder-Style Neural Network for Blendshape Weights Input.

The decoder-style neural network in Section 5.4 learns to predict per-vertex deformations of the HR surface mesh (35,637 vertices) from the 38-dimensional input blendshape weights. Its architecture is comprised of fully connected layers (Linear(input dimension, output dimension)) and Leaky-ReLU activations (LeakyReLU(negative slope)) with 443,840,125 trainable parameters. After the last layer, the vector of shape (106911,1) is reshaped to (35637,3) to obtain the per-vertex deformations.

Linear(38, 256)-LeakyReLU(0.01)

Linear(256, 1024)-LeakyReLU(0.01)

Linear(1024, 4096)-LeakyReLU(0.01)

Linear(4096, 106911)

A.3 Approximate Resolution of Self-Collision

We validate the qualitative performance of self-collisions by visualizing and comparing the predictions on the test set with two variants of the HR surface, collision handling applied in the simulation (Figures 16(c) and 17(c)) and omitted in the simulation (Figures 16(d) and 17(d)). As mentioned in Section 4, we do not resolve self-collisions in the LR simulations, but only in the HR simulations. We observe that the trained model is able to predict HR performances with partial collision resolution, depending on the degree of collision (or penetration). Figure 16 illustrates one such test set performance where the prediction from our model (Figure 16(b)) does not have lip self-collisions when the penetration is low (Figure 16(d)). Conversely, when the penetration is high, as shown in Figure 17(d), the prediction has collisions partially resolved (Figure 17(b)). We also highlight that we do not include any additional penalty for collisions during training (which is an avenue for future work), and the model has approximated partial collision resolution from the HR performances in the training dataset.

Fig. 16.

Fig. 17.

A.4 Unseen External Forces - Embedded Surface

In Figure 18 (in addition to Figure 8), we visualize the surface mesh (Figure 18(a)) embedded in the LR simulation mesh undergoing unseen external forces for side-by-side comparisons with the predicted mesh (Figure 18(b)). We also visualize heatmaps showing deformation discrepancies between the embedded and predicted meshes by computing their per-point Euclidean distances. Note that the embedded mesh is not the target mesh for prediction but only provided as a visual reference.

Fig. 18.

A.5 Experiment with Augmented Wrinkles

We evaluate the quality of reconstructed faces inferred by our model trained using two types of target surface meshes: the original surface mesh and one with additional wrinkle details. The augmented wrinkles are incorporated by applying wrinkle blendshapes onto the original surface mesh (see Figure 20). Using the same LR volumetric input mesh and hyperparameters from Section 5.2, we train our model until convergence to predict the wrinkle-augmented HR surface mesh.

Fig. 19.

Fig. 20.

We visually compare the predicted meshes generated by our model with the target mesh in Figure 21. Our model effectively captures visually reasonable details of augmented wrinkles, particularly in areas around the forehead, eyes, and mouth, where wrinkles are most pronounced. Additionally, we plot the frame-wise mean reconstruction errors in Figure 19 and provide a summary in Table 5. While the mean errors increased overall, we consider this reasonable considering the additional HR details our model must infer given the equivalent capacity of our neural network model.

Table 5.

[mm]	Original	Augmented Wrinkles
Mean	0.37	0.62
Median	0.36	0.61
Std.	0.07	0.15
Max.	0.59	1.19
Min.	0.24	0.33

Table 5. Descriptive Statistic Measures of Frame-Wise Mean Surface Reconstruction Errors on Unseen Facial Expressions without (i.e., Original) vs. with Augmented Wrinkles

Fig. 21.

References

[1]

Marc Alexa, Johannes Behr, Daniel Cohen-Or, Shachar Fleishman, David Levin, and Claudio T. Silva. 2001. Point set surfaces. In Proceedings Visualization, 2001. VIS’01. IEEE, 21–29.

Notation	Value
\(N\) (num. of LR volumetric mesh vertices)	15,872
\(M\) (num. HR surface mesh vertices)	35,637
S (num. of submodule layers in Feature Encoding network)	2
\(D_0\)	35
\(D_1\)	64
\(D_2\)	128
\(\alpha\) in Equation (7)	0.001
\(\beta\) in Equation (7)	gradually increased from 0.001 to 20
\(k\) neighbors in the \(k\)-NN graphs from Feature Encoding networks	5
Interpolation neighbors for Coordinate-based Upsampling	20

Abstract

1 Introduction

2 Related Work

2.1 3D Super-Resolution

2.2 3D Super-Resolution in other Domains

2.3 Coordinate-Based MLPs

2.4 Model Reduction Methods

3 Method

3.1 FE Network

3.2 Coordinate-Based Upsampling Network

3.3 Surface Reconstruction Network

3.4 Loss Function

4 Dataset Generation

4.1 Acquisition of Simulation Models

4.2 Simulation Framework

5 Experiments and Evaluation

5.1 Near-Realtime High-Resolution Facial Animations

5.2 Generalization to Unseen Facial Expressions

5.3 Generalization Beyond Parametric Space

5.3.1 Unseen Dynamics.

5.3.2 Unseen Forces.

5.4 Experiments with Blendshape Inputs

5.4.1 Construction of Low-Resolution Tetrahedral Mesh Blendshapes.

5.4.2 Evaluation Results.

5.5 Additional Experiments

5.5.1 Comparison with Coarser Low-Resolution Simulation Mesh.

5.5.2 Contributions of FE and Coordinate-Based Upsampling Modules.

5.5.3 Effects of Different Locality Parameters.

5.5.4 Correlations Learned in FE Module.

6 Conclusion

6.1 Limitations and Future Work

Supplementary Material

A Appendices

A.1 Additional Information of Our Framework

A.1.1 Neural-Network Architecture.

A.1.2 Training Statistics.

A.2 Additional Information of Compared Models

A.2.1 RBF.

A.2.2 Moving Least-Square.

A.2.3 \(\beta\)-Variational Auto Encoder.

A.2.4 DDE Framework.

A.2.5 Decoder-Style Neural Network for Blendshape Weights Input.

A.3 Approximate Resolution of Self-Collision

A.4 Unseen External Forces - Embedded Surface

A.5 Experiment with Augmented Wrinkles

References

Index Terms

Recommendations

SparseSoftDECA — Efficient high-resolution physics-based facial animation from sparse landmarks

Muscle simulation for facial animation in Kong: Skull Island

Facial animation retargeting and control based on a human appearance space

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations