1 Introduction
Physics-based simulation is widely used to drive animations of both human bodies and faces. However, in order to obtain the highest levels of visual quality and realism, traditional simulation pipelines based on anatomic first principles resort to costly design choices. Detailed specifications of geometry and materials are essential, including the muscle and tendon shapes and attachment; bone geometry and motion; and constitutive properties of soft tissue and skin. Collision and frictional contact are ubiquitous in faces, and the resolution of such effects is dependent on mesh detail and the sophistication of detection and response algorithms. Finally, recreating intricate local shapes to match performance details from real actors may impose further directability demands on the simulation pipeline. Such feature demands in conjunction with the sheer geometric mesh resolution necessary for detailed facial expressions often place reference-quality face simulation well beyond the cost that would allow for real-time performance.
This article explores an alternative approach to achieving faithful and accurate facial animation at a much reduced execution cost, ideally as close as possible to real-time. Our method (Figure
1) seeks to convincingly approximate a full,
high-resolution (
HR) 3D simulation with the combination of a simulator that uses lower resolution and model simplifications, paired with a deep neural network that boosts the resolution, detail, and accuracy of this coarse simulated deformation. Our simulation
super-resolution (
SR) module is trained on a dataset of coordinated performances crafted using the high- and
low-resolution (
LR) face simulators and generalizes to novel performances by boosting the output of the LR simulator to the quality anticipated from its HR counterpart.
We aspire to create the best preconditions for the success of such a SR module by focusing our attention on types of physics-based simulations where it may be possible to craft animations from the LR and HR simulators that have strong semantic correspondence on a frame-by-frame basis. In other words, we look for types of simulation where it might be possible to infer—at some level of abstraction—what the fine-resolution simulation would want to do, by observing what the LR simulator was able to do. Face simulation is a good exemplar of this concept; regardless of resolution, the same core drivers of deformation can be seen as being present in both cases: the action of muscles, and the kinematic state of skeletal bones and other collision objects. This allows us to create a training set by simply dialing in the same control parameters for these driving factors of simulations both in the LR and HR models. Hence, we can hope that this semantic correspondence can be learned in a SR neural network that generalizes this semantic correspondence between resolutions to unseen performances.
We highlight that even “semantically corresponding” simulated poses from the respective simulators described above can be quite different. In particular, the LR result can deviate significantly from the mere downsampling of the HR simulation, with discrepancies extending beyond high-frequency details. There are at least three core causes of such discrepancy: First, and most obvious, the reduced mesh resolution of the coarser simulation will be unable to resolve fine geometric features such as localized folds, wrinkles, and bulges that the fine-resolution mesh would capture. Second, the fact that governing physics and topology have to be represented using a coarser discretization may create bulk deviations from the expected behavior of the continuous medium. For example, the action of thin muscles might have to be dissipated over larger elements, reducing the crispness of their action. Fine topological features like the corners of the lips may be under-resolved, especially if at lower resolution we opt for an embedding simulation mesh that does not conform to the model boundary. Non-conforming embedded simulation offers well-conditioned elements and improved convergence that is attractive for real-time performance, but it also leads to a crude first-order approximation of the material volume for elements on the model boundary, leading to artificial stiffness and resistance to bending. The third and final contributor to bulk discrepancy between resolutions could be conscious design choices for the sake of interactive performance; for example, we may choose to perform elaborate contact/collision processing in our reference-quality simulation but forego collision processing altogether in the LR simulator (as in our examples). Thus, our SR module must account for much more than localized high-frequency deformation details and should compensate for all factors (mesh resolution, discretization non-convergence, and physical simplifications) of bulk differences between the two simulation resolutions.
Our objective is to build a framework capable of producing high-accuracy animations without incurring the cost of simulations on HR meshes. We achieve this by training a deep neural network to act as a SR upsampler of simulations performed on a coarser 3D mesh. In practice, this allows for real-time simulations of facial animations that preserve many of the qualities associated with much slower HR simulations.
We simulate a coarse LR face mesh with significantly fewer mesh elements allowing for real-time simulations and reconstruct the HR details learned from data. Our upsampling module accounts for both high-frequency details and bulk differences between resolutions, responses to dynamics and external forces, and can also approximate a degree of collision response even if collision handling is omitted from the LR simulator. Our end-to-end animation attains near-realtime at 18.46 FPS from 30.06 FPS simulation and 47.82 FPS upsampling. We also emphasize that true real-time end-to-end animation (i.e., 24 or more FPS) is attainable by scaling down to coarser representations at a modest sacrifice of upsampling accuracy (discussed more in Section
5.5.1).
Previous efforts to accelerate physics-based simulations of deforming elastic bodies have focused on building faster numerical methods [Hauth and Etzmuss
2001; Kharevych et al.
2006; Stern and Grinspun
2009; Su et al.
2013], employing alternative constraint-based formulations such as Position Based Dynamics [Müller et al.
2007; Macklin et al.
2016; Bender et al.
2013] and its variants [Bouaziz et al.
2014; Liu et al.
2013; Stam
2009], and other techniques such as adaptively computing higher resolutions only when needed [Bergou et al.
2007]. However, given the real-time performance afforded by regular, embedded models for LR simulations and the fast inferencing time of deep models, our framework can reconstruct HR facial expressions faster and with reduced developmental effort.
We extend the concept of SR to the domain of physics-based
simulation, contrasting with most prior applications of this process to purely geometric 3D models without regard to the fact the data originated from simulation. We summarize our core contributions as follows:
—
We demonstrate a neural network-based pipeline that can convincingly approximate a HR facial simulation, using as input a real-time LR approximate simulation and a fast inference step that performs the resolution boost. We show that this pipeline can robustly compensate for discrepancies between the two simulation resolutions extending beyond localized high-frequency deformation details.
—
We identify the opportunity to create a training set for our SR module with a high degree of semantic correspondence between LR and HR simulation frames, by giving the two simulators the same anatomical controls of muscle activations and bone kinematics.
—
We demonstrate near-realtime performance of the end-to-end pipeline, and a robust ability to generalize to expressions not in the training set. We can even demonstrate this ability on deformations that extend beyond the parametric space used in the simulations that generated the training set (e.g., dynamics, external forces, collisions, or constraints not present in the training data).
3 Method
tIn this section, we present the specific design choices for our model architecture, aimed at learning to map from a LR volumetric mesh to a HR surface mesh depicting the same facial expression (Figure
2). The input LR volumetric mesh contains 15,872 vertices and is derived from regular
BCC (
body-centered cubic) lattices for real-time simulation leveraging on its sparse and regular distribution of the vertices but with a compromise on accuracy and visual fidelity (Figure
4(c)). On the other hand, the target HR mesh contains 35,637 vertices and is a triangular mesh conforming to a denser volumetric mesh capable of producing fine details of deformations but at a significantly slower simulation speed (Figure
4(b)). More information about the data generation is outlined in Section
4.
We represent our input and output as a set of 3D displacement vectors from a rest pose stacked in an arbitrary yet consistent order. We divide our pipeline into three modules for (1)
feature encoding (
FE), (2)
coordinate-based upsampling (
CU), and (3) surface reconstruction. The hyperparameters are specified in Appendix
A.1.
3.1 FE Network
The FE network computes feature embedding for each input vector. We first concatenate each input displacement vector with a positional encoding
\(\in \mathbb {R}^{32}\) using sine and cosine functions as done in Transformers [Vaswani et al.
2017]. Then, the concatenated input
\(\in \mathbb {R}^{D_0}\) (in our implementation,
\(D_0=35\)) goes through the submodules of the FE network.
While deformations in the human face are primarily attributed to the activation and motion of the underlying muscles and bones, respectively, they can also be a result of deformations in other parts of the face (e.g., a wide smile can cause the skin around the eyes to fold); therefore, the localized per-vertex information of deformation needs to be shared with other vertices. For this reason, we model the submodules of the FE network with edge convolutional layers, dubbed
EdgeConv, introduced in DGCNN [Wang et al.
2019] which is capable of aggregating neighborhood information in feature space rather than coordinate space by dynamically constructing a
\(k\)-NN graph in each layer.
We initialize the first
\(k\)-NN graph of the network using geodesic distances based on the edge information of the LR mesh in the rest pose. The subsequent graphs are constructed on the fly in their learned feature spaces. The motivation is to encourage capturing
local spatial correlations in the first submodule and potentially
global feature correlations in the subsequent submodules (discussed more in Section
5.5.4).
We apply max and average pooling on the intermediate outputs from EdgeConv to extract global features. They are repeated and concatenated with the outputs from EdgeConv and the preceding input encoding feature, which are then passed through a shared fully connected network. We repeat the submodule \(S=2\) times with the intermediate outputs from one module passed as input to the next. The output of the last submodule is concatenated with all of the previous \(S\) intermediate features (including the position-encoded input) to construct the final encoded feature. Specifically, denoting the output of the \(s^{th}\) submodule for the \(i^{th}\) LR mesh vertex as \(\mathbf {z}^L_i \in \mathbb {R}^{D_s}\), the final encoded output has the dimension of \(\mathbf {z}^L_i\in \mathbb {R}^{\sum _{s=0}^SD_s}\). In our implementation, we used \(S=2\) with \(D_1=64\) and \(D_2=128\).
3.2 Coordinate-Based Upsampling Network
The upsampling network takes as the input a set of encoded per-vertex features from the LR mesh and outputs per-vertex features for the HR surface. To generalize over arbitrary and non-integer upsampling ratios, we propose to formulate the upsampling operation as a continuous local interpolation of the input features.
Formally, let the set of encoded features contributing to the upsampled
\(j^{th}\) feature be
\(\lbrace \mathbf {z}^L_i\rbrace _{i\in \mathcal {N}_j}\) where
\(\mathbf {z}^L_i\) denotes the encoded
\(i^{th}\) LR mesh feature, and
\(\mathcal {N}_j\) denotes a set of local interpolation neighbors for the
\(j^{th}\) feature. Then, the upsampling operation can be expressed as
where
\(w_{ij}\) indicates the contribution of the
\(i^{th}\) LR mesh feature to the
\(j^{th}\) HR mesh feature. Different modeling options can be explored for defining the local neighbors set
\(\mathcal {N}_j\) (e.g., number and criteria of neighbors) and computing the interpolation weight
\(w_{ij}\) (e.g.,
inverse distance weighting (
IDW), RBF), which we describe next.
Neighborhood locality. We define the local neighbors set
\(\mathcal {N}_j\) as the indices of the
\(k\) nearest LR mesh vertices from the
\(j^{th}\) HR mesh vertex in terms of geodesic distances (illustrated in the blue point cloud in center-bottom of Figure
2). Since the LR and HR vertices do not live on the same surface, we first map the LR vertices
\(\lbrace \mathbf {x}^L_i\rbrace\) to the HR vertices (we temporarily denote the resulting mapped vertices as
\(\lbrace \mathbf {x}^{\prime L}_i\rbrace\)) using the linear assignment algorithm [Crouse
2016]. This finds the optimal one-to-one mapping between the LR and HR vertices by minimizing the mapping distance (Euclidean). Then, we use Dijkstra’s algorithm to find the
\(k\) nearest mapped vertices
\(\lbrace \mathbf {x}^{\prime L}_i\rbrace\) (which directly corresponds to the original LR vertices
\(\lbrace \mathbf {x}^L_i\rbrace\)) for every HR vertex using the edges of the HR surface mesh as paths (Figure
3). The local neighbor information is pre-computed offline once. In this work, we use
\(k=20\) and additionally explore the effects of different values of
\(k\) in Section
5.5.
Weighting function. The weighting function \(w^{\prime }_{ij}=f_\theta (\mathbf {u}_{ij})\) outputs the interpolation weight \(w^{\prime }_{ij}\in \mathbb {R}\) for the \(i^{th}\) LR mesh vertex neighboring the \(j^{th}\) HR mesh vertex, given some input vector \(\mathbf {u}_{ij}\).
Conceptually, the HR surface mesh can be thought of as a discretization of a continuous and smooth limit-surface, i.e. its vertices are approximations of the sampled points from the continuous surface. Thus, one could sample an infinite number of continuously varying features from any point on this surface. For this reason, we model
\(f_\theta\) as a trainable coordinate-based MLP where we employ SIREN [Sitzmann et al.
2020] for its superiority in modeling continuous (and differentiable) functions.
As the input to
\(f_\theta (\mathbf {u}_{ij})\), we provide the spatial information using a concatenated vector of coordinates of the HR and LR mesh vertices (
\(\mathbf {x}^H_j, \mathbf {x}^L_i \in \mathbb {R}^3\), respectively) and their mutual Euclidean distance, written as
Then, we normalize the output weight
\(w^{\prime }_{ij}\) across the local neighbors
\(\mathcal {N}_j\) using the softmax function
\(\sigma _{j}\) and obtain the final interpolation weight
\(w_{ij}\), expressed as
for
\(j=1, \ldots , M\) and
\(i\in \mathcal {N}_j\).
3.3 Surface Reconstruction Network
The surface reconstruction network predicts the per-vertex displacements
\(\Delta \mathbf {x}_j^H\) from the upsampled features
\(\mathbf {z}_j^H\). Since
\(\mathbf {z}_j^H\) implicitly inherits coordinate information
\(\mathbf {x}_j^H\) from the upsampling network and to reconstruct fine deformation details on the HR surface, we also model the surface reconstruction network using SIREN [Sitzmann et al.
2020] to exploit its ability to model high-frequency signals utilizing coordinate information. As the last step, the predicted deformations are added to the HR mesh in its rest pose to reconstruct the final deformed HR surface.
We also note that we use a minimal modeling technique for the surface reconstruction network not only to reduce the computational overhead for processing a relatively large number of HR mesh vertices (\(\gt\)36k) but also because we assume all the information needed for the fine-detailed surface reconstruction is to be encoded in the LR mesh features.
3.4 Loss Function
We minimize the reconstruction loss
\(\mathcal {L}_{recon}\) between the predicted and ground-truth per-vertex deformations of the HR surface mesh denoted
\(\Delta \hat{\mathbf {x}}_j^H\) and
\(\Delta \mathbf {x}_j^H\), respectively:
Moreover, we introduce the loss term
\(\mathcal {L}_{fn}\) for local smoothness which encourages the face normal of triangles on the predicted and target HR surface meshes (denoted
\(\hat{\mathbf {n}}_k\) and
\(\mathbf {n}_k\), respectively) to be equivalent in terms of cosine similarity:
where
\(F\) is the number of triangles on the HR surface mesh.
We also include the regularization term
\(\mathcal {L}_{reg}\) to encourage the encoded intermediate features
\(\lbrace \lbrace \bar{\mathbf {z}}_{s, i}\rbrace _{i=1}^N\rbrace _{s=1}^S\) (Figure
2) to center around 0, encouraging their prior to follow a multivariate normal distribution [Park et al.
2019; Chabra et al.
2020]:
We find that the face normal loss improves the visual fidelity of the reconstructed face and the regularization term helps prevent overfitting.
The final loss function
\(\mathcal {L}\) is written as
where
\(\alpha\) and
\(\beta\) are the scalar weight terms whose values are reported in Table
4 of the Appendix.
5 Experiments and Evaluation
We report performance metrics in terms of reconstruction speed (Section
5.1) and as well as quantitative and qualitative reconstruction errors (Section
5.2). We use the unseen performances in the test set to evaluate the generalization capacity of the trained model. We also evaluate our framework’s ability to generalize to unseen dynamics and forces (Section
5.3). Additionally, we present the experimental results pertaining to the utilization of blendshape inputs as a substitute for the LR physics-based simulator in generating the input LR tetrahedral mesh (Section
5.4).
We also conduct ablation experiments. In Section
5.5.1, we explore the tradeoffs in the reconstruction performance of our model when trained using the coarser LR volumetric mesh capable of attaining the
true real-time end-to-end animation at 28.04 FPS as compared to our recommended
near real-time at 18.46 FPS. In Section
5.5.2, we explore how the submodules of our framework, namely FE and CU modules, contribute to the reconstruction accuracy, and, in Section
5.5.3, evaluate the effects of using different interpolation neighbors
\(\mathcal {N}_j\) for the CU network and different neighbors
\(k\) for the
\(k\)-NN graph from the FE network. Then in Section
5.5.4, we qualitatively evaluate the correlations among different parts of the face learned by the EdgeConv layers in the FE submodules.
In addition, we investigate our framework’s capability to approximate self-collisions between the upper and lower lips in Section
A.3, and we conduct ablation experiments to assess the impact of incorporating higher degrees of wrinkle details on the target surface mesh in Section
A.5.
5.1 Near-Realtime High-Resolution Facial Animations
Simulations speed. The average time to simulate the HR conforming simulation with 1,944,549 tetrahedral elements is 6.22s per frame or a frame rate of 0.16 FPS. Conversely, the average time to simulate the LR embedding mesh with 73,128 tetrahedral elements is 0.033s, corresponding to 30.06 FPS, i.e. 188\(\times\) faster than the HR simulation. These simulation times are recorded on a workstation with a single GeForce RTX 4090 GPU.
SR inference speed. To approximate the HR surface from the LR simulation, we need to infer the HR displacements from our model. The computational overhead of our model inference on a single GeForce RTX 4090 GPU is 0.0209s per frame, corresponding to 47.82 FPS for inference alone.
End-to-end speed and additional performance boosting. Consequently, our simulation SR framework takes a total of 0.054 FPS per frame, or 18.46 FPS, which implies that we achieve a speedup of 115\(\times\) relative to the HR simulation that takes 6.22s per frame (0.16 FPS). We emphasize that there are multiple ways to bridge the gap from near-realtime, e.g., 18.46 FPS, to true real-time, i.e., 24 or more FPS.
First and foremost, using a coarser LR simulation mesh can easily attain the true real-time end-to-end animation given tolerance to a minute tradeoff in the quality of reconstructions which our current LR mesh enjoy (we explore the tradeoff in Section
5.5.1). Similarly, we can also achieve faster inference time by choosing to use fewer interpolation neighbors in the CU module but with a tradeoff in the overall reconstruction accuracy (see Section
5.5), as we identify the bottleneck of inference is the neighborhood information gathering step in the CU module.
On the other hand, while adhering to the strict bar for the permissible reconstruction quality, we could pipeline the LR simulation and inference steps using a 2 GPU workstation. In such a set up, we could achieve an end-to-end speed of 30.06 FPS after tolerating a single frame latency. Conversely, we could also move away from the inference library (we use ONNX Runtime for PyTorch) and implement custom inference kernels on GPUs that speed up computation.
5.2 Generalization to Unseen Facial Expressions
Using the simulation data, generated as described in Section
4, we select the
amazement and
pain sequences for training (435 frames) and test on
anger and
fear sequences (445 frames), ensuring that the test set contains unseen performances. We use the trained model to infer the HR face surface from unseen LR volumetric mesh performances in the test set.
Quantitative evaluation. As we have access to the HR simulations of the test data, we can readily compute the reconstruction error in terms of per-point Euclidean distance between the reconstructed and the target (reference) mesh whose dimension is
\(179.8\times 257.3\times 164.5\) [mm] (Figure
4). We also set up other commonly used reconstruction methods to serve as comparisons for our method. We train a
\(\beta\)-VAE [Higgins et al.
2016], on the same dataset to serve as a baseline generative neural framework comparison. We implement two of the commonly used surface reconstruction methods: the RBF and
moving least-square (
MLS)-based methods as the representative global and local methods, respectively, where we employ the Gaussian function for RBF. Lastly, we compare with
Deep Detail Enhancement (
DDE) framework [Zhang et al.
2021] as the representative state-of-the-art SR framework for 3D garment surfaces which uses normal maps to synthesize plausible wrinkle details on a coarse geometry. The formulations for RBF and MLS along with details on the
\(\beta\)-VAE and DDE can be found in Section
A.2.1,
A.2.2,
A.2.3, and
A.2.4, respectively.
Our method outperformed the others and robustly achieved the lowest mean reconstruction errors per frame
\(\lt\)0.59mm. We plot the frame-wise mean reconstruction errors of the comparisons to validate that our method has the least error for every test performance in Figure
6. The evaluation result is summarized in Table
1.
Qualitative evaluation. In Figure
7, we evaluate the visual fidelity of the inferred face mesh by visualizing the reconstructed HR surfaces and heatmaps of corresponding reconstruction errors for all the methods. Our method can infer the target facial expression from the input LR volumetric mesh more faithfully than other methods, allowing us to conserve both the expression and the subtle deformation details that otherwise would have been compromised by using the LR simulation.
5.3 Generalization Beyond Parametric Space
We test the ability of our framework to handle deformations that extend beyond the parametric space used in simulations. To evaluate, we simulate the LR simulation mesh with unseen dynamics and external forces, respectively, and qualitatively evaluate the inference accuracy.
5.3.1 Unseen Dynamics.
To evaluate our model’s capability in generalizing to non-quasi-static simulations, we simulate the dynamics of the LR simulation mesh using a semi-implicit backward Euler scheme. This allows us to model ballistic effects that are not present in our training dataset which was simulated under the quasi-static assumption. We further exaggerate the ballistic effects in the simulation by shaking the head back and forth in conjunction with the muscle contractions and jaw motion.
We compare the reconstructed surface inferred from the input mesh with unseen dynamics (middle row of Figure
8(b)) and the reference surface conforming to the quasi-static simulation mesh (middle row of Figure
8(a)). Also, we visualize heatmaps showing average facial deformations across the training data (top row of Figure
8(c)) and the deformation differences between the predicted and reference surfaces, respectively (middle row of Figure
8(c)). We highlight that although the nose shows little or no deformations throughout the training data (thus, showing the nose as a dark blue region in the first heatmap), our model is capable of inferring them from the unseen input (showing as a lighter blue region in the second heatmap).
Similarly, we visualize the dynamic simulations (with yaw rotation motions of the head) and their reconstructions in a time sequence in Figure
9 along with the heatmaps (Figure
9(e)–(f)) showing deformation differences between the quasi-static/dynamic simulation meshes (Figure
9(a)/(b)), and also the reference conforming quasi-static surface (Figure
9(c)) and the reconstructed surface inferred from the dynamic LR simulation mesh (Figure
9(d)), respectively. Regions with distinctive facial deformations of the inferred faces (Figure
9(e)) are in line with the deformed regions of the input simulation meshes (Figure
9(f)), implying generalizations beyond the quasi-static simulation data.
5.3.2 Unseen Forces.
We craft two quasi-static simulation examples with external forces applied on the rest pose mesh (Figure
8(d)). In the first example (Figure
8(e)), we apply a spring force pulling the side of the lips. This force can also be interpreted as a candy cane pulling on one side of the lips. In the second example (Figure
8(f)), we collide the LR simulation mesh with a sphere, pushing the cheek inward. The LR performances, reconciled by the simulator, are given as input to our framework. The predictions indicate that our framework is able to handle inputs that have deformations not seen in the training performances. Moreover, for side-by-side comparisons, we visualize the surface mesh embedded in the LR simulation mesh in Appendix
A.4.
5.4 Experiments with Blendshape Inputs
Employing a LR physics-based simulator for producing the input mesh is perfectly affordable and absorbs much of the nonlinearities in mapping from the simulation parameters (e.g., muscle activations) to the input mesh. Moreover, incorporating dynamics or external forces into the input mesh is a straightforward application for the physics-based simulator, providing an inherent advantage to its usage. Additionally, our SR framework can produce intended facial expressions of the HR surface mesh from its semantically corresponding LR input while compensating for topological discrepancies and can extrapolate to unseen physical effects after being trained only on purely quasi-static simulations.
In this section, we further investigate whether our SR framework can still predict the intended facial expressions from a non-physics-based LR input animated using blendshapes. Specifically, we conduct two experiments employing the blendshape system as a replacement for the LR physics-based simulator. First, we construct volumetric blendshapes of our LR input mesh and generate the training dataset using a
blendshape animator, instead of the physics simulator. We also go a step further and use the low-dimensional
blendshape weights to approximate the HR facial performances by training a decoder-style neural network with around
\(628\times\) more trainable parameters than our method. The architecture of the neural network is specified in Appendix
A.2.5. We highlight that in both approaches, incorporating dynamics or external forces into the input mesh presents significant challenges compared to the straightforward application of the LR physics-based simulator, which inherently confers an advantage to its use.
In the following subsections, we describe our blendshape system setup used for constructing the volumetric blendshapes and weights for producing facial performances. Then, we provide the evaluation results of the two approaches.
5.4.1 Construction of Low-Resolution Tetrahedral Mesh Blendshapes.
For each blendshape in the blendshape muscle rig constructed in Section
4.2, we set its weight to 1.0 and zero out the remaining weights in order to obtain the kinematic muscle deformation corresponding to solely that blendshape. Then, we run the quasi-static solver to obtain the muscle-driven deformation of the LR tetrahedral mesh which is then stored as the corresponding LR tetrahedral mesh blendshape.
Volumetric blendshape animation as input. In the first scenario, we use the tetrahedral mesh animated using the blendshape weights constructed in Section
4.2 as input, as a replacement for the LR physics-based simulator. We then re-initialize and train our existing neural network (Section
3) to learn to predict the corresponding HR surface mesh.
Blendshape weights as lower-dimensional input. In the second scenario, we directly use the blendshape weights of the facial performances as inputs, bypassing the use of the simulator. To achieve this, we construct a fully connected neural network with ample capacity (443,840,125 trainable parameters) to learn the mapping from 38-dimensional blendshape weight vector (comprised of 31 blendshapes weights and a 7-dimensional vector for the rigid transformation of the jaw - quaternion and a translation vector) to the HR surface mesh.
5.4.2 Evaluation Results.
We infer the HR surface mesh in the test dataset and plot the framewise errors for both methods and ours utilizing the LR physics-based simulator. We overlay the plots in Figure
6 to highlight the overall difference. As shown in Figure
10 and detailed in Table
2, using the blendshape weights as inputs (in blue) yields the largest reconstruction error compared to the other two methods (in red and green). We explain the larger error by noting that the neural network, despite having 628
\(\times\) more learnable parameters than our method, must learn the blendshapes and produce accurate jaw transformations - tasks that the blendshape animator can easily produce.
On the other hand, using the input tetrahedral mesh produced by the blendshape animator (in green) leads to marginally higher error when compared to using the LR physics-based simulator (in red). This finding aligns with our expectations, given that the physics-based simulator can generate an input mesh that more faithfully adheres to the target surface mesh, accommodating the highly nonlinear and intricate nature of the physics-based simulations.
Notably, relying on blendshape weights as inputs often leads to difficulties in generalizing to unseen jaw transformations. This is clearly observed in the close-up side view of the mouth in the 3rd row of Figure
11(d), where the red background highlights the reconstruction difference between the target mesh (Figure
11(a)). Employing the blendshape animator helps to mitigate this issue by generating the LR tetrahedral mesh with accurate jaw motions, as depicted in Figure
11(c). Nevertheless, using the LR physics-based simulator demonstrates the superior performance in faithfully predicting the target facial deformations, particularly evident in the close-up front views of the mouth in the 2nd rows of Figure
11(a)–(c).
5.5 Additional Experiments
In this section, we compare the quality of reconstructed faces inferred by our model trained using the original LR simulation mesh with 73k elements (Figure
4(c)) and another one trained using a coarser LR simulation mesh with 34k elements (Figure
4(d)). The coarser mesh attains the
true real-time end-to-end animation at 28.04 FPS (67.79 FPS simulation and 47.82 FPS inference) on the same hardware setup.
Furthermore, we evaluate the contributions of our FE (Section
3.1) and CU (Section
3.2) modules. We explore the effects of the key parameters in each of the two modules, namely, the neighbors
\(k\) in the FE module and the interpolations neighbors in the upsampling module, respectively. Additionally, we qualitatively validate the correlations among different parts of the face learned by our FE network.
5.5.1 Comparison with Coarser Low-Resolution Simulation Mesh.
For training, we use the same hyperparameters as the training on the original LR simulation mesh. Following the same procedure in Section
5.2, we evaluate the surface reconstruction errors on the unseen facial expressions in the test dataset.
As shown in the error plot of Figure
12, using the coarser LR mesh expectedly attains slightly larger reconstruction errors across most of the frames compared to the original mesh. We observe increased artifacts in the inferred surfaces especially around the mouth regions in Figure
12(a)-(b). We highlight that, in practice, true real-time end-to-end animation is easily attainable had we tolerated a minute deterioration of the reconstruction quality which could become unnoticeable to human eyes with different rendering techniques such as using texture map as opposed to a plain diffuse rendering. However, we choose to adhere to the current resolution for the robustness of generalization capabilities beyond the parametric space used in the simulation (e.g., unseen dynamics and external forces), given that true real-time animation is also attainable, in practice, had we tolerated one frame latency.
5.5.2 Contributions of FE and Coordinate-Based Upsampling Modules.
We evaluate the contributions of the FE and CU modules by excluding them (one at a time). We compare the predictions on test performances.
Specifically, we train three different models using the same dataset and hyperparameters for the same number of epochs (1000). The first model we train includes both the FE and CU modules (our proposed framework). The second model excludes the FE module and directly feeds the output of position-encoding to the CU module. In the third model, we reintroduce the FE module and exclude the CU module. To replace the CU module, we opt for a different and standard upsampling method (with a fixed upsampling ratio) that uses the transposed convolution operation, widely adopted in upsampling images for SR [Yang et al.
2019]. To mimic the transposed convolution operator, we find 20 nearest LR mesh vertices from each HR mesh vertex in terms of Euclidean distance (same number as our neighbor interpolation in the CU module). We then compute weighted sums of the 20 LR mesh features for every HR mesh vertices. For a fair comparison, we learn these weights, similar to the weights learned in our CU module.
From the three trained models, we compare the reconstruction error on the test dataset. As summarized in Table
3, our model which includes both the FE and CU modules outperforms the other two variants which have been trained in the absence of the FE and CU modules, respectively.
We qualitatively validate the visual fidelity of the performances reconstructed by the three models in Figure
13. We observe that in the absence of the FE module, the model fails to reconstruct the parts of the face with larger deformations accurately (like the mouth area in Figure
13(c)), and replacing the CU module leads to reconstruction artifacts and discontinuities in the HR surface (Figure
13(d)).
5.5.3 Effects of Different Locality Parameters.
Interpolation neighbors in CU. We explore the effects of using a different number of interpolation neighbors for defining the local neighbors set
\(\mathcal {N}_j\) in Section
3.2. For this experiment, we train our model using the same training dataset and hyperparameters for 500 epochs but vary the number of interpolation neighbors as 1, 3, 5, 10, and 20. We fix
\(k=5\) for the
\(k\)-NN graph in the FE module for these experiments. We plot the mean surface reconstruction error on the test dataset to study the effect of varying the number of interpolation neighbors on reconstruction accuracy.
As shown in the plot in Figure
14(a), we observe that using a higher number of interpolation neighbors achieves lower mean reconstruction error on unseen performances (shown in red). However, the tradeoff is a linearly increasing time consumption for each inference (shown in blue).
Number of neighbors \(k\) in FE. We conduct another experiment to study the effect of varying the neighbors
\(k\) used in constructing the
\(k\)-NN graph in the EdgeConv layer of the FE module. We train our model for 500 epochs while varying
\(k\) from 1 to 10 in each experiment, and evaluate the mean surface reconstruction error on the test dataset. We fix the number of interpolation neighbors in the CU module to 10 for these experiments. As shown in the plot in Figure
14(b), we find that using
\(k=4, 5\) gives the minimum reconstruction error (shown in red) without a large tradeoff in the inference time (shown in blue).
5.5.4 Correlations Learned in FE Module.
We visualize the heatmaps of the feature similarities learned by the EdgeConv layer in the second FE network submodule. This can reveal the correlations among different parts of the face learned from data. As outlined in Section
3.1, we encourage the first submodule to learn local spatial correlations by constructing the
\(k\)-NN graph in based on geodesic distances, and the second submodule to learn (potentially global) feature correlations in its learned feature space.
Figure
15 shows the learned similarities for four selected frames where the red point in each image denotes a queried point, and the similar colors and shades represent higher similarities. We observe that the FE module has captured the correlations among different parts of the face, such as the right part of the chin being correlated with the left part of the mouth (third image from the left).
6 Conclusion
We have proposed a data-driven deep neural network framework which, using as input a LR simulation of facial expression, enhances its detail and visual fidelity to levels commensurate with that of a much more expensive, HR simulation. The combined performance of the low-resolution simulator and the upsampling module itself is efficient enough to yield 18.46 FPS end-to-end, with the potential of the true real-time 28.04 FPS end-to-end for a modest sacrifice of accuracy. We demonstrate that our SR framework is able to convincingly bridge the visual quality gap between the real-time LR and offline HR simulations, even in instances where the two simulations have substantial differences due to discretization, modeling, and resolution disparities. Our SR network successfully upsamples even deformations that go beyond the parametric poses exemplified in the training set (triggered by muscle action and bone motion), to include dynamics, external forces, and collision objects and constraints. Finally, we observe that our framework can approximate a degree of collision response purely via generalization from the training data. Our code is available on
https://github.com/hjoonpark/3d-sim-super-res.git6.1 Limitations and Future Work
We have adopted a number of design choices that may consciously limit the scope of our work. We have chosen the output of our upsampling module to be the
surface of the face model, rather than a description that includes the interior of the HR target simulation mesh. The same output is also purely geometry, as opposed to physical quantities such as volumetric strain tensor fields or action potentials (e.g., in the style of Srinivasan et al. [
2021] and Yang et al. [
2022]) which might have been useful for an extra simulation pass at the HR to incorporate additional effects. Both such choices are made to reduce the dependency of our system on any internal traits of the simulation engine that was used to produce the HR training data, requiring only surfaces at HR for training (those could even have originated from performance acquisition, as opposed to simulation), and stay as close to the real-time regime as possible.
Our SR approach strives to recreate physical behaviors as exemplified at the HR component of the training set; however, the degree at which such physical traits are conveyed is limited by how large and representative our training set is, and not enforced via explicit physics-based simulation at the HR output. For example, traits such as volume preservation, strain limits, or contact/collision behavior are only approximated to the degree that the network can learn them from data, while a full-fledged simulator could provide stronger guarantees. Specifically, if the LR simulation does not employ collision handling and the HR simulator used for training does, it would be very challenging to resolve behaviors where the exact result of contact resolution is history dependent and admits multiple solutions. A typical example would be a facial motion that brings the lips into deep collision at LR; at HR, any result including the lips being pressed together, or sliding under one another in any order, would not have the benefit of history dependence or friction to naturally lead to one of the possible scenarios.
In future work, we wish to further investigate possibilities for boosting our method’s efficacy of collision handling, by tuning the training loss to more directly emphasize collision avoidance (rather than just matching the target provided), and possibly augment the LR simulation with cheap approximations to collisions (e.g., using proxy geometry and repulsive forces to create a “soft” collision response) to help disambiguate collision scenarios where multiple solutions are admissible. We would also investigate adding a temporal element to our prediction; this could be beneficial both as a way to enhance temporal consistency of our animation, and perhaps as a pathway to adding dynamic effects to the resulting animation (even if the LR simulation was overdamped or quasistatic). Lastly, our method is trained on the facial model of a single identity, overfitting on a specific face mesh. Extending our proposed simulation SR framework to accommodate multiple identities is also an interesting direction for future work.