Audio-Driven Facial Animation with Deep Learning: A Survey
<p>The development of audio-driven animation.</p> "> Figure 2
<p>Scope of this survey.</p> "> Figure 3
<p>Graphical illustration of deep learning-based generation of audio-driven 2D video and 3D facial animation.</p> "> Figure 4
<p>Graphical illustration of landmark-based methods.</p> ">
Abstract
:1. Introduction
1.1. The Development of Audio-Driven Animation
1.2. The Scope of This Survey
1.3. Differences with Existing Surveys
2. Two-Dimensional Video and Three-Dimensional Facial Animation Generation Methods
2.1. Pipeline for Audio-Driven Facial Animation
2.2. Two-Dimensional Video Generation
2.2.1. Landmark-Based Methods
2.2.2. Encoder–Decoder Methods
2.2.3. Three-Dimensional Model-Based Methods
2.3. Three-Dimensional Face Animation
2.3.1. Transformer-Based Methods
2.3.2. Advanced Generative Network-Based Methods
3. Comparison
3.1. Datasets
3.1.1. Audio–Visual Speech Data for Face Animation
- GRID: provides sentence-level audio–visual data ideal for animating lip movements in speech-driven face animation.
- TCD-TIMIT: contains synchronized audio–visual recordings for audio-driven speech animation tasks.
- VOCASET: specifically built for 3D facial mesh generation from speech audio, making it highly suitable for audio-driven face animation.
- LRS: a sentence-level continuous speech dataset for speech-driven facial animation, particularly for lip movement.
- LRS2-BBC: offers continuous speech data from BBC shows, useful for high-quality lip sync and facial motion generation.
- LRS3-TED: provides diverse speech data from TED talks, valuable for training models on varied speakers and expressions in audio-driven face animation.
- MultiTalk: a multilingual audiovisual dataset designed to enhance 3D talking head generation across multiple languages.
3.1.2. Emotional Expression and Speech Synthesis
- CREMA-D: a multimodal emotional dataset useful for generating facial animations that incorporate emotional expressions along with speech.
- MSP-IMPROV: ideal for creating expressive facial animations that respond to both emotional and speech input.
- RAVDESS: combines emotional speech and facial expressions, facilitating models that generate emotional facial animations from audio.
- MEAD: a large-scale dataset of emotional talking faces, perfect for generating facial expressions and lip syncing with emotional variations in speech.
3.1.3. High-Resolution and Detailed Facial Data for 3D Animation
- HDTF: high-definition 3D facial sequences useful for creating fine-grained facial animations driven by speech.
- Multiface: captures facial landmarks and expressions, valuable for generating precise facial animations synchronized with audio.
- MMFace4D: a 4D facial dataset that can be used for generating dynamic 3D facial expressions driven by speech audio.
3.1.4. Multimodal Data with Speech for Face Animation
- MODALIT: a multimodal interaction dataset that can be adapted for audio-driven facial animation, particularly for tasks involving synchronized speech and expressions.
3.1.5. General and Speaker Recognition Datasets with Potential for Face Animation
- VoxCeleb: a large-scale audiovisual dataset that can be adapted for generating personalized facial animations from speaker-specific audio.
Laboratory Dataset vs. Wild Dataset
Challenges and Considerations
3.2. Evaluation of Audio-Driven Facial Animation Methods
- (A)
- Quantitative Metrics
3.2.1. Pixel Wise Metrics (Direct Comparison of Pixel Values)
- MSE (Mean Squared Error): Measures the average squared difference between the predicted and ground truth facial animations. Lower MSE values indicate better accuracy in reproducing facial movements.
- PSNR (Peak Signal-to-Noise Ratio): Evaluates the quality of generated facial animations by comparing them to reference animations. Higher PSNR values indicate better visual fidelity and less distortion.
- LMD (Landmark Distance Error): LMD quantifies the accuracy of lip movement generation by calculating the distance between predicted and actual landmark positions on the lips during animation. This metric is vital for assessing the fidelity of facial animation systems that generate lip movements, ensuring that they closely match the intended expressions.
- EAR (Eye Aspect Ratio): EAR is a quantitative measure used to assess the openness of the eyes by analyzing the geometric relationships between specific facial landmarks around the eyes. It is primarily utilized in computer vision and facial recognition applications to detect eye blinks and monitor eye movement.
3.2.2. Perception-Based Metrics (Visual Quality and Realism)
- SSIM (Structural Similarity Index): Assesses the similarity between generated and real facial animations by comparing structural details such as luminance, contrast, and texture. SSIM provides a more perceptually relevant measure of quality than MSE or PSNR.
- CPBD (Cumulative Probability of Blur Detection): Cumulative Probability of Blur Detection (CPBD) is an effective metric for assessing the sharpness and clarity of images. By leveraging a probabilistic model that evaluates edge sharpness, CPBD provides a comprehensive measure of image quality that aligns well with human visual perception.
- LPIPS (Learned Perceptual Image Patch Similarity): LPIPS (Learned Perceptual Image Patch Similarity) is a metric designed to evaluate the perceptual similarity between images, focusing on how humans perceive differences in visual content. Unlike traditional metrics such as MSE or PSNR, which rely solely on pixel wise comparisons, LPIPS leverages deep learning to assess image quality in a manner that aligns more closely with human perception.
- FID (Fréchet Inception Distance): FID measures the distance between the distributions of generated images from a Generative Adversarial Network (GAN) and real images, based on feature representations extracted from a pre-trained Inception network. It is commonly used to evaluate the quality of GANs, particularly under a two-time-scale update rule, as it helps determine convergence towards a local Nash equilibrium in the training process.
- LRA (Lip-Reading Accuracy): Conventional metrics such as PSNR, SSIM, and LMD are inadequate for accurately evaluating the correctness of generated lip movements. To enhance the assessment of lip synchronization, LRA (Lip-Reading Accuracy) is analyzed through a cutting-edge deep lip-reading model trained on real speech videos. This method has demonstrated effectiveness in providing a more precise evaluation of lip synchronization quality.
- WER (Word Error Rate): WER is calculated by comparing the predicted words generated from the audio input against a reference transcription of the spoken content. Specifically, WER quantifies the number of errors by assessing the minimum number of word insertions, substitutions, and deletions required to align the predicted transcription with the ground truth.
- ESD (Emotion Similarity Distance): ESD is a metric designed to quantify the similarity of emotional features extracted from video data. ESD is grounded in the concept of cosine similarity, which measures the cosine of the angle between two non-zero vectors in a high-dimensional space.
- FDD (Upper-Face Dynamics Deviation): FDD measures the variation in facial dynamics (the changes in motion over time) of the upper face by comparing the generated motion to the ground truth motion. Upper-face movements are less directly tied to speech, meaning they exhibit more subtle and complex dynamics. These dynamics are influenced by emotion, intention, and personal speaking style. Therefore, FDD helps assess whether the generated facial motions capture this complexity by comparing how well the motion variations match the ground truth.
3.2.3. Vertex Wise Metrics (Geometric Comparisons at Mesh Level)
- LVE (Lip Vertex Error): LVE refers to the difference between the actual and predicted positions of vertices on the lips in a 3D facial model during animation or synchronization tasks.
3.2.4. Temporal Wise Metrics (Time-Based Evaluation)
- BA (Beat Align Score): a metric commonly used in the context of evaluating the alignment between audio signals (such as speech or music) and their corresponding visual representations.
- LSE-D (Lip Sync Error—Distance): LSE-D evaluates lip synchronization by measuring the distance-based error between predicted lip movements and the ground truth over time. It focuses on the degree of mismatch in lip movement positions relative to the audio, assessing how well the generated animation aligns with the timing of speech.
- LSE-C (Lip Sync Error—Confidence): LSE-C assesses lip synchronization but focuses on the confidence of the synchronization rather than just positional differences. It evaluates how confidently the system can predict lip movements based on the input audio, aiming to capture the reliability of the synchronization across the animation sequence.
- (B)
- Qualitative Metrics
- (I)
- Visual Realism and Naturalness: evaluates how visually realistic and natural the generated facial animations appear to human observers.
- Human Subjective Evaluation: Human evaluators are asked to rate the realism, naturalness, and smoothness of the generated animations. This is often carried out through surveys, where participants watch animated clips and rate them based on various criteria (e.g., 1 to 5 scale).
- Expert Evaluation: Professionals in animation, gaming, or film industries may be asked to evaluate the quality of the facial animations based on their experience. This often focuses on the accuracy of facial expressions, lip sync, and overall animation quality.
- Comparative Realism: evaluators compare the generated face animations with real video recordings to assess how closely the animated faces resemble human behavior.
- (II)
- Lip Synchronization Accuracy: evaluates how well the lip movements of the animated face are synchronized with the audio input.
- Visual Lip Sync Test: Evaluators watch the animated face and determine whether the lip movements are in sync with the speech audio. The key criteria include whether the lip movements match the speech phonemes and whether the transitions between visemes are smooth.
- A/B Comparison: human evaluators are shown side-by-side comparisons of the generated animation and the ground truth (real video) and asked which one has better lip sync accuracy or if they are indistinguishable.
- (III)
- Expression Realism and Emotional Consistency: focuses on how well the animated face conveys emotions and expressions that are consistent with the content of the audio.
- Expression Realism Rating: Human evaluators rate how realistic and appropriate the facial expressions are based on the context of the spoken words. They may be asked whether the expressions match the emotional tone of the speech (e.g., happiness, sadness, surprise).
- Emotional Consistency Test: Evaluators assess whether the generated facial expressions are consistent with the emotion implied by the audio (e.g., if happy speech leads to a ng face). They can also evaluate how smoothly emotional transitions occur during the animation.
- Contextual Appropriateness: evaluators judge if the facial expressions are contextually appropriate, i.e., whether the animation expresses the right emotion or expression for the specific dialogue or speech content.
- (IV)
- Temporal Smoothness and Continuity: assesses whether the generated animations are temporally coherent and visually smooth over time.
- Smoothness Evaluation: Evaluators focus on how smoothly the facial movements transition from one frame to the next, particularly in areas like the mouth, eyes, and eyebrows. Jerky or unnatural transitions can lead to lower ratings.
- Temporal Coherence: human evaluators examine the overall temporal continuity of the animation, checking for any glitches, jitter, or sudden changes that disrupt the natural flow of expressions and movements.
- Emotion Transition Smoothness: when emotions change throughout the speech (e.g., neutral to happy), evaluators assess how naturally the facial expressions transition from one emotion to another.
- (V)
- Overall Perceptual Quality: combines several aspects (lip sync, realism, expressions) to give a general assessment of the perceived quality of the animation.
- Immersiveness and Engagement: Evaluators rate how engaging and immersive the animation feels. In applications like gaming or virtual assistants, a higher sense of immersion means the animated faces feel more believable and human-like.
- Consistency with Personality or Identity: Evaluators assess whether the generated face animations are consistent with the identity of the character or speaker. This is particularly important for applications where maintaining a character’s distinct personality through facial expressions is crucial.
- (VI)
- User Experience and Acceptance Tests: these tests focus on how end-users perceive the system in real-world applications, especially in interactive environments like virtual assistants or video games.
- User Interaction Feedback: end-users interact with the system and provide qualitative feedback on how well the animated face corresponds to the speech, its responsiveness, and whether they find the system engaging and easy to use.
- Task-Based Evaluation: in applications like virtual tour guides or digital assistants, users are asked to complete tasks while interacting with the animated face and then rate their overall experience, focusing on the responsiveness and believability of the facial animations.
- Naturalness in Social Interaction: Evaluators are asked how natural the face animations are during conversations or social interactions. This is especially relevant in digital human or virtual assistant applications, where natural interaction is crucial.
3.3. Results
3.3.1. Lip Synchronization
3.3.2. Motion Diversity
3.3.3. Image Quality
3.3.4. Impact of Language on Animation Quality
4. Conclusions and Future Directions
4.1. Nonverbal Elements and Silence
4.2. Synchronization and Temporal Consistency
4.3. Generalization and Customization
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, T. Audiovisual Speech Processing. IEEE Signal Process. Mag. 2001, 18, 9–21. [Google Scholar] [CrossRef]
- Seymour, M.; Evans, C.; Libreri, K. Meet Mike: Epic avatars. In ACM SIGGRAPH 2017 VR Village; ACM: Los Angeles, CA, USA, 2017; pp. 1–2. [Google Scholar]
- Charalambous, C.; Yumak, Z.; Van Der Stappen, A.F. Audio-driven Emotional Speech Animation for Interactive Virtual Characters. Comput. Animat. Virtual 2019, 30, e1892. [Google Scholar] [CrossRef]
- Xu, M.; Duan, L.Y.; Cai, J.; Chia, L.T.; Xu, C.; Tian, Q. HMM-Based Audio Keyword Generation. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2004; pp. 566–574. [Google Scholar] [CrossRef]
- Deng, L.; O’Shaughnessy, D. Speech Processing: A Dynamic and Optimization-Oriented Approach; Signal Processing and Communications; Marcel Dekker: New York, NY, USA, 2003. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; ISCA: Shanghai, China, 2020; pp. 5036–5040. [Google Scholar] [CrossRef]
- Han, W.; Zhang, Z.; Zhang, Y.; Yu, J.; Chiu, C.-C.; Qin, J.; Gulati, A.; Pang, R.; Wu, Y. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; ISCA: Shanghai, China, 2020; pp. 3610–3614. [Google Scholar] [CrossRef]
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv 2022, arXiv:2006.04558. [Google Scholar]
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Chan, W.; Park, D.; Lee, C.; Zhang, Y.; Le, Q.; Norouzi, M. SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network. arXiv 2021, arXiv:2104.02133. [Google Scholar]
- Chen, S.; Liu, S.; Zhou, L.; Liu, Y.; Tan, X.; Li, J.; Zhao, S.; Qian, Y.; Wei, F. VALL-E 2: Neural Codec Language Models Are Human Parity Zero-Shot Text to Speech Synthesizers. arXiv 2024, arXiv:2406.05370. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2023, 33, 12449–12460. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf (accessed on 21 October 2024).
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Blanz, V.; Vetter, T. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques—SIGGRAPH ’99, Los Angeles, CA, USA, 8–13 August 1999; pp. 187–194. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 2022, 65, 99–106. [Google Scholar] [CrossRef]
- Magnenat Thalmann, N.; Thalmann, D. Models and Techniques in Computer Animation; Springer: Tokyo, Japan, 2013. [Google Scholar]
- Fisher, C.G. Confusions Among Visually Perceived Consonants. J. Speech Hear. Res. 1968, 11, 796–804. [Google Scholar] [CrossRef] [PubMed]
- Brand, M. Voice Puppetry. In Proceedings of the 26th Annual Conference on Computer graphics and Interactive Techniques—SIGGRAPH ’99, Los Angeles, CA, USA, 8–13 August 1999; pp. 21–28. [Google Scholar] [CrossRef]
- Anderson, R.; Stenger, B.; Wan, V.; Cipolla, R. Expressive Visual Text-to-Speech Using Active Appearance Models. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; IEEE: Portland, OR, USA, 2013; pp. 3382–3389. [Google Scholar] [CrossRef]
- Wang, L.; Soong, F.K. HMM Trajectory-Guided Sample Selection for Photo-Realistic Talking Head. Multimed. Tools Appl. 2015, 74, 9849–9869. [Google Scholar] [CrossRef]
- Deena, S.; Galata, A. Speech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model. In Advances in Visual Computing; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5875, pp. 89–100. [Google Scholar] [CrossRef]
- Deena, S.; Hou, S.; Galata, A. Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model. IEEE Trans. Multimed. 2013, 15, 1755–1768. [Google Scholar] [CrossRef]
- Schabus, D.; Pucher, M.; Hofer, G. Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis. IEEE J. Sel. Top. Signal Process. 2014, 8, 336–347. [Google Scholar] [CrossRef]
- Fan, B.; Xie, L.; Yang, S.; Wang, L.; Soong, F.K. A Deep Bidirectional LSTM Approach for Video-Realistic Talking Head. Multimed. Tools Appl. 2016, 75, 5287–5309. [Google Scholar] [CrossRef]
- Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
- Karras, T.; Aila, T.; Laine, S.; Herva, A.; Lehtinen, J. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Trans. Graph. 2017, 36, 1–12. [Google Scholar] [CrossRef]
- Kammoun, A.; Slama, R.; Tabia, H.; Ouni, T.; Abid, M. Generative Adversarial Networks for Face Generation: A Survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
- Kandwal, S.; Nehra, V. A Survey of Text-to-Image Diffusion Models in Generative AI. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; IEEE: Noida, India, 2024; pp. 73–78. [Google Scholar] [CrossRef]
- Liu, S. Audio-Driven Talking Face Generation: A Review. J. Audio Eng. Soc. 2023, 71, 408–419. [Google Scholar] [CrossRef]
- Tolosana, R.; Vera-Rodriguez, R.; Fierrez, J.; Morales, A.; Ortega-Garcia, J. Deepfakes and beyond: A Survey of Face Manipulation and Fake Detection. Inf. Fusion 2020, 64, 131–148. [Google Scholar] [CrossRef]
- Mirsky, Y.; Lee, W. The Creation and Detection of Deepfakes: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Zhen, R.; Song, W.; He, Q.; Cao, J.; Shi, L.; Luo, J. Human-Computer Interaction System: A Survey of Talking-Head Generation. Electronics 2023, 12, 218. [Google Scholar] [CrossRef]
- Sha, T.; Zhang, W.; Shen, T.; Li, Z.; Mei, T. Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
- Gowda, S.N.; Pandey, D.; Gowda, S.N. From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications. arXiv 2023, arXiv:2308.16041. [Google Scholar]
- Meng, M.; Zhao, Y.; Zhang, B.; Zhu, Y.; Shi, W.; Wen, M.; Fan, Z. A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing. arXiv 2024, arXiv:2406.10553. [Google Scholar]
- Jalalifar, S.A.; Hasani, H.; Aghajan, H. Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks. arXiv 2018, arXiv:1803.07461. [Google Scholar]
- Chen, L.; Maddox, R.K.; Duan, Z.; Xu, C. Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 7824–7833. [Google Scholar] [CrossRef]
- Das, D.; Biswas, S.; Sinha, S.; Bhowmick, B. Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Volume 12375, pp. 408–424. [Google Scholar] [CrossRef]
- Lu, Y.; Chai, J.; Cao, X. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation. ACM Trans. Graph. 2021, 40, 1–17. [Google Scholar] [CrossRef]
- Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. MakeltTalk: Speaker-Aware Talking-Head Animation. ACM Trans. Graph. 2020, 39, 1–15. [Google Scholar] [CrossRef]
- Chen, Z.; Cao, J.; Chen, Z.; Li, Y.; Ma, C. EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions. arXiv 2024, arXiv:2407.08136. [Google Scholar]
- Tan, J.; Cheng, X.; Xiong, L.; Zhu, L.; Li, X.; Wu, X.; Gong, K.; Li, M.; Cai, Y. Landmark-Guided Diffusion Model for High-Fidelity and Temporally Coherent Talking Head Generation. arXiv 2024, arXiv:2408.01732. [Google Scholar]
- Zhong, W.; Lin, J.; Chen, P.; Lin, L.; Li, G. High-Fidelity and Lip-Synced Talking Face Synthesis via Landmark-Based Diffusion Model. arXiv 2024, arXiv:2408.05416. [Google Scholar]
- Jamaludin, A.; Chung, J.S.; Zisserman, A. You Said That?: Synthesising Talking Faces from Audio. Int. J. Comput. Vis. 2019, 127, 1767–1779. [Google Scholar] [CrossRef]
- Chung, J.S.; Jamaludin, A.; Zisserman, A. You Said That? arXiv 2017, arXiv:1705.02966. [Google Scholar]
- Vougioukas, K.; Petridis, S.; Pantic, M. Realistic Speech-Driven Facial Animation with GANs. Int. J. Comput. Vis. 2020, 128, 1398–1413. [Google Scholar] [CrossRef]
- Wiles, O.; Koepke, A.S.; Zisserman, A. X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Volume 11217, pp. 690–706. [Google Scholar] [CrossRef]
- Chen, L.; Li, Z.; Maddox, R.K.; Duan, Z.; Xu, C. Lip Movements Generation at a Glance. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Volume 11211, pp. 538–553. [Google Scholar] [CrossRef]
- Fan, B.; Wang, L.; Soong, F.K.; Xie, L. Photo-Real Talking Head with Deep Bidirectional LSTM. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; IEEE: South Brisbane, QLD, Australia, 2015; pp. 4884–4888. [Google Scholar] [CrossRef]
- Pham, H.X.; Cheung, S.; Pavlovic, V. Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 2328–2336. [Google Scholar] [CrossRef]
- Vougioukas, K.; Petridis, S.; Pantic, M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019; Volume 887, pp. 37–40. [Google Scholar]
- Song, Y.; Zhu, J.; Li, D.; Wang, A.; Qi, H. Talking Face Generation by Conditional Recurrent Adversarial Network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; International Joint Conferences on Artificial Intelligence Organization: Macao, China, 2019; pp. 919–925. [Google Scholar] [CrossRef]
- Zhu, H.; Huang, H.; Li, Y.; Zheng, A.; He, R. Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; International Joint Conferences on Artificial Intelligence Organization: Yokohama, Japan, 2020; pp. 2362–2368. [Google Scholar] [CrossRef]
- Prajwal, K.R.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C.V. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; ACM: Seattle WA USA, 2020; pp. 484–492. [Google Scholar] [CrossRef]
- Kumar, N.; Goel, S.; Narang, A.; Hasan, M. Robust One Shot Audio to Video Generation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 3334–3343. [Google Scholar] [CrossRef]
- Yaman, D.; Eyiokur, F.I.; Bärmann, L.; Aktı, S.; Ekenel, H.K.; Waibel, A. Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation. arXiv 2024, arXiv:2405.04327. [Google Scholar]
- Shen, S.; Zhao, W.; Meng, Z.; Li, W.; Zhu, Z.; Zhou, J.; Lu, J. DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Vancouver, BC, Canada, 2023; pp. 1982–1991. [Google Scholar] [CrossRef]
- Zhao, D.; Shi, J.; Li, W.; Wang, S.; Xu, S.; Pan, Z. Controllable Talking Face Generation by Implicit Facial Keypoints Editing. arXiv 2024, arXiv:2406.02880. [Google Scholar]
- Yin, F.; Zhang, Y.; Cun, X.; Cao, M.; Fan, Y.; Wang, X.; Bai, Q.; Wu, B.; Wang, J.; Yang, Y. StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-Trained StyleGAN. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Volume 13677, pp. 85–101. [Google Scholar] [CrossRef]
- Liu, T.; Chen, F.; Fan, S.; Du, C.; Chen, Q.; Chen, X.; Yu, K. AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding. arXiv 2024, arXiv:2405.03121. [Google Scholar]
- Wang, S.; Li, L.; Ding, Y.; Yu, X. One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning. AAAI 2022, 36, 2531–2539. [Google Scholar] [CrossRef]
- Yao, Z.; Cheng, X.; Huang, Z. FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model. arXiv 2024, arXiv:2408.09384. [Google Scholar]
- Lin, G.; Jiang, J.; Liang, C.; Zhong, T.; Yang, J.; Zheng, Y. CyberHost: Taming Audio-Driven Avatar Diffusion Model with Region Codebook Attention. arXiv 2024, arXiv:2409.01876. [Google Scholar]
- Zeng, D.; Liu, H.; Lin, H.; Ge, S. Talking Face Generation with Expression-Tailored Generative Adversarial Network. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; ACM: Seattle WA USA, 2020; pp. 1716–1724. [Google Scholar] [CrossRef]
- Zhou, H.; Sun, Y.; Wu, W.; Loy, C.C.; Wang, X.; Liu, Z. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 4174–4184. [Google Scholar] [CrossRef]
- Eskimez, S.E.; Zhang, Y.; Duan, Z. Speech Driven Talking Face Generation From a Single Image and an Emotion Condition. IEEE Trans. Multimed. 2022, 24, 3480–3490. [Google Scholar] [CrossRef]
- Mittal, G.; Wang, B. Animating Face Using Disentangled Audio Representations. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; IEEE: Snowmass Village, CO, USA, 2020; pp. 3279–3287. [Google Scholar] [CrossRef]
- Ji, X.; Zhou, H.; Wang, K.; Wu, W.; Loy, C.C.; Cao, X.; Xu, F. Audio-Driven Emotional Video Portraits. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 14075–14084. [Google Scholar] [CrossRef]
- Zhang, W.; Cun, X.; Wang, X.; Zhang, Y.; Shen, X.; Guo, Y.; Shan, Y.; Wang, F. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. arXiv 2023, arXiv:2211.12194. [Google Scholar]
- Yi, R.; Ye, Z.; Zhang, J.; Bao, H.; Liu, Y.-J. Audio-Driven Talking Face Video Generation with Learning-Based Personalized Head Pose. arXiv 2020, arXiv:2002.10137. [Google Scholar]
- Zhang, C.; Zhao, Y.; Huang, Y.; Zeng, M.; Ni, S.; Budagavi, M.; Guo, X. FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 3847–3856. [Google Scholar] [CrossRef]
- Zhang, Z.; Li, L.; Ding, Y.; Fan, C. Flow-Guided One-Shot Talking Face Generation with a High-Resolution Audio-Visual Dataset. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 3660–3669. [Google Scholar] [CrossRef]
- Ma, Y.; Zhang, S.; Wang, J.; Wang, X.; Zhang, Y.; Deng, Z. DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models. arXiv 2023, arXiv:2312.09767. [Google Scholar]
- Lahiri, A.; Kwatra, V.; Frueh, C.; Lewis, J.; Bregler, C. LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video Using Pose and Lighting Normalization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 2754–2763. [Google Scholar] [CrossRef]
- Liang, J.; Lu, F. Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation. arXiv 2024, arXiv:2406.07895. [Google Scholar]
- Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; Wang, X. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. AAAI 2019, 33, 9299–9306. [Google Scholar] [CrossRef]
- Taylor, S.; Kim, T.; Yue, Y.; Mahler, M.; Krahe, J.; Rodriguez, A.G.; Hodgins, J.; Matthews, I. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 2017, 36, 1–11. [Google Scholar] [CrossRef]
- Pham, H.X.; Wang, Y.; Pavlovic, V. End-to-End Learning for 3D Facial Animation from Speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; ACM: Boulder, CO, USA, 2018; pp. 361–365. [Google Scholar] [CrossRef]
- Zhou, Y.; Xu, Z.; Landreth, C.; Kalogerakis, E.; Maji, S.; Singh, K. Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Trans. Graph. 2018, 37, 1–10. [Google Scholar] [CrossRef]
- Sadoughi, N.; Busso, C. Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks. IEEE Trans. Affect. Comput. 2021, 12, 1031–1044. [Google Scholar] [CrossRef]
- Cudeiro, D.; Bolkart, T.; Laidlaw, C.; Ranjan, A.; Black, M.J. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 10093–10103. [Google Scholar] [CrossRef]
- Richard, A.; Lea, C.; Ma, S.; Gall, J.; La Torre, F.D.; Sheikh, Y. Audio- and Gaze-Driven Facial Animation of Codec Avatars. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; IEEE: Waikoloa, HI, USA, 2021; pp. 41–50. [Google Scholar] [CrossRef]
- Song, L.; Wu, W.; Qian, C.; He, R.; Loy, C.C. Everybody’s Talkin’: Let Me Talk as You Want. IEEE Trans. Inform. Forensic Secur. 2022, 17, 585–598. [Google Scholar] [CrossRef]
- Richard, A.; Zollhofer, M.; Wen, Y.; De La Torre, F.; Sheikh, Y. MeshTalk: 3D Face Animation from Speech Using Cross-Modality Disentanglement. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 1153–1162. [Google Scholar] [CrossRef]
- Fan, Y.; Lin, Z.; Saito, J.; Wang, W.; Komura, T. Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation. Proc. ACM Comput. Graph. Interact. Tech. 2022, 5, 1–15. [Google Scholar] [CrossRef]
- Abdelaziz, A.H.; Theobald, B.-J.; Dixon, P.; Knothe, R.; Apostoloff, N.; Kajareker, S. Modality Dropout for Improved Performance-Driven Talking Faces. arXiv 2020, arXiv:2005.13616. [Google Scholar]
- Chen, L.; Cui, G.; Liu, C.; Li, Z.; Kou, Z.; Xu, Y.; Xu, C. Talking-Head Generation with Rhythmic Head Motion. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Volume 12354, pp. 35–51. [Google Scholar] [CrossRef]
- Huang, D.-Y.; Chandra, E.; Yang, X.; Zhou, Y.; Ming, H.; Lin, W.; Dong, M.; Li, H. Visual Speech Emotion Conversion Using Deep Learning for 3D Talking Head. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data, Seoul, Republic of Korea, 26 October 2018; ACM: Seoul, Republic of Korea, 2018; pp. 7–13. [Google Scholar] [CrossRef]
- Wang, Q.; Fan, Z.; Xia, S. 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head. arXiv 2021, arXiv:2104.12051. [Google Scholar]
- Thambiraja, B.; Habibie, I.; Aliakbarian, S.; Cosker, D.; Theobalt, C.; Thies, J. Imitator: Personalized Speech-Driven 3D Facial Animation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; IEEE: Paris, France, 2023; pp. 20564–20574. [Google Scholar] [CrossRef]
- Fan, Y.; Lin, Z.; Saito, J.; Wang, W.; Komura, T. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 18749–18758. [Google Scholar] [CrossRef]
- Lu, L.; Zhang, T.; Liu, Y.; Chu, X.; Li, Y. Audio-Driven 3D Facial Animation from In-the-Wild Videos. arXiv 2023, arXiv:2306.11541. [Google Scholar]
- Chai, Y.; Shao, T.; Weng, Y.; Zhou, K. Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement. IEEE Trans. Visual. Comput. Graphics 2024, 30, 1803–1820. [Google Scholar] [CrossRef]
- Xing, J.; Xia, M.; Zhang, Y.; Cun, X.; Wang, J.; Wong, T.-T. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Vancouver, BC, Canada, 2023; pp. 12780–12790. [Google Scholar] [CrossRef]
- Peng, Z.; Wu, H.; Song, Z.; Xu, H.; Zhu, X.; He, J.; Liu, H.; Fan, Z. EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; IEEE: Paris, France, 2023; pp. 20630–20640. [Google Scholar] [CrossRef]
- Daněček, R.; Chhatre, K.; Tripathi, S.; Wen, Y.; Black, M.J.; Bolkart, T. Emotional Speech-Driven Animation with Content-Emotion Disentanglement. In Proceedings of the SIGGRAPH Asia 2023 Conference Papers, Sydney, NSW, Australia, 12–15 December 2023; pp. 1–13. [Google Scholar] [CrossRef]
- Han, T.; Gui, S.; Huang, Y.; Li, B.; Liu, L.; Zhou, B.; Jiang, N.; Lu, Q.; Zhi, R.; Liang, Y.; et al. PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-Modal Features. arXiv 2023, arXiv:2312.02781. [Google Scholar]
- Sun, M.; Xu, C.; Jiang, X.; Liu, Y.; Sun, B.; Huang, R. Beyond Talking—Generating Holistic 3D Human Dyadic Motion for Communication. arXiv 2024, arXiv:2403.19467. [Google Scholar]
- He, S.; He, H.; Yang, S.; Wu, X.; Xia, P.; Yin, B.; Liu, C.; Dai, L.; Xu, C. Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; IEEE: Paris, France, 2023; pp. 14146–14156. [Google Scholar] [CrossRef]
- Liang, X.; Zhuang, W.; Wang, T.; Geng, G.; Geng, G.; Xia, H.; Xia, S. CSTalk: Correlation Supervised Speech-Driven 3D Emotional Facial Animation Generation. arXiv 2024, arXiv:2404.18604. [Google Scholar]
- Lin, Y.; Peng, L.; Hu, J.; Li, X.; Kang, W.; Lei, S.; Wu, X.; Xu, H. EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention. arXiv 2024, arXiv:2408.11518. [Google Scholar]
- Jafari, F.; Berretti, S.; Basu, A. JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model. arXiv 2024, arXiv:2408.01627. [Google Scholar]
- Zhuang, Y.; Cheng, B.; Cheng, Y.; Jin, Y.; Liu, R.; Li, C.; Cheng, X.; Liao, J.; Lin, J. Learn2Talk: 3D Talking Face Learns from 2D Talking Face. arXiv 2024, arXiv:2404.12888. [Google Scholar] [CrossRef] [PubMed]
- Ji, X.; Lin, C.; Ding, Z.; Tai, Y.; Zhu, J.; Hu, X.; Luo, D.; Ge, Y.; Wang, C. RealTalk: Real-Time and Realistic Audio-Driven Face Generation with 3D Facial Prior-Guided Identity Alignment Network. arXiv 2024, arXiv:2406.18284. [Google Scholar]
- Peng, Z.; Luo, Y.; Shi, Y.; Xu, H.; Zhu, X.; Liu, H.; He, J.; Fan, Z. SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; ACM: Ottawa, ON, Canada, 2023; pp. 5292–5301. [Google Scholar] [CrossRef]
- Fan, X.; Li, J.; Lin, Z.; Xiao, W.; Yang, L. UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model. arXiv 2024, arXiv:2408.00762. [Google Scholar]
- Chu, Z.; Guo, K.; Xing, X.; Lan, Y.; Cai, B.; Xu, X. CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation. arXiv 2023, arXiv:2310.11295. [Google Scholar] [CrossRef]
- Thambiraja, B.; Aliakbarian, S.; Cosker, D.; Thies, J. 3DiFACE: Diffusion-Based Speech-Driven 3D Facial Animation and Editing. arXiv 2023, arXiv:2312.00870. [Google Scholar]
- Xu, Z.; Zhang, J.; Liew, J.H.; Zhang, W.; Bai, S.; Feng, J.; Shou, M.Z. PV3D: A 3D Generative Model for Portrait Video Generation. arXiv 2022, arXiv:2212.06384. [Google Scholar]
- Stan, S.; Haque, K.I.; Yumak, Z. FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion. In Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games, Rennes, France, 15–17 November 2023; ACM: Rennes, France, 2023; pp. 1–11. [Google Scholar] [CrossRef]
- Chen, P.; Wei, X.; Lu, M.; Zhu, Y.; Yao, N.; Xiao, X.; Chen, H. DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser. arXiv 2023, arXiv:2311.16565. [Google Scholar]
- Papantoniou, F.P.; Filntisis, P.P.; Maragos, P.; Roussos, A. Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in “in-the-Wild” Videos. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 18759–18768. [Google Scholar] [CrossRef]
- Ma, Z.; Zhu, X.; Qi, G.; Qian, C.; Zhang, Z.; Lei, Z. DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer. arXiv 2024, arXiv:2402.05712. [Google Scholar]
- Aneja, S.; Thies, J.; Dai, A.; Nießner, M. FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models. arXiv 2024, arXiv:2312.08459. [Google Scholar]
- Lin, Y.; Fan, Z.; Xiong, L.; Peng, L.; Li, X.; Kang, W.; Wu, X.; Lei, S.; Xu, H. GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer. arXiv 2024, arXiv:2408.01826. [Google Scholar]
- Xu, Z.; Gong, S.; Tang, J.; Liang, L.; Huang, Y.; Li, H.; Huang, S. KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding. arXiv 2024, arXiv:2409.01113. [Google Scholar]
- Zhao, Q.; Long, P.; Zhang, Q.; Qin, D.; Liang, H.; Zhang, L.; Zhang, Y.; Yu, J.; Xu, L. Media2Face: Co-Speech Facial Animation Generation with Multi-Modality Guidance. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, Denver, CO, USA, 27 July–1 August 2024; ACM: Denver, CO, USA, 2024; pp. 1–13. [Google Scholar] [CrossRef]
- Kim, G.; Seo, K.; Cha, S.; Noh, J. NeRFFaceSpeech: One-Shot Audio-Driven 3D Talking Head Synthesis via Generative Prior. arXiv 2024, arXiv:2405.05749. [Google Scholar]
- Alghamdi, N.; Maddock, S.; Marxer, R.; Barker, J.; Brown, G.J. A Corpus of Audio-Visual Lombard Speech with Frontal and Profile Views. J. Acoust. Soc. Am. 2018, 143, EL523–EL529. [Google Scholar] [CrossRef] [PubMed]
- The Sheffield Audio-Visual Lombard Grid Corpus. Available online: https://spandh.dcs.shef.ac.uk/avlombard/ (accessed on 21 October 2024).
- Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. Available online: https://github.com/CheyneyComputerScience/CREMA-D (accessed on 21 October 2024). [CrossRef]
- Fanelli, G.; Gall, J.; Romsdorfer, H.; Weise, T.; Van Gool, L. A 3-D Audio-Visual Corpus of Affective Communication. IEEE Trans. Multimed. 2010, 12, 591–598. [Google Scholar] [CrossRef]
- 3-D Audio-Visual Corpus EULA. Available online: https://data.vision.ee.ethz.ch/cvl/datasets/B3DAC2/CorpusEULA.pdf (accessed on 21 October 2024).
- Harte, N.; Gillen, E. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech. IEEE Trans. Multimed. 2015, 17, 603–615. [Google Scholar] [CrossRef]
- TCD-TIMIT Corpus. Available online: https://sigmedia.tcd.ie (accessed on 21 October 2024).
- Czyzewski, A.; Kostek, B.; Bratoszewski, P.; Kotus, J.; Szykulski, M. An Audio-Visual Corpus for Multimodal Automatic Speech Recognition. J. Intell. Inf. Syst. 2017, 49, 167–192. [Google Scholar] [CrossRef]
- Jachimski, D.; Czyzewski, A.; Ciszewski, T. A Comparative Study of English Viseme Recognition Methods and Algorithms. Multimed Tools Appl. 2018, 77, 16495–16532. [Google Scholar] [CrossRef]
- Kawaler, M.; Czyżewski, A. Database of Speech and Facial Expressions Recorded with Optimized Face Motion Capture Settings. J. Intell. Inf. Syst. 2019, 53, 381–404. [Google Scholar] [CrossRef]
- MODALITY Corpus. Available online: http://www.modality-corpus.org/ (accessed on 21 October 2024).
- Busso, C.; Parthasarathy, S.; Burmania, A.; AbdelWahab, M.; Sadoughi, N.; Provost, E.M. MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception. IEEE Trans. Affect. Comput. 2017, 8, 67–80. Available online: https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Improv.html (accessed on 21 October 2024). [CrossRef]
- Chung, J.S.; Zisserman, A. Lip Reading in the Wild. In Computer Vision—ACCV 2016; Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2017; Volume 10112, pp. 87–103. [Google Scholar] [CrossRef]
- Lip Reading in the Wild dataset. Available online: https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html (accessed on 21 October 2024).
- Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip Reading Sentences in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 3444–3453. [Google Scholar] [CrossRef]
- Lip Reading Sentences Dataset. Available online: https://www.robots.ox.ac.uk/~vgg/data/lip_reading/ (accessed on 21 October 2024).
- Son, J.S.; Zisserman, A. Lip Reading in Profile. In Proceedings of the British Machine Vision Conference 2017, London, UK, 4–7 September 2017; British Machine Vision Association: London, UK, 2017; p. 155. [Google Scholar] [CrossRef]
- Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep Audio-Visual Speech Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8717–8727. [Google Scholar] [CrossRef] [PubMed]
- The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset. Available online: https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html (accessed on 21 October 2024).
- Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: A Large-Scale Dataset for Visual Speech Recognition. arXiv 2018, arXiv:1809.00496. [Google Scholar]
- Lip Reading Sentences 3. Available online: https://mmai.io/datasets/lip_reading/ (accessed on 21 October 2024).
- Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; ISCA: Shanghai, China, 2017; pp. 2616–2620. [Google Scholar] [CrossRef]
- Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-Scale Speaker Verification in the Wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
- Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; ISCA: Shanghai, China, 2018; pp. 1086–1090. [Google Scholar] [CrossRef]
- VoxCeleb Dataset. Available online: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ (accessed on 21 October 2024).
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
- RAVDESS dataset. Available online: https://zenodo.org/records/1188976#.YFZuJ0j7SL8 (accessed on 21 October 2024).
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 527–536. [Google Scholar] [CrossRef]
- MELD: Multimodal EmotionLines Dataset. Available online: https://affective-meld.github.io/ (accessed on 21 October 2024).
- Vocaset Project. Available online: https://voca.is.tue.mpg.de (accessed on 21 October 2024).
- HDTF Dataset. Available online: https://github.com/MRzzm/HDTF (accessed on 21 October 2024).
- Wang, K.; Wu, Q.; Song, L.; Yang, Z.; Wu, W.; Qian, C.; He, R.; Qiao, Y.; Loy, C.C. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Available online: https://wywu.github.io/projects/MEAD/MEAD.html (accessed on 21 October 2024).
- Zhu, H.; Wu, W.; Zhu, W.; Jiang, L.; Tang, S.; Zhang, L.; Liu, Z.; Loy, C.C. CelebV-HQ: A Large-Scale Video Facial Attributes Dataset. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Volume 13667, pp. 650–667. Available online: https://celebv-hq.github.io (accessed on 21 October 2024). [CrossRef]
- Wuu, C.; Zheng, N.; Ardisson, S.; Bali, R.; Belko, D.; Brockmeyer, E.; Evans, L.; Godisart, T.; Ha, H.; Huang, X.; et al. Multiface: A Dataset for Neural Face Rendering. arXiv 2022, arXiv:2207.11243. [Google Scholar]
- Multiface Dataset. Available online: https://github.com/facebookresearch/multiface (accessed on 21 October 2024).
- Wu, H.; Jia, J.; Xing, J.; Xu, H.; Wang, X.; Wang, J. MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation. arXiv 2023, arXiv:2303.09797. [Google Scholar]
- MMFace4D Dataset. Available online: https://wuhaozhe.github.io/mmface4d (accessed on 21 October 2024).
- Sung-Bin, K.; Chae-Yeon, L.; Son, G.; Hyun-Bin, O.; Ju, J.; Nam, S.; Oh, T.-H. MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset. arXiv 2024, arXiv:2406.14272. [Google Scholar]
- MultiTalk Dataset. Available online: https://arxiv.org/pdf/2406.14272 (accessed on 21 October 2024).
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Narvekar, N.D.; Karam, L.J. A No-Reference Image Blur Metric Based on the Cumulative Probability of Blur Detection (CPBD). IEEE Trans. Image Process. 2011, 20, 2678–2683. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 586–595. [Google Scholar] [CrossRef]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Assael, Y.M.; Shillingford, B.; Whiteson, S.; de Freitas, N. LipNet: End-to-End Sentence-Level Lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
- Soukupova, T.; Cech, J. Eye blink detection using facial landmarks. In Proceedings of the 21st Computer Vision Winter Workshop, Laško, Slovenia, 3–5 February 2016; Volume 2. [Google Scholar]
- Chen, L.; Cui, G.; Kou, Z.; Zheng, H.; Xu, C. What Comprises a Good Talking-Head Video Generation?: A Survey and Benchmark. arXiv 2020, arXiv:2005.03201. [Google Scholar]
- Siyao, L.; Yu, W.; Gu, T.; Lin, C.; Wang, Q.; Qian, C.; Loy, C.C.; Liu, Z. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 11040–11049. [Google Scholar] [CrossRef]
Dataset Name | Year | Subjects | Utterance | Environment | Language | Emotions | Emotion Level | Views | Facial Mesh | Link |
---|---|---|---|---|---|---|---|---|---|---|
GRID [124] | 2006 | 54 | 5400 | Lab | English | Neutral | - | 2 views | No | [125] |
CREMA-D [126] | 2014 | 91 | 7442 | Lab | English | 6 | 3 | front | No | [126] |
BIWI [127] | 2014 | 14 | 1109 | Lab | English | 2 | - | - | Yes | [128] |
TCD-TIMIT [129] | 2015 | 62 | 6913 | Lab | English | Neutral | - | 2 views | No | [130] |
MODALITY [131,132,133] | 2015 | 35 | 5880 | Lab | English | Neutral | - | - | No | [134] |
MSP-IMPROV [135] | 2016 | 12 | 8438 | Lab | English | 4 | - | front | No | [135] |
LRW [136] | 2016 | - | ∼539 K | Wild | English | - | - | - | No | [137] |
LRS [138] | 2017 | - | ∼118 k | Wild | English | - | - | - | No | [139] |
MV-LRS [140] | 2018 | - | ∼500 k | Wild | English | - | - | - | No | [139] |
LRS2-BBC [141] | 2018 | ∼62.8 k | ∼144.5 k | Wild | English | - | - | - | No | [142] |
LRS3-TED [143] | 2018 | 9.5 k+ | ∼165 k+ | Wild | English | - | - | - | No | [144] |
VoxCeleb [145,146,147] | 2018 | 7 k+ | 1 M+ | Wild | English | - | - | - | No | [148] |
RAVDESS [149] | 2018 | 24 | 7356 | Lab | English | 8 | 2 | front | No | [150] |
MELD [151] | 2018 | - | 13 k | Wild | English | 6 | - | - | No | [152] |
VOCASET [86] | 2019 | 12 | 480 | Lab | English | - | - | - | Yes | [153] |
HDTF [77] | 2020 | 300+ | 10 k+ | Wild | English | - | - | - | No | [154] |
MEAD [155] | 2020 | 60 | 281.4 k | Lab | English | 8 | 3 | 7 | No | [155] |
CelebV-HQ [156] | 2022 | 15,653 | 35,666 | Wild | English | 8 | - | - | No | [156] |
Multiface [157] | 2022 | 13 | 299 k | Lab | English | 118 | 18 | 150 | Yes | [158] |
MMFace4D [159] | 2023 | 431 | 35,904 | Lab | Chinese | 7 | - | Front | Yes | [160] |
MultiTalk [161] | 2024 | - | 294 k | Wild | 20 Languages | - | - | - | No | [162] |
Metric | Compare Level | Focus on | Typical Range/Values |
---|---|---|---|
MSE | Pixel wise | Pixel wise error | 0 (perfect) to ∞ |
PSNR | Pixel wise | Signal-to-Noise | 20 dB (poor) to 40+ dB (excellent) |
SSIM [163] | Perception | Structural details | −1 (poor) to 1 (perfect) |
CPBD [164] | Perception | Sharpness and clarity | 0 (excellent) to 1 (poor) |
LPIPS [165] | Perception | Perceptual similarity | 0 (perfect) to ∞ |
FID [166] | Perception | Distribution similarity | 0 (perfect) to ∞ |
LMD [53] | Pixel wise | Landmark position error | Varies based on dataset, ideally < 5 pixels |
LRA [57] | Perception | Lip synchronization | 0 (poor) to 1 (perfect) |
WER [167] | Perception | Word errors | 0 (perfect) to 1 (poor) |
EAR [168] | Pixel wise | Openness of the eyes | 0 (closed) to 1 (fully open) |
ESD [169] | Perception | Emotion similarity | 0 (different) to 1 (same) |
LVE [89] | Vertex wise | Lip vertices error | Varies based on dataset, ideally < 2 mm |
FDD [99] | Perception | Facial dynamics | 0 (no motion) to ∞ (extreme motion) |
BA [170] | Temporal wise | Temporal difference | 0 (poor) to 1 (perfect) |
LSE-D [59] | Temporal wise | Lip synchronization | 0 (perfect) to ∞, ideally < 2 |
LSE-C [59] | Temporal wise | Lip synchronization | Varies based on dataset, ideally > 8 |
Method | LVE (mm) | FDD (×10−5 m) |
---|---|---|
VOCA [86] | 6.7155 | 7.5320 |
MeshTalk [89] | 5.9181 | 5.1025 |
FaceFormer [96] | 4.9847 | 5.0972 |
CodeTalker [99] | 4.7914 | 4.1170 |
FaceDiffuser [115] | 4.2985 | 3.9101 |
Method | Lip Synchronization | Motion Diversity | Image Quality | |||
---|---|---|---|---|---|---|
LSE-C | Diversity | Beat Align | FID | PSNR | SSIM | |
Wav2Lip [59] | 10.08/8.13 | - | - | 22.67/23.85 | 32.33/35.19 | 0.740/0.653 |
MakeItTalk [45] | 4.89/2.96 | 0.238/0.260 | 0.221/0.252 | 28.96/31.77 | 17.95/21.08 | 0.623/0.529 |
SadTalker [74] | 6.11/4.51 | 0.275/0.319 | 0.296/0.328 | 23.76/24.19 | 35.78/37.90 | 0.746/0.690 |
DiffTalk [62] | 6.06/4.38 | 0.235/0.258 | 0.226/0.253 | 23.99/24.06 | 36.51/36.17 | 0.721/0.686 |
DreamTalk [78] | 6.93/4.76 | 0.236/0.257 | 0.213/0.249 | 24.30/23.61 | 32.82/33.16 | 0.738/0.692 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, D.; Chang, J.; You, L.; Bian, S.; Kosk, R.; Maguire, G. Audio-Driven Facial Animation with Deep Learning: A Survey. Information 2024, 15, 675. https://doi.org/10.3390/info15110675
Jiang D, Chang J, You L, Bian S, Kosk R, Maguire G. Audio-Driven Facial Animation with Deep Learning: A Survey. Information. 2024; 15(11):675. https://doi.org/10.3390/info15110675
Chicago/Turabian StyleJiang, Diqiong, Jian Chang, Lihua You, Shaojun Bian, Robert Kosk, and Greg Maguire. 2024. "Audio-Driven Facial Animation with Deep Learning: A Survey" Information 15, no. 11: 675. https://doi.org/10.3390/info15110675
APA StyleJiang, D., Chang, J., You, L., Bian, S., Kosk, R., & Maguire, G. (2024). Audio-Driven Facial Animation with Deep Learning: A Survey. Information, 15(11), 675. https://doi.org/10.3390/info15110675