Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475196acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TACR-Net: Editing on Deep Video and Voice Portraits

Published: 17 October 2021 Publication History

Abstract

Utilizing an arbitrary speech clip to edit the mouth of the portrait in the target video is a novel yet challenging task. Despite impressive results have been achieved, there are still three limitations in the existing methods: 1) since the acoustic features are not completely decoupled from person identity, there is no global speech to facial features (i.e., landmarks, expression blendshape) mapping method. 2) the audio-driven talking face sequences generated by simple cascade structure usually lack of temporal consistency and spatial correlation, which leads to defects in the consistency of changes in details. 3) the operation of forgery is always at the video level, without considering the forgery of the voice, especially the synchronization of the converted voice and the mouth. To address these distortion problems, we propose a novel deep learning framework, named Temporal-Refinement Autoregressive-Cascade Rendering Network (TACR-Net) for audio-driven dynamic talking face editing. The proposed TACR-Net encodes facial expression blendshape based on the given acoustic features without separately training for special video. Then TACR-Net also involves a novel autoregressive cascade structure generator for video re-rendering. Finally, we transform the in-the-wild speech to the target portrait and obtain a photo-realistic and audio-realistic video.

Supplementary Material

ZIP File (mfp0211aux.zip)
The video file includes a presentation video of the TACRNet and dynamic visual results. The video includes?1. TACRNet Overview, 2.TEE+ACR+VCN pipeline, 3.Compare with previous methods, 4.Expand work. Each part includes visual and auditory effects, we recommend turning on audio during watching.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[2]
Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187--194, 1999.
[3]
Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, pages 35--51. Springer, 2020.
[4]
Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7832--7841, 2019.
[5]
J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
[6]
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
[7]
Joon Son Chung and Andrew Zisserman. Lip reading in the wild. In Asian Conference on Computer Vision, pages 87--103. Springer, 2016.
[8]
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision, pages 251--263. Springer, 2016.
[9]
Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421--2424, 2006.
[10]
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0--0, 2019.
[11]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. Text-based editing of talking-head video. ACM Transactions on Graphics (TOG), 38(4):1--14, 2019.
[12]
Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. Driving high-resolution facial scans with video performance capture. ACM Transactions on Graphics (TOG), 34(1):1--14, 2014.
[13]
Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer graphics forum, volume 34, pages 193--204. Wiley Online Library, 2015.
[14]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376, 2006.
[15]
Yudong Guo, Jianfei Cai, Boyi Jiang, Jianmin Zheng, et al. Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE transactions on pattern analysis and machine intelligence, 41(6):1294--1307, 2018.
[16]
Torben Hagerup, Kurt Mehlhorn, and J. Ian Munro. Maintaining discrete probability distributions optimally. In Proceedings of the 20th International Colloquium on Automata, Languages and Programming, volume 700 of Lecture Notes in Computer Science, pages 253--264, Berlin, 1993. Springer-Verlag.
[17]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speech recognition, 2014.
[18]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462--2470, 2017.
[19]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125--1134, 2017.
[20]
Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG), 38(6):1--13, 2019.
[21]
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):1--14, 2018.
[22]
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835--4839. IEEE, 2017.
[23]
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[24]
Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, and Yoshua Bengio. Obamanet: Photo-realistic lip-sync from text, 2017.
[25]
Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy, and Megha Manohara. Toward a practical perceptual video quality metric. The Netflix Tech Blog, 6:2, 2016.
[26]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117--2125, 2017.
[27]
Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu Liu. World-consistent video-to-video synthesis. arXiv preprint arXiv:2007.08509, 2020.
[28]
Yajie Miao, Mohammad Gowayyed, and Florian Metze. Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 167--174. IEEE, 2015.
[29]
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[30]
A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.
[31]
Brid O'Conaill, Steve Whittaker, and Sylvia Wilbur. Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication. Human-computer interaction, 8(4):389--428, 1993.
[32]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[33]
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[34]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484--492, 2020.
[35]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. AutoVC: Zero-shot voice style transfer with only autoencoder loss. volume 97 of Proceedings of Machine Learning Research, pages 5210--5219, Long Beach, California, USA, 09--15 Jun 2019. PMLR.
[36]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234--241. Springer, 2015.
[37]
Joseph Roth, Yiying Tong, and Xiaoming Liu. Adaptive 3d face reconstruction from unconstrained photo collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4197--4206, 2016.
[38]
Hacsim Sak, Félix de Chaumont Quitry, Tara Sainath, Kanishka Rao, et al. Acoustic modelling with cd-ctc-smbr lstm rnns. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 604--609. IEEE, 2015.
[39]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. arXiv preprint arXiv:2003.00196, 2020.
[40]
Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. Everybody's talkin': Let me talk as you want. arXiv preprint arXiv:2001.05201, 2020.
[41]
Yang Song, Jingwen Zhu, Dawei Li, Xiaolong Wang, and Hairong Qi. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.
[42]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1--13, 2017.
[43]
J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.
[44]
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision, pages 716--731. Springer, 2020.
[45]
Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures, 2019.
[46]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
[47]
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, pages 1--16, 2019.
[48]
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[49]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
[50]
Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for image super-resolution: A survey. IEEE transactions on pattern analysis and machine intelligence, 2020.
[51]
Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199--6203. IEEE, 2020.
[52]
Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. Audio-driven talking face video generation with learning-based personalized head pose. arXiv e-prints, pages arXiv--2002, 2020.
[53]
Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging, 3(1):47--57, 2016.
[54]
Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG), 39(6):1--15, 2020.

Cited By

View all
  • (2024)Adaptive Super Resolution for One-Shot Talking-Head GenerationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446837(4115-4119)Online publication date: 14-Apr-2024
  • (2024)Tri$$^{2}$$-plane: Thinking Head Avatar via Feature PyramidComputer Vision – ECCV 202410.1007/978-3-031-72920-1_1(1-20)Online publication date: 1-Oct-2024
  • (2023)Context-Aware Talking-Head Video EditingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611765(7718-7727)Online publication date: 26-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. autoregressive cascade structure
  2. talking face editing
  3. temporal-refinement
  4. voice conversion

Qualifiers

  • Research-article

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Adaptive Super Resolution for One-Shot Talking-Head GenerationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446837(4115-4119)Online publication date: 14-Apr-2024
  • (2024)Tri$$^{2}$$-plane: Thinking Head Avatar via Feature PyramidComputer Vision – ECCV 202410.1007/978-3-031-72920-1_1(1-20)Online publication date: 1-Oct-2024
  • (2023)Context-Aware Talking-Head Video EditingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611765(7718-7727)Online publication date: 26-Oct-2023
  • (2023)StableFace: Analyzing and Improving Motion Stability for Talking Face GenerationIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2023.333355217:6(1232-1247)Online publication date: Nov-2023
  • (2023)Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01905(20782-20792)Online publication date: 1-Oct-2023
  • (2023)Structure Invariant Transformation for better Adversarial Transferability2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00425(4584-4596)Online publication date: 1-Oct-2023
  • (2022)Face Forgery Detection via Symmetric TransformerProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547806(4102-4111)Online publication date: 10-Oct-2022
  • (2022)You Can even Annotate Text with Voice: Transcription-only-Supervised Text SpottingProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547787(4154-4163)Online publication date: 10-Oct-2022
  • (2022)Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated DubbingIEEE Access10.1109/ACCESS.2022.323113710(133357-133369)Online publication date: 2022
  • (2022)Adaptive Face Forgery Detection in Cross DomainComputer Vision – ECCV 202210.1007/978-3-031-19830-4_27(467-484)Online publication date: 23-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media