research-article

TACR-Net: Editing on Deep Video and Voice Portraits

Authors:

Jia-Xuan BaiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 478 - 486

https://doi.org/10.1145/3474085.3475196

Published: 17 October 2021 Publication History

Abstract

Utilizing an arbitrary speech clip to edit the mouth of the portrait in the target video is a novel yet challenging task. Despite impressive results have been achieved, there are still three limitations in the existing methods: 1) since the acoustic features are not completely decoupled from person identity, there is no global speech to facial features (i.e., landmarks, expression blendshape) mapping method. 2) the audio-driven talking face sequences generated by simple cascade structure usually lack of temporal consistency and spatial correlation, which leads to defects in the consistency of changes in details. 3) the operation of forgery is always at the video level, without considering the forgery of the voice, especially the synchronization of the converted voice and the mouth. To address these distortion problems, we propose a novel deep learning framework, named Temporal-Refinement Autoregressive-Cascade Rendering Network (TACR-Net) for audio-driven dynamic talking face editing. The proposed TACR-Net encodes facial expression blendshape based on the given acoustic features without separately training for special video. Then TACR-Net also involves a novel autoregressive cascade structure generator for video re-rendering. Finally, we transform the in-the-wild speech to the target portrait and obtain a photo-realistic and audio-realistic video.

Supplementary Material

ZIP File (mfp0211aux.zip)

The video file includes a presentation video of the TACRNet and dynamic visual results. The video includes?1. TACRNet Overview, 2.TEE+ACR+VCN pipeline, 3.Compare with previous methods, 4.Expand work. Each part includes visual and auditory effects, we recommend turning on audio during watching.

Download
31.72 MB

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[2]

Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187--194, 1999.

Digital Library

[3]

Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, pages 35--51. Springer, 2020.

Digital Library

[4]

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7832--7841, 2019.

[5]

J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.

[6]

J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.

[7]

Joon Son Chung and Andrew Zisserman. Lip reading in the wild. In Asian Conference on Computer Vision, pages 87--103. Springer, 2016.

[8]

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision, pages 251--263. Springer, 2016.

[9]

Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421--2424, 2006.

[10]

Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0--0, 2019.

[11]

Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. Text-based editing of talking-head video. ACM Transactions on Graphics (TOG), 38(4):1--14, 2019.

Digital Library

[12]

Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. Driving high-resolution facial scans with video performance capture. ACM Transactions on Graphics (TOG), 34(1):1--14, 2014.

Digital Library

[13]

Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer graphics forum, volume 34, pages 193--204. Wiley Online Library, 2015.

Digital Library

[14]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376, 2006.

Digital Library

[15]

Yudong Guo, Jianfei Cai, Boyi Jiang, Jianmin Zheng, et al. Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE transactions on pattern analysis and machine intelligence, 41(6):1294--1307, 2018.

[16]

Torben Hagerup, Kurt Mehlhorn, and J. Ian Munro. Maintaining discrete probability distributions optimally. In Proceedings of the 20th International Colloquium on Automata, Languages and Programming, volume 700 of Lecture Notes in Computer Science, pages 253--264, Berlin, 1993. Springer-Verlag.

Digital Library

[17]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speech recognition, 2014.

[18]

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462--2470, 2017.

[19]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125--1134, 2017.

[20]

Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG), 38(6):1--13, 2019.

Digital Library

[21]

Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):1--14, 2018.

Digital Library

[22]

Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835--4839. IEEE, 2017.

Digital Library

[23]

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[24]

Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, and Yoshua Bengio. Obamanet: Photo-realistic lip-sync from text, 2017.

[25]

Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy, and Megha Manohara. Toward a practical perceptual video quality metric. The Netflix Tech Blog, 6:2, 2016.

[26]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117--2125, 2017.

[27]

Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu Liu. World-consistent video-to-video synthesis. arXiv preprint arXiv:2007.08509, 2020.

[28]

Yajie Miao, Mohammad Gowayyed, and Florian Metze. Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 167--174. IEEE, 2015.

[29]

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

[30]

A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.

[31]

Brid O'Conaill, Steve Whittaker, and Sylvia Wilbur. Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication. Human-computer interaction, 8(4):389--428, 1993.

Digital Library

[32]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

[33]

Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[34]

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484--492, 2020.

Digital Library

[35]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. AutoVC: Zero-shot voice style transfer with only autoencoder loss. volume 97 of Proceedings of Machine Learning Research, pages 5210--5219, Long Beach, California, USA, 09--15 Jun 2019. PMLR.

[36]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234--241. Springer, 2015.

[37]

Joseph Roth, Yiying Tong, and Xiaoming Liu. Adaptive 3d face reconstruction from unconstrained photo collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4197--4206, 2016.

[38]

Hacsim Sak, Félix de Chaumont Quitry, Tara Sainath, Kanishka Rao, et al. Acoustic modelling with cd-ctc-smbr lstm rnns. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 604--609. IEEE, 2015.

[39]

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. arXiv preprint arXiv:2003.00196, 2020.

[40]

Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. Everybody's talkin': Let me talk as you want. arXiv preprint arXiv:2001.05201, 2020.

[41]

Yang Song, Jingwen Zhu, Dawei Li, Xiaolong Wang, and Hairong Qi. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.

[42]

Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1--13, 2017.

Digital Library

[43]

J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.

Digital Library

[44]

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision, pages 716--731. Springer, 2020.

Digital Library

[45]

Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures, 2019.

Digital Library

[46]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.

[47]

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, pages 1--16, 2019.

[48]

Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

Digital Library

[49]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.

Digital Library

[50]

Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for image super-resolution: A survey. IEEE transactions on pattern analysis and machine intelligence, 2020.

[51]

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199--6203. IEEE, 2020.

[52]

Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. Audio-driven talking face video generation with learning-based personalized head pose. arXiv e-prints, pages arXiv--2002, 2020.

[53]

Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging, 3(1):47--57, 2016.

[54]

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG), 39(6):1--15, 2020.

Digital Library

Cited By

Song LLiu PYin GXu C(2024)Adaptive Super Resolution for One-Shot Talking-Head GenerationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446837(4115-4119)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446837
Song LLiu PChen LYin GXu C(2024)Tri$$^{2}$$-plane: Thinking Head Avatar via Feature PyramidComputer Vision – ECCV 202410.1007/978-3-031-72920-1_1(1-20)Online publication date: 1-Oct-2024
https://doi.org/10.1007/978-3-031-72920-1_1
Yang SWang WLing JPeng BTan XDong JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Context-Aware Talking-Head Video EditingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611765(7718-7727)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611765
Show More Cited By

Index Terms

TACR-Net: Editing on Deep Video and Voice Portraits
1. Computing methodologies

Recommendations

Deep video portraits

We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full ...
Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech

An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely ...
Voice conversion by mapping the speaker-specific features using pitch synchronous approach

The basic goal of the voice conversion system is to modify the speaker-specific characteristics, keeping the message and the environmental information contained in the speech signal intact. Speaker characteristics reflect in speech at different levels, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Song LLiu PYin GXu C(2024)Adaptive Super Resolution for One-Shot Talking-Head GenerationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446837(4115-4119)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446837
Song LLiu PChen LYin GXu C(2024)Tri$$^{2}$$-plane: Thinking Head Avatar via Feature PyramidComputer Vision – ECCV 202410.1007/978-3-031-72920-1_1(1-20)Online publication date: 1-Oct-2024
https://doi.org/10.1007/978-3-031-72920-1_1
Yang SWang WLing JPeng BTan XDong JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Context-Aware Talking-Head Video EditingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611765(7718-7727)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611765
Ling JTan XChen LLi RZhang YZhao SSong L(2023)StableFace: Analyzing and Improving Motion Stability for Talking Face GenerationIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2023.333355217:6(1232-1247)Online publication date: Nov-2023
https://doi.org/10.1109/JSTSP.2023.3333552
Song LYin GJin ZDong XXu C(2023)Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01905(20782-20792)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01905
Wang XZhang ZZhang J(2023)Structure Invariant Transformation for better Adversarial Transferability2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00425(4584-4596)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00425
Song LLi XFang ZJin ZChen YXu CMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Face Forgery Detection via Symmetric TransformerProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547806(4102-4111)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547806
Tang JQiao SCui BMa YZhang SKanoulas DMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)You Can even Annotate Text with Voice: Transcription-only-Supervised Text SpottingProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547787(4154-4163)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547787
Bigioi DJordan HJain RMcDonnell RCorcoran P(2022)Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated DubbingIEEE Access10.1109/ACCESS.2022.323113710(133357-133369)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3231137
Song LFang ZLi XDong XJin ZChen YLyu S(2022)Adaptive Face Forgery Detection in Cross DomainComputer Vision – ECCV 202210.1007/978-3-031-19830-4_27(467-484)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19830-4_27

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents