research-article

Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment

Authors:

Shuwu ZhangAuthors Info & Claims

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

Pages 772 - 778

https://doi.org/10.1145/3577190.3616118

Published: 09 October 2023 Publication History

Abstract

This paper presents the CASIA-GO entry to the Generation and Evaluation of Non-verbal Behaviour for Embedded Agents (GENEA) Challenge 2023. The system is originally designed for few-shot scenarios such as generating gestures with the style of any in-the-wild target speaker from short speech samples. Given a group of reference speech data including gesture sequences, audio, and text, it first constructs a gesture motion graph that describes the soft gesture units and interframe continuity inside the speech, which is ready to be used for new rhythmic and semantic gesture reenactment by pathfinding when test audio and text are provided. We randomly choose one clip from the training data for one test clip to simulate a few-shot scenario and provide compatible results for subjective evaluations. Despite the 0.25% average utilization of the whole training set for each clip in the test set and the 17.5% total utilization of the training set for the whole test set, the system succeeds in providing valid results and ranks in the top 1/3 in the appropriateness for agent speech evaluation.

References

[1]

Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–19.

Digital Library

[2]

Che-Jui Chang, Sen Zhang, and Mubbasir Kapadia. 2022. The IVI Lab entry to the GENEA Challenge 2022–A Tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the 2022 International Conference on Multimodal Interaction. 784–789.

Digital Library

[3]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 3497–3506.

[4]

Ikhsanul Habibie, Mohamed Elgharib, Kripasindhu Sarkar, Ahsan Abdullah, Simbarashe Nyatsanga, Michael Neff, and Christian Theobalt. 2022. A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH 2022 Conference Proceedings. 1–9.

Digital Library

[5]

Lucas Kovar, Michael Gleicher, and Frédéric Pighin. 2008. Motion graphs. In ACM SIGGRAPH 2008 classes. 1–10.

Digital Library

[6]

Björn Krüger, Jochen Tautges, Andreas Weber, and Arno Zinke. 2010. Fast local and global similarity searches in large motion capture databases. In Symposium on Computer Animation. Citeseer, 1–10.

[7]

Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2023. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’23). ACM.

Digital Library

[8]

Abhishek Kumar, Shankar Vembu, Aditya Krishna Menon, and Charles Elkan. 2013. Beam search algorithms for multilabel learning. Machine learning 92 (2013), 65–89.

[9]

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 763–772.

[10]

Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture Controllers. ACM Trans. Graph. 29, 4, Article 124 (jul 2010), 11 pages. https://doi.org/10.1145/1778765.1778861

Digital Library

[11]

Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng. 2022. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1272–1279.

[12]

Librosa Development Team. 2023. librosa.onset.onset_detect - librosa 0.10.1dev documentation. https://librosa.org/doc/main/generated/librosa.onset.onset_detect.html#librosa.onset.onset_detect

[13]

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10462–10472.

[14]

Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual Character Performance from Speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Anaheim, California) (SCA ’13). Association for Computing Machinery, New York, NY, USA, 25–35. https://doi.org/10.1145/2485895.2485900

Digital Library

[15]

Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. 2023. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. arXiv preprint arXiv:2301.05339 (2023).

[16]

Alla Safonova and Jessica K Hodgins. 2007. Construction and optimal search of interpolated motion graphs. In ACM SIGGRAPH 2007 papers. 106–es.

Digital Library

[17]

Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, and Tao Mei. 2022. Freeform Body Motion Generation from Speech. arXiv preprint arXiv:2203.02291 (2022), 1–10.

[18]

Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, and Haolin Zhuang. 2023. QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2321–2330.

[19]

Sheng Ye, Yu-Hui Wen, Yanan Sun, Ying He, Ziyang Zhang, Yaoyuan Wang, Weihua He, and Yong-Jin Liu. 2022. Audio-Driven Stylized Gesture Generation with Flow-Based Model. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V. Springer, 712–728.

[20]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. ACM Trans. Graph. 39, 6, Article 222 (nov 2020), 16 pages. https://doi.org/10.1145/3414685.3417838

Digital Library

[21]

Zeyu Zhao, Nan Gao, Zhi Zeng, and Shuwu Zhang. 2022. Generating Diverse Gestures from Speech Using Memory Networks as Dynamic Dictionaries. In 2022 International Conference on Culture-Oriented Science and Technology (CoST). 163–168. https://doi.org/10.1109/CoST57098.2022.00042

[22]

Chi Zhou, Tengyue Bian, and Kang Chen. 2022. GestureMaster: Graph-based speech-driven gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 764–770.

Digital Library

[23]

Yang Zhou, Jimei Yang, Dingzeyu Li, Jun Saito, Deepali Aneja, and Evangelos Kalogerakis. 2022. Audio-driven neural gesture reenactment with video motion graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3418–3428.

[24]

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. arXiv preprint arXiv:2303.09119 (2023).

Cited By

Zhao ZGao NZeng ZZhang GLiu JZhang S(2024)A Unified Editing Method for Co-Speech Gesture Generation via Diffusion InversionProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700261(1-7)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700261
Tonoli RCosta PMarques LUeda L(2024)Gesture Area Coverage to Assess Gesture Expressiveness and Human-LikenessCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688822(165-169)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688822
Kucherenko TNagy RYoon YWoo JNikolov TTsakov MHenter G(2023)The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616120(792-801)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616120

Index Terms

Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment
1. Computing methodologies
  1. Computer graphics
    1. Animation
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic ...
Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation

We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM)...
Sparse principle motion component for one-shot gesture recognition
ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

With the rapid development of computer vision technology, gesture recognition has attracted much attention in recent years. However, the traditional gesture recognition methods waste a lot of time in the process of building a model with a large number of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

October 2023

858 pages

ISBN:9798400700552

DOI:10.1145/3577190

Editors:
Elisabeth André
University of Augsburg
,
Mohamed Chetouani
Sorbonne University
,
Dominique Vaufreydaz
Univ. Grenoble Alpes
,
Gale Lucas
USC Institute for Creative Technologies
,
Tanja Schultz
University of Bremen
,
Louis-Philippe Morency
Carnegie Mellon University
,
Alessandro Vinciarelli
University of Glasgow

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research and Development Program of China

Conference

ICMI '23

Sponsor:

SIGCHI

ICMI '23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 9 - 13, 2023

Paris, France

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
105
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao ZGao NZeng ZZhang GLiu JZhang S(2024)A Unified Editing Method for Co-Speech Gesture Generation via Diffusion InversionProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700261(1-7)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700261
Tonoli RCosta PMarques LUeda L(2024)Gesture Area Coverage to Assess Gesture Expressiveness and Human-LikenessCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688822(165-169)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688822
Kucherenko TNagy RYoon YWoo JNikolov TTsakov MHenter G(2023)The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616120(792-801)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616120

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten