research-article

SGToolkit: An Interactive Gesture Authoring Toolkit for Embodied Conversational Agents

Authors:

Geehyuk LeeAuthors Info & Claims

UIST '21: The 34th Annual ACM Symposium on User Interface Software and Technology

Pages 826 - 840

https://doi.org/10.1145/3472749.3474789

Published: 12 October 2021 Publication History

Abstract

Non-verbal behavior is essential for embodied agents like social robots, virtual avatars, and digital humans. Existing behavior authoring approaches including keyframe animation and motion capture are too expensive to use when there are numerous utterances requiring gestures. Automatic generation methods show promising results, but their output quality is not satisfactory yet, and it is hard to modify outputs as a gesture designer wants. We introduce a new gesture generation toolkit, named SGToolkit, which gives a higher quality output than automatic methods and is more efficient than manual authoring. For the toolkit, we propose a neural generative model that synthesizes gestures from speech and accommodates fine-level pose controls and coarse-level style controls from users. The user study with 24 participants showed that the toolkit is favorable over manual authoring, and the generated gestures were also human-like and appropriate to input speech. The SGToolkit is platform agnostic, and the code is available at https://github.com/ai4r/SGToolkit.

Supplementary Material

VTT File (p826-talk.vtt)

Download
10.66 KB

VTT File (p826-video_figure.vtt)

Download
1.91 KB

VTT File (p826-video_preview.vtt)

Download
.73 KB

MP4 File (p826-talk.mp4)

Talk video and captions

Download
125.85 MB

MP4 File (p826-video_figure.mp4)

Video figure and captions

Download
38.14 MB

MP4 File (p826-video_preview.mp4)

Video preview and captions

Download
4.84 MB

References

[1]

Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248–265.

Digital Library

[2]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.

[3]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

[4]

Paul Bremner, Anthony G Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The Effects of Robot-Performed Co-Verbal Gesture on Listener Behaviour. In IEEE-RAS International Conference on Humanoid Robots. IEEE, 458–465.

[5]

Judee K Burgoon, Thomas Birk, and Michael Pfau. 1990. Nonverbal Behaviors, Persuasion, and Credibility. Human Communication Research 17, 1 (1990), 140–169.

[6]

R Calvo, S D’Mello, J Gratch, A Kappas, M Lhommet, and SC Marsella. 2015. Expressing emotion through posture and gesture. The Oxford Handbook of Affective Computing(2015).

[7]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters. Springer, 163–185.

[8]

Diane Chi, Monica Costa, Liwei Zhao, and Norman Badler. 2000. The EMOTE model for effort and shape. In ACM SIGGRAPH. 173–182.

[9]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-Objective Adversarial Gesture Generation. In Motion, Interaction and Games. 1–10.

[10]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497–3506.

[11]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 2672–2680.

[12]

Google. 2018. Google Cloud Text-to-Speech. https://cloud.google.com/text-to-speech Accessed: 2021-01-01.

[13]

Arno Hartholt, David Traum, Stacy C Marsella, Ari Shapiro, Giota Stratou, Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013. All Together Now: Introducing the Virtual Human Toolkit. In International Workshop on Intelligent Virtual Agents. Springer, 368–381.

[14]

Björn Hartmann, Maurizio Mancini, and Catherine Pelachaud. 2005. Implementing expressive gesture synthesis for embodied conversational agents. In International Gesture Workshop. Springer, 188–199.

[15]

Chien-Ming Huang and Bilge Mutlu. 2014. Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots. In ACM/IEEE International Conference on Human-Robot Interaction. ACM, 57–64.

[16]

Reallusion Inc.2016. CrazyTalk8. https://www.reallusion.com/crazytalk/ Accessed: 2021-01-01.

[17]

Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter. 2021. HEMVIP: Human evaluation of multiple videos in parallel. arxiv:2101.11898

[18]

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.

Digital Library

[19]

Michael Kipp. 2005. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Universal-Publishers.

[20]

Heather Knight and Reid Simmons. 2014. Expressive motion with x, y and theta: Laban effort features for mobile robots. In IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 267–273.

[21]

Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew N Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn R Thórisson, and Hannes Vilhjálmsson. 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. In ACM International Conference on Intelligent Virtual Agents. Springer, 205–217.

Digital Library

[22]

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In ACM International Conference on Intelligent Virtual Agents. 97–104.

[23]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation. In ACM International Conference on Multimodal Interaction.

[24]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In International Conference on Intelligent User Interfaces.

Digital Library

[25]

Rudolf Laban and Lisa Ullmann. 1971. The mastery of movement.(1971).

[26]

Jina Lee and Stacy Marsella. 2006. Nonverbal behavior generator for embodied conversational agents. In International Workshop on Intelligent Virtual Agents. Springer, 243–255.

Digital Library

[27]

Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture Controllers. ACM Transactions on Graphics 29, 4 (2010), 1–11.

Digital Library

[28]

Miao Liao, Sibo Zhang, Peng Wang, Hao Zhu, Xinxin Zuo, and Ruigang Yang. 2020. Speech2video synthesis with 3D skeleton regularization and expressive body poses. In Asian Conference on Computer Vision.

[29]

Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual Character Performance From Speech. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25–35.

[30]

David McNeill. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago press.

[31]

Alberto Menache. 2000. Understanding motion capture for computer animation and video games. Morgan kaufmann.

[32]

Julien Moreau-Mathis. 2016. Babylon. js Essentials. Packt Publishing Ltd.

[33]

Evonne Ng, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. 2021. Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics. In IEEE Conference on Computer Vision and Pattern Recognition.

[34]

Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/ Accessed: 2021-01-01.

[35]

Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753–7762.

[36]

Allison Sauppé and Bilge Mutlu. 2014. Robot deictics: How gesture and context shape referential communication. In ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 342–349.

Digital Library

[37]

Tianmin Shu, Xiaofeng Gao, Michael S Ryoo, and Song-Chun Zhu. 2017. Learning social affordance grammar from videos: Transferring human interactions to human-robot interactions. In International Conference on Robotics and Automation. IEEE, 1669–1676.

Digital Library

[38]

Robotics Softbank. 2018. NAOqi API Documentation. http://doc.aldebaran.com/2-5/index_dev_guide.html Accessed: 2021-01-01.

[39]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition. 5693–5703.

[40]

Marcus Thiebaux, Stacy Marsella, Andrew N Marshall, and Marcelo Kallmann. 2008. Smartbody: Behavior realization for embodied conversational agents. In International Joint Conference on Autonomous Agents and Multiagent Systems. 151–158.

[41]

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision. Springer, 716–731.

Digital Library

[42]

International Telecommunication Union. 2014. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly (2014).

[43]

Jason R Wilson, Nah Young Lee, Annie Saechao, Sharon Hershenson, Matthias Scheutz, and Linda Tickle-Degnen. 2017. Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport. In International Conference on Social Robotics. Springer, 334–344.

[44]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.

Digital Library

[45]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In International Conference on Robotics and Automation. IEEE, 4303–4309.

Cited By

Wang IRuiz JKappas A(2025)Body Language Between Humans and MachinesBody Language Communication10.1007/978-3-031-70064-4_18(443-476)Online publication date: 2-Jan-2025
https://doi.org/10.1007/978-3-031-70064-4_18
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Sung YKim RSong KShao YYoon S(2024)HapticPilotProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314537:4(1-28)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631453
Show More Cited By

Recommendations

Georgia tech gesture toolkit: supporting experiments in gesture recognition
ICMI '03: Proceedings of the 5th international conference on Multimodal interfaces

Gesture recognition is becoming a more common interaction tool in the fields of ubiquitous and wearable computing. Designing a system to perform gesture recognition, however, can be a cumbersome task. Hidden Markov models (HMMs), a pattern recognition ...
Recognizing gaze aversion gestures in embodied conversational discourse
ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Eye gaze offers several key cues regarding conversational discourse during face-to-face interaction between people. While a large body of research results exist to document the use of gaze in human-to-human interaction, and in animating realistic ...
Realistic eye model for embodied conversational agents
FAA '12: Proceedings of the 3rd Symposium on Facial Analysis and Animation

The eyes play an essential role during face to face communication. They provide important information about visual attention and turn-taking during human-human and human-avatar interaction. In fact, the eye is a complex organ and gaze is only one of its ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

UIST '21: The 34th Annual ACM Symposium on User Interface Software and Technology

October 2021

1357 pages

ISBN:9781450386357

DOI:10.1145/3472749

Editors:
Jeffrey Nichols
Apple, USA
,
Ranjitha Kumar
UIUC, USA
,
Michael Nebeling
University of Michigan, USA

Copyright © 2021 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

UIST '21

Sponsor:

UIST '21: The 34th Annual ACM Symposium on User Interface Software and Technology

October 10 - 14, 2021

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 561 of 2,567 submissions, 22%

Upcoming Conference

UIST '25

Sponsor:
sigchi
sigchi

The 38th Annual ACM Symposium on User Interface Software and Technology

September 28 - October 1, 2025

Busan , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
568
Total Downloads

Downloads (Last 12 months)106
Downloads (Last 6 weeks)9

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang IRuiz JKappas A(2025)Body Language Between Humans and MachinesBody Language Communication10.1007/978-3-031-70064-4_18(443-476)Online publication date: 2-Jan-2025
https://doi.org/10.1007/978-3-031-70064-4_18
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Sung YKim RSong KShao YYoon S(2024)HapticPilotProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314537:4(1-28)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631453
Gao NZhao ZZeng ZZhang SWeng DBao Y(2024)GesGPT: Speech Gesture Synthesis With Text Parsing From ChatGPTIEEE Robotics and Automation Letters10.1109/LRA.2024.33595449:3(2718-2725)Online publication date: Mar-2024
https://doi.org/10.1109/LRA.2024.3359544
Wei LChow K(2023)When Gestures and Words Synchronize: Exploring A Human Lecturer's Multimodal Interaction for the Design of Embodied Pedagogical AgentsCompanion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing10.1145/3584931.3607010(39-44)Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1145/3584931.3607010
Ko HPark GJeon HJo JKim JSeo J(2023)Large-scale Text-to-Image Generation Models for Visual Artists’ Creative WorksProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584078(919-933)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584078
Kucherenko TNagy RYoon YWoo JNikolov TTsakov MHenter G(2023)The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616120(792-801)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616120
Voß HKopp SLugrin BLatoschik Mvon Mammen SKopp SPécune FPelachaud C(2023)Augmented Co-Speech Gesture GenerationProceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607337(1-8)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3570945.3607337
Nyatsanga SKucherenko TAhuja CHenter GNeff M(2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
https://doi.org/10.1111/cgf.14776
Ghorbani SFerstl YHolden DTroje NCarbonneau M(2023)ZeroEGGS: Zero‐shot Example‐based Gesture Generation from SpeechComputer Graphics Forum10.1111/cgf.1473442:1(206-216)Online publication date: 19-Feb-2023
https://doi.org/10.1111/cgf.14734
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten