Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3472749.3474789acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article

SGToolkit: An Interactive Gesture Authoring Toolkit for Embodied Conversational Agents

Published: 12 October 2021 Publication History

Abstract

Non-verbal behavior is essential for embodied agents like social robots, virtual avatars, and digital humans. Existing behavior authoring approaches including keyframe animation and motion capture are too expensive to use when there are numerous utterances requiring gestures. Automatic generation methods show promising results, but their output quality is not satisfactory yet, and it is hard to modify outputs as a gesture designer wants. We introduce a new gesture generation toolkit, named SGToolkit, which gives a higher quality output than automatic methods and is more efficient than manual authoring. For the toolkit, we propose a neural generative model that synthesizes gestures from speech and accommodates fine-level pose controls and coarse-level style controls from users. The user study with 24 participants showed that the toolkit is favorable over manual authoring, and the generated gestures were also human-like and appropriate to input speech. The SGToolkit is platform agnostic, and the code is available at https://github.com/ai4r/SGToolkit.

Supplementary Material

VTT File (p826-talk.vtt)
VTT File (p826-video_figure.vtt)
VTT File (p826-video_preview.vtt)
MP4 File (p826-talk.mp4)
Talk video and captions
MP4 File (p826-video_figure.mp4)
Video figure and captions
MP4 File (p826-video_preview.mp4)
Video preview and captions

References

[1]
Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248–265.
[2]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.
[3]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[4]
Paul Bremner, Anthony G Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The Effects of Robot-Performed Co-Verbal Gesture on Listener Behaviour. In IEEE-RAS International Conference on Humanoid Robots. IEEE, 458–465.
[5]
Judee K Burgoon, Thomas Birk, and Michael Pfau. 1990. Nonverbal Behaviors, Persuasion, and Credibility. Human Communication Research 17, 1 (1990), 140–169.
[6]
R Calvo, S D’Mello, J Gratch, A Kappas, M Lhommet, and SC Marsella. 2015. Expressing emotion through posture and gesture. The Oxford Handbook of Affective Computing(2015).
[7]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters. Springer, 163–185.
[8]
Diane Chi, Monica Costa, Liwei Zhao, and Norman Badler. 2000. The EMOTE model for effort and shape. In ACM SIGGRAPH. 173–182.
[9]
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-Objective Adversarial Gesture Generation. In Motion, Interaction and Games. 1–10.
[10]
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497–3506.
[11]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 2672–2680.
[12]
Google. 2018. Google Cloud Text-to-Speech. https://cloud.google.com/text-to-speech Accessed: 2021-01-01.
[13]
Arno Hartholt, David Traum, Stacy C Marsella, Ari Shapiro, Giota Stratou, Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013. All Together Now: Introducing the Virtual Human Toolkit. In International Workshop on Intelligent Virtual Agents. Springer, 368–381.
[14]
Björn Hartmann, Maurizio Mancini, and Catherine Pelachaud. 2005. Implementing expressive gesture synthesis for embodied conversational agents. In International Gesture Workshop. Springer, 188–199.
[15]
Chien-Ming Huang and Bilge Mutlu. 2014. Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots. In ACM/IEEE International Conference on Human-Robot Interaction. ACM, 57–64.
[16]
Reallusion Inc.2016. CrazyTalk8. https://www.reallusion.com/crazytalk/ Accessed: 2021-01-01.
[17]
Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter. 2021. HEMVIP: Human evaluation of multiple videos in parallel. arxiv:2101.11898
[18]
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
[19]
Michael Kipp. 2005. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Universal-Publishers.
[20]
Heather Knight and Reid Simmons. 2014. Expressive motion with x, y and theta: Laban effort features for mobile robots. In IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 267–273.
[21]
Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew N Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn R Thórisson, and Hannes Vilhjálmsson. 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. In ACM International Conference on Intelligent Virtual Agents. Springer, 205–217.
[22]
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In ACM International Conference on Intelligent Virtual Agents. 97–104.
[23]
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation. In ACM International Conference on Multimodal Interaction.
[24]
Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In International Conference on Intelligent User Interfaces.
[25]
Rudolf Laban and Lisa Ullmann. 1971. The mastery of movement.(1971).
[26]
Jina Lee and Stacy Marsella. 2006. Nonverbal behavior generator for embodied conversational agents. In International Workshop on Intelligent Virtual Agents. Springer, 243–255.
[27]
Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture Controllers. ACM Transactions on Graphics 29, 4 (2010), 1–11.
[28]
Miao Liao, Sibo Zhang, Peng Wang, Hao Zhu, Xinxin Zuo, and Ruigang Yang. 2020. Speech2video synthesis with 3D skeleton regularization and expressive body poses. In Asian Conference on Computer Vision.
[29]
Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual Character Performance From Speech. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25–35.
[30]
David McNeill. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago press.
[31]
Alberto Menache. 2000. Understanding motion capture for computer animation and video games. Morgan kaufmann.
[32]
Julien Moreau-Mathis. 2016. Babylon. js Essentials. Packt Publishing Ltd.
[33]
Evonne Ng, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. 2021. Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics. In IEEE Conference on Computer Vision and Pattern Recognition.
[34]
Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/ Accessed: 2021-01-01.
[35]
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753–7762.
[36]
Allison Sauppé and Bilge Mutlu. 2014. Robot deictics: How gesture and context shape referential communication. In ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 342–349.
[37]
Tianmin Shu, Xiaofeng Gao, Michael S Ryoo, and Song-Chun Zhu. 2017. Learning social affordance grammar from videos: Transferring human interactions to human-robot interactions. In International Conference on Robotics and Automation. IEEE, 1669–1676.
[38]
Robotics Softbank. 2018. NAOqi API Documentation. http://doc.aldebaran.com/2-5/index_dev_guide.html Accessed: 2021-01-01.
[39]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition. 5693–5703.
[40]
Marcus Thiebaux, Stacy Marsella, Andrew N Marshall, and Marcelo Kallmann. 2008. Smartbody: Behavior realization for embodied conversational agents. In International Joint Conference on Autonomous Agents and Multiagent Systems. 151–158.
[41]
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision. Springer, 716–731.
[42]
International Telecommunication Union. 2014. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly (2014).
[43]
Jason R Wilson, Nah Young Lee, Annie Saechao, Sharon Hershenson, Matthias Scheutz, and Linda Tickle-Degnen. 2017. Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport. In International Conference on Social Robotics. Springer, 334–344.
[44]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.
[45]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In International Conference on Robotics and Automation. IEEE, 4303–4309.

Cited By

View all
  • (2025)Body Language Between Humans and MachinesBody Language Communication10.1007/978-3-031-70064-4_18(443-476)Online publication date: 2-Jan-2025
  • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
  • (2024)HapticPilotProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314537:4(1-28)Online publication date: 12-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
UIST '21: The 34th Annual ACM Symposium on User Interface Software and Technology
October 2021
1357 pages
ISBN:9781450386357
DOI:10.1145/3472749
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gesture authoring
  2. social behavior
  3. speech gestures
  4. toolkit

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

UIST '21

Acceptance Rates

Overall Acceptance Rate 561 of 2,567 submissions, 22%

Upcoming Conference

UIST '25
The 38th Annual ACM Symposium on User Interface Software and Technology
September 28 - October 1, 2025
Busan , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)9
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Body Language Between Humans and MachinesBody Language Communication10.1007/978-3-031-70064-4_18(443-476)Online publication date: 2-Jan-2025
  • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
  • (2024)HapticPilotProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314537:4(1-28)Online publication date: 12-Jan-2024
  • (2024)GesGPT: Speech Gesture Synthesis With Text Parsing From ChatGPTIEEE Robotics and Automation Letters10.1109/LRA.2024.33595449:3(2718-2725)Online publication date: Mar-2024
  • (2023)When Gestures and Words Synchronize: Exploring A Human Lecturer's Multimodal Interaction for the Design of Embodied Pedagogical AgentsCompanion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing10.1145/3584931.3607010(39-44)Online publication date: 14-Oct-2023
  • (2023)Large-scale Text-to-Image Generation Models for Visual Artists’ Creative WorksProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584078(919-933)Online publication date: 27-Mar-2023
  • (2023)The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616120(792-801)Online publication date: 9-Oct-2023
  • (2023)Augmented Co-Speech Gesture GenerationProceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607337(1-8)Online publication date: 19-Sep-2023
  • (2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
  • (2023)ZeroEGGS: Zero‐shot Example‐based Gesture Generation from SpeechComputer Graphics Forum10.1111/cgf.1473442:1(206-216)Online publication date: 19-Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media