research-article

Public Access

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Authors:

Chaitanya Ahuja,

Louis-Philippe Morency,

Yaser SheikhAuthors Info & Claims

ICMI '19: 2019 International Conference on Multimodal Interaction

Pages 74 - 84

https://doi.org/10.1145/3340555.3353725

Published: 14 October 2019 Publication History

All formats PDF

Abstract

Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar’s speech and their body pose, but it also needs to model interpersonal dynamics with the interlocutor present in the conversation. In this paper, we introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to generate sequences of body pose conditioned on audio and body pose of the interlocutor and audio of the human operating the avatar. We evaluate our proposed model on dyadic conversational data consisting of pose and audio of both participants, confirming the importance of adaptive attention between monadic and dyadic dynamics when predicting avatar pose. We also conduct a user study to analyze judgments of human observers. Our results confirm that the generated body pose is more natural, models intrapersonal dynamics and interpersonal dynamics better than non-adaptive monadic/dyadic models.

References

[1]

Shailen Agrawal and Michiel van de Panne. 2016. Task-based locomotion. ACM Transactions on Graphics (TOG) 35, 4 (2016), 82.

Digital Library

[2]

Chaitanya Ahuja and Louis-Philippe Morency. 2018. Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling. In AAAI-18. 4996–5003. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17394

[3]

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. 3686–3693.

Digital Library

[4]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271(2018).

[5]

Jeremy N Bailenson, Nick Yee, Dan Merget, and Ralph Schroeder. 2006. The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic interaction. Presence: Teleoperators and Virtual Environments 15, 4(2006), 359–372.

Digital Library

[6]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).

Digital Library

[7]

Matthew Brand. 1999. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 21–28.

Digital Library

[8]

Justine Cassell and Kristinn R Thorisson. 1999. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. Applied Artificial Intelligence 13, 4-5 (1999), 519–538.

[9]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. Beat: the behavior expression animation toolkit. In Life-Like Characters. Springer, 163–185.

[10]

Yu-Wei Chao, Jimei Yang, Brian L Price, Scott Cohen, and Jia Deng. [n. d.]. Forecasting Human Dynamics from Static Images.

[11]

Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents. Springer, 127–140.

[12]

Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents. Springer, 152–166.

[13]

Hang Chu, Daiqing Li, and Sanja Fidler. 2018. A Face-to-Face Neural Conversation Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7113–7121.

[14]

Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham Mysore, Fredo Durand, and William T. Freeman. 2014. The Visual Microphone: Passive Recovery of Sound from Video. ACM Transactions on Graphics (Proc. SIGGRAPH) 33, 4 (2014), 79:1–79:10.

Digital Library

[15]

Allen T Dittmann. 1972. The body movement-speech rhythm relationship as a cue to speech encoding. Studies in dyadic communication(1972), 135–152.

[16]

Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, 2016. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7, 2 (2016), 190–202.

Digital Library

[17]

Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 835–838.

Digital Library

[18]

Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable videorealistic speech animation. Vol. 21. ACM.

[19]

Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 4346–4354.

[20]

Ruohan Gao, Rogerio Feris, and Kristen Grauman. 2018. Learning to separate object sounds by watching unlabeled video. arXiv preprint arXiv:1804.01665(2018).

[21]

Alex Graves. 2012. Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks. Springer, 5–13.

[22]

Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, and Taku Komura. 2017. A Recurrent Variational Autoencoder for Human Motion Synthesis. BMVC17 (2017).

[23]

Uri Hadar, TJ Steiner, and F Clifford Rose. 1984. The relationship between head movements and speech dysfluencies. Language and Speech 27, 4 (1984), 333–342.

[24]

Shangchen Han, Beibei Liu, Robert Wang, Yuting Ye, Christopher D Twigg, and Kenrick Kin. 2018. Online optical marker-based hand tracking with deep labels. ACM Transactions on Graphics (TOG) 37, 4 (2018), 166.

Digital Library

[25]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[26]

Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human–computer interaction: A survey. Computer vision and image understanding 108, 1-2 (2007), 116–134.

Digital Library

[27]

Stanley E Jones and Curtis D LeBaron. 2002. Research on the relationship between verbal and nonverbal communication: Emerging integrations. Journal of Communication 52, 3 (2002), 499–521.

[28]

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 94.

Digital Library

[29]

Jina Lee and Stacy Marsella. 2006. Nonverbal behavior generator for embodied conversational agents. In International Workshop on Intelligent Virtual Agents. Springer, 243–255.

Digital Library

[30]

Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG) 37, 4 (2018), 68.

Digital Library

[31]

Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-based Recurrent Model for Human Motion. arXiv preprint arXiv:1805.06485(2018).

[32]

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic: Example-guided Deep Reinforcement Learning of Physics-based Character Skills. ACM Trans. Graph. 37, 4, Article 143 (July 2018), 14 pages. https://doi.org/10.1145/3197517.3201311

Digital Library

[33]

Stefan Scherer, Stacy Marsella, Giota Stratou, Yuyu Xu, Fabrizio Morbini, Alesia Egan, Louis-Philippe Morency, 2012. Perception markup language: Towards a standardized representation of perceived nonverbal behaviors. In International Conference on Intelligent Virtual Agents. Springer, 455–463.

Digital Library

[34]

Konrad Schindler, Luc Van Gool, and Beatrice de Gelder. 2008. Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural networks 21, 9 (2008), 1238–1246.

[35]

Tomas Simon, Hanbyul Joo, Iain A Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping. In CVPR, Vol. 1. 2.

[36]

Namrata Singh and Sarvpal Singh. 2017. Virtual reality: A brief survey. In Information Communication and Embedded Systems (ICICES), 2017 International Conference on. IEEE, 1–6.

[37]

Anthony Steed and Ralph Schroeder. 2015. Collaboration in Immersive and Non-immersive Virtual Environments. In Immersed in Media. Springer, 263–282.

[38]

Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36, 4 (2017), 95.

Digital Library

[39]

Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. ACM, 365–369.

Digital Library

[40]

Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 93.

Digital Library

[41]

Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. [n. d.]. WaveNet: A generative model for raw audio.

[42]

Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview.

[43]

Nigel Ward and Wataru Tsukahara. 2000. Prosodic features which cue back-channel responses in English and Japanese. Journal of pragmatics 32, 8 (2000), 1177–1207.

[44]

Martin Wöllmer, Moritz Kaiser, Florian Eyben, BjöRn Schuller, and Gerhard Rigoll. 2013. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing 31, 2 (2013), 153–163.

Digital Library

[45]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. arXiv preprint arXiv:1802.00923(2018).

[46]

Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. arXiv preprint arXiv:1804.03160(2018).

Cited By

Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Li GSeaborn K(2024)No Joke: An Embodied Conversational Agent Greeting Older Adults with Humour or a Smile Unrelated to Initial AcceptanceExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650918(1-7)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650918
Ng ERomero JBagautdinov TBai SDarrell TKanazawa ARichard A(2024)From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00101(1001-1010)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00101
Show More Cited By

Recommendations

Coalescing Narrative and Dialogue for Grounded Pose Forecasting
ICMI '19: 2019 International Conference on Multimodal Interaction

This research aims to create a data-driven end-to-end model for multimodal forecasting body pose and gestures of virtual avatars. A novel aspect of this research is to coalesce both narrative and dialogue for pose forecasting. In a narrative, language ...
Is chatting with a sophisticated chatbot as good as chatting online or FTF with a stranger?
Abstract
Emotionally-responsive chatbots are marketed as agents with which one can form emotional connections. They can also become weak ties in the outer layers of one's acquaintance network and available for social support. In this experiment,...
Highlights
- Adults had positive reactions after talking to the emotionally responsive Replika chatbot.
An End-to-End Conversational Style Matching Agent
IVA '19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents

We present an end-to-end voice-based conversational agent that is able to engage in naturalistic multi-turn dialogue and align with the interlocutor's conversational style. The system uses a series of deep neural network components for speech ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '19: 2019 International Conference on Multimodal Interaction

October 2019

601 pages

ISBN:9781450368605

DOI:10.1145/3340555

Editors:
Wen Gao
Peking University, China
,
Helen Mei Ling Meng
Chinese University of Hong Kong, China
,
Matthew Turk
Toyota Technological Institute at Chicago, USA
,
Susan R. Fussell
Cornell University, USA
,
Björn Schuller
Imperial College London / University of Augsburg, UK
,
Yale Song
Microsoft Research, USA
,
Kai Yu
Shanghai Jiao Tong University, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICMI '19

ICMI '19: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 14 - 18, 2019

Suzhou, China

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
1,005
Total Downloads

Downloads (Last 12 months)157
Downloads (Last 6 weeks)14

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Li GSeaborn K(2024)No Joke: An Embodied Conversational Agent Greeting Older Adults with Humour or a Smile Unrelated to Initial AcceptanceExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650918(1-7)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650918
Ng ERomero JBagautdinov TBai SDarrell TKanazawa ARichard A(2024)From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00101(1001-1010)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00101
Ghosh ADabral RGolyanik VTheobalt CSlusallek P(2024)REMOS: 3D Motion-Conditioned Reaction Synthesis for Two-Person InteractionsComputer Vision – ECCV 202410.1007/978-3-031-72764-1_24(418-437)Online publication date: 25-Oct-2024
https://doi.org/10.1007/978-3-031-72764-1_24
Fares MPelachaud CObin N(2023)Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encodingFrontiers in Artificial Intelligence10.3389/frai.2023.11429976Online publication date: 12-Jun-2023
https://doi.org/10.3389/frai.2023.1142997
Wolfert PHenter GBelpaeme T(2023)“Am I listening?”, Evaluating the Quality of Generated Data-driven Listening MotionCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617160(6-10)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3617160
Woo JPelachaud CAchard C(2023)ASAP: Endowing Adaptation Capability to Agent in Human-Agent InteractionProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584081(464-475)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584081
Sha TZhang WShen TLi ZMei T(2023)Deep Person Generation: A Survey from the Perspective of Face, Pose, and Cloth SynthesisACM Computing Surveys10.1145/357565655:12(1-37)Online publication date: 28-Mar-2023
https://dl.acm.org/doi/10.1145/3575656
Ishii RMorikawa AEitoku SFukayama ANakamura TLugrin BLatoschik Mvon Mammen SKopp SPécune FPelachaud C(2023)How Far ahead Can Model Predict Gesture Pose from Speech and Spoken Text?Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607336(1-3)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3570945.3607336
Nyatsanga SKucherenko TAhuja CHenter GNeff M(2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
https://doi.org/10.1111/cgf.14776
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents