Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3340555.3353725acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Public Access

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Published: 14 October 2019 Publication History

Abstract

Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar’s speech and their body pose, but it also needs to model interpersonal dynamics with the interlocutor present in the conversation. In this paper, we introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to generate sequences of body pose conditioned on audio and body pose of the interlocutor and audio of the human operating the avatar. We evaluate our proposed model on dyadic conversational data consisting of pose and audio of both participants, confirming the importance of adaptive attention between monadic and dyadic dynamics when predicting avatar pose. We also conduct a user study to analyze judgments of human observers. Our results confirm that the generated body pose is more natural, models intrapersonal dynamics and interpersonal dynamics better than non-adaptive monadic/dyadic models.

References

[1]
Shailen Agrawal and Michiel van de Panne. 2016. Task-based locomotion. ACM Transactions on Graphics (TOG) 35, 4 (2016), 82.
[2]
Chaitanya Ahuja and Louis-Philippe Morency. 2018. Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling. In AAAI-18. 4996–5003. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17394
[3]
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. 3686–3693.
[4]
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271(2018).
[5]
Jeremy N Bailenson, Nick Yee, Dan Merget, and Ralph Schroeder. 2006. The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic interaction. Presence: Teleoperators and Virtual Environments 15, 4(2006), 359–372.
[6]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
[7]
Matthew Brand. 1999. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 21–28.
[8]
Justine Cassell and Kristinn R Thorisson. 1999. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. Applied Artificial Intelligence 13, 4-5 (1999), 519–538.
[9]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. Beat: the behavior expression animation toolkit. In Life-Like Characters. Springer, 163–185.
[10]
Yu-Wei Chao, Jimei Yang, Brian L Price, Scott Cohen, and Jia Deng. [n. d.]. Forecasting Human Dynamics from Static Images.
[11]
Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents. Springer, 127–140.
[12]
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents. Springer, 152–166.
[13]
Hang Chu, Daiqing Li, and Sanja Fidler. 2018. A Face-to-Face Neural Conversation Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7113–7121.
[14]
Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham Mysore, Fredo Durand, and William T. Freeman. 2014. The Visual Microphone: Passive Recovery of Sound from Video. ACM Transactions on Graphics (Proc. SIGGRAPH) 33, 4 (2014), 79:1–79:10.
[15]
Allen T Dittmann. 1972. The body movement-speech rhythm relationship as a cue to speech encoding. Studies in dyadic communication(1972), 135–152.
[16]
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, 2016. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7, 2 (2016), 190–202.
[17]
Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 835–838.
[18]
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable videorealistic speech animation. Vol. 21. ACM.
[19]
Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 4346–4354.
[20]
Ruohan Gao, Rogerio Feris, and Kristen Grauman. 2018. Learning to separate object sounds by watching unlabeled video. arXiv preprint arXiv:1804.01665(2018).
[21]
Alex Graves. 2012. Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks. Springer, 5–13.
[22]
Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, and Taku Komura. 2017. A Recurrent Variational Autoencoder for Human Motion Synthesis. BMVC17 (2017).
[23]
Uri Hadar, TJ Steiner, and F Clifford Rose. 1984. The relationship between head movements and speech dysfluencies. Language and Speech 27, 4 (1984), 333–342.
[24]
Shangchen Han, Beibei Liu, Robert Wang, Yuting Ye, Christopher D Twigg, and Kenrick Kin. 2018. Online optical marker-based hand tracking with deep labels. ACM Transactions on Graphics (TOG) 37, 4 (2018), 166.
[25]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[26]
Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human–computer interaction: A survey. Computer vision and image understanding 108, 1-2 (2007), 116–134.
[27]
Stanley E Jones and Curtis D LeBaron. 2002. Research on the relationship between verbal and nonverbal communication: Emerging integrations. Journal of Communication 52, 3 (2002), 499–521.
[28]
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 94.
[29]
Jina Lee and Stacy Marsella. 2006. Nonverbal behavior generator for embodied conversational agents. In International Workshop on Intelligent Virtual Agents. Springer, 243–255.
[30]
Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG) 37, 4 (2018), 68.
[31]
Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-based Recurrent Model for Human Motion. arXiv preprint arXiv:1805.06485(2018).
[32]
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic: Example-guided Deep Reinforcement Learning of Physics-based Character Skills. ACM Trans. Graph. 37, 4, Article 143 (July 2018), 14 pages. https://doi.org/10.1145/3197517.3201311
[33]
Stefan Scherer, Stacy Marsella, Giota Stratou, Yuyu Xu, Fabrizio Morbini, Alesia Egan, Louis-Philippe Morency, 2012. Perception markup language: Towards a standardized representation of perceived nonverbal behaviors. In International Conference on Intelligent Virtual Agents. Springer, 455–463.
[34]
Konrad Schindler, Luc Van Gool, and Beatrice de Gelder. 2008. Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural networks 21, 9 (2008), 1238–1246.
[35]
Tomas Simon, Hanbyul Joo, Iain A Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping. In CVPR, Vol. 1. 2.
[36]
Namrata Singh and Sarvpal Singh. 2017. Virtual reality: A brief survey. In Information Communication and Embedded Systems (ICICES), 2017 International Conference on. IEEE, 1–6.
[37]
Anthony Steed and Ralph Schroeder. 2015. Collaboration in Immersive and Non-immersive Virtual Environments. In Immersed in Media. Springer, 263–282.
[38]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36, 4 (2017), 95.
[39]
Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. ACM, 365–369.
[40]
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 93.
[41]
Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. [n. d.]. WaveNet: A generative model for raw audio.
[42]
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview.
[43]
Nigel Ward and Wataru Tsukahara. 2000. Prosodic features which cue back-channel responses in English and Japanese. Journal of pragmatics 32, 8 (2000), 1177–1207.
[44]
Martin Wöllmer, Moritz Kaiser, Florian Eyben, BjöRn Schuller, and Gerhard Rigoll. 2013. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing 31, 2 (2013), 153–163.
[45]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. arXiv preprint arXiv:1802.00923(2018).
[46]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. arXiv preprint arXiv:1804.03160(2018).

Cited By

View all
  • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
  • (2024)No Joke: An Embodied Conversational Agent Greeting Older Adults with Humour or a Smile Unrelated to Initial AcceptanceExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650918(1-7)Online publication date: 11-May-2024
  • (2024)From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00101(1001-1010)Online publication date: 16-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '19: 2019 International Conference on Multimodal Interaction
October 2019
601 pages
ISBN:9781450368605
DOI:10.1145/3340555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dyadic interactions
  2. multimodal fusion
  3. pose forecasting

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICMI '19

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)157
  • Downloads (Last 6 weeks)14
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
  • (2024)No Joke: An Embodied Conversational Agent Greeting Older Adults with Humour or a Smile Unrelated to Initial AcceptanceExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650918(1-7)Online publication date: 11-May-2024
  • (2024)From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00101(1001-1010)Online publication date: 16-Jun-2024
  • (2024)REMOS: 3D Motion-Conditioned Reaction Synthesis for Two-Person InteractionsComputer Vision – ECCV 202410.1007/978-3-031-72764-1_24(418-437)Online publication date: 25-Oct-2024
  • (2023)Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encodingFrontiers in Artificial Intelligence10.3389/frai.2023.11429976Online publication date: 12-Jun-2023
  • (2023)“Am I listening?”, Evaluating the Quality of Generated Data-driven Listening MotionCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617160(6-10)Online publication date: 9-Oct-2023
  • (2023)ASAP: Endowing Adaptation Capability to Agent in Human-Agent InteractionProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584081(464-475)Online publication date: 27-Mar-2023
  • (2023)Deep Person Generation: A Survey from the Perspective of Face, Pose, and Cloth SynthesisACM Computing Surveys10.1145/357565655:12(1-37)Online publication date: 28-Mar-2023
  • (2023)How Far ahead Can Model Predict Gesture Pose from Speech and Spoken Text?Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607336(1-3)Online publication date: 19-Sep-2023
  • (2023)A Comprehensive Review of Data‐Driven Co‐Speech Gesture GenerationComputer Graphics Forum10.1111/cgf.1477642:2(569-596)Online publication date: 23-May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media