Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3664647.3681487acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Towards Emotion-enriched Text-to-Motion Generation via LLM-guided Limb-level Emotion Manipulating

Published: 28 October 2024 Publication History

Abstract

In the literature, existing studies on text-to-motion generation (TMG) routinely focus on exploring the objective alignment of text and motion, which largely ignore the subjective emotion information, especially the limb-level emotion information. With this in mind, this paper proposes a new Emotion-enriched Text-to-Motion Generation (ETMG) task, aiming to generate motions with the subjective emotion information. Further, this paper believes that injecting emotions into limbs (named intra-limb emotion injection) and ensuring the coordination and coherence of emotional motions after injecting emotion information (named inter-limb emotion disturbance) is rather important and challenging in this ETMG task. To this end, this paper proposes an LL M-guided Limb-level Emotion Manipulating ( L3 EM) approach to ETMG. Specifically, this approach designs an LLM-guided intra-limb emotion modeling block to inject emotion into limbs, followed by a graph-structured inter-limb relation modeling block to ensure the coordination and coherence of emotional motions. Particularly, this paper constructs a coarse-grained Emotional Text-to-Motion (EmotionalT2M) dataset and a fine-grained Limb -level Emotional Text-to-Motion (Limb-ET2M) dataset to justify the effectiveness of the proposed L3EM approach. Detailed evaluation demonstrates the significant advantage of our L3EM approach to ETMG over the state-of-the-art baselines. This justifies the importance of the limb-level emotion information for ETMG and the effectiveness of our L3EM approach in coherently manipulating such information.

References

[1]
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In Proceedings of 3DV 2019. IEEE, 719--728. https://doi.org/10.1109/3DV.2019.00084
[2]
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, Vol. 2, 3 (2023), 8.
[3]
Qiongdan Cao, Hui Yu, Paul Charisse, Si Qiao, and Brett Stevens. 2023. Is high-fidelity important for human-like virtual avatars in human computer interactions? International Journal of Network Dynamics and Intelligence (2023), 15--23.
[4]
Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, and Smaranda Muresan. 2023. I spy a metaphor: Large language models and diffusion models co-create visual metaphors. arXiv preprint arXiv:2305.14724 (2023).
[5]
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. 2023. Executing your Commands via Motion Diffusion in Latent Space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17--24, 2023. IEEE, 18000--18010. https://doi.org/10.1109/CVPR52729.2023.01726
[6]
Stephanie Hui-Wen Chuah and Joanne Yu. 2021. The future of service: The power of emotion in human-robot interaction. Journal of Retailing and Consumer Services, Vol. 61 (2021), 102551.
[7]
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. 2023. Empowering dynamics-aware text-to-video diffusion with large language models. arXiv preprint arXiv:2308.13812 (2023).
[8]
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. 2023. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023).
[9]
Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of Compositional Animations from Textual Descriptions. In Proceedings of ICCV 2021. IEEE, 1376--1386. https://doi.org/10.1109/ICCV48922.2021.00143
[10]
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating Diverse and Natural 3D Human Motions from Text. In Proceedings of CVPR 2022. IEEE, 5142--5151. https://doi.org/10.1109/CVPR52688.2022.00509
[11]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2Motion: Conditioned Generation of 3D Human Motions. In Proceedings of ACM MM 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 2021--2029. https://doi.org/10.1145/3394171.3413635
[12]
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. 2023. MotionGPT: Human Motion as a Foreign Language. In Proceedings of NeurIPS 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (Eds.). http://papers.nips.cc/paper_files/paper/2023/hash/3fbf0c1ea0716c03dea93bb6be78dd6f-Abstract-Conference.html
[13]
Buyu Li, Yongchi Zhao, Zhelun Shi, and Lu Sheng. 2022. DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer. In Proceedings of AAAI 2022. AAAI Press, 1272--1279. https://doi.org/10.1609/AAAI.V36I2.20014
[14]
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021. AI Choreographer: Music Conditioned 3D Dance Generation with AIST. In Proceedings of ICCV 2021. IEEE, 13381--13392. https://doi.org/10.1109/ICCV48922.2021.01315
[15]
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. 2023. Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset. CoRR, Vol. abs/2307.00818 (2023). https://doi.org/10.48550/ARXIV.2307.00818 showeprint[arXiv]2307.00818
[16]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851--866.
[17]
Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. 2023. HumanTOMATO: Text-aligned Whole-body Motion Generation. CoRR, Vol. abs/2310.12978 (2023). https://doi.org/10.48550/ARXIV.2310.12978 showeprint[arXiv]2310.12978
[18]
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 8086--8098. https://doi.org/10.18653/V1/2022.ACL-LONG.556
[19]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. In Proceedings of EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 11048--11064. https://doi.org/10.18653/V1/2022.EMNLP-MAIN.759
[20]
Myeongseok Park, Yunsik Cho, Giri Na, and Jinmo Kim. 2023. Application of virtual avatar using motion capture in immersive virtual environment. International Journal of Human-Computer Interaction (2023), 1--15.
[21]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. In Proceedings of ICCV 2021. IEEE, 10965--10975. https://doi.org/10.1109/ICCV48922.2021.01080
[22]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEMOS: Generating Diverse Human Motions from Textual Descriptions. In Proceedings of ECCV 2022 (Lecture Notes in Computer Science, Vol. 13682), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 480--497. https://doi.org/10.1007/978--3-031--20047--2_28
[23]
Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. 2023. LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, Abdulmotaleb El-Saddik, Tao Mei, Rita Cucchiara, Marco Bertini, Diana Patricia Tobon Vallejo, Pradeep K. Atrey, and M. Shamim Hossain (Eds.). ACM, 643--654. https://doi.org/10.1145/3581783.3612012
[24]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. CoRR, Vol. abs/2103.00020 (2021). showeprint[arXiv]2103.00020 https://arxiv.org/abs/2103.00020
[25]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR, Vol. abs/2204.06125 (2022). https://doi.org/10.48550/ARXIV.2204.06125 showeprint[arXiv]2204.06125
[26]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of CVPR 2022. IEEE, 10674--10685. https://doi.org/10.1109/CVPR52688.2022.01042
[27]
Nicolas Spatola and Olga A Wudarczyk. 2021. Ascribing emotions to robots: Explicit and implicit attribution of emotions and perceived robot anthropomorphism. Computers in Human Behavior, Vol. 124 (2021), 106934.
[28]
Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. 2023. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners. CoRR, Vol. abs/2305.14825 (2023). https://doi.org/10.48550/ARXIV.2305.14825 showeprint[arXiv]2305.14825
[29]
Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, and Daniel Cohen-Or. 2022. MotionCLIP: Exposing Human Motion Generation to CLIP Space. In Proceedings of ECCV 2022 (Lecture Notes in Computer Science, Vol. 13682), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 358--374. https://doi.org/10.1007/978--3-031--20047--2_21
[30]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit Haim Bermano. 2023. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1--5, 2023. OpenReview.net. https://openreview.net/pdf?id=SJ1kSyO2jwu
[31]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of NeurIPS 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
[32]
Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn., Vol. 8 (1992), 229--256. https://doi.org/10.1007/BF00992696
[33]
Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. 2019. Graph transformer networks. Advances in neural information processing systems, Vol. 32 (2019).
[34]
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open Bilingual Pre-trained Model. In Proceedings of ICLR 2023. OpenReview.net. https://openreview.net/pdf?id=-Aw0rrrPUF
[35]
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. 2023. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. CoRR, Vol. abs/2301.06052 (2023). https://doi.org/10.48550/ARXIV.2301.06052 showeprint[arXiv]2301.06052
[36]
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. CoRR, Vol. abs/2208.15001 (2022). https://doi.org/10.48550/ARXIV.2208.15001 showeprint[arXiv]2208.15001
[37]
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. 2023. ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. In Proceeding of ICCV 2023. IEEE, 364--373. https://doi.org/10.1109/ICCV51070.2023.00040
[38]
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. 2023. MotionGPT: Finetuned LLMs are General-Purpose Motion Generators. CoRR, Vol. abs/2306.10900 (2023). https://doi.org/10.48550/ARXIV.2306.10900 showeprint[arXiv]2306.10900
[39]
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRR, Vol. abs/2303.18223 (2023). https://doi.org/10.48550/ARXIV.2303.18223 showeprint[arXiv]2303.18223
[40]
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of ICML 2021 (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12697--12706. http://proceedings.mlr.press/v139/zhao21c.html
[41]
Shanshan Zhong, Zhongzhan Huang, Weushao Wen, Jinghui Qin, and Liang Lin. 2023. Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. In Proceedings of the 31st ACM International Conference on Multimedia. 567--578.

Index Terms

  1. Towards Emotion-enriched Text-to-Motion Generation via LLM-guided Limb-level Emotion Manipulating

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. emotion-enriched text-to-motion
    2. limb-level emotion manipulating
    3. llm-guided diffusion model

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 97
      Total Downloads
    • Downloads (Last 12 months)97
    • Downloads (Last 6 weeks)97
    Reflects downloads up to 30 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media