research-article

FLAME: free-form language-based motion synthesis & editing

AUTHORs:

Sungjoon ChoiAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 927, Pages 8255 - 8263

https://doi.org/10.1609/aaai.v37i7.25996

Published: 07 February 2023 Publication History

Abstract

Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that FLAME's editing capability can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.

References

[1]

Ahn, H.; Ha, T.; Choi, Y.; Yoo, H.; and Oh, S. 2018. Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 5915-5920. IEEE.

[2]

Ahuja, C.; and Morency, L.-P. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), 719-728. IEEE.

[3]

Banitalebi-Dehkordi, A.; and Zhang, Y. 2021. Repaint: Improving the Generalization of Down-Stream Visual Tasks by Generating Multiple Instances of Training Examples. arXiv preprint arXiv:2110.10366.

[4]

Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 8780-8794.

[5]

Duan, Y.; Lin, Y.; Zou, Z.; Yuan, Y.; Qian, Z.; and Zhang, B. 2022. A Unified Framework for Real Time Motion Completion. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4): 4459-4467.

[6]

Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, 4346-4354.

[7]

Ghosh, A.; Cheema, N.; Oguz, C.; Theobalt, C.; and Slusallek, P. 2021. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1396-1406.

[8]

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in neural information processing systems, 27.

[9]

Guo, C.; Zou, S.; Zuo, X.; Wang, S.; Ji, W.; Li, X.; and Cheng, L. 2022. Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5152-5161.

[10]

Guo, C.; Zuo, X.; Wang, S.; Zou, S.; Sun, Q.; Deng, A.; Gong, M.; and Cheng, L. 2020. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, 2021-2029.

[11]

Guo, X.; and Choi, J. 2019. Human motion prediction via learning local structure representations and temporal dependencies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2580-2587.

[12]

Harvey, F. G.; Yurick, M.; Nowrouzezahrai, D.; and Pal, C. 2020. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4): 60-1.

Digital Library

[13]

Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.

[14]

Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840-6851.

[15]

Ho, J.; and Salimans, T. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.

[16]

Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4401-4410.

[17]

Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110-8119.

[18]

Kim, J.; Byun, T.; Shin, S.; Won, J.; and Choi, S. 2022a. Conditional Motion In-betweening. Pattern Recognition, 108894.

[19]

Kim, J.-H.; Kim, Y.; Lee, J.; Yoo, K. M.; and Lee, S.-W. 2022b. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. arXiv preprint arXiv:2205.13445.

[20]

Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[21]

Lin, A. S.; Wu, L.; Corona, R.; Tai, K.; Huang, Q.; and Mooney, R. J. 2018. Generating Animated Videos of Human Activities from Natural Language Descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018.

[22]

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

[23]

Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; and Black, M. J. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6): 248:1-248:16.

[24]

Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

[25]

Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; and Black, M. J. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision, 5442-5451.

[26]

Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.

[27]

Nichol, A. Q.; and Dhariwal, P. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 8162-8171. PMLR.

[28]

Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[29]

Petrovich, M.; Black, M. J.; and Varol, G. 2021. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10985-10995.

[30]

Petrovich, M.; Black, M. J.; and Varol, G. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).

[31]

Plappert, M.; Mandery, C.; and Asfour, T. 2016. The KIT motion-language dataset. Big data, 4(4): 236-252.

[32]

Punnakkal, A. R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; and Black, M. J. 2021. BABEL: Bodies, Action and Behavior with English Labels. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 722-731.

[33]

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748-8763. PMLR.

[34]

Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

[35]

Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S.S.; Lopes, R. G.; et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487.

[36]

Sauer, A.; Schwarz, K.; and Geiger, A. 2022. Stylegan-xl: Scaling stylegan to large diverse datasets. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings, 1-10.

[37]

Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256-2265. PMLR.

[38]

Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.

[39]

Song, Z.; Wang, D.; Jiang, N.; Fang, Z.; Ding, C.; Gan, W.; and Wu, W. 2022. ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation. arXiv preprint arXiv:2203.07706.

[40]

Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818-2826.

[41]

Van Den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, 30.

[42]

Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; and Li, H. 2019. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5745-5753.

Cited By

Ishii REitoku SMatsuo SMakiguchi MHoshi AMorency L(2024)Let's Dance Together! AI Dancers Can Dance to Your Favorite Music and StyleCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688373(88-90)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688373
Mao XJiang ZWang QFu CZhang JWu JWang YWang CLi WChi MCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680684(3266-3274)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680684
Fu CWang YZhang JJiang ZMao XWu JCao WWang CGe YLiu YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality FusionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680625(10794-10803)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680625
Show More Cited By

Recommendations

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced ''programmability'' in comparison with traditional general-purpose CPUs. For the domain of ...
FLAME: simulating large populations of agents on parallel hardware architectures
AAMAS '10: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1

High performance computing is essential for simulating complex problems using agent-based modelling (ABM). Researchers are hindered by complexities of porting models on parallel platforms and time taken to run large simulations on a single machine. This ...
Flame recognition in video

This paper presents an automatic system for fire detection in video sequences. There are several previous methods to detect fire, however, all except two use spectroscopy or particle sensors. The two that use visual information suffer from the inability ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ishii REitoku SMatsuo SMakiguchi MHoshi AMorency L(2024)Let's Dance Together! AI Dancers Can Dance to Your Favorite Music and StyleCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688373(88-90)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688373
Mao XJiang ZWang QFu CZhang JWu JWang YWang CLi WChi MCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680684(3266-3274)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680684
Fu CWang YZhang JJiang ZMao XWu JCao WWang CGe YLiu YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality FusionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680625(10794-10803)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680625
Li PWang ZLiu MLiu HChen CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680597(273-281)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680597
Chen RShi MHuang STan PKomura TChen X(2024)Taming Diffusion Probabilistic Models for Character ControlACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657440(1-10)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657440
Huang YWan WYang YCallison-Burch CYatskar MLiu L(2024)CoMo: Controllable Motion Generation Through Language Guided Pose Code EditingComputer Vision – ECCV 202410.1007/978-3-031-73397-0_11(180-196)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73397-0_11
Zhang MLi HCai ZRen JYang LLiu ZOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)FineMoGenProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666738(13981-13992)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666738
Kolotouros NAlldieck TZanfir ABazavan EFieraru MSminchisescu COh ANaumann TGloberson ASaenko KHardt MLevine S(2023)DreamHumanProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666584(10516-10529)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666584
Kim GLi YKo H(2023)The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generationCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616551(220-227)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3616551

View Options

View options

Media

Figures

Other

Tables

View Table of Contents