Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1609/aaai.v37i7.25996guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

FLAME: free-form language-based motion synthesis & editing

Published: 07 February 2023 Publication History

Abstract

Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that FLAME's editing capability can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.

References

[1]
Ahn, H.; Ha, T.; Choi, Y.; Yoo, H.; and Oh, S. 2018. Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 5915-5920. IEEE.
[2]
Ahuja, C.; and Morency, L.-P. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), 719-728. IEEE.
[3]
Banitalebi-Dehkordi, A.; and Zhang, Y. 2021. Repaint: Improving the Generalization of Down-Stream Visual Tasks by Generating Multiple Instances of Training Examples. arXiv preprint arXiv:2110.10366.
[4]
Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 8780-8794.
[5]
Duan, Y.; Lin, Y.; Zou, Z.; Yuan, Y.; Qian, Z.; and Zhang, B. 2022. A Unified Framework for Real Time Motion Completion. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4): 4459-4467.
[6]
Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, 4346-4354.
[7]
Ghosh, A.; Cheema, N.; Oguz, C.; Theobalt, C.; and Slusallek, P. 2021. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1396-1406.
[8]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in neural information processing systems, 27.
[9]
Guo, C.; Zou, S.; Zuo, X.; Wang, S.; Ji, W.; Li, X.; and Cheng, L. 2022. Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5152-5161.
[10]
Guo, C.; Zuo, X.; Wang, S.; Zou, S.; Sun, Q.; Deng, A.; Gong, M.; and Cheng, L. 2020. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, 2021-2029.
[11]
Guo, X.; and Choi, J. 2019. Human motion prediction via learning local structure representations and temporal dependencies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2580-2587.
[12]
Harvey, F. G.; Yurick, M.; Nowrouzezahrai, D.; and Pal, C. 2020. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4): 60-1.
[13]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
[14]
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840-6851.
[15]
Ho, J.; and Salimans, T. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
[16]
Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4401-4410.
[17]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110-8119.
[18]
Kim, J.; Byun, T.; Shin, S.; Won, J.; and Choi, S. 2022a. Conditional Motion In-betweening. Pattern Recognition, 108894.
[19]
Kim, J.-H.; Kim, Y.; Lee, J.; Yoo, K. M.; and Lee, S.-W. 2022b. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. arXiv preprint arXiv:2205.13445.
[20]
Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[21]
Lin, A. S.; Wu, L.; Corona, R.; Tai, K.; Huang, Q.; and Mooney, R. J. 2018. Generating Animated Videos of Human Activities from Natural Language Descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018.
[22]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[23]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; and Black, M. J. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6): 248:1-248:16.
[24]
Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
[25]
Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; and Black, M. J. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision, 5442-5451.
[26]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
[27]
Nichol, A. Q.; and Dhariwal, P. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 8162-8171. PMLR.
[28]
Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[29]
Petrovich, M.; Black, M. J.; and Varol, G. 2021. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10985-10995.
[30]
Petrovich, M.; Black, M. J.; and Varol, G. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).
[31]
Plappert, M.; Mandery, C.; and Asfour, T. 2016. The KIT motion-language dataset. Big data, 4(4): 236-252.
[32]
Punnakkal, A. R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; and Black, M. J. 2021. BABEL: Bodies, Action and Behavior with English Labels. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 722-731.
[33]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748-8763. PMLR.
[34]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
[35]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S.S.; Lopes, R. G.; et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487.
[36]
Sauer, A.; Schwarz, K.; and Geiger, A. 2022. Stylegan-xl: Scaling stylegan to large diverse datasets. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings, 1-10.
[37]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256-2265. PMLR.
[38]
Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
[39]
Song, Z.; Wang, D.; Jiang, N.; Fang, Z.; Ding, C.; Gan, W.; and Wu, W. 2022. ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation. arXiv preprint arXiv:2203.07706.
[40]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818-2826.
[41]
Van Den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, 30.
[42]
Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; and Li, H. 2019. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5745-5753.

Cited By

View all
  • (2024)Let's Dance Together! AI Dancers Can Dance to Your Favorite Music and StyleCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688373(88-90)Online publication date: 4-Nov-2024
  • (2024)MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680684(3266-3274)Online publication date: 28-Oct-2024
  • (2024)MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality FusionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680625(10794-10803)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence
February 2023
16496 pages
ISBN:978-1-57735-880-0

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Let's Dance Together! AI Dancers Can Dance to Your Favorite Music and StyleCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688373(88-90)Online publication date: 4-Nov-2024
  • (2024)MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680684(3266-3274)Online publication date: 28-Oct-2024
  • (2024)MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality FusionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680625(10794-10803)Online publication date: 28-Oct-2024
  • (2024)ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680597(273-281)Online publication date: 28-Oct-2024
  • (2024)Taming Diffusion Probabilistic Models for Character ControlACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657440(1-10)Online publication date: 13-Jul-2024
  • (2024)CoMo: Controllable Motion Generation Through Language Guided Pose Code EditingComputer Vision – ECCV 202410.1007/978-3-031-73397-0_11(180-196)Online publication date: 29-Sep-2024
  • (2023)FineMoGenProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666738(13981-13992)Online publication date: 10-Dec-2023
  • (2023)DreamHumanProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666584(10516-10529)Online publication date: 10-Dec-2023
  • (2023)The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generationCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616551(220-227)Online publication date: 9-Oct-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media