LivePhoto: Real Image Animation with Text-Guided Motion Control

Xi Chen¹³,
Zhiheng Liu¹⁴,
Mengting Chen¹⁴,
Yutong Feng¹⁴,
Yu Liu¹⁴,
Yujun Shen¹⁵ &
…
Hengshuang Zhao¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

European Conference on Computer Vision

292 Accesses
2 Citations

Abstract

Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization. Project page is xavierchen34.github.io/LivePhoto-Page.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

AniClipart: Clipart Animation with Text-to-Video Priors

Article Open access 27 December 2024

NewMove: Customizing Text-to-Video Models with Novel Motions

References

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Google Scholar
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)
Google Scholar
Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: ICCV (2023)
Google Scholar
Chen, J., et al.: PixArt: fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426 (2023)
Chen, T.S., Lin, C.H., Tseng, H.Y., Lin, T.Y., Yang, M.H.: Motion-conditioned diffusion model for controllable video synthesis. arXiv:2304.14404 (2023)
Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: AnyDoor: zero-shot object-level image customization. arXiv:2307.09481 (2023)
Cheng, C.C., Chen, H.Y., Chiu, W.C.: Time flies: animating a still image with time-lapse video as reference. In: CVPR (2020)
Google Scholar
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: ICCV (2023)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv:2208.01618 (2022)
Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)
Holynski, A., Curless, B.L., Seitz, S.M., Szeliski, R.: Animating pictures with Eulerian motion fields. In: CVPR (2021)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv:2106.09685 (2021)
Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. In: CVPR (2022)
Google Scholar
Jhou, W.C., Cheng, W.H.: Animating still landscape photographs through cloud motion creation. TMM 18(1), 4–13 (2015)
Google Scholar
Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: DreamPose: fashion image-to-video synthesis via stable diffusion. arXiv:2304.06025 (2023)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Google Scholar
Khachatryan, L., et al.: Text2Video-Zero: text-to-image diffusion models are zero-shot video generators. arXiv:2303.13439 (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv:1312.6114 (2013)
Li, Z., Tucker, R., Snavely, N., Holynski, A.: Generative image dynamics. arXiv:2309.07906 (2023)
Liew, J.H., Yan, H., Zhang, J., Xu, Z., Feng, J.: MagicEdit: high-fidelity and temporally coherent video editing. arXiv:2308.14749 (2023)
Liu, Z., et al.: Cones: concept neurons in diffusion models for customized generation. arXiv:2303.05125 (2023)
Liu, Z., et al.: Cones 2: customizable image synthesis with multiple subjects. arXiv:2305.19327 (2023)
Luan, T.: AnimateDiff-I2V (2023). https://github.com/ykk648/AnimateDiff-I2V
Mahapatra, A., Kulkarni, K.: Controllable animation of fluid elements in still images. In: CVPR (2022)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. arXiv:2108.01073 (2021)
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv:2302.08453 (2023)
Okabe, M., Anjyo, K., Igarashi, T., Seidel, H.P.: Animating pictures of fluid using video examples. In: Computer Graphics Forum. Wiley Online Library (2009)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:2304.07193 (2023)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Researchers, P.: PikaLabs: An innovative text-to-video platform, October 2023. https://www.pika.art/
Researchers, R.: Gen-2: The next step forward for generative AI, October 2023. https://research.runwayml.com/gen2
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Google Scholar
Saharia, C., T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Shalev, Y., Wolf, L.: Image animation with perturbed masks. In: CVPR (2022)
Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NeurIPS (2019)
Google Scholar
Singer, U., et al.: Make-a-Video: text-to-video generation without text-video data. arXiv:2209.14792 (2022)
talesofai: AnimateDiff talesofai (2023). https://github.com/talesofai/AnimateDiff
Wang, T., et al.: DISCO: disentangled control for referring human dance generation in real world. arXiv:2307.00040 (2023)
Wang, X., et al.: VideoComposer: compositional video synthesis with motion controllability. In: NeurIPS (2023)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13(4), 600–612 (2004)
Google Scholar
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: ICCV (2023)
Google Scholar
Xing, J., et al.: DynamiCrafter: animating open-domain images with video diffusion priors. arXiv:2310.12190 (2023)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
Google Scholar
Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. In: NeurIPS (2023)
Google Scholar
Yin, S., et al.: NUWA-XL: diffusion over diffusion for extremely long video generation. arXiv:2303.12346 (2023)
Zhang, J., Yan, H., Xu, Z., Feng, J., Liew, J.H.: MagicAvatar: multimodal avatar generation and animation. arXiv:2308.14748 (2023)
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv:2302.05543 (2023)
Zhang, S., et al.: I2VGen-XL: high-quality image-to-video synthesis via cascaded diffusion models. arXiv:2311.04145 (2023)
Zhang, Y., Xing, Z., Zeng, Y., Fang, Y., Chen, K.: PIA: your personalized image animator via plug-and-play modules in text-to-image models. In: CVPR (2023)
Google Scholar
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)
Google Scholar
Zhao, R., Wu, T., Guo, G.: Sparse to dense motion transfer for face image animation. In: ICCV (2021)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 62201484), HKU Startup Fund, and HKU Seed Fund for Basic Research.

Author information

Authors and Affiliations

The University of Hong Kong, Hong Kong, China
Xi Chen & Hengshuang Zhao
Alibaba Group, Hangzhou, China
Zhiheng Liu, Mengting Chen, Yutong Feng & Yu Liu
Ant Group, Hangzhou, China
Yujun Shen

Authors

Xi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mengting Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yujun Shen
View author publications
You can also search for this author in PubMed Google Scholar
Hengshuang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hengshuang Zhao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X. et al. (2025). LivePhoto: Real Image Animation with Text-Guided Motion Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-72649-1_27
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LivePhoto: Real Image Animation with Text-Guided Motion Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

AniClipart: Clipart Animation with Text-to-Video Priors

NewMove: Customizing Text-to-Video Models with Novel Motions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

LivePhoto: Real Image Animation with Text-Guided Motion Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

AniClipart: Clipart Animation with Text-to-Video Priors

NewMove: Customizing Text-to-Video Models with Novel Motions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation