SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Zijie Wu^13,14,
Chaohui Yu¹⁴,
Yanqin Jiang¹⁴,
Chenjie Cao¹⁴,
Fan Wang¹⁴ &
…
Xiang Bai¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

European Conference on Computer Vision

118 Accesses

Abstract

Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity of the learned motion and shape. Comprehensive experimental results demonstrate that our method surpasses existing methods in both quality and efficiency. In addition, facilitated by the disentangled modeling of motion and appearance of SC4D, we devise a novel application that seamlessly transfers the learned motion onto a diverse array of 4D entities according to textual descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image Using Latent Video Diffusion

Diverse Generation from a Single Video Made Possible

Notes

1.
Supp.: supplementary file.

References

Bahmani, S., et al.: 4D-fy: text-to-4D generation using hybrid score distillation sampling. arXiv preprint arXiv:2311.17984 (2023)
Cao, A., Johnson, J.: HexPlane: a fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 130–141 (2023)
Google Scholar
Chen, G., Wang, W.: A survey on 3D gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. arXiv preprint arXiv:2303.13873 (2023)
Chen, Z., Wang, F., Liu, H.: Text-to-3D using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023)
Das, D., Wewer, C., Yunus, R., Ilg, E., Lenssen, J.E.: Neural parametric Gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196 (2023)
Deitke, M., et al.: Objaverse-XL: a universe of 10M+ 3D objects. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)
Google Scholar
Fang, J., et al.: Fast dynamic radiance fields with time-aware neural voxels. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Google Scholar
Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12479–12488 (2023)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5712–5721 (2021)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Huang, Y.H., Sun, Y.T., Yang, Z., Lyu, X., Cao, Y.P., Qi, X.: SC-GS: sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937 (2023)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876 (2022)
Google Scholar
Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4D: D360 $\{$$\backslash $deg$\}$ dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848 (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023)
Google Scholar
Kratimenos, A., Lei, J., Daniilidis, K.: DynMF: neural motion factorization for real-time dynamic view synthesis with 3D gaussian splatting. arXiv preprint arXiv:2312.00112 (2023)
Lee, H.H., Chang, A.X.: Understanding pure CLIP guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172 (2022)
Li, J., et al.: Instant3D: fast text-to-3D with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214 (2023)
Li, W., Chen, R., Chen, X., Tan, P.: SweetDreamer: aligning geometric priors in 2D diffusion for consistent text-to-3D. arXiv preprint arXiv:2310.02596 (2023)
Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812 (2023)
Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6498–6508 (2021)
Google Scholar
Liang, Y., et al.: GauFRe: Gaussian deformation fields for real-time dynamic novel view synthesis. arXiv preprint arXiv:2312.11458 (2023)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Google Scholar
Lin, Y., Han, H., Gong, C., Xu, Z., Zhang, Y., Li, X.: Consistent123: one image to highly consistent 3D asset using case-aware diffusion priors. arXiv preprint arXiv:2309.17261 (2023)
Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: text-to-4D with dynamic 3D gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)
Liu, M., et al.: One-2-3-45: any single image to 3D mesh in 45 seconds without per-shape optimization. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Liu, Y., et al.: SyncDreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3D gaussians: tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023)
Melas-Kyriazi, L., Laina, I., Rupprecht, C., Vedaldi, A.: RealFusion: 360deg reconstruction of any object from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8446–8455 (2023)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: CLIP-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)
Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-E: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Fast dynamic 3D object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)
Park, K., et al.: NeRFies: deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874 (2021)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327 (2021)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Qian, G., et al.: Magic123: one image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv preprint arXiv:2306.17843 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ren, J., et al.: DreamGaussian4D: generative 4D gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)
Google Scholar
Seo, J., et al.: Let 2D diffusion model know 3D-consistency for robust text-to-3D generation. arXiv preprint arXiv:2303.07937 (2023)
Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 6087–6101 (2021)
Google Scholar
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Singer, U., et al.: Text-to-4D dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023)
Sumner, R.W., Schmid, J., Pauly, M.: Embedded deformation for shape manipulation. In: ACM SIGGRAPH 2007 Papers, pp. 80–es (2007)
Google Scholar
Sun, J., et al.: DreamCraft3D: hierarchical 3D generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818 (2023)
Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: DreamGaussian: generative gaussian splatting for efficient 3D content creation. arXiv preprint arXiv:2309.16653 (2023)
Tang, J., et al.: Make-it-3D: high-fidelity 3D creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184 (2023)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part II. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12959–12970 (2021)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)
Google Scholar
Wang, P., Shi, Y.: ImageDream: image-prompt multi-view diffusion for 3D generation. arXiv preprint arXiv:2312.02201 (2023)
Wang, X., et al.: AnimatableDreamer: text-guided non-rigid 3D model generation and reconstruction with canonical score distillation. arXiv preprint arXiv:2312.03795 (2023)
Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wu, G., et al.: 4D gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)
Wu, T., Zhong, F., Tagliasacchi, A., Cole, F., Oztireli, C.: D$\hat{\,}$ 2NeRF: self-supervised decoupling of dynamic and static objects from a monocular video. In: Advances in Neural Information Processing Systems, vol. 35, pp. 32653–32666 (2022)
Google Scholar
Wu, Z., Zhu, Z., Du, J., Bai, X.: CCPL: contrastive coherence preserving loss for versatile style transfer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 189–206. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_11
Chapter Google Scholar
Xu, J., et al.: Dream3D: zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20908–20918 (2023)
Google Scholar
Yang, Z., Yang, H., Pan, Z., Zhu, X., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4D gaussian splatting. arXiv preprint arXiv:2310.10642 (2023)
Yi, T., et al.: GaussianDreamer: fast generation from text to 3D gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)
Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4DGen: grounded 4D content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)
Yu, C., Zhou, Q., Li, J., Zhang, Z., Wang, Z., Wang, F.: Points-to-3D: bridging the gap between sparse points and shape-controllable text-to-3D generation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 6841–6850 (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: animating one image to 4D dynamic scene. arXiv preprint arXiv:2311.14603 (2023)
Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., De Mello, S.: A unified approach for text-and image-guided 4D scene generation. arXiv preprint arXiv:2311.16854 (2023)
Zhu, J., Zhuang, P.: HiFA: high-fidelity text-to-3D with advanced diffusion guidance. arXiv preprint arXiv:2305.18766 (2023)

Download references

Acknowledgements

This work was supported by the National Science Fund for Distinguished Young Scholars of China (Grant No. 62225603).

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Zijie Wu & Xiang Bai
DAMO Academy, Alibaba Group, Hangzhou, China
Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao & Fan Wang

Authors

Zijie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chaohui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yanqin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Chenjie Cao
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Bai .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6259 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Z., Yu, C., Jiang, Y., Cao, C., Wang, F., Bai, X. (2025). SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-72624-8_21
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image Using Latent Video Diffusion

Diverse Generation from a Single Video Made Possible

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6259 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image Using Latent Video Diffusion

Diverse Generation from a Single Video Made Possible

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6259 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation