Nothing Special   »   [go: up one dir, main page]

Skip to main content

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

  • 118 Accesses

Abstract

Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity of the learned motion and shape. Comprehensive experimental results demonstrate that our method surpasses existing methods in both quality and efficiency. In addition, facilitated by the disentangled modeling of motion and appearance of SC4D, we devise a novel application that seamlessly transfers the learned motion onto a diverse array of 4D entities according to textual descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Supp.: supplementary file.

References

  1. Bahmani, S., et al.: 4D-fy: text-to-4D generation using hybrid score distillation sampling. arXiv preprint arXiv:2311.17984 (2023)

  2. Cao, A., Johnson, J.: HexPlane: a fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 130–141 (2023)

    Google Scholar 

  3. Chen, G., Wang, W.: A survey on 3D gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)

  4. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. arXiv preprint arXiv:2303.13873 (2023)

  5. Chen, Z., Wang, F., Liu, H.: Text-to-3D using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023)

  6. Das, D., Wewer, C., Yunus, R., Ilg, E., Lenssen, J.E.: Neural parametric Gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196 (2023)

  7. Deitke, M., et al.: Objaverse-XL: a universe of 10M+ 3D objects. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  8. Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)

    Google Scholar 

  9. Fang, J., et al.: Fast dynamic radiance fields with time-aware neural voxels. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)

    Google Scholar 

  10. Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12479–12488 (2023)

    Google Scholar 

  11. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  12. Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5712–5721 (2021)

    Google Scholar 

  13. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  14. Huang, Y.H., Sun, Y.T., Yang, Z., Lyu, X., Cao, Y.P., Qi, X.: SC-GS: sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937 (2023)

  15. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876 (2022)

    Google Scholar 

  16. Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4D: D360 \(\{\)\(\backslash \)deg\(\}\) dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848 (2023)

  17. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023)

    Google Scholar 

  18. Kratimenos, A., Lei, J., Daniilidis, K.: DynMF: neural motion factorization for real-time dynamic view synthesis with 3D gaussian splatting. arXiv preprint arXiv:2312.00112 (2023)

  19. Lee, H.H., Chang, A.X.: Understanding pure CLIP guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172 (2022)

  20. Li, J., et al.: Instant3D: fast text-to-3D with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214 (2023)

  21. Li, W., Chen, R., Chen, X., Tan, P.: SweetDreamer: aligning geometric priors in 2D diffusion for consistent text-to-3D. arXiv preprint arXiv:2310.02596 (2023)

  22. Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812 (2023)

  23. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6498–6508 (2021)

    Google Scholar 

  24. Liang, Y., et al.: GauFRe: Gaussian deformation fields for real-time dynamic novel view synthesis. arXiv preprint arXiv:2312.11458 (2023)

  25. Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

    Google Scholar 

  26. Lin, Y., Han, H., Gong, C., Xu, Z., Zhang, Y., Li, X.: Consistent123: one image to highly consistent 3D asset using case-aware diffusion priors. arXiv preprint arXiv:2309.17261 (2023)

  27. Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: text-to-4D with dynamic 3D gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)

  28. Liu, M., et al.: One-2-3-45: any single image to 3D mesh in 45 seconds without per-shape optimization. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  29. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)

    Google Scholar 

  30. Liu, Y., et al.: SyncDreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  31. Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3D gaussians: tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023)

  32. Melas-Kyriazi, L., Laina, I., Rupprecht, C., Vedaldi, A.: RealFusion: 360deg reconstruction of any object from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8446–8455 (2023)

    Google Scholar 

  33. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  34. Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: CLIP-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)

    Google Scholar 

  35. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-E: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)

  36. Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Fast dynamic 3D object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)

  37. Park, K., et al.: NeRFies: deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874 (2021)

    Google Scholar 

  38. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  39. Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327 (2021)

    Google Scholar 

  40. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  41. Qian, G., et al.: Magic123: one image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv preprint arXiv:2306.17843 (2023)

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  43. Ren, J., et al.: DreamGaussian4D: generative 4D gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)

  44. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  45. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  46. Seo, J., et al.: Let 2D diffusion model know 3D-consistency for robust text-to-3D generation. arXiv preprint arXiv:2303.07937 (2023)

  47. Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 6087–6101 (2021)

    Google Scholar 

  48. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)

  49. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  50. Singer, U., et al.: Text-to-4D dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023)

  51. Sumner, R.W., Schmid, J., Pauly, M.: Embedded deformation for shape manipulation. In: ACM SIGGRAPH 2007 Papers, pp. 80–es (2007)

    Google Scholar 

  52. Sun, J., et al.: DreamCraft3D: hierarchical 3D generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818 (2023)

  53. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: DreamGaussian: generative gaussian splatting for efficient 3D content creation. arXiv preprint arXiv:2309.16653 (2023)

  54. Tang, J., et al.: Make-it-3D: high-fidelity 3D creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184 (2023)

  55. Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part II. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24

    Chapter  Google Scholar 

  56. Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12959–12970 (2021)

    Google Scholar 

  57. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

    Google Scholar 

  58. Wang, P., Shi, Y.: ImageDream: image-prompt multi-view diffusion for 3D generation. arXiv preprint arXiv:2312.02201 (2023)

  59. Wang, X., et al.: AnimatableDreamer: text-guided non-rigid 3D model generation and reconstruction with canonical score distillation. arXiv preprint arXiv:2312.03795 (2023)

  60. Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  61. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  62. Wu, G., et al.: 4D gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)

  63. Wu, T., Zhong, F., Tagliasacchi, A., Cole, F., Oztireli, C.: D\(\hat{\,}\) 2NeRF: self-supervised decoupling of dynamic and static objects from a monocular video. In: Advances in Neural Information Processing Systems, vol. 35, pp. 32653–32666 (2022)

    Google Scholar 

  64. Wu, Z., Zhu, Z., Du, J., Bai, X.: CCPL: contrastive coherence preserving loss for versatile style transfer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 189–206. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_11

    Chapter  Google Scholar 

  65. Xu, J., et al.: Dream3D: zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20908–20918 (2023)

    Google Scholar 

  66. Yang, Z., Yang, H., Pan, Z., Zhu, X., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4D gaussian splatting. arXiv preprint arXiv:2310.10642 (2023)

  67. Yi, T., et al.: GaussianDreamer: fast generation from text to 3D gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)

  68. Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4DGen: grounded 4D content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)

  69. Yu, C., Zhou, Q., Li, J., Zhang, Z., Wang, Z., Wang, F.: Points-to-3D: bridging the gap between sparse points and shape-controllable text-to-3D generation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 6841–6850 (2023)

    Google Scholar 

  70. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  71. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

    Google Scholar 

  72. Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: animating one image to 4D dynamic scene. arXiv preprint arXiv:2311.14603 (2023)

  73. Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., De Mello, S.: A unified approach for text-and image-guided 4D scene generation. arXiv preprint arXiv:2311.16854 (2023)

  74. Zhu, J., Zhuang, P.: HiFA: high-fidelity text-to-3D with advanced diffusion guidance. arXiv preprint arXiv:2305.18766 (2023)

Download references

Acknowledgements

This work was supported by the National Science Fund for Distinguished Young Scholars of China (Grant No. 62225603).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Bai .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6259 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, Z., Yu, C., Jiang, Y., Cao, C., Wang, F., Bai, X. (2025). SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72624-8_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72623-1

  • Online ISBN: 978-3-031-72624-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics