Nothing Special   »   [go: up one dir, main page]

Skip to main content

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15084))

Included in the following conference series:

  • 30 Accesses

Abstract

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations. In this work, we propose Joint Score Distillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model. We then derive the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS. In addition, we instantiate three universal view-aware models as energy functions, demonstrating compatibility with JSD. Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5% CLIP R-Precision and 27.7% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.

C. Jiang and Y. Zeng—Equal contribution.

https://jointdreamer.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Armandpour, M., Zheng, H., Sadeghian, A., Sadeghian, A., Zhou, M.: Re-imagine the negative prompt algorithm: transform 2d diffusion into 3d, alleviate janus problem and beyond. In: ICLR (2024)

    Google Scholar 

  2. Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer. In: ICLR (2024)

    Google Scholar 

  3. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9650–9660 (2021)

    Google Scholar 

  4. Chan, E.R., et al.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR, pp. 16123–16133 (2022)

    Google Scholar 

  5. Chen, H., et al.: Single-stage diffusion nerf: a unified approach to 3d generation and reconstruction. arXiv preprint arXiv:2304.06714 (2023)

  6. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. In: ICCV, pp. 22246–22256 (2023)

    Google Scholar 

  7. Deitke, M., et al.: Objaverse-xl: a universe of 10m+ 3d objects. In: NeurIPS (2024)

    Google Scholar 

  8. Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: CVPR, pp. 13142–13153 (2023)

    Google Scholar 

  9. Deng, Y., Yang, J., Xiang, J., Tong, X.: Gram: generative radiance manifolds for 3d-aware image generation. In: CVPR, pp. 10673–10683 (2022)

    Google Scholar 

  10. Gao, J., et al.: Get3d: a generative model of high quality 3d textured shapes learned from images. NeurIPS 35, 31841–31854 (2022)

    Google Scholar 

  11. Guo, Y.C., et al.: Threestudio: a unified framework for 3d content generation (2023). https://github.com/threestudio-project/threestudio

  12. Henderson, P., Ferrari, V.: Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. IJCV 128(4), 835–854 (2020)

    Article  Google Scholar 

  13. Henderson, P., Tsiminaki, V., Lampert, C.H.: Leveraging 2d data to learn textured 3d mesh generation. In: CVPR, pp. 7498–7507 (2020)

    Google Scholar 

  14. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)

    Google Scholar 

  15. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  16. Hu, Z., et al.: Efficientdreamer: high-fidelity and robust 3d creation via orthogonal-view diffusion priors. In: CVPR, pp. 4949–4958 (2024)

    Google Scholar 

  17. Huang, Y., Wang, J., Shi, Y., Tang, B., Qi, X., Zhang, L.: Dreamtime: an improved optimization strategy for diffusion-guided 3d generation. In: ICLR (2023)

    Google Scholar 

  18. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR, pp. 867–876 (2022)

    Google Scholar 

  19. Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)

  20. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023)

    Article  Google Scholar 

  21. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predict. Struct. Data 1(0) (2006)

    Google Scholar 

  22. Li, M., et al.: Instant3d: instant text-to-3d generation. arXiv preprint arXiv:2311.08403 (2023)

  23. Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. In: ICLR (2024)

    Google Scholar 

  24. Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: CVPR (2023)

    Google Scholar 

  25. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  26. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: ICCV, pp. 9298–9309 (2023)

    Google Scholar 

  27. Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. In: ICLR (2024)

    Google Scholar 

  28. Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: score-based generative 3d mesh modeling. In: ICLR (2023)

    Google Scholar 

  29. Long, X., et al.: Wonder3d: single image to 3d using cross-domain diffusion. In: CVPR (2024)

    Google Scholar 

  30. Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: a universal approach for transferring knowledge from pre-trained diffusion models. In: NeurIPS, vol. 36 (2024)

    Google Scholar 

  31. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  32. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)

    Article  Google Scholar 

  33. Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: unsupervised learning of 3d representations from natural images. In: ICCV, pp. 7588–7597 (2019)

    Google Scholar 

  34. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)

  35. Niemeyer, M., Geiger, A.: Giraffe: representing scenes as compositional generative neural feature fields. In: CVPR, pp. 11453–11464 (2021)

    Google Scholar 

  36. Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: NeurIPS Datasets and Benchmarks Track (2021)

    Google Scholar 

  37. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. In: ICLR (2023)

    Google Scholar 

  38. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)

    Google Scholar 

  39. Sanghi, A., et al.: Clip-sculptor: zero-shot generation of high-fidelity and diverse shapes from natural language. In: CVPR, pp. 18339–18348 (2023)

    Google Scholar 

  40. Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

  41. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. In: ICLR (2024)

    Google Scholar 

  42. Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. In: NeurIPS (2024)

    Google Scholar 

  43. Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. In: ICLR (2023)

    Google Scholar 

  44. Weese, J., Kaus, M., Lorenz, C., Lobregt, S., Truyen, R., Pekar, V.: Shape constrained deformable models for 3D medical image segmentation. In: Insana, M.F., Leahy, R.M. (eds.) IPMI 2001. LNCS, vol. 2082, pp. 380–387. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45729-1_38

    Chapter  Google Scholar 

  45. Zhao, W., Yan, L., Zhang, Y.: Geometric-constrained multi-view image matching method based on semi-global optimization. Geo-spatial Inf. Sci. 21(2), 115–126 (2018)

    Article  Google Scholar 

Download references

Acknowledgment

This research has been made possible by funding support provided to Dit-Yan Yeung by the Research Grants Council of Hong Kong under the Research Impact Fund project R6003-21.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenhan Jiang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 21936 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, C. et al. (2025). JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73347-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73346-8

  • Online ISBN: 978-3-031-73347-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics