JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Chenhan Jiang ORCID: orcid.org/0000-0001-8771-3641¹³,
Yihan Zeng¹⁴,
Tianyang Hu¹⁴,
Songcun Xu¹⁴,
Wei Zhang¹⁴,
Hang Xu¹⁴ &
…
Dit-Yan Yeung¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15084))

Included in the following conference series:

European Conference on Computer Vision

30 Accesses

Abstract

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations. In this work, we propose Joint Score Distillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model. We then derive the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS. In addition, we instantiate three universal view-aware models as energy functions, demonstrating compatibility with JSD. Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5% CLIP R-Precision and 27.7% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.

C. Jiang and Y. Zeng—Equal contribution.

https://jointdreamer.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

DreamView: Injecting View-Specific Text Guidance Into Text-to-3D Generation

References

Armandpour, M., Zheng, H., Sadeghian, A., Sadeghian, A., Zhou, M.: Re-imagine the negative prompt algorithm: transform 2d diffusion into 3d, alleviate janus problem and beyond. In: ICLR (2024)
Google Scholar
Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer. In: ICLR (2024)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9650–9660 (2021)
Google Scholar
Chan, E.R., et al.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR, pp. 16123–16133 (2022)
Google Scholar
Chen, H., et al.: Single-stage diffusion nerf: a unified approach to 3d generation and reconstruction. arXiv preprint arXiv:2304.06714 (2023)
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. In: ICCV, pp. 22246–22256 (2023)
Google Scholar
Deitke, M., et al.: Objaverse-xl: a universe of 10m+ 3d objects. In: NeurIPS (2024)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: CVPR, pp. 13142–13153 (2023)
Google Scholar
Deng, Y., Yang, J., Xiang, J., Tong, X.: Gram: generative radiance manifolds for 3d-aware image generation. In: CVPR, pp. 10673–10683 (2022)
Google Scholar
Gao, J., et al.: Get3d: a generative model of high quality 3d textured shapes learned from images. NeurIPS 35, 31841–31854 (2022)
Google Scholar
Guo, Y.C., et al.: Threestudio: a unified framework for 3d content generation (2023). https://github.com/threestudio-project/threestudio
Henderson, P., Ferrari, V.: Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. IJCV 128(4), 835–854 (2020)
Article Google Scholar
Henderson, P., Tsiminaki, V., Lampert, C.H.: Leveraging 2d data to learn textured 3d mesh generation. In: CVPR, pp. 7498–7507 (2020)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)
Google Scholar
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Hu, Z., et al.: Efficientdreamer: high-fidelity and robust 3d creation via orthogonal-view diffusion priors. In: CVPR, pp. 4949–4958 (2024)
Google Scholar
Huang, Y., Wang, J., Shi, Y., Tang, B., Qi, X., Zhang, L.: Dreamtime: an improved optimization strategy for diffusion-guided 3d generation. In: ICLR (2023)
Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR, pp. 867–876 (2022)
Google Scholar
Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023)
Article Google Scholar
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predict. Struct. Data 1(0) (2006)
Google Scholar
Li, M., et al.: Instant3d: instant text-to-3d generation. arXiv preprint arXiv:2311.08403 (2023)
Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. In: ICLR (2024)
Google Scholar
Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: CVPR (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: ICCV, pp. 9298–9309 (2023)
Google Scholar
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. In: ICLR (2024)
Google Scholar
Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: score-based generative 3d mesh modeling. In: ICLR (2023)
Google Scholar
Long, X., et al.: Wonder3d: single image to 3d using cross-domain diffusion. In: CVPR (2024)
Google Scholar
Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: a universal approach for transferring knowledge from pre-trained diffusion models. In: NeurIPS, vol. 36 (2024)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)
Article Google Scholar
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: unsupervised learning of 3d representations from natural images. In: ICCV, pp. 7588–7597 (2019)
Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
Niemeyer, M., Geiger, A.: Giraffe: representing scenes as compositional generative neural feature fields. In: CVPR, pp. 11453–11464 (2021)
Google Scholar
Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: NeurIPS Datasets and Benchmarks Track (2021)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. In: ICLR (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Sanghi, A., et al.: Clip-sculptor: zero-shot generation of high-fidelity and diverse shapes from natural language. In: CVPR, pp. 18339–18348 (2023)
Google Scholar
Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. In: ICLR (2024)
Google Scholar
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. In: NeurIPS (2024)
Google Scholar
Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. In: ICLR (2023)
Google Scholar
Weese, J., Kaus, M., Lorenz, C., Lobregt, S., Truyen, R., Pekar, V.: Shape constrained deformable models for 3D medical image segmentation. In: Insana, M.F., Leahy, R.M. (eds.) IPMI 2001. LNCS, vol. 2082, pp. 380–387. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45729-1_38
Chapter Google Scholar
Zhao, W., Yan, L., Zhang, Y.: Geometric-constrained multi-view image matching method based on semi-global optimization. Geo-spatial Inf. Sci. 21(2), 115–126 (2018)
Article Google Scholar

Download references

Acknowledgment

This research has been made possible by funding support provided to Dit-Yan Yeung by the Research Grants Council of Hong Kong under the Research Impact Fund project R6003-21.

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Hong Kong, China
Chenhan Jiang & Dit-Yan Yeung
Huawei Noah’s Ark Lab, Montreal, Canada
Yihan Zeng, Tianyang Hu, Songcun Xu, Wei Zhang & Hang Xu

Authors

Chenhan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yihan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Songcun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dit-Yan Yeung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenhan Jiang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 21936 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, C. et al. (2025). JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-73347-5_25
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73346-8
Online ISBN: 978-3-031-73347-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

DreamView: Injecting View-Specific Text Guidance Into Text-to-3D Generation

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 21936 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

DreamView: Injecting View-Specific Text Guidance Into Text-to-3D Generation

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 21936 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation