Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.05477 (cs)

[Submitted on 10 Aug 2024 (v1), last revised 20 Aug 2024 (this version, v2)]

Title:Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Authors:Yiying Yang, Fukun Yin, Jiayuan Fan, Xin Chen, Wanzhang Li, Gang Yu

Abstract:As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency.
Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at this https URL.

Comments:	arXiv admin note: text overlap with arXiv:2305.11588 by other authors
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.05477 [cs.CV]
	(or arXiv:2408.05477v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.05477

Submission history

From: Yiying Yang [view email]
[v1] Sat, 10 Aug 2024 08:09:57 UTC (27,783 KB)
[v2] Tue, 20 Aug 2024 10:16:00 UTC (41,487 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators