Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.10718 (cs)

[Submitted on 14 Dec 2024 (v1), last revised 30 Jun 2025 (this version, v5)]

Title:Grid: Omni Visual Generation

Authors:Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Fan Wang, Yuhang He, Yihong Gong

View PDF HTML (experimental)

Abstract:Visual generation has witnessed remarkable progress in single-image tasks, yet extending these capabilities to temporal sequences remains challenging. Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators, both requiring learning temporal dynamics anew. We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding. Building on this insight, we introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences while leveraging existing model capabilities. Through a parallel flow-matching training strategy with coarse-to-fine scheduling, our approach achieves up to 67 faster inference speeds while using <1/1000 of the computational resources compared to specialized models. Extensive experiments demonstrate that GRID not only excels in temporal tasks from Text-to-Video to 3D Editing but also preserves strong performance in image generation, establishing itself as an efficient and versatile omni-solution for visual generation.

Comments:	Codes: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.10718 [cs.CV]
	(or arXiv:2412.10718v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.10718

Submission history

From: Cong Wan [view email]
[v1] Sat, 14 Dec 2024 07:22:03 UTC (26,772 KB)
[v2] Tue, 17 Dec 2024 05:24:57 UTC (26,772 KB)
[v3] Fri, 10 Jan 2025 07:20:26 UTC (26,774 KB)
[v4] Tue, 21 Jan 2025 04:00:36 UTC (29,394 KB)
[v5] Mon, 30 Jun 2025 10:03:43 UTC (11,147 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Grid: Omni Visual Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Grid: Omni Visual Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators