Computer Science > Computation and Language

arXiv:2312.07066 (cs)

[Submitted on 12 Dec 2023]

Title:DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

Abstract:Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task that involves generating meaningful and coherent narratives from a collection of images, has become even more challenging and is increasingly desired beyond real-world imagery. While existing VST techniques, which typically use autoregressive decoders, have made significant progress, they suffer from low inference speed and are not well-suited for synthetic scenes. To this end, we propose a novel diffusion-based system DiffuVST, which models the generation of a series of visual descriptions as a single conditional denoising process. The stochastic and non-autoregressive nature of DiffuVST at inference time allows it to generate highly diverse narratives more efficiently. In addition, DiffuVST features a unique design with bi-directional text history guidance and multimodal adapter modules, which effectively improve inter-sentence coherence and image-to-text fidelity. Extensive experiments on the story generation task covering four fictional visual-story datasets demonstrate the superiority of DiffuVST over traditional autoregressive models in terms of both text quality and inference speed.

Comments:	EMNLP 2023 Findings
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.07066 [cs.CL]
	(or arXiv:2312.07066v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.07066

Submission history

From: Shengguang Wu [view email]
[v1] Tue, 12 Dec 2023 08:40:38 UTC (11,578 KB)

Computer Science > Computation and Language

Title:DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators