short-paper

Open access

VQ-VDM: Video Diffusion Models with 3D VQGAN

Authors:

Keiji YanaiAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 90, Pages 1 - 5

https://doi.org/10.1145/3595916.3626363

Published: 01 January 2024 Publication History

All formats PDF

Abstract

In recent years, deep generative models have achieved impressive performance such as realizing image generation that is indistinguishable from real images. Particularly, Latent Diffusion Models, one of the image generation models, have had a significant impact on society. Therefore, video generation is attracting attention as the next modality. However, video generation is more challenging than image generation due to the consideration of temporal consistency and the increase in computational complexity, since a video is a sequence of multiple frames. In this study, we propose a video generation model based on diffusion models employing 3D VQGAN, which is called VQ-VDM. The proposed model is about nine times faster than the Video Diffusion Models which directly generate videos, since our model generates a latent representation which is decoded into a video by a VQGAN decoder. Moreover, our model can generate higher quality video than prior video generation methods exclude state-of-the-art method.

Supplementary Material

MP4 File (MMAsia2023_Kaji.mp4)

Generated videos by VQ-VDM

Download
46.51 MB

References

[1]

Clark Aidan, Donahue Jeff, and Simonyan Karen. 2019. Adversarial Video Generation on Complex Datasets. arXiv preprint arXiv:1907.06571 (2019).

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc.of IEEE Computer Vision and Pattern Recognition. 6299–6308.

[3]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proc.of IEEE Computer Vision and Pattern Recognition. 12873–12883.

[4]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Proc. of Advances in Neural Information Processing Systems 33 (2020), 6840–6851.

[5]

Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).

[6]

Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. [n. d.]. Video Diffusion Models. In Proc. of Advances in Neural Information Processing Systems.

[7]

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022).

[8]

He Jiawei, Lehrmann Andreas, Marino Joseph, Mori Greg, and Sigal Leonid. 2018. Probabilistic Video Generation using Holistic Attribute Control. In Proc.of European Conference on Computer Vision. 452–467.

[9]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proc.of IEEE Computer Vision and Pattern Recognition. 1725–1732.

Digital Library

[10]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).

[11]

Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. 2021. CCVS: Context-aware Controllable Video Synthesis. Proc. of Advances in Neural Information Processing Systems 34 (2021), 14042–14055.

[12]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc.of IEEE Computer Vision and Pattern Recognition. 10684–10695.

[13]

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. 2017. Temporal generative adversarial nets with singular value clipping. In Proc.of IEEE International Conference on Computer Vision. 2830–2839.

[14]

Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. 2020. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision 128, 10-11 (2020), 2586–2606.

Digital Library

[15]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training GANs. Proc. of Advances in Neural Information Processing Systems 29 (2016).

[16]

Tulyakov Sergey, Liu Ming-Yu, Yang Xiaodong, and Kautz Jan. 2018. MoCoGAN: Decomposing Motion and Content for Video Generation. In Proc.of IEEE Computer Vision and Pattern Recognition. 1526–1535.

[17]

Yu Sihyun, Tack Jihoon, Mo Sangwoo, Kim Hyunsu, Kima Junho, Ha Jung-Woo, and Shin Jinwoo. 2022. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. In Proc. of International Conference on Learning Representation.

[18]

Ge Songwei, Hayes Thomas, Yang Harry, Yin Xi, Pang Guan, Jacobs David, Huang Jia-Bin, and Parikh Devi. 2022. Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer. In Proc.of European Conference on Computer Vision.

[19]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[20]

Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. 2021. A Good Image Generator Is What You Need for High-Resolution Video Synthesis. In Proc. of International Conference on Learning Representation.

[21]

Wang Ting-Chun, Liu Ming-Yu, Zhu Jun-Yan, Tao Andrew, Kautz Jan, and Catanzaro Bryan. 2018. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proc.of IEEE Computer Vision and Pattern Recognition.

[22]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proc.of IEEE International Conference on Computer Vision. 4489–4497.

Digital Library

[23]

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018).

[24]

Aaron Van Den Oord, Oriol Vinyals, 2017. Neural discrete representation learning. Proc. of Advances in Neural Information Processing Systems 30 (2017).

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Proc. of Advances in Neural Information Processing Systems 30 (2017).

[26]

Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proc.of IEEE Computer Vision and Pattern Recognition. 2364–2373.

[27]

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. 2021. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021).

[28]

Liu Yifan, Chen Hao, Chen Yu, Yin Wei, and Chunhua Shen. 2021. Generic Perceptual Loss for Modeling Structured Output Dependencies. In Proc.of IEEE Computer Vision and Pattern Recognition. 5424–5432.

Index Terms

VQ-VDM: Video Diffusion Models with 3D VQGAN
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Diffusion Models for Document Image Generation
Document Analysis and Recognition - ICDAR 2023
Abstract
Image generation has got wide attention in recent times; however, despite advances in image generation techniques, document image generation having wide industry application has remained largely neglected. The previous research on structured ...
A Survey on Video Diffusion Models
The recent wave of AI-generated content (AIGC) has witnessed substantial success in computer vision, with the diffusion model playing a crucial role in this achievement. Due to their impressive generative capabilities, diffusion models are gradually ...
AdvDiff: Generating Unrestricted Adversarial Examples Using Diffusion Models
Computer Vision – ECCV 2024
Abstract
Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

JSPS KAKENHI

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
752
Total Downloads

Downloads (Last 12 months)752
Downloads (Last 6 weeks)162

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents