Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3595916.3626363acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper
Open access

VQ-VDM: Video Diffusion Models with 3D VQGAN

Published: 01 January 2024 Publication History

Abstract

In recent years, deep generative models have achieved impressive performance such as realizing image generation that is indistinguishable from real images. Particularly, Latent Diffusion Models, one of the image generation models, have had a significant impact on society. Therefore, video generation is attracting attention as the next modality. However, video generation is more challenging than image generation due to the consideration of temporal consistency and the increase in computational complexity, since a video is a sequence of multiple frames. In this study, we propose a video generation model based on diffusion models employing 3D VQGAN, which is called VQ-VDM. The proposed model is about nine times faster than the Video Diffusion Models which directly generate videos, since our model generates a latent representation which is decoded into a video by a VQGAN decoder. Moreover, our model can generate higher quality video than prior video generation methods exclude state-of-the-art method.

Supplementary Material

MP4 File (MMAsia2023_Kaji.mp4)
Generated videos by VQ-VDM

References

[1]
Clark Aidan, Donahue Jeff, and Simonyan Karen. 2019. Adversarial Video Generation on Complex Datasets. arXiv preprint arXiv:1907.06571 (2019).
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc.of IEEE Computer Vision and Pattern Recognition. 6299–6308.
[3]
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proc.of IEEE Computer Vision and Pattern Recognition. 12873–12883.
[4]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Proc. of Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
[5]
Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
[6]
Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. [n. d.]. Video Diffusion Models. In Proc. of Advances in Neural Information Processing Systems.
[7]
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022).
[8]
He Jiawei, Lehrmann Andreas, Marino Joseph, Mori Greg, and Sigal Leonid. 2018. Probabilistic Video Generation using Holistic Attribute Control. In Proc.of European Conference on Computer Vision. 452–467.
[9]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proc.of IEEE Computer Vision and Pattern Recognition. 1725–1732.
[10]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[11]
Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. 2021. CCVS: Context-aware Controllable Video Synthesis. Proc. of Advances in Neural Information Processing Systems 34 (2021), 14042–14055.
[12]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc.of IEEE Computer Vision and Pattern Recognition. 10684–10695.
[13]
Masaki Saito, Eiichi Matsumoto, and Shunta Saito. 2017. Temporal generative adversarial nets with singular value clipping. In Proc.of IEEE International Conference on Computer Vision. 2830–2839.
[14]
Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. 2020. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision 128, 10-11 (2020), 2586–2606.
[15]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training GANs. Proc. of Advances in Neural Information Processing Systems 29 (2016).
[16]
Tulyakov Sergey, Liu Ming-Yu, Yang Xiaodong, and Kautz Jan. 2018. MoCoGAN: Decomposing Motion and Content for Video Generation. In Proc.of IEEE Computer Vision and Pattern Recognition. 1526–1535.
[17]
Yu Sihyun, Tack Jihoon, Mo Sangwoo, Kim Hyunsu, Kima Junho, Ha Jung-Woo, and Shin Jinwoo. 2022. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. In Proc. of International Conference on Learning Representation.
[18]
Ge Songwei, Hayes Thomas, Yang Harry, Yin Xi, Pang Guan, Jacobs David, Huang Jia-Bin, and Parikh Devi. 2022. Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer. In Proc.of European Conference on Computer Vision.
[19]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[20]
Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. 2021. A Good Image Generator Is What You Need for High-Resolution Video Synthesis. In Proc. of International Conference on Learning Representation.
[21]
Wang Ting-Chun, Liu Ming-Yu, Zhu Jun-Yan, Tao Andrew, Kautz Jan, and Catanzaro Bryan. 2018. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proc.of IEEE Computer Vision and Pattern Recognition.
[22]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proc.of IEEE International Conference on Computer Vision. 4489–4497.
[23]
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018).
[24]
Aaron Van Den Oord, Oriol Vinyals, 2017. Neural discrete representation learning. Proc. of Advances in Neural Information Processing Systems 30 (2017).
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Proc. of Advances in Neural Information Processing Systems 30 (2017).
[26]
Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proc.of IEEE Computer Vision and Pattern Recognition. 2364–2373.
[27]
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. 2021. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021).
[28]
Liu Yifan, Chen Hao, Chen Yu, Yin Wei, and Chunhua Shen. 2021. Generic Perceptual Loss for Modeling Structured Output Dependencies. In Proc.of IEEE Computer Vision and Pattern Recognition. 5424–5432.

Index Terms

  1. VQ-VDM: Video Diffusion Models with 3D VQGAN

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
    December 2023
    745 pages
    ISBN:9798400702051
    DOI:10.1145/3595916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D VQGAN
    2. diffusion models
    3. video generation

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Funding Sources

    • JSPS KAKENHI

    Conference

    MMAsia '23
    Sponsor:
    MMAsia '23: ACM Multimedia Asia
    December 6 - 8, 2023
    Tainan, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 752
      Total Downloads
    • Downloads (Last 12 months)752
    • Downloads (Last 6 weeks)162
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media