SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

This repository is the official implementation of "SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation"

Paper: arxiv
Demo page: Audio Samples
Chekpoints: Hugging Face (Now only checkpoints are avaiable.）

Contact:

Koichi SAITO: koichi.saito@sony.com

Checkpoints

Download and put the teacher model's checkpoints and AudioLDM-s-full checkpoints for VAE+Vocoder part to soundctm/ckpt
SoundCTM checkpoint on AudioCaps (ema=0.999, 30K training iterations)

For inference, both AudioLDM-s-full (for VAE's decoder+Vocoder) and SoundCTM checkpoints will be used.

Prerequisites

Install docker to your own server and build docker container:

docker build -t soundctm .

Then run scripts in the container.

Training

Please see ctm_train.sh and ctm_train.py and modify folder path dependeing on your environment.

Then run bash ctm_train.sh

Inference

Please see ctm_inference.sh and ctm_inference.py and modify folder path dependeing on your environment.

Then run bash ctm_inference.sh

Numerical evaluation

Please see numerical_evaluation.sh and numerical_evaluation.py and modify folder path dependeing on your environment.

Then run bash numerical_evaluation.sh

Dataset

Follow the instructions given in the AudioCaps repository for downloading the data. Data locations are needed to be spesificied in ctm_train.sh. You can also see some examples at data/train.csv.

WandB for logging

The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:

$ wandb login

Or you can also pass an API key as an environment variable WANDB_API_KEY. (You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)

$ WANDB_API_KEY="12345x6789y..."

Citation

@article{saito2024soundctm,
  title={SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation}, 
  author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
  journal={arXiv preprint arXiv:2405.18503},
  year={2024}
}

Reference

Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.

https://github.com/sony/ctm

https://github.com/declare-lab/tango

https://github.com/haoheliu/AudioLDM

https://github.com/haoheliu/audioldm_eval

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
audioldm_eval		audioldm_eval
ckpt		ckpt
clap		clap
configs		configs
ctm		ctm
data		data
src/hear21passt		src/hear21passt
tango_edm		tango_edm
tools		tools
LICENSE		LICENSE
README.md		README.md
ctm_inference.py		ctm_inference.py
ctm_inference.sh		ctm_inference.sh
ctm_train.py		ctm_train.py
ctm_train.sh		ctm_train.sh
docker2singularity.sh		docker2singularity.sh
dockerfile		dockerfile
numerical_evaluation.py		numerical_evaluation.py
numerical_evaluation.sh		numerical_evaluation.sh
python_accelerate.sh		python_accelerate.sh
requirements.txt		requirements.txt
teacher_eval.py		teacher_eval.py
teacher_eval.sh		teacher_eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Checkpoints

Prerequisites

Training

Inference

Numerical evaluation

Dataset

WandB for logging

Citation

Reference

About

Releases

Packages

Languages

License

sony/soundctm

Folders and files

Latest commit

History

Repository files navigation

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Checkpoints

Prerequisites

Training

Inference

Numerical evaluation

Dataset

WandB for logging

Citation

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages