Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

SDMuse: Stochastic Differential Music Editing and Generation via Hybrid Representation

Published: 12 June 2023 Publication History

Abstract

While deep generative models have empowered music generation, it remains a challenging and under-explored problem to edit an existing musical piece at fine granularity. In this article, we propose SDMuse, a unified <underline>S</underline>tochastic <underline>D</underline>ifferential <underline>Mus</underline>ic <underline>e</underline>diting and generation framework, which can not only compose a whole musical piece from scratch, but also modify existing musical pieces in many ways, such as combination, continuation, inpainting, and style transferring. The proposed SDMuse follows a two-stage pipeline to achieve music generation and editing on top of a hybrid representation including pianoroll and MIDI-event. In particular, SDMuse first generates/edits pianoroll by iteratively denoising through a stochastic differential equation (SDE) based on a diffusion model generative prior, and then refines the generated pianoroll and predicts MIDI-event tokens auto-regressively. We evaluate the generated music of our method on <italic>ailabs1k7</italic> pop music dataset in terms of quality and controllability on various music editing and generation tasks. Experimental results demonstrate the effectiveness of our proposed stochastic differential music editing and generation process, as well as the hybrid representations.

References

[1]
B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes Their Appl., vol. 12, no. 3, pp. 313–326, 1982.
[2]
G. Brunner, Y. Wang, R. Wattenhofer, and S. Zhao, “Symbolic music genre transfer with cycleGAN,” in Proc. IEEE 30th Int. Conf. Tools Artif. Intell., 2018, pp. 786–793.
[3]
C.-J. Chang, C.-Y. Lee, and Y.-H. Yang, “Variable-length music score infilling via XLNet and musically specialized positional encoding,” in Proc. Int. Soc. Music Inf. Retrieval, 2021, pp. 97–104.
[4]
Ö. Çiçek et al., “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” in Proc. Int. Conf. Med. Image Comput. Comput.- Assist. Interv., 2016, pp. 424–432.
[5]
O. Cífka, U. Şimşekli, and G. Richard, “Groove2Groove: One-shot music style transfer with supervision from synthetic data,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2638–2650, 2020.
[6]
H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” in Proc. AAAI Conf. Artif. Intell., 2018, vol. 32, pp. 34–41.
[7]
F. Guo et al., “Automatic song translation for tonal languages,” in Proc. Findings Assoc. Comput. Linguistics, 2022, pp. 729–743.
[8]
G. Hadjeres and L. Crestel, “The piano inpainting application,” 2021, arXiv:2107.05944.
[9]
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Adv. Neural Inf. Process. Syst., 2020,vol. 33, pp. 6840–6851.
[10]
W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, pp. 178–186.
[11]
Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1180–1188.
[12]
Z. Ju et al., “Telemelody: Lyric-to-melody generation with a template-based two-stage method,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2021, pp. 5426–5437.
[13]
J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger: Singing voice synthesis via shallow diffusion mechanism,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 11020–11028.
[14]
C. Meng et al., “SDEdit: Guided image synthesis and editing with stochastic differential equationss,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–33.
[15]
G. Mittal, J. Engel, C. Hawthorne, and I. Simon, “Symbolic music generation with diffusion models,” in Proc. Int. Soc. Music Inf. Retrieval, 2021, pp. 468–475.
[16]
A. Nichol et al., “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in Proc. Proc. 39th Int. Conf. Mach. Learn., vol. 162, 2022, pp. 16784–16804.
[17]
C. Payne, “Musenet,” OpenAI, 2019. Accessed: 25 Apr., 2019. [Online]. Available: openai.com/blog/musenet
[18]
Y. Ren et al., “PopMAG: Pop music accompaniment generation,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1198–1206.
[19]
A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long-term structure in music,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4364–4373.
[20]
Z. Sheng et al., “Songmass: Automatic song writing with pre-training and alignment constraint,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, pp. 13798–13805.
[21]
Y. Song et al., “Score-based generative modeling through Stochastic differential equations,” in Proc. Int. Conf. Learn. Representations, 2020, pp. 1–36.
[22]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, vol. 30, pp. 6000–6010.
[23]
M. Wang, Y.-K. Lai, Y. Liang, R. R. Martin, and S.-M. Hu, “Biggerpicture: Data-driven image extrapolation using graph matching,” ACM Trans. Graph., vol. 33, no. 6, 2014, pp. 173:1–173:13.
[24]
Z. Wang et al., “Songdriver: Real-time music accompaniment generation without logical latency nor exposure bias,” in Proc. 30th ACM Int. Conf. Multimedia, 2022, pp. 1057–1067.
[25]
S.-L. Wu and Y.-H. Yang, “Musemorphose: Full-song and fine-grained music style transfer with one transformer VAE,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, 2021, pp. 1953–1967.
[26]
L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional generative adversarial network for symbolic-domain music generation,” in Proc. Int. Soc. Music Inf. Retrieval, 2017, pp. 324–331.
[27]
R. A. Yeh et al., “Semantic image inpainting with deep generative models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5485–5493.
[28]
C. Zhang et al., “Relyme: Improving lyric-to-melody generation by incorporating lyric-melody relationships,” in Proc. 30th ACM Int. Conf. Multimedia, 2022, pp. 1047–1056.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
2024
9891 pages

Publisher

IEEE Press

Publication History

Published: 12 June 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media