Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Self-attention generative adversarial networks applied to conditional music generation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The task of audio and music generation in the waveform domain has become possible due to recent advances in deep learning. Generative Adversarial Networks (GANs) are a type of generative model that has achieved success in areas such as image, video and audio generation. However, realistic audio generation with GANs is still a challenge, thanks to the specific characteristics inherent to this kind of data. In this paper we propose a GAN model that employs the self-attention mechanism and produces small chunks of music conditioned by instrument. We compare our model to a baseline and run ablation studies in order to demonstrate its superiority. We also suggest some applications of the model, particularly in the area of computer assisted composition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/asteroid-team/torch-audiomentations

  2. www.ime.unicamp.br/~jbflorindo/samples.zip

References

  1. Barratt S, Sharma R (2018) A note on the inception score. arXiv:1801.01973

  2. Binkowski M, Donahue J, Dieleman S, Clark A, Elsen E, Casagrande N, Cobo LC, Simonyan K (2019) High fidelity speech synthesis with adversarial networks. arXiv:1909.11646

  3. Borji A (2019) Pros and cons of GAN evaluation measures. Comput Vis Image Underst 179:41–65

    Article  Google Scholar 

  4. van den Broek K (2021) Mp3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN. arXiv:2101.04785

  5. Cordonnier J, Loukas A, Jaggi M (2019) On the relationship between self-attention and convolutional layers. arXiv:1911.03584

  6. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09

  7. Dhariwal P, Jun H, Payne C, Kim JW, Radford A, Sutskever I (2020) Jukebox: A generative model for music. arXiv:2005.00341

  8. Dieleman S, van den Oord A, Simonyan K (2018) The challenge of realistic music generation: modelling raw audio at scale. arXiv:1806.10474

  9. Donahue C, McAuley JJ, Puckette MS (2018) Synthesizing audio with generative adversarial networks. arXiv:1802.04208

  10. Donahue J, Dieleman S, Binkowski M, Elsen E, Simonyan K (2021) End-to-end adversarial text-to-speech. In: International Conference on Learning Representations. https://openreview.net/forum?id=rsf1z-JSj87

  11. Dong H, Yang Y (2018) Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv:1804.09399

  12. Engel J, Agrawal KK, Chen S, Gulrajani I, Donahue C, Roberts A (2019) GANSynth: Adversarial neural audio synthesis. In: International Conference on Learning Representations. https://openreview.net/forum?id=H1xQVn09FX

  13. Ferreira LN, Whitehead J (2021) Learning to generate music with sentiment. arXiv:210306125

  14. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv:1406.2661

  15. Guan F, Yu C, Yang S (2019) A gan model with self-attention mechanism to generate multi-instruments symbolic music. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp 1–6. https://doi.org/10.1109/IJCNN.2019.8852291

  16. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385

  17. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Klambauer G, Hochreiter S (2017a) Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv:1706.08500

  18. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Klambauer G, Hochreiter S (2017b) Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv:1706.08500

  19. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980

  20. Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville A (2019)

  21. Lim JH, Ye JC (2017) Geometric gan. arXiv:170502894

  22. Lostanlen V, Cella CE (2016) Deep convolutional networks on the pitch spiral for music instrument recognition. In: ISMIR

  23. Mao HH, Shin T, Cottrell G (2018) Deepj: Style-Specific music generation. In: 2018 IEEE 12Th international conference on semantic computing (ICSC). IEEE, pp 377–382

  24. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784

  25. Miyato T, Koyama M (2018) Cgans with projection discriminator. arXiv:1802.05637

  26. Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957

  27. Muhamed A, Li L, Shi X, Yaddanapudi S, Chi W, Jackson D, Suresh R, Lipton ZC, Smola AJ (2021) Symbolic music generation with transformer-gans. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 408–417

  28. van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499

  29. van den Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo LC, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H, Graves A, King H, Walters T, Belov D, Hassabis D (2017a) Parallel wavenet: Fast high-fidelity speech synthesis. arXiv:1711.10433

  30. van den Oord A, Vinyals O, Kavukcuoglu K (2017b) Neural discrete representation learning. arXiv:1711.00937

  31. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inform Process Syst 32:8026–8037

    Google Scholar 

  32. Razavi A, van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. arXiv:1906.00446

  33. Salimans T, Goodfellow IJ, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. arXiv:1606.03498

  34. dos Santos Tanaka FHK, Aranha C (2019) Data augmentation using gans. arXiv:1904.09135

  35. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368

  36. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567

  37. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762

  38. Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le QV, Agiomyrgiannakis Y, Clark R, Saurous R (2017) Tacotron: Towards end-to-end speech synthesis. In: INTERSPEECH

  39. Weiss RJ, Skerry-Ryan R, Battenberg E, Mariooryad S, Kingma DP (2021) Wave-tacotron: Spectrogram-Free end-to-end text-to-speech synthesis. In: ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5679–5683

  40. Yang L, Chou S, Yang Y (2017) Midinet: a convolutional generative adversarial network for symbolic-domain music generation using 1d and 2d conditions. arXiv:1703.10847

  41. Yu Y, Srivastava A, Canales S (2021) Conditional lstm-gan for melody generation from lyrics. ACM Trans Multimedia Comput Commun Appl, 17(1). https://doi.org/10.1145/3424116

  42. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. arXiv:1805.08318

  43. Zhao S, Liu Z, Lin J, Zhu JY, Han S (2020) Differentiable augmentation for data-efficient gan training. arXiv:2006.10738

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Lucas Tomaz Neves.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tomaz Neves, P.L., Fornari, J. & Batista Florindo, J. Self-attention generative adversarial networks applied to conditional music generation. Multimed Tools Appl 81, 24419–24430 (2022). https://doi.org/10.1007/s11042-022-12116-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12116-7

Keywords

Navigation