Self-attention generative adversarial networks applied to conditional music generation

Pedro Lucas Tomaz Neves ORCID: orcid.org/0000-0002-8902-5743¹,
José Fornari² &
João Batista Florindo¹

632 Accesses
1 Altmetric
Explore all metrics

Abstract

The task of audio and music generation in the waveform domain has become possible due to recent advances in deep learning. Generative Adversarial Networks (GANs) are a type of generative model that has achieved success in areas such as image, video and audio generation. However, realistic audio generation with GANs is still a challenge, thanks to the specific characteristics inherent to this kind of data. In this paper we propose a GAN model that employs the self-attention mechanism and produces small chunks of music conditioned by instrument. We compare our model to a baseline and run ablation studies in order to demonstrate its superiority. We also suggest some applications of the model, particularly in the area of computer assisted composition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generative Adversarial Network for Music Generation

Linear Transformer-GAN: A Novel Architecture to Symbolic Music Generation

Application of Deep Neural Networks to Music Composition Based on MIDI Datasets and Graphical Representation

Notes

References

Barratt S, Sharma R (2018) A note on the inception score. arXiv:1801.01973
Binkowski M, Donahue J, Dieleman S, Clark A, Elsen E, Casagrande N, Cobo LC, Simonyan K (2019) High fidelity speech synthesis with adversarial networks. arXiv:1909.11646
Borji A (2019) Pros and cons of GAN evaluation measures. Comput Vis Image Underst 179:41–65
Article Google Scholar
van den Broek K (2021) Mp3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN. arXiv:2101.04785
Cordonnier J, Loukas A, Jaggi M (2019) On the relationship between self-attention and convolutional layers. arXiv:1911.03584
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09
Dhariwal P, Jun H, Payne C, Kim JW, Radford A, Sutskever I (2020) Jukebox: A generative model for music. arXiv:2005.00341
Dieleman S, van den Oord A, Simonyan K (2018) The challenge of realistic music generation: modelling raw audio at scale. arXiv:1806.10474
Donahue C, McAuley JJ, Puckette MS (2018) Synthesizing audio with generative adversarial networks. arXiv:1802.04208
Donahue J, Dieleman S, Binkowski M, Elsen E, Simonyan K (2021) End-to-end adversarial text-to-speech. In: International Conference on Learning Representations. https://openreview.net/forum?id=rsf1z-JSj87
Dong H, Yang Y (2018) Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv:1804.09399
Engel J, Agrawal KK, Chen S, Gulrajani I, Donahue C, Roberts A (2019) GANSynth: Adversarial neural audio synthesis. In: International Conference on Learning Representations. https://openreview.net/forum?id=H1xQVn09FX
Ferreira LN, Whitehead J (2021) Learning to generate music with sentiment. arXiv:210306125
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv:1406.2661
Guan F, Yu C, Yang S (2019) A gan model with self-attention mechanism to generate multi-instruments symbolic music. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp 1–6. https://doi.org/10.1109/IJCNN.2019.8852291
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Klambauer G, Hochreiter S (2017a) Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv:1706.08500
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Klambauer G, Hochreiter S (2017b) Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv:1706.08500
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980
Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville A (2019)
Lim JH, Ye JC (2017) Geometric gan. arXiv:170502894
Lostanlen V, Cella CE (2016) Deep convolutional networks on the pitch spiral for music instrument recognition. In: ISMIR
Mao HH, Shin T, Cottrell G (2018) Deepj: Style-Specific music generation. In: 2018 IEEE 12Th international conference on semantic computing (ICSC). IEEE, pp 377–382
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784
Miyato T, Koyama M (2018) Cgans with projection discriminator. arXiv:1802.05637
Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957
Muhamed A, Li L, Shi X, Yaddanapudi S, Chi W, Jackson D, Suresh R, Lipton ZC, Smola AJ (2021) Symbolic music generation with transformer-gans. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 408–417
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499
van den Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo LC, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H, Graves A, King H, Walters T, Belov D, Hassabis D (2017a) Parallel wavenet: Fast high-fidelity speech synthesis. arXiv:1711.10433
van den Oord A, Vinyals O, Kavukcuoglu K (2017b) Neural discrete representation learning. arXiv:1711.00937
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inform Process Syst 32:8026–8037
Google Scholar
Razavi A, van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. arXiv:1906.00446
Salimans T, Goodfellow IJ, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. arXiv:1606.03498
dos Santos Tanaka FHK, Aranha C (2019) Data augmentation using gans. arXiv:1904.09135
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le QV, Agiomyrgiannakis Y, Clark R, Saurous R (2017) Tacotron: Towards end-to-end speech synthesis. In: INTERSPEECH
Weiss RJ, Skerry-Ryan R, Battenberg E, Mariooryad S, Kingma DP (2021) Wave-tacotron: Spectrogram-Free end-to-end text-to-speech synthesis. In: ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5679–5683
Yang L, Chou S, Yang Y (2017) Midinet: a convolutional generative adversarial network for symbolic-domain music generation using 1d and 2d conditions. arXiv:1703.10847
Yu Y, Srivastava A, Canales S (2021) Conditional lstm-gan for melody generation from lyrics. ACM Trans Multimedia Comput Commun Appl, 17(1). https://doi.org/10.1145/3424116
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. arXiv:1805.08318
Zhao S, Liu Z, Lin J, Zhu JY, Han S (2020) Differentiable augmentation for data-efficient gan training. arXiv:2006.10738

Download references

Author information

Authors and Affiliations

Institute of Mathematics, Statistics and Scientific Computing (IMECC), University of Campinas (UNICAMP), Rua Sérgio Buarque de Holanda - 651, Cidade Universitária “Zeferino Vaz” - Distr. Barão Geraldo, CEP 13083-859, Campinas, SP, Brasil
Pedro Lucas Tomaz Neves & João Batista Florindo
Interdisciplinary Nucleus of Sound Communication (NICS), University of Campinas (UNICAMP), Rua Sérgio Buarque de Holanda - 165, Cidade Universitária “Zeferino Vaz” - Distr. Barão Geraldo, CEP 13083-872, Campinas, SP, Brasil
José Fornari

Authors

Pedro Lucas Tomaz Neves
View author publications
You can also search for this author in PubMed Google Scholar
José Fornari
View author publications
You can also search for this author in PubMed Google Scholar
João Batista Florindo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Lucas Tomaz Neves.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tomaz Neves, P.L., Fornari, J. & Batista Florindo, J. Self-attention generative adversarial networks applied to conditional music generation. Multimed Tools Appl 81, 24419–24430 (2022). https://doi.org/10.1007/s11042-022-12116-7

Download citation

Received: 19 April 2021
Revised: 27 July 2021
Accepted: 03 January 2022
Published: 19 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11042-022-12116-7

Self-attention generative adversarial networks applied to conditional music generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Generative Adversarial Network for Music Generation

Linear Transformer-GAN: A Novel Architecture to Symbolic Music Generation

Application of Deep Neural Networks to Music Composition Based on MIDI Datasets and Graphical Representation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Self-attention generative adversarial networks applied to conditional music generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Generative Adversarial Network for Music Generation

Linear Transformer-GAN: A Novel Architecture to Symbolic Music Generation

Application of Deep Neural Networks to Music Composition Based on MIDI Datasets and Graphical Representation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation