Abstract
The task of audio and music generation in the waveform domain has become possible due to recent advances in deep learning. Generative Adversarial Networks (GANs) are a type of generative model that has achieved success in areas such as image, video and audio generation. However, realistic audio generation with GANs is still a challenge, thanks to the specific characteristics inherent to this kind of data. In this paper we propose a GAN model that employs the self-attention mechanism and produces small chunks of music conditioned by instrument. We compare our model to a baseline and run ablation studies in order to demonstrate its superiority. We also suggest some applications of the model, particularly in the area of computer assisted composition.
Similar content being viewed by others
References
Barratt S, Sharma R (2018) A note on the inception score. arXiv:1801.01973
Binkowski M, Donahue J, Dieleman S, Clark A, Elsen E, Casagrande N, Cobo LC, Simonyan K (2019) High fidelity speech synthesis with adversarial networks. arXiv:1909.11646
Borji A (2019) Pros and cons of GAN evaluation measures. Comput Vis Image Underst 179:41–65
van den Broek K (2021) Mp3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN. arXiv:2101.04785
Cordonnier J, Loukas A, Jaggi M (2019) On the relationship between self-attention and convolutional layers. arXiv:1911.03584
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09
Dhariwal P, Jun H, Payne C, Kim JW, Radford A, Sutskever I (2020) Jukebox: A generative model for music. arXiv:2005.00341
Dieleman S, van den Oord A, Simonyan K (2018) The challenge of realistic music generation: modelling raw audio at scale. arXiv:1806.10474
Donahue C, McAuley JJ, Puckette MS (2018) Synthesizing audio with generative adversarial networks. arXiv:1802.04208
Donahue J, Dieleman S, Binkowski M, Elsen E, Simonyan K (2021) End-to-end adversarial text-to-speech. In: International Conference on Learning Representations. https://openreview.net/forum?id=rsf1z-JSj87
Dong H, Yang Y (2018) Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv:1804.09399
Engel J, Agrawal KK, Chen S, Gulrajani I, Donahue C, Roberts A (2019) GANSynth: Adversarial neural audio synthesis. In: International Conference on Learning Representations. https://openreview.net/forum?id=H1xQVn09FX
Ferreira LN, Whitehead J (2021) Learning to generate music with sentiment. arXiv:210306125
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv:1406.2661
Guan F, Yu C, Yang S (2019) A gan model with self-attention mechanism to generate multi-instruments symbolic music. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp 1–6. https://doi.org/10.1109/IJCNN.2019.8852291
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Klambauer G, Hochreiter S (2017a) Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv:1706.08500
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Klambauer G, Hochreiter S (2017b) Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv:1706.08500
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980
Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville A (2019)
Lim JH, Ye JC (2017) Geometric gan. arXiv:170502894
Lostanlen V, Cella CE (2016) Deep convolutional networks on the pitch spiral for music instrument recognition. In: ISMIR
Mao HH, Shin T, Cottrell G (2018) Deepj: Style-Specific music generation. In: 2018 IEEE 12Th international conference on semantic computing (ICSC). IEEE, pp 377–382
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784
Miyato T, Koyama M (2018) Cgans with projection discriminator. arXiv:1802.05637
Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957
Muhamed A, Li L, Shi X, Yaddanapudi S, Chi W, Jackson D, Suresh R, Lipton ZC, Smola AJ (2021) Symbolic music generation with transformer-gans. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 408–417
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499
van den Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo LC, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H, Graves A, King H, Walters T, Belov D, Hassabis D (2017a) Parallel wavenet: Fast high-fidelity speech synthesis. arXiv:1711.10433
van den Oord A, Vinyals O, Kavukcuoglu K (2017b) Neural discrete representation learning. arXiv:1711.00937
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inform Process Syst 32:8026–8037
Razavi A, van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. arXiv:1906.00446
Salimans T, Goodfellow IJ, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. arXiv:1606.03498
dos Santos Tanaka FHK, Aranha C (2019) Data augmentation using gans. arXiv:1904.09135
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le QV, Agiomyrgiannakis Y, Clark R, Saurous R (2017) Tacotron: Towards end-to-end speech synthesis. In: INTERSPEECH
Weiss RJ, Skerry-Ryan R, Battenberg E, Mariooryad S, Kingma DP (2021) Wave-tacotron: Spectrogram-Free end-to-end text-to-speech synthesis. In: ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5679–5683
Yang L, Chou S, Yang Y (2017) Midinet: a convolutional generative adversarial network for symbolic-domain music generation using 1d and 2d conditions. arXiv:1703.10847
Yu Y, Srivastava A, Canales S (2021) Conditional lstm-gan for melody generation from lyrics. ACM Trans Multimedia Comput Commun Appl, 17(1). https://doi.org/10.1145/3424116
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. arXiv:1805.08318
Zhao S, Liu Z, Lin J, Zhu JY, Han S (2020) Differentiable augmentation for data-efficient gan training. arXiv:2006.10738
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tomaz Neves, P.L., Fornari, J. & Batista Florindo, J. Self-attention generative adversarial networks applied to conditional music generation. Multimed Tools Appl 81, 24419–24430 (2022). https://doi.org/10.1007/s11042-022-12116-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12116-7