Abstract
The current wave of deep learning (the hyper-vitamined return of artificial neural networks) applies not only to traditional statistical machine learning tasks: prediction and classification (e.g., for weather prediction and pattern recognition), but has already conquered other areas, such as translation. A growing area of application is the generation of creative content, notably the case of music, the topic of this article. The motivation is in using the capacity of modern deep learning techniques to automatically learn musical styles from arbitrary musical corpora and then to generate musical samples from the estimated distribution, with some degree of control over the generation. This article provides a tutorial on music generation based on deep learning techniques. After a short introduction to the topic illustrated by a recent example, the article analyzes some early works from the late 1980s using artificial neural networks for music generation and how their pioneering contributions foreshadowed current techniques. Then, we introduce some conceptual framework to analyze various concepts and dimensions involved. Various examples of recent systems are introduced and analyzed to illustrate the variety of concerns and of techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In 2012, an image recognition competition (the ImageNet Large Scale Visual Recognition Challenge) was won by a deep neural network algorithm named AlexNet [33] with a stunning margin over the other algorithms which were using handcrafted features. This striking victory was the event which ended the prevalent opinion that neural networks with many hidden layers could not be efficiently trained and which started the deep learning wave.
Audio source separation, often coined as the cocktail party effect, has been known for a long time to be a very difficult problem [5].
Following the model introduced in [3].
Actually, the style is defined extensively by (and learnt from) various examples of music selected as the training examples.
Using various techniques such as Markov models, constraints and rules, and not (yet) deep learning techniques.
The fact that Bach music is often used for such experiments may not be only because of its wide availability, but also because his music is actually easier to automate, as Bach himself was somehow an algorithmic music composer. An example is the way he was composing chorales by designing and applying (with talent) counterpoint rules to existing melodies.
Two exceptions are introduced in Sect. 9.5.
To be described in Sect. 9.3.
Piano roll format and one-hot encoding are explained in Sect. 6.
This is an arbitrary choice.
A collection of such early papers is [63].
After the early stop of the first wave due to the critique of the limitation of the Perceptron [42].
In that respect, the Time-Windowed model is analog to an order 1 Markov model (considering only the previous state) at the level of a melody measure.
As a way to distinguish a longer note from a repeated note.
Note that the output layer is isomorphic to the context layer.
This is a peculiar characteristic of this architecture, as in recent standard recurrent network architecture recurrent connexions are encapsulated within the hidden layer (as shown in Sect. 7.2). The argument by Todd in [61] is that context units are more interpretable than hidden units: “Since the hidden units typically compute some complicated, often uninterpretable function of their inputs, the memory kept in the context units will likely also be uninterpretable. This is in contrast with [this] design, where, as described earlier, each context unit keeps a memory of its corresponding output unit, which is interpretable.”
In practice, it is a scalar real value, e.g., 0.7, but Todd discusses his experiments with other possible encodings [61].
Actually, as an optimization, Todd proposes in the following of his description to pass back the target (training) values and not the output values.
Actually, in his article, Lewis does not detail the exact representation he uses (if he is using a one-hot encoding for each note) and the exact nature of refinement, i.e., adjustment of the values.
Systems refers to various proposals (architectures, systems and experiments) about deep learning-based music generation surveyed from the literature.
It is important to highlight that, in this conceptual framework, by strategy we only consider the generation strategy, i.e., the strategy to generate musical content. A strategy for training an architecture could be quite different and is out of direct concern in this classification.
In fact, they correspond to different scientific and technical communities, namely signal processing and knowledge representation.
Indeed, at the level of processing by a deep network architecture, the initial distinction between audio and symbolic representation boils down, as only numerical values and operations are considered. In this article, we will focus on symbolic music representation and generation.
The objective of the Fourier transform (which could be continuous or discrete) is the decomposition of an arbitrary signal into its elementary components (sinusoidal waveforms). As well as compressing the information, its role is fundamental for musical purposes as it reveals the harmonic components of the signal.
Acronym of Musical Instrument Digital Interface.
The volume may also be specified.
Note that the ABC notation has been designed independently of computer music and machine learning concerns.
Actually, in the original mechanical paper piano roll, the distinction is made: two holes are different from a longer single hole. The end of the hole is the encoding of the end of the note.
The name comes from digital circuits, one-hot referring to a group of bits among which the only legal (possible) combinations of values are those with a single high (hot!) (1) bit, all the others being low (0).
The figure also illustrates that a piano roll could be straightforwardly encoded as a sequence of one-hot vectors to construct the input representation of an architecture, as, e.g., shown in Fig. 2.
As a reminder from Sect. 5.1, we only consider here generation strategies.
Also named multilayer Perceptron (MLP).
The feedforward architecture and the feedforward strategy are naturally associated, although, as we will see in some of the next sections, other associations are possible.
Note that, as opposed to feedforward strategy (and decoder feedforward strategy, to be introduced in Sect. 9.1.1), iterative and recursive strategies allow the generation of musical content of arbitrary length.
The actual length of the melody generated depending on the number of iterations.
Sampling is the action of generating an element (a sample) from a stochastic model according to a probability distribution.
The chance of sampling a given pitch is its corresponding probability. In the example shown in Fig. 15, G\(\sharp\) has around one chance in two of being selected and A\(\sharp\) one chance in four.
A stacked autoencoder is a hierarchical nesting of autoencoders with decreasing number of hidden layer units, as shown in the right part of Fig. 18.
More precisely, an RNN is nested within the encoder and another RNN within the decoder. Therefore, it is also named an RNN Encoder–Decoder architecture.
Note that we limit here the scope of a pattern to the external enfolding of an existing architecture. Additionally, we could have considered convolutional, autoencoder and even recurrent architectures as an internal architectural pattern.
Such as introduced by Todd in his Sequential architecture conditioned by a plan in Sect. 4.1.
Convolutional architectures are actually an important component of the current success of deep learning and they recently emerged as an alternative, more efficient to train, to recurrent architectures [3, Section 8.2]. A convolutional architecture is composed of a succession of feature maps and pooling layers [15, Section 9][3, Section 5.9]. (We could have considered convolutional as an internal architectural pattern, as has just been remarked in a previous footnote.) However, we do not detail convolutional architectures here, because of space limitation and of nonspecificity regarding music generation applications.
Compared to traditional dimension reduction algorithms, such as principal component analysis (PCA), feature extraction is nonlinear, but it does not ensure orthogonality of the dimensions, as shown in Sect. 9.2.2.
See more details, e.g., in [55].
The implementation of the encoder of a VAE actually generates a mean vector and a standard deviation vector [31].
This constraint is implemented by adding a specific term to the cost function to compute the cross-entropy between the distribution of latent variables and the prior distribution.
This attribute vector is computed as the average latent vector for a collection of examples sharing that attribute (characteristic), e.g., high density of notes (see an example in Fig. 21), rapid change, high register, etc.
As pointed out in Sect. 7.2.1, the generation of next note from current note is repeated until a sequence of the desired length is produced.
Actually this architecture is replicated 4 times, one for each voice (4 in a chorale).
The two bottom lines correspond to metadata (fermata and beat information), not detailed here.
The architecture is convolutional (only) on the time dimension, in order to model temporally invariant motives, but not pitch invariant motives which would break the notion of tonality.
Because of space limitation, and the fact that RBMs are not mainstream, we do not detail here the characteristics of RBM (see, e.g., [15, Section 20.2] or [3, Section 5.7] for details). In a first approximation for this article, we may consider an RBM as analog to an autoencoder, except with two differences: the input and output layers are merged (and named the visible layer), and the model is stochastic.
This is named structure imposition, with the same basic approach that of style transfer [7], except that of a high-level structure.
This article is obviously not exhaustive. Interested readers may refer, e.g., to [3] for additional examples and details.
As in the case of a good cook, whose aim is not to simply mix all possible ingredients but to discover original successful combinations.
like inpainting for the regeneration of missing or deteriorating parts of images.
References
Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. In: Proceedings of the 29th international conference on machine learning (ICML-12). Edinburgh, Scotland, pp 1159–1166
Briot JP (2020) Compress to create. MusMat Braz J Music Math IV(1):12–38
Briot JP, Hadjeres G, Pachet FD (2019) Deep learning techniques for music generation. Computational synthesis and creative systems. Springer, Berlin
Briot JP, Pachet F (2020) Music generation by deep learning: challenges and directions. Neural Comput Appl 32(4):981–993
Cherry EC (1953) Some experiments on the recognition of speech, with one and two ears. J Acoust Soc Am 25(5):975–979
Cope D (2000) The algorithmic composer. A-R Editions, Middleton
Dai S, Zhang Z, Xia GG (2018) Music style transfer issues: a position paper. arXiv:1803.06841v1
Eck D, Schmidhuber J (2002) A first look at music composition using LSTM recurrent neural networks. Technical report, IDSIA/USI-SUPSI, Manno, Switzerland. No. IDSIA-07-02
Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) CAN: creative adversarial networks generating “art” by learning about styles and deviating from style norms. arXiv:1706.07068v1
Emerging Technology from the arXiv: deep learning machine solves the cocktail party problem. MIT Technology Review (2015). https://www.technologyreview.com/s/537101/deep-learning-machine-solves-the-cocktail-party-problem/. Accessed 7 Apr 2020
Fernández JD, Vico F (2013) AI methods in algorithmic composition: a comprehensive survey. J Artif Intell Res 48:513–582
Fiebrink R, Caramiaux B (2016) The machine learning algorithm as creative musical tool. arXiv:1611.00379v1
Gamma E, Helm R, Johnson R, Vlissides J (1995) Design patterns: elements of reusable object-oriented software. Professional computing series. Addison-Wesley, Boston
Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576v2
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozairy S, Courville A, Bengio Y (2014) Generative adversarial nets. arXiv:1406.2661v1
Google: Celebrating Johann Sebastian Bach (2019). https://www.google.com/doodles/celebrating-johann-sebastian-bach
Graves A (2014) Generating sequences with recurrent neural networks. arXiv:1308.0850v5
Hadjeres G, Nielsen F (2017) Interactive music generation with positional constraints using anticipation-RNN. arXiv:1709.06404v1
Hadjeres G, Nielsen F, Pachet F (2017) GLSR-VAE: geodesic latent space regularization for variational autoencoder architectures. arXiv:1707.04588v1
Hadjeres G, Pachet F, Nielsen F (2017) DeepBach: a steerable model for Bach chorales generation. arXiv:1612.01010v2
Herremans D, Chuan CH, Chew E (2017) A functional taxonomy of music generation systems. ACM Comput Surv 50(5):1–30
Hiller LA, Isaacson LM (1959) Experimental music: composition with an electronic computer. McGraw-Hill, New York
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang CZA, Cooijmans T, Dinculescu M, Roberts A, Hawthorne C (2019) Coconet: the ML model behind today’s Bach Doodle. https://magenta.tensorflow.org/coconet. Accessed 7 Apr 2020
Huang CZA, Cooijmans T, Roberts A, Courville A, Eck D (2017) Counterpoint by convolution. In: Hu X, Cunningham SJ, Turnbull D, Duan Z (eds.) Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017). ISMIR, Suzhou, pp 211–218
Huang CZA, Vaswani A, Uszkoreit J, Shazeer N, Hawthorne ISC, Dai AM, Hoffman MD, Dinculescu M, Eck D (2018) Music Transformer: generating music with long-term structure. arXiv:1809.04281v3
Jaques N, Gu S, Turner RE, Eck D (2016) Tuning recurrent neural networks with reinforcement learning. arXiv:1611.02796v3
Keith J (2016) The session. https://thesession.org. Accessed 21 Dec 2016
Kingma DP, Welling M (2014) Auto-encoding variational Bayes. arXiv:1312.6114v10
Koutník J, Greff K, Gomez F, Schmidhuber J (2014) A clockwork RNN. arXiv:1402.3511v1
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems, NIPS 2012, vol 1. Curran Associates Inc., Lake Tahoe, pp 1097–1105
Lattner S, Grachten M, Widmer G (2018) Imposing higher-level structure in polyphonic music generation using convolutional restricted Boltzmann machines and constraints. J Creat Music Syst. arXiv:1612.04742
Lewis JP (1988) Creation by refinement: a creativity paradigm for gradient descent learning networks. In: IEEE international conference on neural networks, vol II. San Diego, CA, USA, pp 229–233
Lewis JP (1991) Creation by refinement and the problem of algorithmic music composition. In: Todd PM, Loy DG (eds) Music and connectionism. MIT Press, Cambridge, pp 212–228
Mao HH, Shin T, Cottrell GW (2018) DeepJ: style-specific music generation. arXiv:1801.00887v1
Mathieu E, Rainforth T, Siddharth N, Teh YW (2019) Disentangling disentanglement in variational autoencoders. arXiv:1812.02833v3
Matisse N (2019) How YACHT fed their old music to the machine and got a killer new album. ArsTechnica. https://arstechnica.com/gaming/2019/08/yachts-chain-tripping-is-a-new-landmark-for-ai-music-an-album-that-doesnt-suck/. Accessed 7 Apr 2020
McClelland JL, Rumelhart DE, Group PR (1986) Parallel distributed processing–explorations in the microstructure of cognition. Psychological and biological models, vol 2. MIT Press, Cambridge
Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y (2017) SampleRNN: an unconditional end-to-end neural audio generation model. arXiv:1612.07837v2
Minsky M, Papert S (1969) Perceptrons: an introduction to computational geometry. MIT Press, Cambridge
MMA: MIDI specifications. https://www.midi.org/specifications. Accessed 14 Apr 2017
Mogren O (2016) C-RNN-GAN: continuous recurrent neural networks with adversarial training. arXiv:1611.09904v1
Mordvintsev A, Olah C, Tyka M (2015) Deep dream. https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html. Accessed 7 Apr 2020
Nierhaus G (2009) Algorithmic composition: paradigms of automated music generation. Springer, Berlin
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior, A., Kavukcuoglu, K (2016) WaveNet: a generative model for raw audio. arXiv:1609.03499v2
Papadopoulos A, Roy P, Pachet F (2016) Assisted lead sheet composition using FlowComposer. In: Rueher M (ed) Principles and practice of constraint programming: 22nd international conference, CP 2016, Toulouse, France, 5-9 Sept 2016, proceedings. Springer, pp 769–785
Papadopoulos G, Wiggins G (1999) AI methods for algorithmic composition: a survey, a critical view and future prospects. In: AISB 1999 symposium on musical creativity, pp 110–117
Pati A, Lerch A, Hadjeres G (2019) Learning to traverse latent spaces for musical score inpainting. arXiv:1907.01164v1
Pons J (2018) Neural networks for music: a journey through its history. https://towardsdatascience.com/neural-networks-for-music-a-journey-through-its-history-91f93c3459fb. Accessed 7 Apr 2020
Ramsundar B, Zadeh RB (2018) TensorFlow for deep learning. O’Reilly Media, Newton
Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. In: Proceedings of the 35th international conference on machine learning (ICML 2018). ACM, Montréal, PQ, Canada
Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. arXiv:1803.05428v2
Rocca J (2019) Understanding variational autoencoders (VAEs): building, step by step, the reasoning that leads to VAEs. https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73. Accessed 7 Apr 2020
Rumelhart DE, McClelland JL, Group PR (1986) Parallel distributed processing: explorations in the microstructure of cognition. Foundations, vol 1. MIT Press, Cambridge
Shaw M, Garlan D (1996) Software architecture: perspectives on an emerging discipline. Prentice Hall, Upper Saddle River
Simon I, Oore S (2017) Performance RNN: generating music with expressive timing and dynamics. https://magenta.tensorflow.org/performance-rnn
Sturm BL, Santos JF, Ben-Tal O, Korshunova I (2016) Music transcription modelling and composition using deep learning. In: Proceedings of the 1st conference on computer simulation of musical creativity (CSCM 16). Huddersfield, UK
Sun F, DeepHear – Composing and harmonizing music with neural networks. https://fephsun.github.io/2015/09/01/neural-music.html. Accessed 21 Dec 2017
Todd PM (1989) A connectionist approach to algorithmic composition. Comput Music J 13(4):27–43
Todd PM (1991) A connectionist approach to algorithmic composition. In: Todd PM, Loy DG (eds) Music and connectionism. MIT Press, Cambridge, pp 190–194
Todd PM, Loy DG (1991) Music and connectionism. MIT Press, Cambridge
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762v5
Walshaw C, abc notation home page. http://abcnotation.com. Accessed 21 Dec 2016
Wild CM (2018) What a disentangled net we weave: representation learning in VAEs (Pt. 1). https://towardsdatascience.com/what-a-disentangled-net-we-weave-representation-learning-in-vaes-pt-1-9e5dbc205bd1. Accessed 7 Apr 2020
Yang LC, Chou SY, Yang YH (2017) MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017). Suzhou, China
Yang R, Wang D, Wang Z, Chen T, Jiang J, Xia G (2019) Deep music analogy via latent representation disentanglement. arXiv:1906.03626v4
Acknowledgements
We thank Gaëtan Hadjeres and François Pachet for their participation to the book [3] which has been a significant initial input for this article; CNRS, Sorbonne Université and UNIRIO for their support, and various participants of our course on the topic (Course material is available online at: http://www-desir.lip6.fr/~briot/cours/unirio3/.) for their feedback.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: glossary
Appendix: glossary
-
Activation function The function applied to the weighted sum for each neuron of a given layer. It is usually nonlinear (to introduce nonlinearity, in order to address the linear separability limitation of the Perceptron). Common examples are sigmoid or ReLU. The activation function of the output layer is a specific case (see Output layer activation function).
-
Algorithmic composition The use of algorithms and computers to generate music compositions (symbolic form) or music pieces (audio form). Examples of models and algorithms are: grammars, rules, stochastic processes (e.g., Markov chains), evolutionary methods and artificial neural networks.
-
Architecture An (artificial neural network) architecture is the structure of the organization of computational units (neurons), usually grouped in layers, and their weighted connexions. Examples of types of architecture are: feedforward (aka multilayer perceptron), recurrent (RNN), autoencoder and generative adversarial networks (GAN). Architectures process encoded representations (in our case of a musical content) which have been encoded.
-
Artificial neural network A family of bio-inspired machine learning algorithms whose model is based on weighted connexions between computing units (neurons). Weights are incrementally adjusted during the training phase in order for the model to fit the data (examples).
-
Attention mechanism A mechanism inspired by the human visual system which focuses at each time step on some specific elements of the input sequence. This is modeled by weighted connexions onto the sequence elements (or onto the sequence of hidden units) which are subject to be learned.
-
Autoencoder A specific case of artificial neural network architecture with an output layer mirroring the input layer and with one hidden layer. Autoencoders are good at extracting features.
-
Backpropagation A short hand for “backpropagation of errors,” it is the algorithm used to compute the gradients of the cost function. Gradients will be used to guide the minimization of the cost function in order to fit the data.
-
Bias The b offset term of a simple linear regression model \(h( {x}) = b + \theta {x}\) and by extension of a neural network layer.
-
Bias node The node of a neural network layer corresponding to a bias. Its constant value is 1 and is usually notated as \(+1\).
-
Classification A machine learning task about the attribution of an instance to a class (from a set of possible classes). An example is to determine if next note is a C\(_4\), a C\(\sharp _4\), etc.
-
Compound architecture An artificial neural network architecture which is the result of some combination of some architectures. Examples of types of combination are composition, nesting and pattern instantiation.
-
Conditioning architecture The parametrization of an artificial neural network architecture by some conditioning information (e.g., a bass line, a chord progression...) represented via a specific extra input, in order to guide the generation.
-
Connexion A relation between a neuron and another neuron representing a computational flow from the output of the first neuron to an input of the second neuron. A connexion is modulated by a weight which will be adjusted during the training phase.
-
Convolution In mathematics, a mathematical operation on two functions sharing the same domain that produces a third function which is the integral (or the sum in the discrete case—the case of images made of pixels) of the pointwise multiplication of the two functions varying within the domain in an opposing way. Inspired both by mathematical convolution and by a model of human visions, it has been adapted to artificial neural networks and it improves pattern recognition accuracy by exploiting the spatial local correlation present in natural images. The basic principle is to slide a matrix (named a filter, a kernel or a feature detector) through the entire image (seen as the input matrix) and for each mapping position to compute the dot product of the filter with each mapped portion of the image and then sum up all elements of the resulting matrix.
-
Correlation Any statistical relationship, whether causal or not, between two random variables. Artificial neural networks are good at extracting correlations between variables, for instance, between input variables and output variables and also between input variables.
-
Cost function (aka Loss function) The function used for measuring the distance between the prediction by an artificial neural network architecture (ŷ) and the actual target (true value y). Various cost functions may be used, depending on the task (prediction or classification) and the encoding of the output, e.g., mean squared error, binary cross-entropy and categorical cross-entropy.
-
Counterpoint In musical theory, an approach for the accompaniment of a melody through a set of other melodies (voices). An example is a chorale with 3 voices (alto, tenor and bass) matching a soprano melody. Counterpoint focuses on the horizontal relations between successive notes for each simultaneous melody (voice) and then considers the vertical relations between their progression (e.g., to avoid parallel fifths).
-
Creation by refinement strategy A strategy for generating content based on the incremental modification of a representation to be processed by an artificial neural network architecture.
-
Cross-entropy A function measuring the dissimilarity between two probability distributions. It is used as a cost (loss) function for a classification task to measure the difference between the prediction by an artificial neural network architecture (ŷ) and the actual target (true value y). There are two types of cross-entropy cost functions: binary cross-entropy when the classification is binary and categorical cross-entropy when the classification is multiclass with a single label to be selected.
-
Dataset The set of examples used for training an artificial neural network architecture.
-
Decoder The decoding component of an autoencoder which reconstructs the compressed representation (an embedding) from the hidden layer into a representation at the output layer as close as possible to the initial data representation at the input layer.
-
Decoder feedforward strategy A strategy for generating content based on an autoencoder architecture in which values are assigned onto the latent variables of the hidden layer and forwarded into the decoder component of the architecture in order to generate a musical content corresponding to the abstract description inserted.
-
Deep learning (aka Deep neural network) An artificial neural network architecture with a significant number of successive layers.
-
Discriminator The discriminative model component of generative adversarial networks (GAN) which estimates the probability that a sample came from the real data rather than from the generator.
-
Disentanglement The objective of separating different factors governing variability in the data (e.g., in the case of human images, identity of the individual and facial expression, in the case of music, note pitch range and note duration range).
-
Embedding In mathematics, an injective and structure-preserving mapping. Initially used for natural language processing, it is now often used in deep learning as a general term for encoding a given representation into a vector representation.
-
Encoder The encoding component of an autoencoder which transforms the data representation from the input layer into a compressed representation (an embedding) at the hidden layer.
-
Encoding The encoding of a representation consists in the mapping of the representation (composed of a set of variables, e.g., pitch or dynamics) into a set of inputs (also named input nodes or input variables) for the neural network architecture. Examples of encoding strategies are: value encoding, one-hot encoding and many-hot encoding.
-
End-to-end architecture An artificial neural network architecture that processes the raw unprocessed data—without any pre-processing, transformation of representation or extraction of features—to produce a final output.
-
Feedforward The basic way for a neural network architecture to process an input by feedforwarding the input data into the successive layers of neurons of the architecture until producing the output.
-
Feedforward architecture It is the most basic and common type of artificial neural network architecture. It is also named multilayer neural network or multilayer Perceptron (MLP). It is composed of successive layers, with at least one hidden layer.
-
Fourier transform A transformation (which could be continuous or discrete) of a signal into the decomposition into its elementary components (sinusoidal waveforms). As well as compressing the information, its role is fundamental for musical purposes as it reveals the harmonic components of the signal.
-
Generative adversarial networks (GAN) A compound architecture composed of two component architectures, the generator and the discriminator, who are trained simultaneously with opposed objectives. The generator objective is to generate synthetic samples resembling real data while the discriminator objective is to detect synthetic samples.
-
Generator The generative model component of generative adversarial networks (GAN) whose objective is to transform a random noise vector into a synthetic (faked) sample which resembles real samples drawn from a distribution of real data.
-
Gradient A partial derivative of the cost function with respect to a weight parameter or a bias.
-
Gradient descent A basic algorithm for training a linear regression model and an artificial neural network. It consists in an incremental update of the weight parameters guided by the gradients of the cost function until reaching a minimum.
-
Harmony In musical theory, a system for organizing simultaneous notes. Harmony focuses on the vertical relations between simultaneous notes, as objects on their own (chords), and then considers the horizontal relations between them (e.g., harmonic cadences).
-
Hidden layer Any neuron layer located between the input layer and the output layer of a neural network architecture.
-
Hold The information about a note that extends its duration over a single time step.
-
Input layer The first layer of a neural network architecture. It is an interface consisting in a set of nodes without internal computation.
-
Iterative feedforward strategy A strategy for generating content by generating its successive time slices.
-
Latent variable In statistics, a variable which is not directly observed. In deep learning architectures, variables within a hidden layer. By sampling a latent variable(s), one may control the generation, e.g., in the case of a variational autoencoder.
-
Layer A component of a neural network architecture composed of a set of neurons.
-
Linear regression Regression for an assumed linear relationship between a scalar variable and one or several explanatory variable(s).
-
Linear separability The ability to separate by a line or a hyperplane the elements of two different classes represented in an Euclidian space.
-
Long short-term memory (LSTM) A type of recurrent neural network architecture with capacity for learning long-term correlations and not suffering from the vanishing or exploding gradient problem during the training phase. The idea is to secure information in memory cells protected from the standard data flow of the recurrent network. Decisions about writing to, reading from and forgetting the values of cells are performed by the opening or closing of gates and are expressed at a distinct control level, while being learnt during the training process.
-
Markov chain A stochastic model describing a sequence of possible states. The chance to change from the current state to a state or to another state is governed by a probability and does not depend on previous states.
-
Musical instrument digital interface (MIDI) A technical standard that describes a protocol, a digital interface and connectors for interoperability between various electronic musical instruments, softwares and devices.
-
Multilayer perceptron (MLP) A feedforward neural architecture composed of successive layers, with at least one hidden layer. Also named Feedforward architecture.
-
Multivoice (aka multitrack) The abbreviation of a multivoice polyphony that is a set of sequences of notes intended for more than one voice or instrument.
-
Neuron The atomic processing element (unit) of an artificial neural network architecture, inspired by the biological model of a neuron. A neuron has several input connexions, each one with an associated weight, and one output. A neuron will compute the weighted sum of all its input values and then apply its associated activation function in order to compute its output value. Weights will be adjusted during the training phase of the neural network architecture.
-
Node The atomic structural element of an artificial neural network architecture. A node could be a processing unit (a neuron) or a simple interface element for a value, e.g., in the case of the input layer or a bias node.
-
Objective The nature and the destination of the musical content to be generated by a neural network architecture. Examples of objectives are: a monophonic melody to be played by a human flutist and a polyphonic accompaniment played by a synthesizer.
-
One-hot encoding Strategy used to encode a categorical variable (e.g., a note pitch) as a vector having as its length the number of possible values (e.g., from C\(_4\) to B\(_4\)). A given element (e.g., a note pitch) is represented with a corresponding 1 with all other elements being 0. The name comes from digital circuits, one-hot referring to a group of bits among which the only legal (possible) combinations of values are those with a single high (hot) (1) bit, all the others being low (0).
-
Output layer The last layer of a neural network architecture.
-
Output layer activation function The activation function of the output layer, which is usually: identity for a prediction task, sigmoid for a binary classification task and softmax for a multiclass single-label classification task.
-
Parameter The parameters of an artificial neural network architecture are the weights associated with each connexion between neurons as well as the biases associated with each layer.
-
Perceptron One of the first artificial neural network architectures, created by Rosenblatt in 1957. It had no hidden layer and suffered from the linear separability limitation.
-
Piano roll Representation of a melody (monophonic or polyphonic) inspired from automated pianos. Each “perforation” represents a note control information, to trigger a given note. The length of the perforation corresponds to the duration of a note. In the other dimension, the localization (height) of a perforation corresponds to its pitch.
-
Pitch class The name of the corresponding note (e.g., C) independently of the octave position.
-
Polyphony The abbreviation of a single-voice polyphony, that is a sequence of notes for a single instrument (e.g., a guitar or a piano) with possibly simultaneous notes.
-
Pooling For a convolutional architecture, a data dimensionality reduction operation (by max, average or sum) for each feature map produced by a convolutional stage, while retaining significant information. Pooling brings the important property of the invariance to small transformations, distortions and translations in the input image.
-
Prediction See regression.
-
Recurrent connexion A connexion from an output of a node to its input. By extension, the recurrent connexion of a layer fully connects the ouputs of all its nodes to all inputs of all its nodes. This is the basis of a recurrent neural network (RNN) architecture.
-
Recurrent neural network (RNN) A type of artificial neural network architecture with recurrent connexions and memory. It is used to learn sequences.
-
Recursive feedforward strategy A special case of iterative feedforward strategy where the current output is used as the next input.
-
Regression In statistics, regression is an approach for modeling the relationship between a scalar variable and one or several explanatory variable(s).
-
Reinforcement learning An area of machine learning concerned with an agent making successive decisions about an action in an environment while receiving a reward (reinforcement signal) after each action. The objective for the agent is to find the best policy maximizing its cumulated rewards.
-
Reinforcement strategy A strategy for content generation by modeling generation of successive notes as a reinforcement learning problem while using an RNN as a reference for the modeling of the reward. Therefore, one may introduce arbitrary control objectives (e.g., adherence to current tonality, maximum number of repetitions, etc.) as additional reward terms.
-
ReLU The rectified linear unit function, which may be used as a hidden layer nonlinear activation function, specially in the case of convolutions.
-
Representation The nature and format of the information (data) used to train and to generate musical content. Examples of types of representation are: waveform signal, spectrum, piano roll and MIDI.
-
Requirement One of the qualities that may be desired for music generation. Examples are: content variability, incrementality, originality and structure.
-
Rest The information about the absence of a note (silence) during one (or more) time step(s).
-
Restricted Boltzmann machine (RBM) A specific type of artificial neural network that can learn a probability distribution over its set of inputs. It is stochastic, has no distinction between input and output and uses a specific learning algorithm.
-
Sampling The action of producing an item (a sample) according to a given probability distribution over the possible values. As more and more samples are generated, their distribution should more closely approximate the given distribution.
-
Sampling strategy A strategy for generating content where variables of a content representation are incrementally instantiated and refined according to a target probability distribution which has been previously learnt.
-
Seed-based generation An approach to generate arbitrary content (e.g., a long melody) with a minimal (seed) information (e.g., a first note).
-
Sigmoid Also named the logistic function, it is used as an output layer activation function for binary classification tasks and it may also be used as a hidden layer nonlinear activation function.
-
Single-step feedforward strategy A strategy for generating content where a feedforward architecture processes in a single processing step a global temporal scope representation which includes all time slices.
-
Softmax Generalization of the sigmoid (logistic) function to the case of multiple classes. Used as an output activation function for multiclass single-label classification.
-
Spectrum The representation of a sound in terms of the amount of vibration at each individual (as a function of) frequency. It is computed by a Fourier transformation which decomposes the original signal into its elementary (harmonic) components (sinusoidal waveforms)
-
Stacked autoencoder A set of hierarchically nested autoencoders with decreasing numbers of hidden layer units.
-
Strategy The way the architecture will process representations in order to generate the objective while matching desired requirements. Examples of types of strategy are: single-step feedforward, iterative feedforward and decoder feedforward.
-
Style transfer The technique for capturing a style (e.g., of a given painting, by capturing the correlations between neurons for each layer) and applying it onto another content.
-
Time slice The time interval considered as an atomic portion (grain) of the temporal representation used by an artificial neural network architecture.
-
Time step The atomic increment of time considered by an artificial neural network architecture.
-
Turing test Initially codified in 1950 by Alan Turing and named by him the “imitation game,” the “Turing test” is a test of the ability for a machine to exhibit intelligent behavior equivalent to (and more precisely, indistinguishable from) the behavior of a human. In his imaginary experimental setting, Turing proposed the test to be a natural language conversation between a human (the evaluator) and a hidden actor (another human or a machine). If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test.
-
Unit See neuron.
-
Variational autoencoder (VAE) An autoencoder with the added constraint that the encoded representation (its latent variables) follows some prior probability distribution, usually a Gaussian distribution. The variational autoencoder is therefore able to learn a “smooth” latent space mapping to realistic examples which provides interesting ways to control the variation in the generation.
-
Value encoding The direct encoding of a numerical value as a scalar.
-
Vanishing or exploding gradient problem A known problem when training a recurrent neural network caused by the difficulty of estimating gradients, because, in backpropagation through time, recurrence brings repetitive multiplications and could thus lead to over amplify or minimize effects (numerical errors). The long short-term memory (LSTM) architecture solved the problem.
-
Waveform The raw representation of a signal as the evolution of its amplitude in time.
-
Weight A numerical parameter associated with a connexion between a node (neuron or not) and a unit (neuron). A neuron will compute the weighted sum of the activations of its connexions and then apply its associated activation function. Weights will be adjusted during the training phase.
Rights and permissions
About this article
Cite this article
Briot, JP. From artificial neural networks to deep learning for music generation: history, concepts and trends. Neural Comput & Applic 33, 39–65 (2021). https://doi.org/10.1007/s00521-020-05399-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05399-0