Abstract
Generative AI models for music and the arts in general are increasingly complex and hard to understand. The field of explainable AI (XAI) seeks to make complex and opaque AI models such as neural networks more understandable to people. One approach to making generative AI models more understandable is to impose a small number of semantically meaningful attributes on generative AI models. This paper contributes a systematic examination of the impact that different combinations of variational auto-encoder models (measureVAE and adversarialVAE), configurations of latent space in the AI model (from 4 to 256 latent dimensions), and training datasets (Irish folk, Turkish folk, classical, and pop) have on music generation performance when 2 or 4 meaningful musical attributes are imposed on the generative model. To date, there have been no systematic comparisons of such models at this level of combinatorial detail. Our findings show that measureVAE has better reconstruction performance than adversarialVAE which has better musical attribute independence. Results demonstrate that measureVAE was able to generate music across music genres with interpretable musical dimensions of control, and performs best with low complexity music such as pop and rock. We recommend that a 32 or 64 latent dimensional space is optimal for 4 regularised dimensions when using measureVAE to generate music across genres. Our results are the first detailed comparisons of configurations of state-of-the-art generative AI models for music and can be used to help select and configure AI models, musical features, and datasets for more understandable generation of music.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
B. L. Sturm, O. Ben-Tal, Ú. Monaghan, N. Collins, D. Herremans, E. Chew, G. Hadjeres, E. Deruty, F. Pachet. Machine learning research that matters for music creation: A case study. Journal of New Music Research, vol. 48, no. 1, pp. 36–55, 2019. DOI: https://doi.org/10.1080/09298215.2018.1515233.
D. Herremans, C. H. Chuan, E. Chew. A functional taxonomy of music generation systems. ACM Computing Surveys, vol. 50, no. 5, Article number 69, 2018. DOI: https://doi.org/10.1145/3108242.
F. Carnovalini, A. Rodà. Computational creativity and music generation systems: An introduction to the state of the art. Frontiers in Artificial Intelligence, vol. 3, Article number 14, 2020. DOI: https://doi.org/10.3389/frai.2020.00014.
P. M. Todd. A connectionist approach to algorithmic composition. Computer Music Journal, vol. 13, no. 4, pp. 27–43, 1989. DOI: https://doi.org/10.2307/3679551.
D. Eck, J. Schmidhuber. A First Look at Music Composition Using LSTM Recurrent Neural Networks, Technical Report No. IDSIA-07-02, Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, Manno, Switzerland, 2002.
J. P. Briot, G. Hadjeres, F. D. Pachet. Deep learning techniques for music generation-a survey, [Online], Available: https://arxiv.org/abs/1709.01620, 2017.
G. Hadjeres, F. Pachet, F. Nielsen. DeepBach: A steerable model for Bach chorales generation. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, vol. 70, pp. 1362–1371, 2017.
H. Y. Zhu, Q. Liu, N. J. Yuan, C. Qin, J. W. Li, K. Zhang, G. Zhou, F. R. Wei, Y. C. Xu, E. H. Chen. XiaoIce band: A melody and arrangement generation framework for pop music. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, pp. 2837–2846, 2018. DOI: https://doi.org/10.1145/3219819.3220105.
C. Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, D. Eck. Music transformer: Generating music with long-term structure. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
D. Gunning. Explainable Artificial Intelligence (XAI). DARPA/I2O Proposers Day, [Online], Available: https://www.darpa.mil/attachments/XAIIndustryDay_Final.pptx, 2016.
A. Pati, A. Lerch. Attribute-based regularization of latent spaces for variational auto-encoders. Neural Computing and Applications, vol. 33, no. 9, pp. 4429–4444, 2021. DOI: https://doi.org/10.1007/s00521-020-05270-2.
R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys, vol. 51, no. 5, Article number 93, 2019. DOI: https://doi.org/10.1145/3236009.
G. Ciatto, M. I. Schumacher, A. Omicini, D. Calvaresi. Agent-based explanations in AI: Towards an abstract framework. In Proceedings of the 2nd International Workshop on Explainable, Transparent Autonomous Agents and Multi-Agent Systems, Auckland, New Zealand, pp. 3–20, 2020. DOI: https://doi.org/10.1007/978-3-030-51924-7_1.
Q. V. Liao, D. Gruen, S. Miller. Questioning the AI: Informing design practices for explainable AI user experiences. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, USA, pp. 1–15, 2020. DOI: https://doi.org/10.1145/3313831.3376590.
M. T. Ribeiro, S. Singh, C. Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, pp. 1135–1144, 2016. DOI: https://doi.org/10.1145/2939672.2939778.
G. Quellec, H. Al Hajj, M. Lamard, P. H. Conze, P. Massin, B. Cochener. ExplAIn: Explanatory artificial intelligence for diabetic retinopathy diagnosis. Medical Image Analysis, vol. 72, Article number 102118, 2021. DOI: https://doi.org/10.1016/j.media.2021.102118.
N. Du, J. Haspiel, Q. N. Zhang, D. Tilbury, A. K. Pradhan, X. J. Yang, L. P. RobertJr. Look who’s talking now: Implications of AV’s explanations on driver’s trust, AV preference, anxiety and mental workload. Transportation Research Part C: Emerging Technologies, vol. 104, pp. 428–442, 2019. DOI: https://doi.org/10.1016/j.trc.2019.05.025.
Y. Shen, S. D. J. Jiang, Y. L. Chen, E. Yang, X. L. Jin, Y. L. Fan, K. Driggs-Campbell. To explain or not to explain: A study on the necessity of explanations for autonomous vehicles, [Online], Available: https://arxiv.org/abs/2006.11684, 2020.
N. Bryan-Kinns, B. Banar, C. Ford, C. N. Reed, Y. X. Zhang, S. Colton, J. Armitage. Exploring XAI for the arts: Explaining latent space in generative music. In Proceedings of the 1st Workshop on eXplainable AI Approaches for Debugging and Diagnosis, 2021.
G. Vigliensoni, L. McCallum, R. Fiebrink. Creating latent spaces for modern music genre rhythms using minimal training data. In Proceedings of the 11th International Conference on Computational Creativity, Coimbra, Portugal, pp. 259–262, 2020.
J. McCormack, T. Gifford, P. Hutchings, M. T. L. Rodriguez, M. Yee-King, M. d’Inverno. In a silent way: Communication between AI and improvising musicians beyond sound. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Glasgow, UK, Article number 38, 2019. DOI: https://doi.org/10.1145/3290605.3300268.
P. Pasquier, A. Eigenfeldt, O. Bown, S. Dubnov. An introduction to musical metacreation. Computers in Entertainment, vol. 14, no. 2, Article number 2, 2016. DOI: https://doi.org/10.1145/2930672.
G. Widmer. Getting closer to the essence of music: The Con espressione manifesto. ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 2, Article number 19, 2017. DOI: https://doi.org/10.1145/2899004.
J. P. Briot, F. Pachet. Deep learning for music generation: Challenges and directions. Neural Computing and Applications, vol. 32, no. 4, pp. 981–993, 2020. DOI: https://doi.org/10.1007/s00521-018-3813-6.
F. Colombo, A. Seeholzer, S. P. Muscinelli, J. Brea, W. Gerstner. Algorithmic composition of melodies with deep recurrent neural networks. In Proceedings of the 1st Conference on Computer Simulation of Musical Creativity, Huddersfield, UK, 2016. DOI: https://doi.org/10.13140/RG.2.1.2436.5683.
B. L. Sturm, J. F. Santos, O. Ben-Tal, I. Korshunova. Music transcription modelling and composition using deep learning, [Online], Available: https://arxiv.org/abs/1604.08723, 2016.
A. Pati, A. Lerch, G. Hadjeres. Learning to traverse latent spaces for musical score inpainting. In Proceedings of the 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, pp. 343–351, 2019.
A. Roberts, J. H. Engel, C. Raffel, C. Hawthorne, D. Eck. A hierarchical latent vector model for learning long-term structure in music. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, vol. 80, pp. 4361–4370, 2018.
E. S. Koh, S. Dubnov, D. Wright. Rethinking recurrent latent variable model for music composition. In Proceedings of the 20th International Workshop on Multimedia Signal Processing, Vancouver, Canada, pp. 1–6, 2018. DOI: https://doi.org/10.1109/MMSP.2018.8547061.
C. Ames. The Markov process as a compositional model: A survey and tutorial. Leonardo, vol. 22, no. 2, pp. 175–187, 1989. DOI: https://doi.org/10.2307/1575226.
R. Whorley, R. Laney. Generating subjects for pieces in the style of Bach’s two-part inventions. In Proceedings of the Joint Conference on AI Music Creativity, Stockholm, Sweden, 2020.
L. Kawai, P. Esling, T. Harada. Attributes-aware deep music transformation. In Proceedings of the 21st International Society for Music Information Retrieval Conference, Montreal, Canada, pp. 670–677, 2020.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial networks. Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020. DOI: https://doi.org/10.1145/3422622.
D. P. Kingma, M. Welling. Auto-encoding variational Bayes, [Online], Available: https://arxiv.org/abs/1312.6114, 2013.
R. H. Yang, D. S. Wang, Z. Y. Wang, T. Y. Chen, J. Y. Jiang, G. Xia. Deep music analogy via latent representation disentanglement. In Proceedings of the 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, pp. 596–603, 2019.
Z. Y. Wang, Y. Y. Zhang, Y. X. Zhang, J. Y. Jiang, R. H. Yang, G. Xia, J. B. Zhao. PIANOTREE VAE: Structured representation learning for polyphonic music. In Proceedings of the 21th International Society for Music Information Retrieval Conference, Montreal, Canada, pp. 368–375, 2020.
R. Q. Wei, C. Garcia, A. El-Sayed, V. Peterson, A. Mahmood. Variations in variational autoencoders–A comparative evaluation. IEEE Access, vol. 8, pp. 153651–153670, 2020. DOI: https://doi.org/10.1109/ACCESS.2020.3018151.
R. Louie, A. Cohen, C. Z. A. Huang, M. Terry, C. J. Cai. Cococo: AI-steering tools for music novices co-creating with generative models. In Proceedings of the Human-AI Co-creation with Generative Models and User-aware Conversational Agents Co-located, the 25th International Conference on Intelligent User Interfaces, Cagliari, Italy, 2020.
N. J. W. Thelle, P. Pasquier. Spire muse: A Virtual musical partner for creative brainstorming. In Proceedings of the 21th International Conference on New Interfaces for Musical Expression, Shanghai, China, 2021. DOI: https://doi.org/10.21428/92fbeb44.84c0b364.
T. Murray-Browne, P. Tigas. Latent mappings: Generating open-ended expressive mappings using variational autoencoders. In Proceedings of the 21th International Conference on New Interfaces for Musical Expression, Shanghai, China, 2021. DOI: https://doi.org/10.21428/92fbeb44.9d4bcd4b.
A. K. Gillette, T. H. Chang. ALGORITHMS: Assessing Latent Space Dimension by Delaunay Loss, Technical Report LLNL-CONF-814930, Lawrence Livermore National Laboratory, Livermore, USA, 2020.
A. Pati, A. Lerch. Latent space regularization for explicit control of musical attributes. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, 2019.
G. Hadjeres, F. Nielsen, F. Pachet. GLSR-VAE: Geodesic latent space regularization for variational autoencoder architectures. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Honolulu, USA, pp. 1–7, 2017. DOI: https://doi.org/10.1109/SSCI.2017.8280895.
G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, M. Ranzato. Fader networks: Manipulating images by sliding attributes. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 5969–5978, 2017.
H. H. Tan, D. Herremans. Music FaderNets: Controllable music generation based on high-level features via low-level feature modelling. In Proceedings of the 21th International Society for Music Information Retrieval Conference, Montreal, Canada, pp. 109–116, 2020.
B. Banar, S. Colton. A systematic evaluation of GPT-2-based music generation. In Proceedings of the 11th International Conference on Artificial Intelligence in Music, Sound, Art and Design, Madrid, Spain, pp. 19–35, 2022. DOI: https://doi.org/10.1007/978-3-031-03789-4_2.
M. K. Karaosmanoglu. A Turkish makam music symbolic database for music information retrieval: SymbTr. In Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal, pp. 223–228, 2012.
G. Dzhambazov, A. Srinivasamurthy, S. Şentürk, X. Serra. On the use of note onsets for improved lyrics-to-audio alignment in Turkish makam music. In Proceedings of the 17th International Society for Music Information Retrieval Conference, New York, USA, pp. 716–722, 2016.
C. Raffel. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching, Ph. D. dissertation, Columbia University, USA, 2016.
C. McKay, I. Fujinaga. jSymbolic: A feature extractor for MIDI files. In Proceedings of the International Computer Music Conference, New Orleans, USA, 2006.
G. T. Toussaint. A mathematical analysis of African, Brazilian, and Cuban clave rhythms. In Proceedings of the BRIDGES: Mathematical Connections in Art, Music, and Science, Towson, USA, pp. 157–168, 2002.
D. P. Kingma, J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
L. Myers, M. J. Sirois. Spearman correlation coefficients, differences between. Encyclopedia of Statistical Sciences, S. Kotz, C. B. Read, N. Balakrishnan, B. Vidakovic, N. L. Johnson, Eds., Hoboken: John Wiley & Sons, Inc., 2006. DOI: https://doi.org/10.1002/0471667196.ess5050.pub2.
D. P. Kingma, M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014.
T. Adel, Z. Ghahramani, A. Weller. Discovering interpretable representations for both deep generative and discriminative models. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 50–59, 2018.
O. O. Garibay, B. Winslow, S. Andolina, M. Antona, A. Bodenschatz, C. Coursaris, G. Falco, S. M. Fiore, I. Garibay, K. Grieman, J. C. Havens, M. Jirotka, H. Kacorri, W. Karwowski, J. Kider, J. Konstan, S. Koon, M. Lopez-Gonzalez, I. Maifeld-Carucci, S. McGregor, G. Salvendy, B. Shneiderman, C. Stephanidis, C. Strobel, C. Ten Holter, W. Xu. Six human-centered artificial intelligence grand challenges. International Journal of Human–computer Interaction, vol. 39, no. 3, pp. 391–437, 2023. DOI: https://doi.org/10.1080/10447318.2022.2153320.
K. Chen, G. Xia, S. Dubnov. Continuous melody generation via disentangled short-term representations and structural conditions. In Proceedings of the 14th International Conference on Semantic Computing, San Diego, USA, pp. 128–135, 2020. DOI: https://doi.org/10.1109/ICSC.2020.00025.
D. Y. Liu, L. Wu, H. F. Zhao, F. Boussaid, M. Bennamoun, X. H. Xie. Jacobian norm with selective input gradient regularization for improved and interpretable adversarial defense, [Online], Available: https://arxiv.org/abs/2207.13036, 2022.
B. Banar, N. Bryan-Kinns, S. Colton. A tool for generating controllable variations of musical themes using variational autoencoders with latent space regularisation. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2023.
Acknowledgements
This work was supported by the UKRI Centre, UK, for Doctoral Training in Artificial Intelligence and Music supported by UKRI (EP/S022694/1), Queen Mary University of London, UK, and the Carleton College Career Center, USA for funding. Open Access funding provided by Queen Mary University of London.
Author information
Authors and Affiliations
Contributions
Bryan-Kinns instigated this work, led the research, supervised the student projects, and led the data analysis and writing. Zhao and Zhang contributed equally to the implementation of AI models, data collection and analysis in this work. Banar developed the original implementation in [19] which formed the basis for this work, contributed to the supervision of the student projects, contributed to the data analysis, and led the technical writing in this paper.
Corresponding author
Ethics declarations
The authors declared that they have no conflicts of interest to this work.
Additional information
Colored figures are available in the online version at https://link.springer.com/journal/11633
Nick Bryan-Kinns received the B. Sc. degree in computer science and the M.Sc degree in human-computer interaction from King’s College London, UK in 1993 and 1994, respectively, and the Ph. D. degree in human-computer interaction from Queen Mary and Westfield College, University of London, UK in 1998. He is a professor of creative computing at the Creative Computing Institute, University of the Arts London, UK. He is a Fellow of the Royal Society of Arts, Fellow of the British Computer Society, Turing Fellow at the Alan Turing Institute, Senior Member of the Association of Computing Machinery (ACM), and Chartered Engineer.
His research interests include explainable AI, interaction design, mutual engagement, interactive art and cross-cultural design.
E-mail: n.bryankinns@arts.ac.uk (Corresponding author)
ORCID iD: 0000-0002-1382-2914
Bingyuan Zhang received the B. Sc. degree in software engineering from Anhui Normal University, China in 2020. She is a master student in artificial intelligence from School of Electronic Engineering and Computer Science, Queen Mary University of London, UK.
Her research interests include interpretability of music generation models, anime faces generation models and machine learning.
Songyan Zhao received the B. Sc. degree in computer science and mathematics at Carleton College, USA in 2023. He is a master student in computer science at UCLA Samueli School of Engineering, USA.
His research interests include machine learning and AI music composition.
Berker Banar received the B. Sc. degree in electrical and electronics engineering from Bilkent University, Turkey in 2016, and the M. Sc. degree in electronic production and design from Berklee College of Music, USA in 2019. He is currently a Ph. D. researcher (Comp. Sci.) at the AI and Music CDT, Queen Mary University of London, UK, and an enrichment student at the Alan Turing Institute, UK. He has previously done research internships at Sony, Bose and Northwestern University.
His research interests include machine learning, deep learning, optimisation, generative modelling, music generation and computational creativity.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bryan-Kinns, N., Zhang, B., Zhao, S. et al. Exploring Variational Auto-encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI. Mach. Intell. Res. 21, 29–45 (2024). https://doi.org/10.1007/s11633-023-1457-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-023-1457-1