research-article

Transformer-Based Music Language Modelling and Transcription

Authors:

Christos Zonios,

John Pavlopoulos,

Aristidis LikasAuthors Info & Claims

SETN '22: Proceedings of the 12th Hellenic Conference on Artificial Intelligence

Article No.: 13, Pages 1 - 8

https://doi.org/10.1145/3549737.3549754

Published: 09 September 2022 Publication History

Abstract

Automatic Music Transcription (AMT) is the process of extracting information from audio into some form of music notation. This challenging task requires significant prior knowledge and understanding of music language. In this paper, we examine Transformer-based approaches for performing AMT on piano recordings by learning music language representations. We propose a new Music Language Modelling (MusicLM) pre-training approach for Transformers. It is based on an appropriately defined transcription error-correction task, and enables transfer learning for various musical tasks. Furthermore, a novel model for AMT is proposed that appropriately exploits a BERT Transformer for the MusicLM problem, showing the potential of transfer learning from Natural Language to MusicLM. We apply the Transformer on a Masked MusicLM task, and achieve musically coherent results. We also replace the RNNs used in current AMT models with pre-trained BERT-based Transformers, achieving improvements in AUC.

References

[1]

Betty van Aken, Benjamin Winter, Alexander Löser, and Felix A Gers. 2020. Visbert: Hidden-state visualizations for transformers. In Companion Proceedings of the Web Conference 2020. 207–211.

Digital Library

[2]

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477(2020).

[3]

Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, and Alexandros Potamianos. 2019. SEQ⌃3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 673–681. https://doi.org/10.18653/v1/N19-1071

[4]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150(2020).

[5]

Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. 2018. Automatic music transcription: An overview. IEEE Signal Processing Magazine 36, 1 (2018), 20–30.

[6]

Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. 2013. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems 41, 3 (2013), 407–434.

Digital Library

[7]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[9]

Johnathon Michael Ender. 2018. Neural Networks for Automatic Polyphonic Piano Music Transcription. University of Colorado Colorado Springs.

[10]

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. 2017. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153(2017).

[11]

Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, and Jesse Engel. 2021. Sequence-to-sequence piano transcription with transformers. arXiv preprint arXiv:2107.09142(2021).

[12]

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2018. Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv preprint arXiv:1810.12247(2018).

[13]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music transformer. arXiv preprint arXiv:1809.04281(2018).

[14]

Jong Wook Kim and Juan Pablo Bello. 2019. Adversarial learning for improved onsets and frames music transcription. arXiv preprint arXiv:1906.08512(2019).

[15]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683(2019).

[16]

Justin J Salamon 2013. Melody extraction from polyphonic music signals. Ph. D. Dissertation. Universitat Pompeu Fabra.

[17]

M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093

Digital Library

[18]

Siddharth Sigtia, Emmanouil Benetos, Srikanth Cherla, Tillman Weyde, A Garcez, and Simon Dixon. 2014. RNN-based music language models for improving automatic music transcription. (2014).

[19]

Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2016. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 5(2016), 927–939.

Digital Library

[20]

Jonathan Sleep. 2017. Automatic music transcription with convolutional neural networks using intuitive filter shapes. (2017).

[21]

Stanley S Stevens and John Volkmann. 1940. The relation of pitch to frequency: A revised scale. The American Journal of Psychology 53, 3 (1940), 329–353.

[22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[23]

Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model. arxiv:1906.05714 [cs.HC]

Index Terms

Transformer-Based Music Language Modelling and Transcription
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning
    2. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Automatic transcription of flamenco singing from polyphonic music recordings

Automatic note-level transcription is considered one of the most challenging tasks in music information retrieval. The specific case of flamenco singing transcription poses a particular challenge due to its complex melodic progressions, intonation ...
Automatic music transcription based on non-negative matrix factorization
ICS'10: Proceedings of the 14th WSEAS international conference on Systems: part of the 14th WSEAS CSCC multiconference - Volume I

In this paper, we present a method for the automatic transcription of polyphonic piano music. The input to this method consists in piano music recordings stored in WAV files, while the pitch of all the notes in the corresponding score forms the output. ...
Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing
Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SETN '22: Proceedings of the 12th Hellenic Conference on Artificial Intelligence

September 2022

450 pages

ISBN:9781450395977

DOI:10.1145/3549737

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SETN 2022

SETN 2022: 12th Hellenic Conference on Artificial Intelligence

September 7 - 9, 2022

Corfu, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
159
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)2

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents