Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3549737.3549754acmotherconferencesArticle/Chapter ViewAbstractPublication PagessetnConference Proceedingsconference-collections
research-article

Transformer-Based Music Language Modelling and Transcription

Published: 09 September 2022 Publication History

Abstract

Automatic Music Transcription (AMT) is the process of extracting information from audio into some form of music notation. This challenging task requires significant prior knowledge and understanding of music language. In this paper, we examine Transformer-based approaches for performing AMT on piano recordings by learning music language representations. We propose a new Music Language Modelling (MusicLM) pre-training approach for Transformers. It is based on an appropriately defined transcription error-correction task, and enables transfer learning for various musical tasks. Furthermore, a novel model for AMT is proposed that appropriately exploits a BERT Transformer for the MusicLM problem, showing the potential of transfer learning from Natural Language to MusicLM. We apply the Transformer on a Masked MusicLM task, and achieve musically coherent results. We also replace the RNNs used in current AMT models with pre-trained BERT-based Transformers, achieving improvements in AUC.

References

[1]
Betty van Aken, Benjamin Winter, Alexander Löser, and Felix A Gers. 2020. Visbert: Hidden-state visualizations for transformers. In Companion Proceedings of the Web Conference 2020. 207–211.
[2]
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477(2020).
[3]
Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, and Alexandros Potamianos. 2019. SEQ⌃3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 673–681. https://doi.org/10.18653/v1/N19-1071
[4]
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150(2020).
[5]
Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. 2018. Automatic music transcription: An overview. IEEE Signal Processing Magazine 36, 1 (2018), 20–30.
[6]
Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. 2013. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems 41, 3 (2013), 407–434.
[7]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[9]
Johnathon Michael Ender. 2018. Neural Networks for Automatic Polyphonic Piano Music Transcription. University of Colorado Colorado Springs.
[10]
Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. 2017. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153(2017).
[11]
Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, and Jesse Engel. 2021. Sequence-to-sequence piano transcription with transformers. arXiv preprint arXiv:2107.09142(2021).
[12]
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2018. Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv preprint arXiv:1810.12247(2018).
[13]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music transformer. arXiv preprint arXiv:1809.04281(2018).
[14]
Jong Wook Kim and Juan Pablo Bello. 2019. Adversarial learning for improved onsets and frames music transcription. arXiv preprint arXiv:1906.08512(2019).
[15]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683(2019).
[16]
Justin J Salamon 2013. Melody extraction from polyphonic music signals. Ph. D. Dissertation. Universitat Pompeu Fabra.
[17]
M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093
[18]
Siddharth Sigtia, Emmanouil Benetos, Srikanth Cherla, Tillman Weyde, A Garcez, and Simon Dixon. 2014. RNN-based music language models for improving automatic music transcription. (2014).
[19]
Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2016. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 5(2016), 927–939.
[20]
Jonathan Sleep. 2017. Automatic music transcription with convolutional neural networks using intuitive filter shapes. (2017).
[21]
Stanley S Stevens and John Volkmann. 1940. The relation of pitch to frequency: A revised scale. The American Journal of Psychology 53, 3 (1940), 329–353.
[22]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[23]
Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model. arxiv:1906.05714 [cs.HC]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SETN '22: Proceedings of the 12th Hellenic Conference on Artificial Intelligence
September 2022
450 pages
ISBN:9781450395977
DOI:10.1145/3549737
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic music transcription
  2. deep learning
  3. music language modelling
  4. transformers

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SETN 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 159
    Total Downloads
  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)2
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media