research-article

Open access

ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

Authors:

Dorien Herremans,

Li SuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3918 - 3926

https://doi.org/10.1145/3474085.3475405

Published: 17 October 2021 Publication History

Abstract

Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available.

Supplementary Material

ZIP File (mfp1388aux.zip)

Demo page is available at:

Download
80.18 MB

MP4 File (MM21-mfp1388.mp4)

Presentation video for the paper titled ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

Download
100.16 MB

References

[1]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173--182.

Digital Library

[2]

Mert Bay, Andreas F Ehmann, and J Stephen Downie. 2009. Evaluation of Multiple-F0 Estimation and Tracking Systems. In ISMIR. 315--320.

[3]

Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. 2019. Automatic Music Transcription: An Overview. IEEE Signal Processing Magazine, Vol. 36 (2019), 20--30.

[4]

Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. 2013. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems, Vol. 41 (2013), 407--434.

Digital Library

[5]

Taylor Berg-Kirkpatrick, Jacob Andreas, and Dan Klein. 2014. Unsupervised transcription of piano music. In Advances in neural information processing systems. 1538--1546.

Digital Library

[6]

Ralf Gunter Correa Carvalho and Paris Smaragdis. 2017. Towards end-to-end polyphonic music transcription: Transforming music audio directly to a score. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 151--155.

[7]

Ali Taylan Cemgil. 2004. Bayesian music transcription .Ph.D. thesis, Radbound University Nijmegen Netherlands.

[8]

William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), 4960--4964.

Digital Library

[9]

Hung-Chen Chen and Arbee LP Chen. 2001. A music recommendation system based on music data grouping and user interests. In Proceedings of the tenth international conference on Information and knowledge management. 231--238.

Digital Library

[10]

Luoxin Chen, Weitong Ruan, Xinyue Liu, and Jianhua Lu. 2020. SeqVAT: Virtual adversarial training for semi-supervised sequence labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8801--8811.

[11]

K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans. 2020. nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks. IEEE Access, Vol. 8 (2020), 161981--162003.

[12]

Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, and Dorien Herremans. 2020. The Effect of Spectrogram Reconstructions on Automatic Music Transcription:An Alternative Approach to Improve Transcription Accuracy. In International Conference on Pattern Recognition (ICPR). IEEE, Washington, DC, USA, 9091--9098.

[13]

Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, and Dorien Herremans. 2021. Revisiting the Onsets and Frames Model with Additive Attention. In Proceedings of the International Joint Conference on Neural Networks. IEEE, In press. https://doi.org/10.1109/SPW.2018.00014

[14]

Keunwoo Choi and Kyunghyun Cho. 2019. Deep Unsupervised Drum Transcription. In ISMIR, Arthur Flexer, Geoffroy Peeters, Juliá n Urbano, and Anja Volk (Eds.). 183--191. http://archives.ismir.net/ismir2019/paper/000020.pdf

[15]

Michael Scott Cuthbert and Christopher Ariza. 2010. music21: A toolkit for computer-aided musicology and symbolic music data. In ISMIR.

[16]

Valentin Emiya, Nancy Bertin, Bertrand David, and Roland Badeau. 2010. MAPS-A piano database for multipitch estimation and automatic transcription of music. Hal Inria (2010).

[17]

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations.

[18]

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. 2017. Onsets and Frames: Dual-Objective Piano Transcription. In ISMIR.

[19]

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In International Conference on Learning Representations. https://openreview.net/forum?id=r1lYRjC9F7

[20]

Yu-Siang Huang and Yi-Hsuan Yang. 2020. Pop Music Transformer: Generating Music with Rhythm and Harmony. arXiv preprint arXiv:2002.00212 (2020).

[21]

Yun-Ning Hung, I Ping Chiang, Yian Chen, and Yi-Hsuan Yang. 2019. Musical Composition Style Transfer via Disentangled Timbre Representations. In IJCAI. 4697--4703.

Digital Library

[22]

Zheng Jiang and Roger B Dannenberg. 2019. Melody identification in standard midi files. In Proceedings of the 16th sound & music computing conference. 65--71.

[23]

Rainer Kelz, Sebastian Böck, and Gerhard Widmer. 2019. Deep polyphonic adsr piano note transcription. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 246--250.

[24]

Jong Wook Kim and Juan Pablo Bello. 2019. Adversarial Learning for Improved Onsets and Frames Music Transcription. International Society forMusic Information Retrieval Conference (2019), 670--677.

[25]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[26]

Qiuqiang Kong, Keunwoo Choi, and Yuxuan Wang. 2020. Large-Scale MIDI-based Composer Classification. arXiv preprint arXiv:2010.14805 (2020).

[27]

Florian L Kreyssig and Philip C Woodland. 2020. Cosine-distance virtual adversarial training for semi-supervised speaker-discriminative acoustic embeddings. Interspeech (2020).

[28]

Toyoaki Kuwahara, Yuichi Sei, Yasuyuki Tahara, Ryohei Orihara, and Akihiko Ohsuga. 2019. Model Smoothing Using Virtual Adversarial Training for Speech Emotion Estimation. In 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD). IEEE, 60--64.

[29]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[30]

Zhang Liumei, Jiang Fanzhi, Li Jiao, Ma Gang, and Liu Tianshi. 2021. K-means clustering analysis of Chinese traditional folk music based on midi music textualization. In 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 1062--1066.

[31]

José Pedro Magalhaes. 2015. CHORDIFY: THREE YEARS AFTER THE LAUNCH. In ISMIR.

[32]

Albert Mero no-Pe nuela, Rinke Hoekstra, Aldo Gangemi, Peter Bloem, Reinier de Valk, Bas Stringer, Berit Janssen, Victor de Boer, Alo Allik, Stefan Schlobach, et al. 2017. The MIDI linked data cloud. In International Semantic Web Conference. Springer, 156--164.

[33]

Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semi-supervised text classification. ICLR (2016).

[34]

T. Miyato, S. Maeda, M. Koyama, and S. Ishii. 2019. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 8 (2019), 1979--1993. https://doi.org/10.1109/TPAMI.2018.2858821

[35]

Dae Hoon Park and Yi Chang. 2019. Adversarial sampling and training for semi-supervised information retrieval. In The World Wide Web Conference. 1443--1453.

Digital Library

[36]

Fabrizio Pedersoli, George Tzanetakis, and Kwang Moo Yi. 2020. Improving Music Transcription by Pre-Stacking A U-Net. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 506--510.

[37]

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. 2019. Stand-Alone Self-Attention in Vision Models. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/3416a75f4cea9109507cacd8e2f2aefc-Paper.pdf

Digital Library

[38]

Miguel A. Romá n, Antonio Pertusa, and Jorge Calvo-Zaragoza. 2018. An End-to-end Framework for Audio-to-Score Music Transcription on Monophonic Excerpts. In ISMIR. 34--41.

[39]

Miguel A. Romá n, Antonio Pertusa, and Jorge Calvo-Zaragoza. 2019. A Holistic Approach to Polyphonic Music Transcription with Neural Networks. In ISMIR. 731--737.

[40]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 464--468. https://doi.org/10.18653/v1/N18--2074

[41]

Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2015. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24 (2015), 927--939.

Digital Library

[42]

JQ Sun and Seok-Pil Lee. 2017. Query by singing/humming system based on deep learning. Int. J. Appl. Eng. Res, Vol. 12, 13 (2017), 973--4562.

[43]

John Thickstun, Zaïd Harchaoui, Dean P. Foster, and Sham M. Kakade. 2017. Invariances and Data Augmentation for Supervised Music Transcription. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 2241--2245.

[44]

John Thickstun, Zaïd Harchaoui, and Sham M. Kakade. 2016. Learning Features of Music From Scratch. In ICLR, Vol. abs/1611.09827.

[45]

Minz Won, Sanghyuk Chun, and Xavier Serra. 2019. Toward Interpretable Music Tagging with Self-Attention. arxiv: 1906.04972 [cs.SD]

[46]

Yu-Te Wu, Berlin Chen, and Li Su. 2019. Polyphonic Music Transcription with Semantic Segmentation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 166--170. https://doi.org/10.1109/ICASSP.2019.8682605

[47]

Y. T. Wu, B. Chen, and L. Su. 2020. Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020), 2796--2809. https://doi.org/10.1109/TASLP.2020.3030482

Digital Library

[48]

Genki Yamaguchi and Makoto Fukumoto. 2019. A music recommendation system based on melody creation by interactive GA. In 2019 20th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 286--290.

Cited By

Sato GAkama T(2024)Annotation-Free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688348(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10688348
Wu YWei WLi DLi MYu YGao YLi W(2024)Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688217(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10688217
Cwitkowitz FCheuk KChoi WMartínez-Ramírez MToyama KLiao WMitsufuji Y(2024)Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music TranscriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446141(1291-1295)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446141
Show More Cited By

Index Terms

ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Semi-supervised learning settings
    2. Machine learning approaches
      1. Neural networks

Recommendations

Automatic transcription of flamenco singing from polyphonic music recordings

Automatic note-level transcription is considered one of the most challenging tasks in music information retrieval. The specific case of flamenco singing transcription poses a particular challenge due to its complex melodic progressions, intonation ...
Transfer of Knowledge Among Instruments in Automatic Music Transcription
Artificial Intelligence and Soft Computing
Abstract
Automatic music transcription (AMT) is one of the most challenging tasks in the music information retrieval domain. It is the process of converting an audio recording of music into a symbolic representation containing information about the notes, ...
Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing
Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Education (Singapore)
Agency for Science, Technology and Research (A*STAR)
Singapore University of Technology and Design

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
775
Total Downloads

Downloads (Last 12 months)198
Downloads (Last 6 weeks)29

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sato GAkama T(2024)Annotation-Free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688348(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10688348
Wu YWei WLi DLi MYu YGao YLi W(2024)Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688217(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10688217
Cwitkowitz FCheuk KChoi WMartínez-Ramírez MToyama KLiao WMitsufuji Y(2024)Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music TranscriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446141(1291-1295)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446141
Kang JPoria SHerremans D(2024)Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer modelExpert Systems with Applications10.1016/j.eswa.2024.123640249(123640)Online publication date: Sep-2024
https://doi.org/10.1016/j.eswa.2024.123640
Zehren MAlunno MBientinesi P(2023)High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced DataSignals10.3390/signals40400424:4(768-787)Online publication date: 10-Nov-2023
https://doi.org/10.3390/signals4040042
Scarpiniti MSigismondi EComminiello DUncini A(2023)A U-Net Based Architecture for Automatic Music Transcription2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP55844.2023.10285985(1-6)Online publication date: 17-Sep-2023
https://doi.org/10.1109/MLSP55844.2023.10285985
Wu YZhao JYu YLi W(2023)MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00194(1109-1114)Online publication date: Jul-2023
https://doi.org/10.1109/ICME55011.2023.00194
Lu WWang JHung Y(2023)Multitrack Music Transcription with a Time-Frequency PerceiverICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096688(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096688
Cheuk KSawata RUesaka TMurata NTakahashi NTakahashi SHerremans DMitsufuji Y(2023)Diffroll: Diffusion-Based Generative Music Transcription with Unsupervised Pretraining CapabilityICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095935(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095935
Huang TYu PSu L(2023)Note and Playing Technique Transcription of Electric Guitar Solos in Real-World Music PerformanceICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095225(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095225
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents