Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475405acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

Published: 17 October 2021 Publication History

Abstract

Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available.

Supplementary Material

ZIP File (mfp1388aux.zip)
Demo page is available at:
MP4 File (MM21-mfp1388.mp4)
Presentation video for the paper titled ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

References

[1]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173--182.
[2]
Mert Bay, Andreas F Ehmann, and J Stephen Downie. 2009. Evaluation of Multiple-F0 Estimation and Tracking Systems. In ISMIR. 315--320.
[3]
Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. 2019. Automatic Music Transcription: An Overview. IEEE Signal Processing Magazine, Vol. 36 (2019), 20--30.
[4]
Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. 2013. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems, Vol. 41 (2013), 407--434.
[5]
Taylor Berg-Kirkpatrick, Jacob Andreas, and Dan Klein. 2014. Unsupervised transcription of piano music. In Advances in neural information processing systems. 1538--1546.
[6]
Ralf Gunter Correa Carvalho and Paris Smaragdis. 2017. Towards end-to-end polyphonic music transcription: Transforming music audio directly to a score. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 151--155.
[7]
Ali Taylan Cemgil. 2004. Bayesian music transcription .Ph.D. thesis, Radbound University Nijmegen Netherlands.
[8]
William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), 4960--4964.
[9]
Hung-Chen Chen and Arbee LP Chen. 2001. A music recommendation system based on music data grouping and user interests. In Proceedings of the tenth international conference on Information and knowledge management. 231--238.
[10]
Luoxin Chen, Weitong Ruan, Xinyue Liu, and Jianhua Lu. 2020. SeqVAT: Virtual adversarial training for semi-supervised sequence labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8801--8811.
[11]
K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans. 2020. nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks. IEEE Access, Vol. 8 (2020), 161981--162003.
[12]
Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, and Dorien Herremans. 2020. The Effect of Spectrogram Reconstructions on Automatic Music Transcription:An Alternative Approach to Improve Transcription Accuracy. In International Conference on Pattern Recognition (ICPR). IEEE, Washington, DC, USA, 9091--9098.
[13]
Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, and Dorien Herremans. 2021. Revisiting the Onsets and Frames Model with Additive Attention. In Proceedings of the International Joint Conference on Neural Networks. IEEE, In press. https://doi.org/10.1109/SPW.2018.00014
[14]
Keunwoo Choi and Kyunghyun Cho. 2019. Deep Unsupervised Drum Transcription. In ISMIR, Arthur Flexer, Geoffroy Peeters, Juliá n Urbano, and Anja Volk (Eds.). 183--191. http://archives.ismir.net/ismir2019/paper/000020.pdf
[15]
Michael Scott Cuthbert and Christopher Ariza. 2010. music21: A toolkit for computer-aided musicology and symbolic music data. In ISMIR.
[16]
Valentin Emiya, Nancy Bertin, Bertrand David, and Roland Badeau. 2010. MAPS-A piano database for multipitch estimation and automatic transcription of music. Hal Inria (2010).
[17]
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations.
[18]
Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. 2017. Onsets and Frames: Dual-Objective Piano Transcription. In ISMIR.
[19]
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In International Conference on Learning Representations. https://openreview.net/forum?id=r1lYRjC9F7
[20]
Yu-Siang Huang and Yi-Hsuan Yang. 2020. Pop Music Transformer: Generating Music with Rhythm and Harmony. arXiv preprint arXiv:2002.00212 (2020).
[21]
Yun-Ning Hung, I Ping Chiang, Yian Chen, and Yi-Hsuan Yang. 2019. Musical Composition Style Transfer via Disentangled Timbre Representations. In IJCAI. 4697--4703.
[22]
Zheng Jiang and Roger B Dannenberg. 2019. Melody identification in standard midi files. In Proceedings of the 16th sound & music computing conference. 65--71.
[23]
Rainer Kelz, Sebastian Böck, and Gerhard Widmer. 2019. Deep polyphonic adsr piano note transcription. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 246--250.
[24]
Jong Wook Kim and Juan Pablo Bello. 2019. Adversarial Learning for Improved Onsets and Frames Music Transcription. International Society forMusic Information Retrieval Conference (2019), 670--677.
[25]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[26]
Qiuqiang Kong, Keunwoo Choi, and Yuxuan Wang. 2020. Large-Scale MIDI-based Composer Classification. arXiv preprint arXiv:2010.14805 (2020).
[27]
Florian L Kreyssig and Philip C Woodland. 2020. Cosine-distance virtual adversarial training for semi-supervised speaker-discriminative acoustic embeddings. Interspeech (2020).
[28]
Toyoaki Kuwahara, Yuichi Sei, Yasuyuki Tahara, Ryohei Orihara, and Akihiko Ohsuga. 2019. Model Smoothing Using Virtual Adversarial Training for Speech Emotion Estimation. In 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD). IEEE, 60--64.
[29]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[30]
Zhang Liumei, Jiang Fanzhi, Li Jiao, Ma Gang, and Liu Tianshi. 2021. K-means clustering analysis of Chinese traditional folk music based on midi music textualization. In 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 1062--1066.
[31]
José Pedro Magalhaes. 2015. CHORDIFY: THREE YEARS AFTER THE LAUNCH. In ISMIR.
[32]
Albert Mero no-Pe nuela, Rinke Hoekstra, Aldo Gangemi, Peter Bloem, Reinier de Valk, Bas Stringer, Berit Janssen, Victor de Boer, Alo Allik, Stefan Schlobach, et al. 2017. The MIDI linked data cloud. In International Semantic Web Conference. Springer, 156--164.
[33]
Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semi-supervised text classification. ICLR (2016).
[34]
T. Miyato, S. Maeda, M. Koyama, and S. Ishii. 2019. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 8 (2019), 1979--1993. https://doi.org/10.1109/TPAMI.2018.2858821
[35]
Dae Hoon Park and Yi Chang. 2019. Adversarial sampling and training for semi-supervised information retrieval. In The World Wide Web Conference. 1443--1453.
[36]
Fabrizio Pedersoli, George Tzanetakis, and Kwang Moo Yi. 2020. Improving Music Transcription by Pre-Stacking A U-Net. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 506--510.
[37]
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. 2019. Stand-Alone Self-Attention in Vision Models. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/3416a75f4cea9109507cacd8e2f2aefc-Paper.pdf
[38]
Miguel A. Romá n, Antonio Pertusa, and Jorge Calvo-Zaragoza. 2018. An End-to-end Framework for Audio-to-Score Music Transcription on Monophonic Excerpts. In ISMIR. 34--41.
[39]
Miguel A. Romá n, Antonio Pertusa, and Jorge Calvo-Zaragoza. 2019. A Holistic Approach to Polyphonic Music Transcription with Neural Networks. In ISMIR. 731--737.
[40]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 464--468. https://doi.org/10.18653/v1/N18--2074
[41]
Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2015. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24 (2015), 927--939.
[42]
JQ Sun and Seok-Pil Lee. 2017. Query by singing/humming system based on deep learning. Int. J. Appl. Eng. Res, Vol. 12, 13 (2017), 973--4562.
[43]
John Thickstun, Zaïd Harchaoui, Dean P. Foster, and Sham M. Kakade. 2017. Invariances and Data Augmentation for Supervised Music Transcription. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 2241--2245.
[44]
John Thickstun, Zaïd Harchaoui, and Sham M. Kakade. 2016. Learning Features of Music From Scratch. In ICLR, Vol. abs/1611.09827.
[45]
Minz Won, Sanghyuk Chun, and Xavier Serra. 2019. Toward Interpretable Music Tagging with Self-Attention. arxiv: 1906.04972 [cs.SD]
[46]
Yu-Te Wu, Berlin Chen, and Li Su. 2019. Polyphonic Music Transcription with Semantic Segmentation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 166--170. https://doi.org/10.1109/ICASSP.2019.8682605
[47]
Y. T. Wu, B. Chen, and L. Su. 2020. Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020), 2796--2809. https://doi.org/10.1109/TASLP.2020.3030482
[48]
Genki Yamaguchi and Makoto Fukumoto. 2019. A music recommendation system based on melody creation by interactive GA. In 2019 20th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 286--290.

Cited By

View all
  • (2024)Annotation-Free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688348(1-6)Online publication date: 15-Jul-2024
  • (2024)Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688217(1-6)Online publication date: 15-Jul-2024
  • (2024)Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music TranscriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446141(1291-1295)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio processing
  2. automatic music transcription
  3. music information retrieval
  4. semi-supervised training
  5. virtual adversarial training

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)198
  • Downloads (Last 6 weeks)29
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Annotation-Free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688348(1-6)Online publication date: 15-Jul-2024
  • (2024)Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688217(1-6)Online publication date: 15-Jul-2024
  • (2024)Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music TranscriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446141(1291-1295)Online publication date: 14-Apr-2024
  • (2024)Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer modelExpert Systems with Applications10.1016/j.eswa.2024.123640249(123640)Online publication date: Sep-2024
  • (2023)High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced DataSignals10.3390/signals40400424:4(768-787)Online publication date: 10-Nov-2023
  • (2023)A U-Net Based Architecture for Automatic Music Transcription2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP55844.2023.10285985(1-6)Online publication date: 17-Sep-2023
  • (2023)MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00194(1109-1114)Online publication date: Jul-2023
  • (2023)Multitrack Music Transcription with a Time-Frequency PerceiverICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096688(1-5)Online publication date: 4-Jun-2023
  • (2023)Diffroll: Diffusion-Based Generative Music Transcription with Unsupervised Pretraining CapabilityICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095935(1-5)Online publication date: 4-Jun-2023
  • (2023)Note and Playing Technique Transcription of Electric Guitar Solos in Real-World Music PerformanceICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095225(1-5)Online publication date: 4-Jun-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media