short-paper

Semi-supervised Many-to-many Music Timbre Transfer

Authors:

Wen-Cheng Chen,

Min-Chun HuAuthors Info & Claims

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Pages 442 - 446

https://doi.org/10.1145/3460426.3463590

Published: 01 September 2021 Publication History

Abstract

This work presents a music timbre transfer model that aims to transfer the style of a music clip while preserving the semantic content. Compared to the existing music timbre transfer models, our model can achieve many-to-many timbre transfer between different instruments. The proposed method is based an autoencoder framework, which comprises two pretrained encoders trained in a supervised manner and one decoder trained in an unsupervised manner. To learn more representative features for the encoders, we produced a parallel dataset, called MI-Para, which is synthesized from MIDI files and digital audio workstations (DAW). Both the objective and the subjective evaluation results showed the effectiveness of the proposed framework. To scale up the application scenario, we also demonstrate that our model can achieve style transfer by training in a semi-supervised manner with a smaller parallel dataset.

References

[1]

Feross Aboukhadijeh. [n.d.]. Bitmidi.com. https://github.com/feross/bitmidi.com ( [n.,d.]).

[2]

Adrien Bitton, Philippe Esling, and Axel Chemla-Romeu-Santos. 2018. Modulated Variational auto-Encoders for many-to-many musical timbre transfer. arXiv preprint arXiv:1810.00222 (2018).

[3]

Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. 2019. One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742 (2019).

[4]

Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee. 2018. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. arXiv preprint arXiv:1804.02812 (2018).

[5]

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414--2423.

[6]

Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, 2 (1984), 236--243.

[7]

Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition. Springer, 84--92.

[8]

Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore, and Roger B Grosse. 2018a. Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer. arXiv preprint arXiv:1811.09620 (2018).

[9]

Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501--1510.

[10]

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018b. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV). 172--189.

Digital Library

[11]

Yun-Ning Hung and Yi-Hsuan Yang. 2018. Frame-level instrument recognition by timbre and pitch. arXiv preprint arXiv:1806.09587 (2018).

[12]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1125--1134.

[13]

Yannick Jadoul, Bill Thompson, and Bart De Boer. 2018. Introducing parselmouth: A python interface to praat. Journal of Phonetics, Vol. 71 (2018), 1--15.

[14]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[15]

Hyungui Lim and Jeongsoo Park. [n.d.]. Rare sound event detection using 1D convolutional recurrent neural networks.

[16]

Chien-Yu Lu, Min-Xin Xue, Chia-Che Chang, Che-Rung Lee, and Li Su. 2019. Play as you like: Timbre-enhanced multi-modal music style transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1061--1068.

Digital Library

[17]

Héctor Martel. 2019. Pix2Pix-Timbre-Transfer. URL https://github.com/hmartelb/Pix2Pix-Timbre-Transfer (2019).

[18]

Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman. 2018. A universal music translation network. arXiv preprint arXiv:1805.07848 (2018).

[19]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[20]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. arXiv preprint arXiv:1905.05879 (2019).

[21]

Lawrence Rabiner and Ronald Schafer. 2010. Theory and applications of digital speech processing .Prentice Hall Press.

[22]

Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. 2014. mir_eval: A transparent implementation of common MIR metrics. In In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR . Citeseer.

[23]

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016).

[24]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.

Cited By

Hu ZChen GLiu YMa XGuan NWang X(2024)Make a song curative: A spatio-temporal therapeutic music transfer model for anxiety reductionExpert Systems with Applications10.1016/j.eswa.2023.122161240(122161)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122161
Huang WZhan F(2023)A Novel Probabilistic Diffusion Model Based on the Weak Selection Mimicry Theory for the Generation of Hypnotic SongsMathematics10.3390/math1115334511:15(3345)Online publication date: 30-Jul-2023
https://doi.org/10.3390/math11153345
Wu YHe YLiu XWang YDannenberg R(2023)Transplayer: Timbre Style Transfer with Flexible Timbre ControlICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096233(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096233

Index Terms

Semi-supervised Many-to-many Music Timbre Transfer
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing

Recommendations

Musical composition style transfer via disentangled timbre representations
IJCAI'19: Proceedings of the 28th International Joint Conference on Artificial Intelligence

Music creation involves not only composing the different parts (e.g., melody, chords) of a musical work but also arranging/selecting the instruments to play the different parts. While the former has received increasing attention, the latter has not been ...
MITT: Musical Instrument Timbre Transfer Based on the Multichannel Attention-Guided Mechanism
Intelligent Computing Theories and Application
Abstract
Research on neural style transfer and domain translation has clearly demonstrated the ability of deep learning algorithms to manipulate images based on their artistic style. The idea of image translation has been applied to the task of music-style ...
Semi Supervised Autoencoder
Proceedings of the 23rd International Conference on Neural Information Processing - Volume 9948

Autoencoders are self-supervised learning tools, but are unsupervised in the sense that class information is not required for training; but almost invariably they are used for supervised classification tasks. We propose to learn the autoencoder for a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

August 2021

715 pages

ISBN:9781450384636

DOI:10.1145/3460426

General Chairs:
Wen-Huang Cheng
National Yang Ming Chiao Tung University, Taiwan
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Meng Wang
Hefei University of Technology, China
,
Program Chairs:
Wei-Ta Chu
National Cheng Kung University, Taiwan
,
Jiaying Liu
Peking University, China
,
Marcel Worring
University of Amsterdam, Netherlands

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

August 21 - 24, 2021

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
195
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hu ZChen GLiu YMa XGuan NWang X(2024)Make a song curative: A spatio-temporal therapeutic music transfer model for anxiety reductionExpert Systems with Applications10.1016/j.eswa.2023.122161240(122161)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122161
Huang WZhan F(2023)A Novel Probabilistic Diffusion Model Based on the Weak Selection Mimicry Theory for the Generation of Hypnotic SongsMathematics10.3390/math1115334511:15(3345)Online publication date: 30-Jul-2023
https://doi.org/10.3390/math11153345
Wu YHe YLiu XWang YDannenberg R(2023)Transplayer: Timbre Style Transfer with Flexible Timbre ControlICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096233(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096233

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten