Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3460426.3463590acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Semi-supervised Many-to-many Music Timbre Transfer

Published: 01 September 2021 Publication History

Abstract

This work presents a music timbre transfer model that aims to transfer the style of a music clip while preserving the semantic content. Compared to the existing music timbre transfer models, our model can achieve many-to-many timbre transfer between different instruments. The proposed method is based an autoencoder framework, which comprises two pretrained encoders trained in a supervised manner and one decoder trained in an unsupervised manner. To learn more representative features for the encoders, we produced a parallel dataset, called MI-Para, which is synthesized from MIDI files and digital audio workstations (DAW). Both the objective and the subjective evaluation results showed the effectiveness of the proposed framework. To scale up the application scenario, we also demonstrate that our model can achieve style transfer by training in a semi-supervised manner with a smaller parallel dataset.

References

[1]
Feross Aboukhadijeh. [n.d.]. Bitmidi.com. https://github.com/feross/bitmidi.com ( [n.,d.]).
[2]
Adrien Bitton, Philippe Esling, and Axel Chemla-Romeu-Santos. 2018. Modulated Variational auto-Encoders for many-to-many musical timbre transfer. arXiv preprint arXiv:1810.00222 (2018).
[3]
Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. 2019. One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742 (2019).
[4]
Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee. 2018. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. arXiv preprint arXiv:1804.02812 (2018).
[5]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414--2423.
[6]
Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, 2 (1984), 236--243.
[7]
Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition. Springer, 84--92.
[8]
Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore, and Roger B Grosse. 2018a. Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer. arXiv preprint arXiv:1811.09620 (2018).
[9]
Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501--1510.
[10]
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018b. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV). 172--189.
[11]
Yun-Ning Hung and Yi-Hsuan Yang. 2018. Frame-level instrument recognition by timbre and pitch. arXiv preprint arXiv:1806.09587 (2018).
[12]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1125--1134.
[13]
Yannick Jadoul, Bill Thompson, and Bart De Boer. 2018. Introducing parselmouth: A python interface to praat. Journal of Phonetics, Vol. 71 (2018), 1--15.
[14]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[15]
Hyungui Lim and Jeongsoo Park. [n.d.]. Rare sound event detection using 1D convolutional recurrent neural networks.
[16]
Chien-Yu Lu, Min-Xin Xue, Chia-Che Chang, Che-Rung Lee, and Li Su. 2019. Play as you like: Timbre-enhanced multi-modal music style transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1061--1068.
[17]
Héctor Martel. 2019. Pix2Pix-Timbre-Transfer. URL https://github.com/hmartelb/Pix2Pix-Timbre-Transfer (2019).
[18]
Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman. 2018. A universal music translation network. arXiv preprint arXiv:1805.07848 (2018).
[19]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[20]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. arXiv preprint arXiv:1905.05879 (2019).
[21]
Lawrence Rabiner and Ronald Schafer. 2010. Theory and applications of digital speech processing .Prentice Hall Press.
[22]
Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. 2014. mir_eval: A transparent implementation of common MIR metrics. In In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR . Citeseer.
[23]
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016).
[24]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.

Cited By

View all
  • (2024)Make a song curative: A spatio-temporal therapeutic music transfer model for anxiety reductionExpert Systems with Applications10.1016/j.eswa.2023.122161240(122161)Online publication date: Apr-2024
  • (2023)A Novel Probabilistic Diffusion Model Based on the Weak Selection Mimicry Theory for the Generation of Hypnotic SongsMathematics10.3390/math1115334511:15(3345)Online publication date: 30-Jul-2023
  • (2023)Transplayer: Timbre Style Transfer with Flexible Timbre ControlICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096233(1-5)Online publication date: 4-Jun-2023

Index Terms

  1. Semi-supervised Many-to-many Music Timbre Transfer

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
    August 2021
    715 pages
    ISBN:9781450384636
    DOI:10.1145/3460426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. autoencoder
    2. deep neural networks
    3. music timbre transfer

    Qualifiers

    • Short-paper

    Conference

    ICMR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Make a song curative: A spatio-temporal therapeutic music transfer model for anxiety reductionExpert Systems with Applications10.1016/j.eswa.2023.122161240(122161)Online publication date: Apr-2024
    • (2023)A Novel Probabilistic Diffusion Model Based on the Weak Selection Mimicry Theory for the Generation of Hypnotic SongsMathematics10.3390/math1115334511:15(3345)Online publication date: 30-Jul-2023
    • (2023)Transplayer: Timbre Style Transfer with Flexible Timbre ControlICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096233(1-5)Online publication date: 4-Jun-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media