Abstract
Research on neural style transfer and domain translation has clearly demonstrated the ability of deep learning algorithms to manipulate images based on their artistic style. The idea of image translation has been applied to the task of music-style transfer and to the timbre transfer of musical instrument recordings; however, the results have not been ideal. Generally, the task of instrument timbre transfer depends on the ability to extract a separated manipulable instrument timbre feature. However, as the distinction between a musical note and its timbre is often not sufficiently clear, generated samples by current timbre transfer models usually contain irrelevant waveforms. Here, we propose a method of timbre transfer, for musical instrument sounds, capable of converting one instrument sound to another while preserving note information (duration, pitch, rhythm, etc.). The multichannel attention-guided mechanism is used to enable timbre transfer between spectrograms, enhancing the ability of the model guidance generator to capture the most distinguishable components (harmonic components) in the process. The proposed model uses a Markov discriminator to optimize the generator, enabling it to accurately learn a spectrogram’s higher-order feature. Experimental results demonstrate that the proposed instrument timbre transfer model effectively captures the harmonic components in the target domain and produces explicit high-frequency details.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mor, N., Wolf, L., Polyak, A., et al.: A universal music translation network. arXiv preprint arXiv:1805.07848 (2016)
Oord, A., Dieleman, S., Zen, H., et al.: Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Huang, S., Li, Q., Anil, C., et al.: Timbretron: a wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer. arXiv preprint arXiv:1811.09620 (2016)
Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoustical Soc. Am. 89(1), 425–434 (1991)
Zhu, J.Y., Park, T., Isola, P., et al.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232. IEEE, Italy (2017)
Bitton, A., Esling, P., Chemla-Romeu-Santos, A.: Modulated Variational auto-Encoders for many-to-many musical timbre transfer. arXiv preprint arXiv:1810.00222 (2018)
Jain, D.K., Kumar, A., Cai, L., et al.: ATT: Attention-based Timbre Transfer. In: 2020 International Joint Conference on Neural Networks, pp. 1–6. IEEE, UK (2020)
Tang, H., Liu, H., Xu, D., et al.: Attentiongan: unpaired image-to-image translation using attention-guided generative adversarial networks. arXiv preprint arXiv:1911.11897 (2019)
Isola, P., Zhu, J.Y., Zhou, T., et al.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. IEEE, Italy (2017).
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
Reed, S., Akata, Z., Yan, X., et al.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)
Yamamoto, R., Song, E., Kim, J.M.: Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics. Speech and Signal Processing, pp. 6199–6203. IEEE, USA (2020)
Yi, Z., Zhang, H., Tan, P., et al.: Dualgan: Unsupervised dual learning for image-to-image translation In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857. IEEE, USA (2017)
Pasini, M.: Melgan-vc: voice conversion and audio style transfer on arbitrarily long samples using spectrograms. arXiv preprint arXiv:1910.03713 (2019)
Schörkhuber, C., Klapuri, A., Sontacchi, A.: Pitch shifting of audio signals using the constant-q transform. In: Proceedings of the DAFx Conference (2012)
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848 (2017)
Choi, Y., Choi, M., Kim, M., et al.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. IEEE, USA (2018)
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal Unsupervised Image-to-Image Translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) European Conference on Computer Vision, ECCV 2018. Lecture Notes in Computer Science, vol. 11207. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11
Kim, J., Kim, M., Kang, H., et al.: U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830 (2019)
Tang, H., Xu, D., Sebe, N., Yan, Y.: Attention-guided generative adversarial networks for unsupervised image-to-image translation. In: 2019 International Joint Conference on Neural Networks, pp. 1–8. IEEE, Hungary (2019).
Engel, J., Agrawal, K.K., Chen, S., et al.: Gansynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710 (2019)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, H., Chen, Y. (2021). MITT: Musical Instrument Timbre Transfer Based on the Multichannel Attention-Guided Mechanism. In: Huang, DS., Jo, KH., Li, J., Gribova, V., Bevilacqua, V. (eds) Intelligent Computing Theories and Application. ICIC 2021. Lecture Notes in Computer Science(), vol 12836. Springer, Cham. https://doi.org/10.1007/978-3-030-84522-3_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-84522-3_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84521-6
Online ISBN: 978-3-030-84522-3
eBook Packages: Computer ScienceComputer Science (R0)