short-paper

Interactive deep singing-voice separation based on human-in-the-loop adaptation

Authors:

Tomoyasu Nakano,

Masahiro Hamasaki,

Masataka GotoAuthors Info & Claims

IUI '20: Proceedings of the 25th International Conference on Intelligent User Interfaces

Pages 78 - 82

https://doi.org/10.1145/3377325.3377539

Published: 17 March 2020 Publication History

Abstract

This paper presents a deep-learning-based interactive system separating the singing voice from input polyphonic music signals. Although deep neural networks have been successful for singing voice separation, no approach using them allows any user interaction for improving the separation quality. We present a framework that allows a user to interactively fine-tune the deep neural model at run time to adapt it to the target song. This is enabled by designing unified networks consisting of two U-Net architectures based on frequency spectrogram representations: one for estimating the spectrogram mask that can be used to extract the singing-voice spectrogram from the input polyphonic spectrogram; the other for estimating the fundamental frequency (F0) of the singing voice. Although it is not easy for the user to edit the mask, he or she can iteratively correct errors in part of the visualized F0 trajectory through simple interaction. Our unified networks leverage the user-corrected F0 to improve the rest of the F0 trajectory through the model adaptation, which results in better separation quality. We validated this approach in a simulation experiment showing that the F0 correction can improve the quality of singing-voice separation. We also conducted a pilot user study with an expert musician, who used our system to produce a high-quality singing-voice separation result.

References

[1]

S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza. 2014. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4 (Dec. 2014), 105--120.

Digital Library

[2]

R. Bittner, B. McFee, J. Salamon, P. Li, and J. P. Bello. 2017. Deep Salience Representations for F0 Estimation in Polyphonic Music. In ISMIR 2017. 63--70.

[3]

R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. 2014. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research. In ISMIR 2014. 155--160.

[4]

J. Bryan, Gautham J. Mysore, and Ge Wang. 2014. ISSE: An Interactive Source Separation Editor. In ACM CHI 2014.

Digital Library

[5]

P. Cabañas-Molero, D. Martínez Muñoz, M. Cobos, and J. J. López. 2011. Singing Voice Separation from Stereo Recordings using Spatial Clues and Robust F0 Estimation. In AEC Conference.

[6]

A. Demetriou, A. Jansson, A. Kumar, and R. M. Bittner. 2018. Vocals in Music Matter: the Relevance of Vocals in the Minds of Listeners. In ISMIR 2018. 514--520.

[7]

N. Q. K. Duong, A. Ozerov, L. Chevallier, and J. Sirot. 2014. An Interactive Audio Source Separation Framework based on Non-negative Matrix Factorization. In ICASSP 2014.

[8]

J. Durrieu, B. David, and G. Richard. 2011. A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation. IEEE J. Sel. Topics Signal Process. 5, 6 (2011), 1180--1191.

[9]

J.-L. Durrieu and J.-P. Thiran. 2012. Musical Audio Source Separation based on User-selected F0 Track. In LVA/ICA 2012. 438--445.

Digital Library

[10]

J. A. Fails and Jr. D. R. Olsen. 2003. Interactive Machine Learning. In ACM IUI 2003.

[11]

Z. C. Fan, J. S. R. Jang, and C. L. Lu. 2016. Singing Voice Separation and Pitch Extraction from Monaural Polyphonic Audio Music via DNN and Adaptive Pitch Tracking. In BigMM 2016. 178--185.

[12]

B. Fuentes, R. Badeau, and G. Richard. 2012. Blind Harmonic Adaptive Decomposition Applied to Supervised Source Separation. In EUSIPCO 2012.

[13]

H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno. 2010. A Modeling of Singing Voice Robust to Accompaniment Sounds and its Application to Singer Identification and Vocal-timbre-similarity-based Music Information Retrieval. IEEE Trans. on Audio, Speech, and Language Processing 18, 3 (2010), 638--648.

Digital Library

[14]

M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. 2002. RWC Music Database: Popular, Classical, and Jazz Music Databases. In ISMIR 2002. 287--288.

[15]

C. L. Hsu, D. Wang, J. R. Jang, and K. Hu. 2012. A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment. IEEE Trans. Acoust., Speech, Signal Process. 20, 5 (2012), 1482--1491.

Digital Library

[16]

P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. 2014. Singing-voice Separation from Monaural Recordings using Deep Recurrent Neural Networks. In ISMIR 2014. 477--482.

[17]

E. J. Humphrey, S. Reddy, P. Seetharaman, A. Kumar, R. M. Bittner, A. Demetriou, S. Gulati, A. Jansson, T. Jehan, B. Lehner, A. Krupse, and L. Yang. 2019. An Introduction to Signal Processing for Singing-Voice Analysis: High Notes in the Effort to Automate the Understanding of Vocals in Music. IEEE Signal Processing Magazine 36, 1 (2019), 82--94.

[18]

A. Jansson, R. M. Bittner, S. Ewert, and T. Weyde. 2019. Joint Singing Voice Separation and F0 Estimation with Deep U-Net Architectures. In EUSIPCO 2019.

[19]

A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde. 2017. Singing Voice Separation with Deep U-Net Convolutional Networks. In ISMIR 2017. 745--751.

[20]

J. Kato, T. Nakano, and M. Goto. 2015. TextAlive: Integrated Design Environment for Kinetic Typography. In ACM CHI 2015.

[21]

B. Kim and B. Pardo. 2018. A Human-in-the-Loop System for Sound Event Detection and Annotation. TiiS 8, 2 (2018), 13:1--13:23.

[22]

A. Lefevre, F. Bach, and C. Fevotte. 2012. Semi-supervised NMF with Time-frequency Annotations for Single-channel Source Separation. In ISMIR 2012.

[23]

K. W. E. Lin, B. T. Balamurali, E. K., S. L., and D. Herremans. 2018. Singing Voice Separation using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy. Neural Computing and Applications (2018), 1--14.

[24]

Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani. 2017. Deep Clustering and Conventional Networks for Music Separation: Stronger Together. In ICASSP 2017. 61--65.

[25]

L. Le Magoarou, A. Ozerov, and N. Duong. 2013. Text-Informed Audio Source Separation using Nonnegative Matrix Partial Co-Factorization. In MLSP 2013.

[26]

T. Nakano, J. Kato, M. Hamasaki, and M. Goto. 2016. PlaylistPlayer: An Interface Using Multiple Criteria to Change the Playback Order of a Music Playlist. In ACM IUI 2016. 186--190.

[27]

T. Nakano, K. Yoshii, Y. Wu, R. Nishikimi, K. W. E. Lin, and M. Goto. 2019. Joint Singing Pitch Estimation and Voice Separation Based on a Neural Harmonic Structure Renderer. In IEEE WASPAA 2019.

[28]

A. Ozerov, C. Fevotte, R. Blouet, and J.-L. Durrieu. 2011. Multichannel Nonnegative Tensor Factorization with Structured Constraints for User-guided Audio Source Separation. In ICASSP 2011.

[29]

C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. 2014. mir_eval: A Transparent Implementation of Common MIR Metrics. In ISMIR 2014. 367--372.

[30]

Z. Rafii, A. Liutkus, F-R Stoter, S.I. Mimilakis, D. FitzGerald, and B. Pardo. 2018. An Overview of Lead and Accompaniment Separation in Music. IEEE/ACM Transactions on Audio, Speech, Language Processing 26, 8 (2018), 1307--1335.

Digital Library

[31]

P. Smaragdis and G. Mysore. 2009. "Separation by Humming": User Guided Sound Extraction from Monophonic Mixtures. In WASPAA 2009.

[32]

D. Stoller, S. Ewert, and S. Dixon. 2017. Wave-U-Net: A Multi-scale Neural Network for End-to-end Audio Source Separation. In ISMIR 2017. 330--340.

[33]

D. Stoller, S. Ewert, and S. Dixon. 2018. Jointly Detecting and Separating Singing Voice: A Multi-Task Approach. In LVA/ICA 2018. 329--339.

[34]

N. Takahashi and Y. Mitsufuji. 2017. Multi-scale Multi-band DenseNets for Audio Source Separation. In WASPAA 2017. 21--25.

[35]

G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren. 2018. Interactive Medical Image Segmentation Using Deep Learning With Image-Specific Fine Tuning. IEEE Trans. on Medical Imaging 32, 7 (2018), 1562--1573.

[36]

N. Xu, B. L. Price, S. Cohen, J. Yang, and T. S. Huang. 2016. Deep Interactive Object Selection. In CVPR 2016.

Cited By

He YYang XChang CXie HIgarashi T(2023)Efficient Human-in-the-loop System for Guiding DNNs AttentionProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584074(294-306)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584074

Index Terms

Interactive deep singing-voice separation based on human-in-the-loop adaptation
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing

Recommendations

Singing voice separation using mono-channel mask

Separating singing voice from monaural song recording is a highly difficult task. Still it is important because it has many applications such as singer identification, lyrics recognition, and melody extraction. Difficulty arises due to many musical ...
Separation of Singing Voice From Music Accompaniment for Monaural Recordings

Separating singing voice from music accompaniment is very useful in many applications, such as lyrics recognition and alignment, singer identification, and music information retrieval. Although speech separation has been extensively studied for decades, ...
Singing Voice Database
Speech and Computer
Abstract
The first publicly available singing voice database, which was first released in 2012, is presented in this paper. This database contains recordings of professional singers including one Grammy Award winner. The database includes so-called plain ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IUI '20: Proceedings of the 25th International Conference on Intelligent User Interfaces

March 2020

607 pages

ISBN:9781450371186

DOI:10.1145/3377325

General Chairs:
Fabio Paternò,
Nuria Oliver,
Program Chairs:
Cristina Conati,
Lucio Davide Spano,
Nava Tintarev

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

JST ACCEL

Conference

IUI '20

Sponsor:

IUI '20: 25th International Conference on Intelligent User Interfaces

March 17 - 20, 2020

Cagliari, Italy

Acceptance Rates

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Sponsor:
sigai
sigai

30th International Conference on Intelligent User Interfaces

March 24 - 27, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
280
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

He YYang XChang CXie HIgarashi T(2023)Efficient Human-in-the-loop System for Guiding DNNs AttentionProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584074(294-306)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584074

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents