Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3377325.3377539acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
short-paper

Interactive deep singing-voice separation based on human-in-the-loop adaptation

Published: 17 March 2020 Publication History

Abstract

This paper presents a deep-learning-based interactive system separating the singing voice from input polyphonic music signals. Although deep neural networks have been successful for singing voice separation, no approach using them allows any user interaction for improving the separation quality. We present a framework that allows a user to interactively fine-tune the deep neural model at run time to adapt it to the target song. This is enabled by designing unified networks consisting of two U-Net architectures based on frequency spectrogram representations: one for estimating the spectrogram mask that can be used to extract the singing-voice spectrogram from the input polyphonic spectrogram; the other for estimating the fundamental frequency (F0) of the singing voice. Although it is not easy for the user to edit the mask, he or she can iteratively correct errors in part of the visualized F0 trajectory through simple interaction. Our unified networks leverage the user-corrected F0 to improve the rest of the F0 trajectory through the model adaptation, which results in better separation quality. We validated this approach in a simulation experiment showing that the F0 correction can improve the quality of singing-voice separation. We also conducted a pilot user study with an expert musician, who used our system to produce a high-quality singing-voice separation result.

References

[1]
S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza. 2014. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4 (Dec. 2014), 105--120.
[2]
R. Bittner, B. McFee, J. Salamon, P. Li, and J. P. Bello. 2017. Deep Salience Representations for F0 Estimation in Polyphonic Music. In ISMIR 2017. 63--70.
[3]
R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. 2014. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research. In ISMIR 2014. 155--160.
[4]
J. Bryan, Gautham J. Mysore, and Ge Wang. 2014. ISSE: An Interactive Source Separation Editor. In ACM CHI 2014.
[5]
P. Cabañas-Molero, D. Martínez Muñoz, M. Cobos, and J. J. López. 2011. Singing Voice Separation from Stereo Recordings using Spatial Clues and Robust F0 Estimation. In AEC Conference.
[6]
A. Demetriou, A. Jansson, A. Kumar, and R. M. Bittner. 2018. Vocals in Music Matter: the Relevance of Vocals in the Minds of Listeners. In ISMIR 2018. 514--520.
[7]
N. Q. K. Duong, A. Ozerov, L. Chevallier, and J. Sirot. 2014. An Interactive Audio Source Separation Framework based on Non-negative Matrix Factorization. In ICASSP 2014.
[8]
J. Durrieu, B. David, and G. Richard. 2011. A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation. IEEE J. Sel. Topics Signal Process. 5, 6 (2011), 1180--1191.
[9]
J.-L. Durrieu and J.-P. Thiran. 2012. Musical Audio Source Separation based on User-selected F0 Track. In LVA/ICA 2012. 438--445.
[10]
J. A. Fails and Jr. D. R. Olsen. 2003. Interactive Machine Learning. In ACM IUI 2003.
[11]
Z. C. Fan, J. S. R. Jang, and C. L. Lu. 2016. Singing Voice Separation and Pitch Extraction from Monaural Polyphonic Audio Music via DNN and Adaptive Pitch Tracking. In BigMM 2016. 178--185.
[12]
B. Fuentes, R. Badeau, and G. Richard. 2012. Blind Harmonic Adaptive Decomposition Applied to Supervised Source Separation. In EUSIPCO 2012.
[13]
H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno. 2010. A Modeling of Singing Voice Robust to Accompaniment Sounds and its Application to Singer Identification and Vocal-timbre-similarity-based Music Information Retrieval. IEEE Trans. on Audio, Speech, and Language Processing 18, 3 (2010), 638--648.
[14]
M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. 2002. RWC Music Database: Popular, Classical, and Jazz Music Databases. In ISMIR 2002. 287--288.
[15]
C. L. Hsu, D. Wang, J. R. Jang, and K. Hu. 2012. A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment. IEEE Trans. Acoust., Speech, Signal Process. 20, 5 (2012), 1482--1491.
[16]
P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. 2014. Singing-voice Separation from Monaural Recordings using Deep Recurrent Neural Networks. In ISMIR 2014. 477--482.
[17]
E. J. Humphrey, S. Reddy, P. Seetharaman, A. Kumar, R. M. Bittner, A. Demetriou, S. Gulati, A. Jansson, T. Jehan, B. Lehner, A. Krupse, and L. Yang. 2019. An Introduction to Signal Processing for Singing-Voice Analysis: High Notes in the Effort to Automate the Understanding of Vocals in Music. IEEE Signal Processing Magazine 36, 1 (2019), 82--94.
[18]
A. Jansson, R. M. Bittner, S. Ewert, and T. Weyde. 2019. Joint Singing Voice Separation and F0 Estimation with Deep U-Net Architectures. In EUSIPCO 2019.
[19]
A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde. 2017. Singing Voice Separation with Deep U-Net Convolutional Networks. In ISMIR 2017. 745--751.
[20]
J. Kato, T. Nakano, and M. Goto. 2015. TextAlive: Integrated Design Environment for Kinetic Typography. In ACM CHI 2015.
[21]
B. Kim and B. Pardo. 2018. A Human-in-the-Loop System for Sound Event Detection and Annotation. TiiS 8, 2 (2018), 13:1--13:23.
[22]
A. Lefevre, F. Bach, and C. Fevotte. 2012. Semi-supervised NMF with Time-frequency Annotations for Single-channel Source Separation. In ISMIR 2012.
[23]
K. W. E. Lin, B. T. Balamurali, E. K., S. L., and D. Herremans. 2018. Singing Voice Separation using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy. Neural Computing and Applications (2018), 1--14.
[24]
Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani. 2017. Deep Clustering and Conventional Networks for Music Separation: Stronger Together. In ICASSP 2017. 61--65.
[25]
L. Le Magoarou, A. Ozerov, and N. Duong. 2013. Text-Informed Audio Source Separation using Nonnegative Matrix Partial Co-Factorization. In MLSP 2013.
[26]
T. Nakano, J. Kato, M. Hamasaki, and M. Goto. 2016. PlaylistPlayer: An Interface Using Multiple Criteria to Change the Playback Order of a Music Playlist. In ACM IUI 2016. 186--190.
[27]
T. Nakano, K. Yoshii, Y. Wu, R. Nishikimi, K. W. E. Lin, and M. Goto. 2019. Joint Singing Pitch Estimation and Voice Separation Based on a Neural Harmonic Structure Renderer. In IEEE WASPAA 2019.
[28]
A. Ozerov, C. Fevotte, R. Blouet, and J.-L. Durrieu. 2011. Multichannel Nonnegative Tensor Factorization with Structured Constraints for User-guided Audio Source Separation. In ICASSP 2011.
[29]
C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. 2014. mir_eval: A Transparent Implementation of Common MIR Metrics. In ISMIR 2014. 367--372.
[30]
Z. Rafii, A. Liutkus, F-R Stoter, S.I. Mimilakis, D. FitzGerald, and B. Pardo. 2018. An Overview of Lead and Accompaniment Separation in Music. IEEE/ACM Transactions on Audio, Speech, Language Processing 26, 8 (2018), 1307--1335.
[31]
P. Smaragdis and G. Mysore. 2009. "Separation by Humming": User Guided Sound Extraction from Monophonic Mixtures. In WASPAA 2009.
[32]
D. Stoller, S. Ewert, and S. Dixon. 2017. Wave-U-Net: A Multi-scale Neural Network for End-to-end Audio Source Separation. In ISMIR 2017. 330--340.
[33]
D. Stoller, S. Ewert, and S. Dixon. 2018. Jointly Detecting and Separating Singing Voice: A Multi-Task Approach. In LVA/ICA 2018. 329--339.
[34]
N. Takahashi and Y. Mitsufuji. 2017. Multi-scale Multi-band DenseNets for Audio Source Separation. In WASPAA 2017. 21--25.
[35]
G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren. 2018. Interactive Medical Image Segmentation Using Deep Learning With Image-Specific Fine Tuning. IEEE Trans. on Medical Imaging 32, 7 (2018), 1562--1573.
[36]
N. Xu, B. L. Price, S. Cohen, J. Yang, and T. S. Huang. 2016. Deep Interactive Object Selection. In CVPR 2016.

Cited By

View all
  • (2023)Efficient Human-in-the-loop System for Guiding DNNs AttentionProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584074(294-306)Online publication date: 27-Mar-2023

Index Terms

  1. Interactive deep singing-voice separation based on human-in-the-loop adaptation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IUI '20: Proceedings of the 25th International Conference on Intelligent User Interfaces
    March 2020
    607 pages
    ISBN:9781450371186
    DOI:10.1145/3377325
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 March 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. F0 estimation
    2. deep learning
    3. human-in-the-loop model adaptation
    4. singing-voice separation

    Qualifiers

    • Short-paper

    Funding Sources

    • JST ACCEL

    Conference

    IUI '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 746 of 2,811 submissions, 27%

    Upcoming Conference

    IUI '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Efficient Human-in-the-loop System for Guiding DNNs AttentionProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584074(294-306)Online publication date: 27-Mar-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media