research-article

Improving music source separation based on deep neural networks through data augmentation and network blending

Authors:

Marcello Porcu,

Michael Enenkl,

Naoya Takahashi,

Yuki MitsufujiAuthors Info & Claims

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pages 261 - 265

https://doi.org/10.1109/ICASSP.2017.7952158

Published: 05 March 2017 Publication History

Abstract

This paper deals with the separation of music into individual instrument tracks which is known to be a challenging problem. We describe two different deep neural network architectures for this task, a feed-forward and a recurrent one, and show that each of them yields themselves state-of-the art results on the SiSEC DSD100 dataset. For the recurrent network, we use data augmentation during training and show that even simple separation networks are prone to overfitting if no data augmentation is used. Furthermore, we propose a blending of both neural network systems where we linearly combine their raw outputs and then perform a multi-channel Wiener filter post-processing. This blending scheme yields the best results that have been reported to-date on the SiSEC DSD100 dataset.

7. References

[1]

Z. Rafii and B. Pardo, “Repeating pattern extraction technique (REPET): A simple method for music/voice separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 21, no. 1, pp. 73–84, 2013.

Digital Library

[2]

J.-L. Durrieu, B. David, and G. Richard, “A musically motivated mid-level representation for pitch estimation and musical audio source separation,” IEEE Journal on Selected Topics on Signal Processing, vol. 5, pp. 1180–1191, 2011.

[3]

H. Shim, J. S. Abel, and K.-M. Sung, “Stereo music source separation for 3-D upmixing,” in 127th AES Convention, 2009.

[4]

D. FitzGerald, “Upmixing from mono - a source separation approach,” Proc. Digital Signal Processing, 2011.

[5]

D. FitzGerald, “The good vibrations problem,” 134th AES Convention, e-brief. 2013.

[6]

“SiSEC MUS Homepage,” https://sisec.inria.fr/home/2016-professionally-produced-music-recordings/.

[7]

N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 signal separation evaluation campaign,” in Proc. LVA/ICA, 2015, pp. 387–395.

[8]

A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” INRIA Technical Report, 2015.

[9]

A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel music separation with deep neural networks,” in Proc. EUSIPCO, 2016.

[10]

S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural network based instrument extraction from music,” in Proc. ICASSP, 2015, pp. 2135–2139.

[11]

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[12]

E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neural networks for single channel source separation,” Proc. IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 3734–3738, 2014.

[13]

P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Singing-voice separation from monaural recordings using deep recurrent neural networks.,” in Proc. ISMIR, 2014, pp. 477–482.

[14]

P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136–2147, 2015.

Digital Library

[15]

A. J. Simpson, G. Roma, and M. D. Plumbley, “Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network,” in Proc. LVA/ICA, 2015, pp. 429–436.

[16]

J. Le Roux, S. Watanabe, and J. R. Hershey, “Ensemble learning for speech enhancement,” in Proc. WASPAA, 2013, pp. 1–4.

[17]

X. Jaureguiberry, G. Richard, P. Leveau, R. Hennequin, and E. Vincent, “Introducing a simple fusion framework for audio source separation.” in Proc. MLSP. 2013. pp. 1–6.

[18]

X. Jaureguiberry, E. Vincent, and G. Richard, “Fusion methods for speech enhancement and audio source separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing. vol. 24, no. 7, pp. 1266–1279, 2016.

Digital Library

[19]

E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Combining mask estimates for single channel audio source separation using deep neural networks,” in Proc. Interspeech, 2016.

[20]

E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Single-channel audio source separation using deep neural network ensembles.” in 140th AES Convention. 2016.

[21]

X.-L. Zhang and D. Wang, “Multi-resolution stacking for speech separation based on boosted DNN,” in Proc. Interspeech, 2015, pp. 1745–1749.

[22]

X.-L. Zhang and D. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 967–977, 2016.

Digital Library

[23]

W. Jiang, S. Liang, L. Dong, H. Yang, W. Liu, and Y. Wang, “Cross-domain cooperative deep stacking network for speech separation,” in Proc. ICASSP. IEEE. 2015. pp. 5083–5087.

[24]

J. Eggert and E. Korner, “Sparse coding and NMF,” in Proc. Neural Networks, 2004, vol. 4, pp. 2529–2533.

[25]

C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.

[26]

F. Weninger, J. Le Roux, J. R. Hershey, and S. Watanabe, “Discriminative NMF and its application to single-channel source separation,” in Proc. Interspeech, 2014, pp. 865–869.

[27]

J. Le Roux, J. R. Hershey, and F. Weninger, “Deep NMF for speech separation,” in Proc. ICASSP, 2015, pp. 66–70.

[28]

E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.

Digital Library

[29]

N. Q. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1830–1840, 2010.

Digital Library

[30]

A. Ozerov, E. Vincent, and F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1118–1133, 2012.

Digital Library

[31]

S. M. Kay, Fundamentals of Statistical Signal Processing, Volume 1: Estimation Theory, Prentice-Hall, 1993.

Digital Library

[32]

S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales-Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature simulation,” in Proc. ASRU 2015, pp. 482–489.

[33]

X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier networks,” Pmc. AISTATS, vol. 15, pp. 315–323, 2011.

[34]

R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive MIR research.,” in Proc. ISMIR, 2014, pp. 155–160.

[35]

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification and Scene Analysis. John Wiley& Sons, 2001.

[36]

A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.

Digital Library

[37]

“Lasagne GitHub.” https.//uithub.com/Lasagne/Lasagne.

[38]

“Theano GitHub,” https://github.com/Theano/Theano.

[39]

The Theano Development Team, “Theano: A python framework for fast computation of mathematical expressions,” ar Xiv preprint arXiv:, 2016.

[40]

J. Schlüter and T. Grill, “Exploring data augmentation for improved singing voice detection with neural networks,” in Proc. ISMIR. 2015.

[41]

B. McFee, E. J. Humphrey, and J. P. Bello, “A software framework for musical data augmentation,” in Proc. ISMIR.2015.

[42]

J. Bennett and S. Lanning, “The Netflix prize,” in Proc. KDD Cup and Workshop, 2007, vol. 2007, p. 35.

[43]

R. M. Bell and Y. Koren, “Lessons from the Netflix prize challenge,” ACM SIGKDD Explorations Newsletter, vol. 9, no. 2, pp. 75–79, 2007.

Digital Library

Cited By

Duan RQu ZZhao SDing LLiu YLu ZYin HStavrou ACremers CShi E(2022)Perception-Aware AttackProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559350(905-919)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3559350
Van TQuang NThanh T(2019)Deep Learning Approach for Singer Voice Classification of Vietnamese Popular MusicProceedings of the 10th International Symposium on Information and Communication Technology10.1145/3368926.3369700(255-260)Online publication date: 4-Dec-2019
https://dl.acm.org/doi/10.1145/3368926.3369700

Index Terms

Improving music source separation based on deep neural networks through data augmentation and network blending

Index terms have been assigned to the content through auto-classification.

Recommendations

Single Channel Blind Source Separation Under Deep Recurrent Neural Network
Abstract
In wireless sensor networks, the signals received by sensors are usually complex nonlinear single-channel mixed signals. In practical applications, it is necessary to separate the useful signals from the complex nonlinear mixed signals. However, ...
Joint optimization of masks and deep recurrent neural networks for monaural source separation

Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore ...
Deep Kronecker neural networks: A general framework for neural networks with adaptive activation functions
Abstract
We propose a new type of neural networks, Kronecker neural networks (KNNs), that form a general framework for neural networks with adaptive activation functions. KNNs employ the Kronecker product, which provides an efficient way of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Mar 2017

6527 pages

Copyright © 2017.

Publisher

IEEE Press

Publication History

Published: 05 March 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Duan RQu ZZhao SDing LLiu YLu ZYin HStavrou ACremers CShi E(2022)Perception-Aware AttackProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559350(905-919)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3559350
Van TQuang NThanh T(2019)Deep Learning Approach for Singer Voice Classification of Vietnamese Popular MusicProceedings of the 10th International Symposium on Information and Communication Technology10.1145/3368926.3369700(255-260)Online publication date: 4-Dec-2019
https://dl.acm.org/doi/10.1145/3368926.3369700

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents