Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ICASSP.2017.7952158guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Improving music source separation based on deep neural networks through data augmentation and network blending

Published: 05 March 2017 Publication History

Abstract

This paper deals with the separation of music into individual instrument tracks which is known to be a challenging problem. We describe two different deep neural network architectures for this task, a feed-forward and a recurrent one, and show that each of them yields themselves state-of-the art results on the SiSEC DSD100 dataset. For the recurrent network, we use data augmentation during training and show that even simple separation networks are prone to overfitting if no data augmentation is used. Furthermore, we propose a blending of both neural network systems where we linearly combine their raw outputs and then perform a multi-channel Wiener filter post-processing. This blending scheme yields the best results that have been reported to-date on the SiSEC DSD100 dataset.

7. References

[1]
Z. Rafii and B. Pardo, “Repeating pattern extraction technique (REPET): A simple method for music/voice separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 21, no. 1, pp. 73–84, 2013.
[2]
J.-L. Durrieu, B. David, and G. Richard, “A musically motivated mid-level representation for pitch estimation and musical audio source separation,” IEEE Journal on Selected Topics on Signal Processing, vol. 5, pp. 1180–1191, 2011.
[3]
H. Shim, J. S. Abel, and K.-M. Sung, “Stereo music source separation for 3-D upmixing,” in 127th AES Convention, 2009.
[4]
D. FitzGerald, “Upmixing from mono - a source separation approach,” Proc. Digital Signal Processing, 2011.
[5]
D. FitzGerald, “The good vibrations problem,” 134th AES Convention, e-brief. 2013.
[7]
N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 signal separation evaluation campaign,” in Proc. LVA/ICA, 2015, pp. 387–395.
[8]
A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” INRIA Technical Report, 2015.
[9]
A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel music separation with deep neural networks,” in Proc. EUSIPCO, 2016.
[10]
S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural network based instrument extraction from music,” in Proc. ICASSP, 2015, pp. 2135–2139.
[11]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[12]
E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neural networks for single channel source separation,” Proc. IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 3734–3738, 2014.
[13]
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Singing-voice separation from monaural recordings using deep recurrent neural networks.,” in Proc. ISMIR, 2014, pp. 477–482.
[14]
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136–2147, 2015.
[15]
A. J. Simpson, G. Roma, and M. D. Plumbley, “Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network,” in Proc. LVA/ICA, 2015, pp. 429–436.
[16]
J. Le Roux, S. Watanabe, and J. R. Hershey, “Ensemble learning for speech enhancement,” in Proc. WASPAA, 2013, pp. 1–4.
[17]
X. Jaureguiberry, G. Richard, P. Leveau, R. Hennequin, and E. Vincent, “Introducing a simple fusion framework for audio source separation.” in Proc. MLSP. 2013. pp. 1–6.
[18]
X. Jaureguiberry, E. Vincent, and G. Richard, “Fusion methods for speech enhancement and audio source separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing. vol. 24, no. 7, pp. 1266–1279, 2016.
[19]
E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Combining mask estimates for single channel audio source separation using deep neural networks,” in Proc. Interspeech, 2016.
[20]
E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Single-channel audio source separation using deep neural network ensembles.” in 140th AES Convention. 2016.
[21]
X.-L. Zhang and D. Wang, “Multi-resolution stacking for speech separation based on boosted DNN,” in Proc. Interspeech, 2015, pp. 1745–1749.
[22]
X.-L. Zhang and D. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 967–977, 2016.
[23]
W. Jiang, S. Liang, L. Dong, H. Yang, W. Liu, and Y. Wang, “Cross-domain cooperative deep stacking network for speech separation,” in Proc. ICASSP. IEEE. 2015. pp. 5083–5087.
[24]
J. Eggert and E. Korner, “Sparse coding and NMF,” in Proc. Neural Networks, 2004, vol. 4, pp. 2529–2533.
[25]
C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
[26]
F. Weninger, J. Le Roux, J. R. Hershey, and S. Watanabe, “Discriminative NMF and its application to single-channel source separation,” in Proc. Interspeech, 2014, pp. 865–869.
[27]
J. Le Roux, J. R. Hershey, and F. Weninger, “Deep NMF for speech separation,” in Proc. ICASSP, 2015, pp. 66–70.
[28]
E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[29]
N. Q. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1830–1840, 2010.
[30]
A. Ozerov, E. Vincent, and F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1118–1133, 2012.
[31]
S. M. Kay, Fundamentals of Statistical Signal Processing, Volume 1: Estimation Theory, Prentice-Hall, 1993.
[32]
S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales-Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature simulation,” in Proc. ASRU 2015, pp. 482–489.
[33]
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier networks,” Pmc. AISTATS, vol. 15, pp. 315–323, 2011.
[34]
R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive MIR research.,” in Proc. ISMIR, 2014, pp. 155–160.
[35]
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification and Scene Analysis. John Wiley& Sons, 2001.
[36]
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
[37]
Lasagne GitHub.” https.//uithub.com/Lasagne/Lasagne.
[39]
The Theano Development Team, “Theano: A python framework for fast computation of mathematical expressions,” ar Xiv preprint arXiv:, 2016.
[40]
J. Schlüter and T. Grill, “Exploring data augmentation for improved singing voice detection with neural networks,” in Proc. ISMIR. 2015.
[41]
B. McFee, E. J. Humphrey, and J. P. Bello, “A software framework for musical data augmentation,” in Proc. ISMIR.2015.
[42]
J. Bennett and S. Lanning, “The Netflix prize,” in Proc. KDD Cup and Workshop, 2007, vol. 2007, p. 35.
[43]
R. M. Bell and Y. Koren, “Lessons from the Netflix prize challenge,” ACM SIGKDD Explorations Newsletter, vol. 9, no. 2, pp. 75–79, 2007.

Cited By

View all
  • (2022)Perception-Aware AttackProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559350(905-919)Online publication date: 7-Nov-2022
  • (2019)Deep Learning Approach for Singer Voice Classification of Vietnamese Popular MusicProceedings of the 10th International Symposium on Information and Communication Technology10.1145/3368926.3369700(255-260)Online publication date: 4-Dec-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Mar 2017
6527 pages

Publisher

IEEE Press

Publication History

Published: 05 March 2017

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Perception-Aware AttackProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559350(905-919)Online publication date: 7-Nov-2022
  • (2019)Deep Learning Approach for Singer Voice Classification of Vietnamese Popular MusicProceedings of the 10th International Symposium on Information and Communication Technology10.1145/3368926.3369700(255-260)Online publication date: 4-Dec-2019

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media