Past review, current progress, and challenges ahead on the cocktail party problem

Yan-min Qian ORCID: orcid.org/0000-0002-0314-3790¹,
Chao Weng¹,
Xuan-kai Chang²,
Shuai Wang² &
…
Dong Yu¹

1284 Accesses
Explore all metrics

An Erratum to this article was published on 01 March 2019

An Erratum to this article was published on 01 April 2018

This article has been updated

Abstract

The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on speech separation in cocktail party environment: challenges and approaches

Article 23 February 2023

BioCPPNet: automatic bioacoustic source separation with deep neural networks

Article Open access 06 December 2021

General Formulation of Multichannel Extensions of NMF Variants

Change history

11 June 2018
In the original version of this article, the affiliations are incorrect. The correct affiliations are given above.
The corresponding author’s E-mail address should be yanminqian@sjtu.edu.cn.
19 April 2019
In the original version of this article, there is a mistake about the result of DPCL++ (Isik et al., 2016) in Section 5.6 (Fig. 7). As reported in Isik et al. (2016), the SDR improvement was 10.3 dB, rather than 9.4 dB. For further information, the best performance in Isik et al. (2016) was 10.8 dB with the help of a more complicated architecture.

References

Abdel-Hamid O, Mohamed A, Jiang H, et al., 2014. Convolutional neural networks for speech recognition. Annual Conf of Int Speech Communication Association, p.1533–1545.
Google Scholar
Anguera X, Wooters C, Hernando J, 2007. Acoustic beamforming for speaker diarization of meetings. IEEE Trans Audio Speech Lang Process, 15(7):2011–2022. https://doi.org/10.1109/TASL.2007.902460
Google Scholar
Applebaum S, 1976. Adaptive arrays. IEEE Trans Antennas Propag, 24(9):585–598. https://doi.org/10.1109/TAP.1976.1141417
Google Scholar
Barker J, Ma N, Coy A, et al., 2010. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Comput Speech Lang, 24(1):94–111. https://doi.org/10.1016/j.csl.2008.05.003
Google Scholar
Behnke S, 2003. Discovering hierarchical speech features using convolutional non-negative matrix factorization. Int Joint Conf on Neural Networks, p.2758–2763. https://doi.org/10.1109/IJCNN.2003.1224004
Google Scholar
Bello RWJ, 2010. Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization. 11th Int Society for Music Information Retrieval Conf, p.123–128.
Google Scholar
Benesty J, Chen J, Huang Y, et al., 2007. On microphonearray beamforming from a MIMO acoustic signal processing perspective. IEEE Trans Audio Speech Lang Process, 15(3):1053–1065. https://doi.org/10.1109/TASL.2006.885251
Google Scholar
Benesty J, Chen J, Huang Y, 2008. Automatic Speech Recognition: a Deep Learning Approach. Springer Berlin Heidelberg, New York, USA.
Google Scholar
Bi M, Qian Y, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16th Annual Conf of Int Speech Communication Association, p.3259–3263.
Google Scholar
Bregman AS, 1990. Auditory scene analysis. In: Smelzer NJ, Bates PB (Eds.), International Encyclopedia of the Social and Behavioral Sciences. Elsevier, Amsterdam.
Google Scholar
Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297–336. https://doi.org/10.1006/csla.1994.1016
Google Scholar
Capon J, 1969. High resolution frequency-wavenumber spectrum analysis. Proc IEEE, 57:1408–1418. https://doi.org/10.1109/PROC.1969.7278
Google Scholar
Carter GC, Nuttall AH, Cable PG, 1973. The smoothed coherence transform. Proc IEEE, 61:1497–1498. https://doi.org/10.1109/PROC.1973.9300
Google Scholar
Chang X, Qian Y, Yu D, 2018. Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition. Int Conf on Acoustics, Speech, and Signal Processing, in press.
Google Scholar
Chen J, Benesty J, Huang Y, 2006. Time delay estimation in room acoustic environments: an overview. EURASIP J Adv Signal Process, 2006:026503. https://doi.org/10.1155/ASP/2006/26503
MATH Google Scholar
Chen N, Qian Y, Yu K, 2015. Multi-task learning for textdependent speaker verification. Annual Conf of Int Speech Communication Association, p.185–189.
Google Scholar
Chen Z, 2017. Single Channel Auditory Source Separation with Neural Network. PhD Thesis, Columbia University, New York, USA.
Google Scholar
Chen Z, Ellis DP, 2013. Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition. Workshop on Applications of Signal Processing to Audio and Acoustics, p.1–4. https://doi.org/10.1109/WASPAA.2013.6701883
Google Scholar
Chen Z, McFee B, Ellis DP, 2014. Speech enhancement by low-rank and convolutive dictionary spectrogram decomposition. Annual Conf of Int Speech Communication Association, p.2833–2837.
Google Scholar
Chen Z, Li J, Xiao X, et al., 2017a. Cracking the cocktail party problem by multi-beam deep attractor network. IEEE Workshop on Automatic Speech Recognition and Understanding, p.437–444.
Google Scholar
Chen Z, Luo Y, Mesgarani N, 2017b. Deep attractor network for single-microphone speaker separation. Int Conf on Acoustics, Speech, and Signal Processing, p.246–250. https://doi.org/10.1109/ICASSP.2017.7952155
Google Scholar
Chen Z, Droppo J, Li J, et al., 2017c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. http://arxiv.org/abs/1707.07048
Google Scholar
Cherry EC, 1953. Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am, 25(5):975–979. https://doi.org/10.1121/1.1907229
Google Scholar
Cooke M, Hershey JR, Rennie SJ, 2010. Monaural speech separation and recognition challenge. Comput Speech Lang, 24(1):1–15. https://doi.org/10.1016/j.csl.2009.02.006
Google Scholar
Dehak N, Kenny PJ, Dehak R, et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process, 19(4):788–798. https://doi.org/10.1109/TASL.2010.2064307
Google Scholar
Doclo S, Moonen M, 2003. Design of far-field and near-field broadband beamformers using eigenfilters. IEEE Signal Process Lett, 83(12):2641–2673. https://doi.org/10.1016/j.sigpro.2003.07.005
MATH Google Scholar
Drude L, Haeb-Umbach R, 2017. Tight integration of spatial and spectral features for BSS with deep clustering embeddings. Annual Conf of Int Speech Communication Association, p.2650–2654.
Google Scholar
Du J, Tu Y, Xu Y, et al., 2014. Speech separation of a target speaker based on deep neural networks. Int Conf on Signal Processing, p.473–477. https://doi.org/10.1109/ICOSP.2014.7015050
Google Scholar
Ellis DPW, 1996. Prediction-Driven Computational Auditory Scene Analysis. PhD Thesis, Massachusetts Institute of Technology, Cambridge, USA.
Google Scholar
Ephraim Y, Malah D, 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Audio Speech Lang Process, 33(2): 443–445. https://doi.org/10.1109/TASSP.1985.1164550
Google Scholar
Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phasesensitive and recognition-boosted speech separation using deep recurrent neural networks. Int Conf on Acoustics Speech and Signal Processing, p.708–712. https://doi.org/10.1109/ICASSP.2015.7178061
Google Scholar
Erdogan H, Hershey J, Watanabe S, et al., 2016. Improved MVDR beamforming using single-channel mask prediction networks. Annual Conf of Int Speech Communication Association, p.1981–1985.
Google Scholar
Erdogan H, Hershey JR, Watanabe S, et al., 2017. Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio. New Era for Robust Speech Recognition, p.165–186. https://doi.org/10.1007/978-3-319-64680-0_7
Google Scholar
Fischer S, Simmer KU, 1996. Beamforming microphone arrays for speech acquisition in noisy environments. Speech Commun, 20(3-4):215–227. https://doi.org/10.1016/S0167-6393(96)00054-4
Google Scholar
Frost OL, 1972. An algorithm for linearly constrained adaptive array processing. Proc IEEE, 60(8):926–935. https://doi.org/10.1109/PROC.1972.8817
Google Scholar
Gannot S, Burshtein D, Weinstein E, 2001. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process, 49(8): 1614–1626. https://doi.org/10.1109/78.934132
Google Scholar
Gannot S, Burshtein D, Weinstein E, 2004. Analysis of the power spectral deviation of the general transfer function GSC. IEEE Trans Signal Process, 52(4):1115–1120. https://doi.org/10.1109/TSP.2004.823487
MathSciNet MATH Google Scholar
Ghahramani Z, Jordan MI, 1996. Factorial hidden Markov models. NIPS, p.472–478.
Google Scholar
Hassab JC, Boucher RE, 1981. Performance of the generalized cross correlator in the presence of a strong spectral peak in the signal. IEEE Trans Audio Speech Lang Process, 29(3):549–555. https://doi.org/10.1109/TASSP.1981.1163613
Google Scholar
Hershey JR, Kristjansson T, Rennie S, et al., 2007. Single channel speech separation using factorial dynamics. NIPS, p.593–600.
Google Scholar
Hershey JR, Rennie SJ, Olsen PA, et al., 2010. Super-human multi-talker speech recognition: a graphical modeling approach. Comput Speech Lang, 24(1):45–66. https://doi.org/10.1016/j.csl.2008.11.001
Google Scholar
Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Int Conf on Acoustics Speech and Signal Processing, p.31–35. https://doi.org/10.1109/ICASSP.2016.7471631
Google Scholar
Heymanna J, Drudea L, Haeb-Umbacha R, 2017. A generic neural acoustic beamforming architecture for robust multi-channel speech processing. Comput Speech Lang, 46(C):374–385. https://doi.org/10.1016/j.csl.2016.11.007
Google Scholar
Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597
Google Scholar
Hoyer PO, 2004. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res, 5:1457–1469.
MathSciNet MATH Google Scholar
Hu G, Wang D, 2004. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neur Netw, 15(5):1135–1150. https://doi.org/10.1109/TNN.2004.832812
Google Scholar
Hu G, Wang D, 2008. Segregation of unvoiced speech from nonspeech interference. J Acoust Soc Am, 124(2): 1306–1319. https://doi.org/10.1121/1.2939132
Google Scholar
Hu G, Wang D, 2010. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 18(8):2067–2079. https://doi.org/10.1109/TASL.2010.2041110
Google Scholar
Hu K, Wang D, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122–131. https://doi.org/10.1109/TASL.2012.2215591
MathSciNet Google Scholar
Hu Y, Loizou PC, 2007. Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun, 49(7):588–601. https://doi.org/10.1016/j.specom.2006.12.006
Google Scholar
Hu Y, Loizou PC, 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process, 16(1):229–238. https://doi.org/10.1109/TASL.2007.911054
Google Scholar
Huang Z, Wang S, Qian Y, 2018. Joint i-vector with end-toend system for short duration text-independent speaker verification. Int Conf on Acoustics, Speech, and Signal Processing, in press.
Google Scholar
Hyvarinen A, Karhunen J, Oja E, 2001. Independent Component Analysis. John Wiley & Sons, Inc, New York, USA.
Google Scholar
Isik Y, Roux JL, Chen Z, et al., 2016. Single-channel multispeaker separation using deep clustering. Annual Conf of Int Speech Communication Association, p.545–549. https://doi.org/10.21437/Interspeech.2016-1176
Google Scholar
Kellermann W, 1997. Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays. Int Conf on Acoustics Speech and Signal Processing, p.219–222. https://doi.org/10.1109/ICASSP.1997.599608
Google Scholar
Kim T, Attias HT, Lee SY, et al., 2006. Blind source separation exploiting higher-order frequency dependencies. IEEE Trans Audio Speech Lang Process, 15(4):70–79. https://doi.org/10.1109/TASL.2006.872618
Google Scholar
Kjems U, Boldt JB, Pedersen MS, et al., 2009. Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J Acoust Soc Am, 126(3):1415–1426. https://doi.org/10.1121/1.3179673
Google Scholar
Knapp CK, Carter GC, 1976. The generalized correlation method for estimation of time delay. IEEE Trans Audio Speech Lang Process, 24(4):320–327. https://doi.org/10.1109/TASSP.1976.1162830
Google Scholar
Kolbæk M, Yu D, Tan ZH, et al., 2017a. Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training. IEEE Int Workshop on Machine Learning for Signal Processing. http://arxiv.org/abs/1708.09588
Google Scholar
Kolbæk M, Yu D, Tan ZH, et al., 2017b. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE Trans Audio Speech Lang Process, 25(10):1901–1913. https://doi.org/10.1109/TASLP.2017.2726762
Google Scholar
Kristjansson T, Hershey J, Olsen P, et al., 2006. Superhuman multi-talker speech recognition: the IBM 2006 speech separation challenge system. Int Conf on Spoken Language Processing, Paper 1775-Mon1WeS.7.
Google Scholar
Kuhl PK, 1991. Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Percept Psychol, 50(2):93–107. https://doi.org/10.3758/BF03212211
Google Scholar
Larcher A, Lee KA, Ma B, et al., 2014. Textdependent speaker verification: classifiers, databases and RSR2015. Speech Commun, 60:56–77. https://doi.org/10.1016/j.specom.2014.03.001
Google Scholar
Lee DD, Seung HS, 2001. Algorithms for non-negative matrix factorization. NIPS, p.556–562.
Google Scholar
Lee TW, 1998. Independent Component Analysis—Theory and Applications. Kluwer Academic Publishers, Boston, USA.
MATH Google Scholar
Lei Y, Scheffer N, Ferrer L, et al., 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. Int Conf on Acoustics Speech and Signal Processing, p.1695–1699. https://doi.org/10.1109/ICASSP.2014.6853887
Google Scholar
Li P, Guan Y, Wang S, et al., 2010. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 24(1):30–44. https://doi.org/10.1016/j.csl.2008.05.005
Google Scholar
Liu Y, Qian Y, Chen N, et al., 2015. Deep feature for text-dependent speaker verification. Speech Commun, 73:1–13. https://doi.org/10.1016/j.specom.2015.07.003
Google Scholar
Lovekin JM, Yantorno RE, Krishnamachari KR, et al., 2001. Developing usable speech criteria for speaker identification technology. Int Conf on Acoustics Speech and Signal Processing, p.421–424. https://doi.org/10.1109/ICASSP.2001.940857
Google Scholar
Mandel MI, Weiss RJ, Ellis DPW, 2010. Model-based expectation maximization source separation and localization. IEEE Trans Audio Speech Lang Process, 18(2):382–394. https://doi.org/10.1109/TASL.2009.2029711
Google Scholar
McDermott JH, 2009. The cocktail party problem. Curr Biol, 19(22):R1024–R1027.
Google Scholar
Mesgarani N, Chang EF, 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397):233–236. https://doi.org/10.1038/nature11020
Google Scholar
Mowlaee P, Saeidi R, Tan ZH, et al., 2010. Joint singlechannel speech separation and speaker identification. Int Conf on Acoustics Speech and Signal Processing, p.4430–4433. https://doi.org/10.1109/ICASSP.2010.5495619
Google Scholar
Mowlaee P, Saeidi R, Christensen MG, et al., 2012. A joint approach for single-channel speaker identification and speech separation. IEEE Trans Audio Speech Lang Process, 20(9):2586–2601. https://doi.org/10.1109/TASL.2012.2208627
Google Scholar
Narayanan A, Wang D, 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.7092–7096. https://doi.org/10.1109/ICASSP.2013.6639038
Google Scholar
Ono N, 2011. Stable and fast update rules for independent vector analysis based on auxiliary function technique. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. https://doi.org/10.1109/ASPAA.2011.6082320
Google Scholar
Peddinti V, Povey D, Khudanpur S, 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. Annual Conf of Int Speech Communication Association, p.3214–3218.
Google Scholar
Pedersen MS, Larsen J, Kjems U, et al., 2007. A Survey of Convolutive Blind Source Separation Methods. Springer Press, New York, USA.
Google Scholar
Qian YM, Bi M, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE Trans Audio Speech Lang Process, 24(12):2263–2276. https://doi.org/10.1109/TASLP.2016.2602884
Google Scholar
Qian YM, Chang XK, Yu D, 2017. Single-channel multitalker speech recognition with permutation invariant training. http://arxiv.org/abs/1707.06527
Google Scholar
Qian YM, Tan T, Hu H, et al., 2018. Noise robust speech recognition on Aurora4 by humans and machines. Int Conf on Acoustics, Speech, and Signal Processing, in press.
Google Scholar
Raj B, Virtanen T, Chaudhuri S, et al., 2010. Non-negative matrix factorization based compensation of music for automatic speech recognition. Annual Conf of Int Speech Communication Association, p.717–720.
Google Scholar
Rennie SJ, Hershey JR, Olsen PA, 2010. Single-channel multitalker speech recognition. IEEE Signal Process Mag, 27(6):66–80. https://doi.org/10.1109/MSP.2010.938081
Google Scholar
Reynolds DA, Quatieri TF, Dunn RB, 2000. Speaker verification using adapted gaussian mixture models. Dig Signal Process, 10(1-3):19–41. https://doi.org/10.1006/dspr.1999.0361
Google Scholar
Rix AW, Beerends JG, Hollier MP, et al., 2001. Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. Int Conf on Acoustics, Speech, and Signal Processing, p.749–752. https://doi.org/10.1109/ICASSP.2001.941023
Google Scholar
Roth P, 1971. Effective measurements using digital signal analysis. IEEE Spectr, 8(4):62–70.
Google Scholar
Sainath TN, Mohamed A, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Int Conf on Acoustics Speech and Signal Processing, p.8614–8618.
Google Scholar
Sainath TN, Vinyals O, Senior A, et al., 2015. Convolutional, long short-term memory, fully connected deep neural networks. Int Conf on Acoustics Speech and Signal Processing, p.4580–4584. https://doi.org/10.1109/ICASSP.2015.7178838
Google Scholar
Sawada H, Araki S, Makino S, 2007. A two-stage frequencydomain blind source separation method for underdetermined convolutive mixtures. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.139–142. https://doi.org/10.1109/ASPAA.2007.4393012
Google Scholar
Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Annual Conf of Int Speech Communication Association, Paper 1652-Thu2FoP.10.
Google Scholar
Schuller B, Weninger F, Wöllmer M, et al., 2010. Nonnegative matrix factorization as noise-robust feature extractor for speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.4562–4565. https://doi.org/10.1109/ICASSP.2010.5495567
Google Scholar
Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Int Conf on Acoustics Speech and Signal Processing, p.4955–4959. https://doi.org/10.1109/ICASSP.2016.7472620
Google Scholar
Shao Y, Wang D, 2003. Co-channel speaker identification using usable speech extraction based on multi-pitch tracking. Int Conf on Acoustics, Speech, and Signal Processing, p.205–208. https://doi.org/10.1109/ICASSP.2003.1202330
Google Scholar
Shao Y, Wang D, 2006. Model-based sequential organization in cochannel speech. IEEE Trans Audio Speech Lang Process, 14(1):289–298. https://doi.org/10.1109/TSA.2005.854106
Google Scholar
Souden M, Benesty J, Affes S, 2010. On optimal frequencydomain multichannel linear filtering for noise reduction. IEEE Trans Signal Process, 18(2):260–276. https://doi.org/10.1109/TASL.2009.2025790
MATH Google Scholar
Souden M, Araki S, Kinoshita K, et al., 2013. A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans Signal Process, 21(9):1913–1928. https://doi.org/10.1109/TASL.2013.2263137
Google Scholar
Sydow C, 1994. Broadband beamforming for a microphone array. J Acoust Soc Am, 96(8):845–849. https://doi.org/10.1121/1.410323
Google Scholar
Taal CH, Hendriks RC, Heusdens R, et al., 2010. A shorttime objective intelligibility measure for time-frequency weighted noisy speech. Int Conf on Acoustics Speech and Signal Processing, p.4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701
Google Scholar
Tan T, Qian Y, Yu D, 2018. Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition. Int Conf on Acoustics, Speech, and Signal Processing, in press.
Google Scholar
Tu Y, Du J, Xu Y, et al., 2014a. Deep neural network based speech separation for robust speech recognition. Int Conf on Signal Processing, p.532–536. https://doi.org/10.1109/ICOSP.2014.7015061
Google Scholar
Tu Y, Du J, Xu Y, et al., 2014b. Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. Int Symp on Chinese Spoken Language Processing, p.250–254. https://doi.org/10.1109/ISCSLP.2014.6936615
Google Scholar
Variani E, Lei X, McDermott E, et al., 2014. Deep neural networks for small footprint text-dependent speaker verification. Int Conf on Acoustics Speech and Signal Processing, p.4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
Google Scholar
Vincent E, Gribonval R, Févotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462–1469. https://doi.org/10.1109/TSA.2005.858005
Google Scholar
Virtanen T, 2006. Speech recognition using factorial hidden Markov models for separation in the feature space. Annual Conf of Int Speech Communication Association, Paper 1850-Mon1WeS.5.
Google Scholar
Virtanen T, 2007. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process, 15(3):1066–1074. https://doi.org/10.1109/TASL.2006.885253
Google Scholar
Wang D, 2005. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P (Ed.), Speech Separation by Humans and Machines. Springer, Boston, USA, p.181–197. https://doi.org/10.1007/0-387-22794-6_12
Google Scholar
Wang D, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, New York, USA.
Google Scholar
Wang S, Qian Y, Yu K, 2018. Focal KL-divergence based dilated convolutional neural networks for co-channel speaker identification. Int Conf on Acoustics, Speech, and Signal Processing, in press.
Google Scholar
Wang Y, Narayanan A, Wang D, 2014. On training targets for supervised speech separation. IEEE Trans Audio Speech Lang Process, 22(12):1849–1858. https://doi.org/10.1109/TASLP.2014.2352935
Google Scholar
Weng C, Yu D, Seltzer ML, et al., 2015. Deep neural networks for single-channel multi-talker speech recognition. IEEE Trans Audio Speech Lang Process, 23(10):1670–1679. https://doi.org/10.1109/TASLP.2015.2444659
Google Scholar
Weninger F, Erdogan H, Watanabe S, et al., 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Int Conf on Latent Variable Analysis and Signal Separation, p.91–99. https://doi.org/10.1007/978-3-319-22482-4_11
Google Scholar
Xiao X, Zhao SK, Jones DL, et al., 2017. On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.3246–3250. https://doi.org/10.1109/ICASSP.2017.7952756
Google Scholar
Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. http://arxiv.org/abs/1610.05256
Google Scholar
Xu Y, Du J, Dai LR, et al., 2014. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett, 21(1):65–68. https://doi.org/10.1109/LSP.2013.2291240
Google Scholar
Yilmaz O, Rickard S, 2004. Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process, 52(7):1830–1847. https://doi.org/10.1109/TSP.2004.828896
MathSciNet MATH Google Scholar
Yu D, Deng L, 2014. Automatic Speech Recognition: a Deep Learning Approach. Springer, New York, USA.
MATH Google Scholar
Yu D, Li, JY, 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA J Automat Sin, 4(3):396–409. https://doi.org/10.1109/JAS.2017.7510508
Google Scholar
Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Annual Conf of Int Speech Communication Association, p.17–21. https://doi.org/10.21437/Interspeech.2016-251
Google Scholar
Yu D, Chang X, Qian Y, 2017a. Recognizing multi-talker speech with permutation invariant training. Annual Conf of Int Speech Communication Association, p.2456–2460.
Google Scholar
Yu D, Kolbæk M, Tan ZH, et al., 2017b. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Int Conf on Acoustics, Speech and Signal Processing, p.241–245. https://doi.org/10.1109/ICASSP.2017.7952154
Google Scholar
Yu F, Koltun V, 2015. Multi-scale context aggregation by dilated convolutions. http://arxiv.org/abs/1511.07122
Google Scholar
Zhang C, Koishida K, 2017. End-to-end text-independent speaker verification with triplet loss on short utterances. Annual Conf of Int Speech Communication Association, p.1487–1491. https://doi.org/10.21437/Interspeech.2017-1608
Google Scholar
Zhang L, Chen Z, Zheng M, et al., 2011. Robust nonnegative matrix factorization. Front. Electr. Electron. Eng. China, 6(2):192–200. https://doi.org/10.1007/s11460-011-0128-0
Google Scholar
Zhao X, Wang Y, Wang D, 2015a. Cochannel speaker identification in anechoic and reverberant conditions. IEEE Trans Audio Speech Lang Process, 23(11):1727–1736. https://doi.org/10.1109/TASLP.2015.2447284
Google Scholar
Zhao X, Wang Y, Wang D, 2015b. Deep neural networks for cochannel speaker identification. Int Conf on Acoustics, Speech and Signal Processing, p.4824–4828. https://doi.org/10.1109/ICASSP.2015.7178887
Google Scholar
Zhou Y, Qian Y, 2018. Robust mask estimation by integrating neural network-based and clustering-based approaches for adaptive acoustic beamforming. Int Conf on Acoustics, Speech, and Signal Processing, in press.
Google Scholar
Zmolikova K, Delcroix M, Kinoshita K, et al., 2017. Speakeraware neural network based beamformer for speaker extraction in speech mixtures. Annual Conf of Int Speech Communication Association, p.2655–2659. https://doi.org/10.21437/Interspeech.2017-667
Google Scholar

Download references

Author information

Authors and Affiliations

Tencent AI Lab, Tencent, Bellevue, 98004, USA
Yan-min Qian, Chao Weng & Dong Yu
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Xuan-kai Chang & Shuai Wang

Authors

Yan-min Qian
View author publications
You can also search for this author in PubMed Google Scholar
Chao Weng
View author publications
You can also search for this author in PubMed Google Scholar
Xuan-kai Chang
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-min Qian.

Additional information

Project supported by the Tencent and Shanghai Jiao Tong University Joint Project

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, Ym., Weng, C., Chang, Xk. et al. Past review, current progress, and challenges ahead on the cocktail party problem. Frontiers Inf Technol Electronic Eng 19, 40–63 (2018). https://doi.org/10.1631/FITEE.1700814

Download citation

Received: 08 December 2017
Revised: 17 January 2018
Published: 25 January 2018
Issue Date: January 2018
DOI: https://doi.org/10.1631/FITEE.1700814

Keywords

CLC number

TP391.4

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A review on speech separation in cocktail party environment: challenges and approaches

BioCPPNet: automatic bioacoustic source separation with deep neural networks

General Formulation of Multichannel Extensions of NMF Variants

Change history

11 June 2018

19 April 2019

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

CLC number

Subscribe and save

Buy Now

Navigation

Past review, current progress, and challenges ahead on the cocktail party problem

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A review on speech separation in cocktail party environment: challenges and approaches

BioCPPNet: automatic bioacoustic source separation with deep neural networks

General Formulation of Multichannel Extensions of NMF Variants

Change history

11 June 2018

19 April 2019

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CLC number

Subscribe and save

Buy Now

Search

Navigation