Dr. Yong XU
Bio: 

I am currently a Principal Researcher (supervised by Dr. Dong Yu) at Tencent America (AI Lab), Bellevue, WA, USA.  I once worked at the University of Surrey, UK as a Research Fellow (supervised by Prof. Mark D. Plumbley and Prof. Wenwu Wang) for two years. I earned my Ph.D. degree from the University of Science and Technology of China (USTC) in 2015 and studied under a joint Ph.D. program at Georgia Institute of Technology (Georgia Tech, USA) during 2014-2015.  My Ph.D. supervisors are Prof. Chin-Hui Lee (Georgia Tech, USA), Prof. Jun Du (USTC) and Prof. Li-Rong Dai (USTC). I won the 1st prize in DCASE 2017 challenge for "Large-scale weakly supervised sound event detection for smart cars". I have two ESI highly cited IEEE journal papers. I achieved the 2018 IEEE SPS Best Paper award for my work on deep learning-based speech enhancement. I am the elected IEEE Signal Processing Society - Speech and Language Technical Committee (SLTC) member (2023-2025). I am among the World's Top 2% Scientists in 2022 ranked by Stanford University. I am an IEEE Senior Member. I am leading a multilingual ASR team to support GPT4o-like product development based on LLM.

My Google Scholar: https://scholar.google.com/citations?user=nCmKPM4AAAAJ&hl=en (6600+ citations, H-index=39)

Email: yong.xu.ustc@gmail.com

News: 
Publications:

Google Scholar: https://scholar.google.co.uk/citations?user=nCmKPM4AAAAJ&hl=en  Total Citations=6600+, h-index=39, i10-index=61

Journal papers:

[1] Multi-channel Multi-frame ADL-MVDR for Target Speech Separation 

Z Zhang, Y Xu, M Yu, SX Zhang, L Chen, DS Williamson, D Yu, accepted to IEEE/ACM Trans. on audio, speech, language processing, Nov. 2021

[2] Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network

Ke Tan, Yong XU, Shixiong Zhang, Meng Yu, Dong Yu, IEEE Journal of Selected Topics in Signal Processing, 2020

[3] Multi-modal Multi-channel Target Speech Separation,

Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lianwu Chen, Yuexian Zou, Dong Yu, IEEE Journal of Selected Topics in Signal Processing, 2020

[4] Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization

Qiuqiang Kong, Yong XU, Wengwu Wang, Mark D. Plumbley, accepted by IEEE/ACM trans. on audio, speech, language processing, July 2020

[5] Weakly Labelled AudioSet Tagging with Attention Neural Networks

Qiuqiang Kong, Changsong Yu, Yong Xu* (corresponding author), Turab Iqbal, Wenwu Wang, Mark D. Plumbley, accepted by IEEE/ACM trans. on audio, speech, language processing, 2019

[6] Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, Mark D. Plumbley, accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing, July 2016

[7] A Regression Approach to Speech Enhancement Based on Deep Neural Networks. [2018 IEEE SPS Best paper award] [citations: 1000+] [ESI highly cited papers]

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee,  P. 7-19, Vol. 23, No. 1, IEEE/ACM trans. on audio, speech, language processing, 2015

[8] Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data

Qiuqiang Kong*, Yong Xu* (equal contribution) , Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley, accepted by IEEE/ACM trans. on audio, speech, language processing, 2019

[9] An Experimental Study on Speech Enhancement Based on Deep Neural Networks. [citations: 800+] [ESI highly cited papers]

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, IEEE signal processing letters, p. 65-68, vol. 21, no. 1, January 2014

[10] Hierarchical deep neural network for multivariate regression

      Jun Du and Yong Xu,  Pattern Recognition, Volume 63, March 2017, Pages 149–157

[11] Joint training of DNNs by incorporating an explicit dereverberation structure for distant speech recognition

      Tian Gao, Jun Du, Yong Xu, Cong Liu, Li-Rong Dai, Chin-Hui Lee, EURASIP Journal on Advances in Signal Processing, 2016

[12] Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition

      Lei Sun, Jun Du, Zhipeng Xie, Yong Xu, Journal of Signal Processing Systems, Springer, 2017


Conference papers:

[62] Advancing Multi-Talker ASR Performance with Large Language Models

Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu*, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei zhang, Dong Yu, SLT2024

[61] SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays

Yiwen Shao, Yong Xu, Sanjeev Khudanpur, Dong Yu, SLT2024

[60] Multi-Channel Multi-Speaker ASR Using Target Speaker Solo Segments

Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Dong Yu, Daniel Povey, Sanjeev Khudanpur, Interspeech2024

[59] LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey, Interspeech2024

[58] SpatialCodec: Neural Spatial Speech Coding

Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu, accepted to ICASSP2024

[57] uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu, accepted to ICASSP2024

[56] NeuralEcho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network for Acoustic Echo Cancellation and Speech Enhancement

Meng Yu, Yong Xu, Chunlei Zhang, Shi-Xiong Zhang, Dong Yu, accepted to ASRU2023

[55] Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation

Yong Xu, Vinay Kothapally, Meng Yu, Shi-Xiong Zhang, Dong Yu, accepted to Interspeech2023 (Dublin, Ireland)

[54] Deep Neural Mel-Subband Beamformer for In-car Speech Separation 

V Kothapally, Yong Xu, M Yu, SX Zhang, D Yu, accepted to ICASSP2023

[53] EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers Maiti, Soumi; Ueda, Yushi; Watanabe, Shinji; zhang, chunlei ; Yu, Meng; Zhang, Shixiong; Yong, Xu, SLT2022

[52] Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer 

V Kothapally, Yong Xu, M Yu, SX Zhang, D Yu, accepted to Interspeech2022

[51] Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter   

Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, YONG XU and Wenwu Wang, accepted to Interspeech2022

[50] Audio-Visual Tracking of Multiple Speakers via a PMBM Filter 

Jinzheng Zhao, Peipei Wu, Xubo Liu, Yong Xu, Lyudmila Mihaylova, Simon Godsill, Wenwu Wang, ICASSP2022

[49] Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation 

Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu, Interspeech2021

[48] MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation 

Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu, Interspeech2021

[47] TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu, Interspeech2021

[46] MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment

Meng Yu, Chunlei Zhang, Yong Xu, Shixiong Zhang, Dong Yu, Interspeech2021

[45] WPD++: an improved neural beamformer for simultaneous speech separation and dereverberation

Zhaoheng Ni, Yong Xu, Meng Yu, Bo Wu, Shixiong Zhang, Dong Yu, Michael I Mandel , accepted to SLT2021

[44] Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising

Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu, accepted to SLT2021

[43] ADL-MVDR: All deep learning MVDR beamformer for target speech separation, [PDF] [Demo]

Zhuohuang Zhang, Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Dong Yu, accepted to ICASSP2021

[42] Neural Spatio-Temporal Beamformer for Target Speech Separation, [PDF] [Demo]

Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu, accepted to Interspeech2020

[41] Audio-visual Multi-channel Recognition of Overlapped Speech

Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng, accepted to Interspeech2020

[40] Far-Field Location Guided Target Speech Extraction using End-to-End Speech Recognition Objectives

Aswin Shanmugam Subramanian, Chao Weng, Meng Yu, Shi-Xiong Zhang, Yong Xu, Shinji Watanabe and Dong Yu, ICASSP2020

[39] Enhancing End-To-End Multi-channel Speech Separation via Spatial Feature Learning,

Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu, ICASSP2020

[38] Self-supervised learning for audio-visual speaker diarization, 

Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang, ICASSP2020

[37] Time Domain Audio Visual Speech Separation,

Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu, ASRU2019

[36] Improved Speaker-Dependent Separation for CHiME-5 Challenge,

Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu, Interspeech2019

[35] Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information

Rongzhi Gu, Lianwu Chen, Shi-Xiong Zhang, Jimeng Zheng, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu,, Interspeech2019

[34] A comprehensive study of speech separation: spectrogram vs waveform separation

Fahimeh Bahmaninezhad, Jian Wu, Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Interspeech2019

[33] Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks,

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley, Philip Jackson, accepted to IJCAI2019 (accept rate=18%)

[32] Joint training of complex ratio mask based beamformer and acoustic model for noise robust, 

YONG XU, CHAO WENG, LIKE HUI, JIANMING LIU, MENG YU, DAN SU, DONG YU, accepted to ICASSP2019

[31] Acoustic scene generation with conditional sampleRNN,

Qiuqiang Kong, Yong Xu, Turab Iqbal, Yin Cao, Wenwu Wang, Mark Plumbley, accepted to ICASSP2019

[30]  An attention-based neural network approach for single channel speech enhancement,

Xiang Hao, Changhao Shan, Yong Xu, Sining Sun, Lei Xie, accepted to ICASSP2019

[29] Large-scale weakly supervised audio classification using gated convolutional neural network,  [pdf] [Rank 1st system in DCASE2017 challenge]

Yong Xu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley, accepted to ICASSP2018

[28] A joint separation-classification model for sound event detection of weakly labelled data

 Qiuqiang Kong, Yong Xu (* equal contribution), Wenwu Wang and Mark D. Plumbley, accepted to ICASSP2018

[27] Audio Set classification with attention model: A probabilistic perspective

 Qiuqiang Kong, Yong Xu (* equal contribution), Wenwu Wang and Mark D. Plumbley, accepted to ICASSP2018

[26] Iterative deep neural networks for speaker-independent binaural blind speech separation

QINGJU LIU, YONG XU, PHILIP COLEMAN, PHILIP JACKSON, WENWU WANG, accepted to ICASSP2018

[25] Intelligent signal processing mechanisms for nuanced anomaly detection in action audio-visual data streams

Josef Kittler, Ioannis Kaloskampis, Cemre Zor, Yong Xu*, Yulia Hicks and Wenwu Wang, accepted to ICASSP2018

[24] Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging, 

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang and Mark D. Plumbley, accepted to Interspeech2017

[23] Joint Detection and Classification Convolutional Neural Network (JDC-CNN) on Weakly Labelled Bird Audio Data (BAD)

Qiuqiang Kong, Yong Xu, Mark D. Plumbley, accepted to EUSIPCO2017

[22] Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang and Mark D. Plumbley, IJCNN2017

[21] A joint detection-classification model for audio tagging of weakly labelled data

Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark Plumbley, ICASSP2017

[20] Fast Tagging of Natural Sounds Using Marginal Co-regularization

Qiang Huang, Yong Xu, Philip J. B. Jackson, Wenwu Wang, Mark D. Plumbley, ICASSP2017

[19] Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation

Alfredo Zermini, Qingju Liu, Yong Xu, Mark D. Plumbley, Dave Betts, Wenwu Wang, MMSP2017

[18] Deep neural network based audio source separation

A. Zermini and Y. Yu and Yong Xu and W. Wang and M. D. Plumbley, 11th IMA International Conference on Mathematics in Signal Processing, 2016

[17] Fully DNN-based Multi-label regression for audio tagging.

Yong Xu, Qiang Huang, Wenwu Wang, Philip J B Jackson, Mark D Plumbley, accepted by DCASE2016 workshop, July 2016

[16] Hierarchical learning for DNN-based acoustic scene classification

Yong Xu, Qiang Huang, Wenwu Wang, Mark D. Plumbley, accepted by DCASE2016 workshop, July 2016

[15] Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor. Xie, Zhi-Peng and Du, Jun and McLoughlin, Ian Vince and Xu, Yong and Ma, Feng and Wang, Haikun. ISCSLP2016

[14] Multi-objective learning and Mask-based Post-processing for Deep Neural Network based Speech Enhancement.

Yong Xu, Jun Du, Zhen Huang, Li-Rong Dai, Chin-Hui Lee, accepted, Interspeech2015, Dresden, Germany 

[13] DNN-Based Speech Bandwidth Expansion and Its Application to Adding High Frequency Missing Features for Automatic Speech Recognition of Narrowband Speech.

Kehuang Li, Zhen Huang, Yong Xu and Chin-Hui Lee, accepted, Interspeech2015, Dresden, Germany

[12] Dynamic Noise Aware Training for Speech Enhancement Based on Deep Neural Networks.

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, Interspeech2014, Singapore

[11] Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments. (Best paper candidate)

Tian Gao, Jun Du, Yong Xu, Cong Liu, Li-Rong Dai, Chin-Hui Lee, accepted, LVA/ICA 2015, Liberec, Czech Republic

[10] Robust Speech Recognition with Speech Enhanced Deep Neural Networks

Jun Du, Qing Wang, Tian Gao, Yong Xu, Li-Rong Dai and Chin-Hui Lee, Interspeech2014, Singapore

[9] Cross-language Transfer Learning for Deep Neural Network Based Speech Enhancement

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, ISCSLP2014, Singapore

[8] Speech Separation Based on Improved Deep Neural Networks with Dual Outputs of Speech Features for both Target and Interfering Speakers, Yanhui Tu, Jun Du, Yong Xu, Lirong Dai and Chin-Hui Lee, ISCSLP2014, Singapore

[7] Speech separation of a target speaker based on deep neural networks.

Jun Du, Yanhui Tu, Yong Xu, Li-Rong Dai and Chin-Hui Lee, P. 532 – 536, ICSP2014, Hangzhou, China

[6] Deep neural network based speech separation for robust speech recognition.

Yanhui Tu, Jun Du, Yong Xu, Lirong Dai and Chin-Hui Lee, P. 532 – 536, Hangzhou, China

[5] Global Variance Equalization for Improving Deep Neural Network Based Speech Enhancement.

Yong Xu, Jun Du, Li-Rong Dai and Chin-Hui Lee, to be appeared at ChinaSIP2014, Xi’an, China

[4] Spoken Term Detection for OOV Terms Based on Phone Fragment.

Yong Xu, Wu Guo, Shan Su and Li-Rong Dai, ICALIP2012, Shanghai, China

[3] Improved Spoken Term Detection by Template-based Confidence Measure.

Shan Su, Wu Guo, Yong Xu and Li-Rong Dai, ICALIP2012, Shanghai, China

[2] A hybrid fragment / syllable-based system for improved OOV term detection.

Yong Xu, Wu Guo and Li-Rong Dai, ISCSLP2012, Hong Kong

[1] Spoken term detection for OOV terms based on tri-phone confusion matrix.

Yong Xu, Wu Guo and Li-Rong Dai, ISCSLP2012, Hong Kong


Patents:

[1] Multi-tap minimum variance distortionless response beamformer with neural networks for target speech separation 

Yong Xu, M Yu, SX Zhang, C Weng, J Liu, D Yu - US Patent App. 16/926,138, 2022

[2] All-deep-learning unified speech front end (ADL-UFE) system for joint speech separation, beamforming, denoising, dereverberation and AEC

Yong Xu, M Yu, SX Zhang, D Yu - submitted to US Patent, 2022

[3] All-deep-learning MVDR for speech separation

Yong Xu, M Yu, SX Zhang, D Yu - US Patent App. 17/038,498, 2022

[4] Multi-modal framework for multi-channel target speech separation

SX Zhang, Y Xu, M Yu, D Yu - US Patent App. 16/901,487, 2021

[5] Inter-channel feature extraction method, audio separation method and apparatus, and computing device

Rongzhi Gu, S Zhang, L Chen, Y Xu, M Yu, D Su, D. Yu. US Patent App. 17/401,125, 2021

[6] Speech separation method and system, US patent, US 20160189730A1

DU Jun, XU Yong, TU Yanhui, Dai Li-rong, Wang Zhiguo, HU Yu, Liu Qingfeng, June 2016



Research and Work Experience:

Tencent America (AI Lab), Bellevue, WA, USA    Principle Research Scientist and Tech Lead   2023 – present  

I lead a multilingual ASR team to support GPT4o-like product development based on a large language model (LLM).

Tencent America (AI Lab), Bellevue, WA, USA    Principle Research Scientist   2021 – present  

Multi-channel speech enhancement/separation/de-reverberation/speech recognition, I proposed ADL-MVDR/RNN beamformer.

Tencent America (AI Lab), Bellevue, WA, USA    Senior Research Scientist   2018 – 2021  

Multi-modality speech enhancement/separation/de-reverberation/speech recognition

University of Surrey, Guildford, UK    Full-time Research Fellow    2016 – 2018 

Deep learning (DNN/CNN/LSTM, attention, reinforcement learning, generative adversarial network, etc) based environmental sound classification and analysis.

iFLYTEK, China       Research Scientist     2015– 2016

Worked on far-field speech recognition for the smart speaker

Georgia Institute of Technology, USA  Visiting Student     2014– 2015

Deep neural networks based speech enhancement and used for automatic speech recognition (ASR), and my advisor is Prof. Chin-Hui Lee.

Bosch - research center, CA, USA Short Internship    Sept. 2014– Oct. 2014

Deep neural networks based speech enhancement and used for the automatic speech recognition (ASR). Supervised by Dr. Pongtep Angkititrakul and Dr. Fuliang Weng

National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China (USTC), China    Jul. 2012 – Jun. 2015

DNN based speech enhancement, co-supervised by Prof. Chin-Hui Lee (Georgia Tech)

I developed a Large Vocabulary Continuous Speech Recognition (LVCSR) system trained on 2300h English speech database, and built a baseline for OOV term detection. MLE, DT, Tandem systems were built.

National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China (USTC), Hefei, China     Graduate student       Sept. 2010 – Jul. 2012

Working on Spoken Term Detection (STD) for Out-Of-Vocabulary (OOV) words, I use the tri-phone confusion matrix and a hybrid fragment/syllable system to improve the performance of OOV term detection.

National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China (USTC), Hefei, China       Undergraduate student        Mar. 2010 – Jul. 2010

I did the project of my undergraduate thesis about room acoustic impulse response.