Cheng et al., 2021 - Google Patents

Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning

Cheng et al., 2021

Document ID: 8048735950973480338
Author: Cheng Y; He M; Yu J; Feng R
Publication year: 2021
Publication venue: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

External Links

Cited by

Snippet

Speech enhancement in realistic scenarios still remains many challenges, such as complex background signals and data limitations. In this paper, we present a co-attention based framework that incorporates self-supervised and curriculum learning to derive the target …

Continue reading at ieeexplore.ieee.org (other versions)

230000013016 learning 0 title abstract description 19

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00496—Recognising patterns in signals and combinations thereof
- G06K9/00536—Classification; Matching
- G06K9/00543—Classification; Matching by matching peak patterns
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30799—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search

Similar Documents

Publication	Publication Date	Title
Joze et al.	2020	MMTM: Multimodal transfer module for CNN fusion
Lu et al.	2019	Audio–visual deep clustering for speech separation
Noulas et al.	2011	Multimodal speaker diarization
Barzelay et al.	2007	Harmony in motion
Lu et al.	2018	Listen and look: Audio–visual matching assisted speech source separation
Gogate et al.	2018	DNN driven speaker independent audio-visual mask estimation for speech separation
Phan et al.	2019	Spatio-temporal attention pooling for audio scene classification
Li et al.	2020	Deep audio-visual speech separation with attention mechanism
Zhu et al.	2022	Visually guided sound source separation and localization using self-supervised motion representations
Parekh et al.	2019	Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
Zhu et al.	2020	Visually guided sound source separation using cascaded opponent filter network
Xiong et al.	2022	Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement
Shi et al.	2018	Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation.
Montesinos et al.	2022	Vovit: Low latency graph-based audio-visual voice separation transformer
Oya et al.	2020	Do we need sound for sound source localization?
Cheng et al.	2021	Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning
Zhu et al.	2022	V-slowfast network for efficient visual sound separation
Ahmad et al.	2020	Speech enhancement for multimodal speaker diarization system
Rahman et al.	2021	Weakly-supervised audio-visual sound source detection and separation
Li et al.	2024	IIANet: An Intra-and Inter-Modality Attention Network for Audio-Visual Speech Separation
Zhang et al.	2020	Audio-visual speech separation with adversarially disentangled visual representation
Luo et al.	2021	Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments.
Min et al.	2021	Cross-modal attention consistency for video-audio unsupervised learning
Xiong et al.	2022	Audio-visual speech separation based on joint feature representation with cross-modal attention
CN116978399A (en)	2023-10-31	Cross-modal voice separation method and system without visual information during test