Cheng et al., 2021 - Google Patents
Improving multimodal speech enhancement by incorporating self-supervised and curriculum learningCheng et al., 2021
- Document ID
- 8048735950973480338
- Author
- Cheng Y
- He M
- Yu J
- Feng R
- Publication year
- Publication venue
- ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
External Links
Snippet
Speech enhancement in realistic scenarios still remains many challenges, such as complex background signals and data limitations. In this paper, we present a co-attention based framework that incorporates self-supervised and curriculum learning to derive the target …
- 230000013016 learning 0 title abstract description 19
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00496—Recognising patterns in signals and combinations thereof
- G06K9/00536—Classification; Matching
- G06K9/00543—Classification; Matching by matching peak patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30799—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Joze et al. | MMTM: Multimodal transfer module for CNN fusion | |
Lu et al. | Audio–visual deep clustering for speech separation | |
Noulas et al. | Multimodal speaker diarization | |
Barzelay et al. | Harmony in motion | |
Lu et al. | Listen and look: Audio–visual matching assisted speech source separation | |
Gogate et al. | DNN driven speaker independent audio-visual mask estimation for speech separation | |
Phan et al. | Spatio-temporal attention pooling for audio scene classification | |
Li et al. | Deep audio-visual speech separation with attention mechanism | |
Zhu et al. | Visually guided sound source separation and localization using self-supervised motion representations | |
Parekh et al. | Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision | |
Zhu et al. | Visually guided sound source separation using cascaded opponent filter network | |
Xiong et al. | Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement | |
Shi et al. | Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation. | |
Montesinos et al. | Vovit: Low latency graph-based audio-visual voice separation transformer | |
Oya et al. | Do we need sound for sound source localization? | |
Cheng et al. | Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning | |
Zhu et al. | V-slowfast network for efficient visual sound separation | |
Ahmad et al. | Speech enhancement for multimodal speaker diarization system | |
Rahman et al. | Weakly-supervised audio-visual sound source detection and separation | |
Li et al. | IIANet: An Intra-and Inter-Modality Attention Network for Audio-Visual Speech Separation | |
Zhang et al. | Audio-visual speech separation with adversarially disentangled visual representation | |
Luo et al. | Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments. | |
Min et al. | Cross-modal attention consistency for video-audio unsupervised learning | |
Xiong et al. | Audio-visual speech separation based on joint feature representation with cross-modal attention | |
CN116978399A (en) | Cross-modal voice separation method and system without visual information during test |