Nothing Special   »   [go: up one dir, main page]

Cheng et al., 2021 - Google Patents

Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning

Cheng et al., 2021

Document ID
8048735950973480338
Author
Cheng Y
He M
Yu J
Feng R
Publication year
Publication venue
ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

External Links

Snippet

Speech enhancement in realistic scenarios still remains many challenges, such as complex background signals and data limitations. In this paper, we present a co-attention based framework that incorporates self-supervised and curriculum learning to derive the target …
Continue reading at ieeexplore.ieee.org (other versions)

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6217Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00221Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00496Recognising patterns in signals and combinations thereof
    • G06K9/00536Classification; Matching
    • G06K9/00543Classification; Matching by matching peak patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30781Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F17/30784Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
    • G06F17/30799Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Similar Documents

Publication Publication Date Title
Joze et al. MMTM: Multimodal transfer module for CNN fusion
Lu et al. Audio–visual deep clustering for speech separation
Noulas et al. Multimodal speaker diarization
Barzelay et al. Harmony in motion
Lu et al. Listen and look: Audio–visual matching assisted speech source separation
Gogate et al. DNN driven speaker independent audio-visual mask estimation for speech separation
Phan et al. Spatio-temporal attention pooling for audio scene classification
Li et al. Deep audio-visual speech separation with attention mechanism
Zhu et al. Visually guided sound source separation and localization using self-supervised motion representations
Parekh et al. Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
Zhu et al. Visually guided sound source separation using cascaded opponent filter network
Xiong et al. Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement
Shi et al. Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation.
Montesinos et al. Vovit: Low latency graph-based audio-visual voice separation transformer
Oya et al. Do we need sound for sound source localization?
Cheng et al. Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning
Zhu et al. V-slowfast network for efficient visual sound separation
Ahmad et al. Speech enhancement for multimodal speaker diarization system
Rahman et al. Weakly-supervised audio-visual sound source detection and separation
Li et al. IIANet: An Intra-and Inter-Modality Attention Network for Audio-Visual Speech Separation
Zhang et al. Audio-visual speech separation with adversarially disentangled visual representation
Luo et al. Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments.
Min et al. Cross-modal attention consistency for video-audio unsupervised learning
Xiong et al. Audio-visual speech separation based on joint feature representation with cross-modal attention
CN116978399A (en) Cross-modal voice separation method and system without visual information during test