Li et al., 2024 - Google Patents

Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network

Li et al., 2024

View PDF

Document ID: 15524722439250285271
Author: Li N; Wang L; Ge M; Unoki M; Li S; Dang J
Publication year: 2024
Publication venue: Speech Communication

External Links

Cited by

Snippet

Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel- frequency cepstral coefficients, to deep neural networks often leads to degraded VAD …

Continue reading at papers.ssrn.com (PDF) (other versions)

230000000694 effects 0 title abstract description 36

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. hidden Markov models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 characterised by the analysis technique using neural networks
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 characterised by the type of extracted parameters
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signal, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signal, using source filter models or psychoacoustic analysis using predictive techniques
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computer systems based on biological models
- G06N3/02—Computer systems based on biological models using neural network models

Similar Documents

Publication	Publication Date	Title
Wang et al.	2018	Supervised speech separation based on deep learning: An overview
Luo et al.	2019	Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation
Qian et al.	2017	Speech Enhancement Using Bayesian Wavenet.
Hussain et al.	2017	Experimental study on extreme learning machine applications for speech enhancement
Pandey et al.	2022	Self-attending RNN for speech enhancement to improve cross-corpus generalization
Yu et al.	2020	Speech enhancement based on denoising autoencoder with multi-branched encoders
Xiang et al.	2020	A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network
Cui et al.	2020	Speech enhancement based on simple recurrent unit network
Borgström et al.	2020	Speech enhancement via attention masking network (SEAMNET): An end-to-end system for joint suppression of noise and reverberation
Samui et al.	2019	Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network
Hussain et al.	2019	Ensemble hierarchical extreme learning machine for speech dereverberation
Li et al.	2024	Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network
Cui et al.	2021	Multi-objective based multi-channel speech enhancement with BiLSTM network
Du et al.	2020	A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement
Le et al.	2022	Inference skipping for more efficient real-time speech enhancement with parallel RNNs
Hou et al.	2020	Multi-task learning for end-to-end noise-robust bandwidth extension
Khattak et al.	2022	Regularized sparse features for noisy speech enhancement using deep neural networks
Agrawal et al.	2023	A review on speech separation in cocktail party environment: challenges and approaches
Girirajan et al.	2023	Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
Wang	2015	Supervised speech separation using deep neural networks
Nguyen et al.	2016	Feature adaptation using linear spectro-temporal transform for robust speech recognition
Peracha et al.	2023	Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network
Sivapatham et al.	2022	Gammatone filter bank-deep neural network-based monaural speech enhancement for unseen conditions
Cornell et al.	2023	Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection
Bawa et al.	2023	Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions