Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleOctober 2024
The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan
Highlights- In more naturalistic everyday settings, communication efficiency gradually improves from childhood to adulthood irrespective of the listening condition (easy vs. challenging).
- Even moderate levels of background speech affect ...
Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (...
- research-articleMay 2024
Automatic classification of neurological voice disorders using wavelet scattering features
AbstractNeurological voice disorders are caused by problems in the nervous system as it interacts with the larynx. In this paper, we propose to use wavelet scattering transform (WST)-based features in automatic classification of neurological voice ...
Highlights- The WST-based features are utilized in the multi-class cassification of NVDs.
- NCA Feature selection was applied to reduce feature dimensions and improve performance.
- Support vector machine and feed forward neural network are used ...
- research-articleMay 2024
AVID: A speech database for machine learning studies on vocal intensity
AbstractVocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded ...
Highlights- An open database called AVID is launched to support research of vocal intensity.
- AVID includes calibrated recordings of speech produced in four intensity categories.
- Both speech and electroglottography signals are provided by AVID.
- research-articleMay 2024
The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy
Highlights- Higher speech intensity and lower speech rates improve automatic speech recognition accuracy.
- Arab ESL teachers and students give more attention to pronunciation errors that do not affect intelligibility.
- Arabic-influenced ESL ...
The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL ...
- research-articleMay 2024
LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild
AbstractResearchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper ...
Highlights- We proposed a generic method, LPIPS-AttnWav2Lip, for talking head generation
- We used residual CBAM blocks to improve the accuracy of lip synchronization.
- Semantic alignment module: FFC expands field; AdaIN aligns audio-visual ...
-
- research-articleMay 2024
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network
AbstractDeep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads ...
Highlights- We simulate the auditory perception process and obtain the modulation encoder.
- AMME is proposed to control the gain of the modulation encoder.
- We develop a VAD approach based on the auditory attention mechanism.
- More robust ...
- research-articleFebruary 2024
Coarse-to-fine speech separation method in the time-frequency domain
Highlights- Monaural speech separation is conducted in the time-frequency domain using a coarse-to-fine approach.
- Rough separation occurs in the coarse phase, while precise extraction takes place in the refining phase to mitigate distortions from ...
Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation ...
- research-articleFebruary 2024
Dual-model self-regularization and fusion for domain adaptation of robust speaker verification
AbstractLearning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim ...
Highlights- An ECAPA-TDNN based dual-model architecture is proposed.
- Self-supervised regularization is performed using dual-model intermediate embeddings.
- All models are jointly trained with time-dependent regularization loss.
- Speaker ...
- research-articleFebruary 2024
The Role of Auditory and Visual Cues in the Perception of Mandarin Emotional Speech in Male Drug Addicts
Highlights- Fill the research gap in the field of speech perception among drug addicts.
- Reveal the presence of a disorder or deficit in multi-modal emotional speech processing in drug addicts.
- Suggest that visual cues, such as facial ...
Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: ...
- research-articleFebruary 2024
Graph attention-based deep embedded clustering for speaker diarization
AbstractDeep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage ...
Highlights- A graph constructed from speaker embeddings to utilize the local structural information among embeddings.
- Employed Multi-layer graph attention networks as an encoder module to learn latent speaker embeddings.
- Multi-objective ...
- research-articleFebruary 2024
Comparing Levenshtein distance and dynamic time warping in predicting listeners’ judgments of accent distance
Highlights- Holistic acoustic and segmental differences contribute to perception of accent distance.
- Segmental differences contribute more than holistic acoustic ones to accent distance.
- Speakers’ native language influences the importance of ...
Listeners attend to variation in segmental and prosodic cues when judging accent strength. The relative contributions of these cues to perceptions of accentedness in English remains open for investigation, although objective accent distance ...
- research-articleFebruary 2024
Chirplet transform based time frequency analysis of speech signal for automated speech emotion recognition
AbstractNowadays, the recognition of emotion using the speech signal has gained popularity because of its vast number of applications in different fields like medicine, online marketing, online search engines, the education system, criminal ...
- research-articleFebruary 2024
CAST: Context-association architecture with simulated long-utterance training for mandarin speech recognition
AbstractEnd-to-end (E2E) models are widely used because they significantly improve the performance of automatic speech recognition (ASR). However, based on the limitations of existing hardware computing devices, previous studies mainly focus on short ...
Highlights- To address the challenge of long-form speech recognition, we propose a novel Context-Association Architecture with Simulated Long-utterance Training (termed CAST), which consists of a Context-Association RNN-Transducer (CARNN-T) and a ...
- research-articleOctober 2023
Determining spectral stability in vowels: A comparison and assessment of different metrics
Highlights- Different metrics for spectral stability identification in vowels are discussed.
- A new metric is introduced.
- The different metrics are assessed both on synthesized and natural speech.
- Higher-dimensional metrics capture spectral ...
This study investigated the performance of several metrics used to evaluate spectral stability in vowels. Four metrics suggested in the literature and a newly developed one were tested and compared to the traditional method of associating the ...
Graphical abstractDisplay Omitted
- research-articleOctober 2023
Acoustic properties of non-native clear speech: Korean speakers of English
Highlights- Non-native clear speech is acoustically distinct from casual speech.
- The nature of modifications is the same in native and non-native clear speech.
- The magnitude of modifications is different in native and non-native clear speech. ...
The present study examined the acoustic properties of clear speech produced by non-native speakers of English (L1 Korean), in comparison to native clear speech. L1 Korean speakers of English (N=30) and native speakers of English (N=20) read an ...
- review-articleOctober 2023
Speech emotion recognition approaches: A systematic review
AbstractThe speech emotion recognition (SER) field has been active since it became a crucial feature in advanced Human–Computer Interaction (HCI), and wide real-life applications use it. In recent years, numerous SER systems have been covered by ...
Highlights- The speech-emotion recognition (SER) field became crucial in advanced Human-computer interaction (HCI).
- Numerous SER systems have been proposed by researchers using Machine Learning (ML) and Deep Learning (DL).
- This survey aims to ...
- research-articleOctober 2023
DNN controlled adaptive front-end for replay attack detection systems
Highlights- Conventional methods fall short in detecting replay spoofing attacks effectively.
- Auditory-based dynamic filters can detect artefacts in high-quality replayed signals.
- Deep neural networks can adaptively learn filter traits based ...
Developing robust countermeasures to protect automatic speaker verification systems against replay spoofing attacks is a well-recognized challenge. Current approaches to spoofing detection are generally based on a fixed front-end, typically a ...
- research-articleOctober 2023
Model predictive PESQ-ANFIS/FUZZY C-MEANS for image-based speech signal evaluation
- Eder Pereira Neves,
- Marco Aparecido Queiroz Duarte,
- Jozue Vieira Filho,
- Caio Cesar Enside de Abreu,
- Bruno Rodrigues de Oliveira
AbstractThis paper presents a new method to evaluate the quality of speech signals through images generated from a psychoacoustic model to estimate PESQ (ITU-T P862) values using a first-order Fuzzy Sugeno approach implemented in the Adaptive Neuro-Fuzzy ...
Highlights- Extraction of speech signal factors using image processing techniques.
- Signal image extracted from a psychoacoustic model.
- Non-intrusive measurement based on PESQ values trained by ANFIS.
- Configuration of ANFIS with fuzzy c-...
- research-articleFebruary 2023
Shared and task-specific phase coding characteristics of gamma- and theta-bands in speech perception and covert speech
Speech Communication (SPCO), Volume 147, Issue CPages 63–73https://doi.org/10.1016/j.specom.2023.01.007AbstractCovert speech is the mental imagery of speaking. This task has gained increasing attention to understand the nature of thought and produce decoding methods for brain–computer interfaces. Building on previous work, we sought to ...
Highlights- Understanding speech-related temporal encoding useful for brain-computer interface training.