Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline

Article Types

Countries / Regions

Search Results (2)

Search Parameters:
Keywords = mel-filterbank energy features

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
13 pages, 456 KiB  
Article
Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech
by Siyuan Song, Yanjue Song and Nilesh Madhu
Appl. Sci. 2024, 14(2), 609; https://doi.org/10.3390/app14020609 - 10 Jan 2024
Viewed by 958
Abstract
The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different [...] Read more.
The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different strategies to improve the classification accuracy when foreground speech is present. In this paper, we extend these methods to deep-learning (DL)-based ASC systems, for improving foreground speech robustness. ResNet models are proposed as the baseline, in combination with multi-condition training at different signal-to-background ratios (SBRs). For further robustness, we first investigate the noise-floor-based Mel-FilterBank Energies (NF-MFBE) as the input feature of the ResNet model. Next, speech presence information is incorporated within the ASC framework obtained from a speech enhancement (SE) system. As the speech presence information is time-frequency specific, it allows the network to learn to distinguish better between background signal regions and foreground speech. While the proposed modifications improve the performance of ASC systems when foreground speech is dominant, in scenarios with low-level or absent foreground speech, performance is slightly worse. Therefore, as a last consideration, ensemble methods are introduced, to integrate classification scores from different models in a weighted manner. The experimental study systematically validates the contribution of each proposed modification and, for the final system, it is shown that with the proposed input features and meta-learner, the classification accuracy is improved in all tested SBRs. Especially for SBRs of 20 dB, absolute improvements of up to 9% can be obtained. Full article
(This article belongs to the Section Acoustics and Vibrations)
Show Figures

Figure 1

Figure 1
<p>ResNet Architecture: C1 = 20 and C2 = 40 in small-footprint model; C1 = 80 and C2 = 160 in large-footprint model.</p>
Full article ">Figure 2
<p>Double Conv Block.</p>
Full article ">Figure 3
<p>The performance of both small- and large-footprint ResNet baseline and the proposed best systems.</p>
Full article ">
18 pages, 10771 KiB  
Article
A Deep Neural Network Model for Speaker Identification
by Feng Ye and Jun Yang
Appl. Sci. 2021, 11(8), 3603; https://doi.org/10.3390/app11083603 - 16 Apr 2021
Cited by 70 | Viewed by 10318
Abstract
Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent [...] Read more.
Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural network (RNN). Indeed, these methods perform well in many tasks, but there is no attempt to combine these two network models to study the speaker identification task. Due to the spectrogram that a speech signal contains, the spatial features of voiceprint (which corresponds to the voice spectrum) and CNN are effective for spatial feature extraction (which corresponds to modeling spectral correlations in acoustic features). At the same time, the speech signal is in a time series, and deep RNN can better represent long utterances than shallow networks. Considering the advantage of gated recurrent unit (GRU) (compared with traditional RNN) in the segmentation of sequence data, we decide to use stacked GRU layers in our model for frame-level feature extraction. In this paper, we propose a deep neural network (DNN) model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In the network model design, the convolutional layer is used for voiceprint feature extraction and reduces dimensionality in both the time and frequency domains, allowing for faster GRU layer computation. In addition, the stacked GRU recurrent network layers can learn a speaker’s acoustic features. During this research, we tried to use various neural network structures, including 2-D CNN, deep RNN, and deep LSTM. The above network models were evaluated on the Aishell-1 speech dataset. The experimental results showed that our proposed DNN model, which we call deep GRU, achieved a high recognition accuracy of 98.96%. At the same time, the results also demonstrate the effectiveness of the proposed deep GRU network model versus other models for speaker identification. Through further optimization, this method could be applied to other research similar to the study of speaker identification. Full article
(This article belongs to the Special Issue Deep Learning for Signal Processing Applications)
Show Figures

Figure 1

Figure 1
<p>Conventional recurrent neural network (RNN) model architecture. The input is a time series, <math display="inline"><semantics> <msub> <mi>x</mi> <mi>t</mi> </msub> </semantics></math> represents the input information at time <span class="html-italic">t</span>. The middle box represents the hidden state. <math display="inline"><semantics> <msub> <mi>y</mi> <mi>t</mi> </msub> </semantics></math> represents the output at time <span class="html-italic">t</span>.</p>
Full article ">Figure 2
<p>Long short-term memory (LSTM) architecture (a single memory block is shown for clarity).</p>
Full article ">Figure 3
<p>Gated recurrent unit (GRU) architecture (a single memory block is shown for clarity).</p>
Full article ">Figure 4
<p>The structure of the deep GRU networks. The entire architecture diagram mainly includes voice signal preprocessing, two-dimensional convolutional layer, stacked GRU layers, a fully-connected layer, and softmax layer.</p>
Full article ">Figure 5
<p>The architecture of the deep GRU network model, showing the input and some specific parameters of other layers, the framework is also the basis of code implementation.</p>
Full article ">Figure 6
<p>Spectrogram of speech. The intensity of any given frequency component at a given time is expressed by color depth. Darker, smaller; brighter, larger.</p>
Full article ">Figure 7
<p>The process of the data preprocessing. The purpose of the preprocessing process is to obtain a spectrogram of the speech signal for further analysis.</p>
Full article ">Figure 8
<p>Loss and accuracy of the deep GRU network model with the Aishell-1 dataset. The left block presents loss vs. training epochs and the right block represents the accuracy vs. training epochs.</p>
Full article ">Figure 9
<p>Loss and accuracy of the 2-D convolutional neural network (CNN) network model with the Aishell-1 dataset. The left part of the figure describes the loss vs. training epochs, and the right part describes the accuracy vs. training epochs.</p>
Full article ">Figure 10
<p>Performance comparison of proposed deep GRU network model with other network models. The block on the left side shows the performance of the models on the original Aishell-1 dataset, and the right part corresponds to the performance on the Aishell-1 dataset with Gaussian white noise.</p>
Full article ">
Back to TopTop