Applsci 12 03461 v2
Applsci 12 03461 v2
Applsci 12 03461 v2
sciences
Article
Hybrid Dilated and Recursive Recurrent Convolution Network
for Time-Domain Speech Enhancement
Zhendong Song 1 , Yupeng Ma 2, *, Fang Tan 1 and Xiaoyi Feng 1
1 School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China;
secszd@163.com (Z.S.); bjtanfang@163.com (F.T.); fengxiao@nwpu.edu.cn (X.F.)
2 College of Computer and Cyber Security, Hebei Normal University, Shijiazhuang 050024, China
* Correspondence: mayupenga@mail.nwpu.edu.cn
Abstract: In this paper, we propose a fully convolutional neural network based on recursive recurrent
convolution for monaural speech enhancement in the time domain. The proposed network is an
encoder-decoder structure using a series of hybrid dilated modules (HDM). The encoder creates
low-dimensional features of a noisy input frame. In the HDM, the dilated convolution is used to
expand the receptive field of the network model. In contrast, the standard convolution is used to
make up for the under-utilized local information of the dilated convolution. The decoder is used to
reconstruct enhanced frames. The recursive recurrent convolutional network uses GRU to solve the
problem of multiple training parameters and complex structures. State-of-the-art results are achieved
on two commonly used speech datasets.
Keywords: speech enhancement; time domain; hybrid dilated convolution; recurrent convolution
1. Introduction
Citation: Song, Z.; Ma, Y.; Tan, F.;
Speech enhancement refers to the technology of removing or attenuating noise from
Feng, X. Hybrid Dilated and
Recursive Recurrent Convolution
noisy speech signal and extracting useful speech signal. Speech enhancement technology is
Network for Time-Domain Speech
widely used in automatic speech recognition, speech communication system, and hearing
Enhancement. Appl. Sci. 2022, 12, aids. Traditional monaural speech enhancement methods include spectral subtraction [1],
3461. https://doi.org/10.3390/ Wiener filtering [2], and subspace algorithm [3].
app12073461 In the past few years, supervised methods based on deep learning have become
the mainstream for speech enhancement. In these supervised methods, time-frequency
Academic Editor: Lijiang Chen
(T-F) features of noisy speech are extracted first, and the T-F features of clean speech are
Received: 21 February 2022 extracted to represent the target. Training targets can be divided into two types; one is the
Accepted: 27 March 2022 masking-based, such as the ideal binary mask (IBM) [4] and the ideal ratio mask (IRM) [5].
Published: 29 March 2022 The other is the spectral mapping-based, such as the log power spectrum feature used
in [6]. Methods [5,7,8] based on T-F domain have certain limitations. Firstly, these methods
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
require preprocessing, which increases the complexity. Secondly, these methods usually
published maps and institutional affil-
ignore the phase information of clean speech and use the noisy signal phase to reconstruct
iations.
the time domain signal. Some previous studies have proved that phase is very important
to improve speech quality, especially in low SNR [9].
For the above reasons, researchers have proposed a variety of speech enhancement net-
works based on time-domain [10–12]. Fu et al. [13] proposed a fully convolutional network
Copyright: © 2022 by the authors. and proved that the network is more suitable for speech enhancement in the time-domain
Licensee MDPI, Basel, Switzerland. than a full connection network. Then, a text speech model named WaveNet [14] directly
This article is an open access article synthesizes the original waveform. Rethage et al. [15] proposed an improved WaveNet
distributed under the terms and model for speech enhancement based on WaveNet, which uses residual connection and
conditions of the Creative Commons one-dimensional dilated convolution. After that, Pandey et al. [16] combined a time convo-
Attribution (CC BY) license (https:// lution module and codec for speech enhancement in the time-domain, in which the time
creativecommons.org/licenses/by/
convolution module also uses one-dimensional dilated convolution.
4.0/).
One-dimensional dilated convolution can improve the receptive field of the network
model. However, when one-dimensional dilated convolution is used for time-domain
speech enhancement tasks, there is a problem as local information cannot be fully utilized.
The reason is that when the dilation rate is greater than one during the convolution, holes in
dilated convolutions lead to local information loss. Therefore, a hybrid dilated convolution
module is proposed to combine dilated convolution with conventional convolution.
The end-to-end speech enhancement algorithm directly processes the original wave-
form of the speech, avoiding the low calculation efficiency and “phase inconsistency”
problems based on the time-frequency domain-speech enhancement algorithm and also
achieves a better enhancement effect. However, whether based on end-to-end or non-end-
to-end speech enhancement algorithms, these models have a large number of trainable
parameters. Recently, recursive learning mechanisms have been applied to a variety of
tasks, such as single-image de-rain [17] and image super-resolution [18]. The principle of
recursive learning is to train the network recursively using the same network parameters,
similar to a mathematical iterative process in which the entire process of mapping the
network model is handled in several stages. Thus, through recursive learning, the network
parameters can be reused at each stage, and we can explore the network at a deeper level
without using additional parameters. Inspired by recursive learning, this paper proposes a
speech enhancement algorithm based on combining a hybrid dilated convolution module
and recursive learning. The contributions of this article can be summarized as:
1. Proposed hybrid dilated convolution module (HDM), which consists of dilated con-
volution and conventional convolution;
2. Proposed recursive recurrent speech enhancement network (RRSENet), which uses a GRU
module to solve the problem of multiple training parameters and complex structures;
3. Extensive experiments are performed on dataset synthesized by TIMIT corpus and
NOISEX92, and the proposed model achieves excellent results.
The remainder of this paper is structured as follows. Section 2 describes the related
work on speech enhancement, RNN, and dilated convolution. Section 3 formulates the
problem and proposes the architecture. Section 4.3 presents the experiment settings, results,
and analysis. Some conclusions are drawn in Section 5.
2. Related Work
2.1. Speech Enhancement
Spectral subtraction, Wiener filtering, and subspace algorithm are the three most classic
traditional monaural speech enhancement methods. Spectral subtraction methods [1,19–22]
firstly obtains the noise spectrum to be processed by estimating and updating the noise
spectrum operation in the non-speech segment. After that, the enhanced speech spectrum
will be estimated through the subtraction operation. Finally, the speech spectrum is con-
verted into a speech waveform. Although the spectral subtraction method is relatively
simple, there will be problems such as voice distortion and music noise, and this type of
method is suitable for the case of stable noise. The effect of suppressing non-stationary
noise is relatively poor.
The Wiener filtering algorithms [2,23–26] originated during the Second World War.
It was proposed by the mathematician Norbert Wiener to solve the military air shooting
control problem. It is mainly used to extract the required clean signal from the noisy
observation signal. The Wiener filtering algorithm has a history of nearly 80 years, and
its ideas have undergone many variations after decades of development. The essence of
the Wiener filtering algorithm is to extract the signal from the noise and use the minimum
mean square value of the error between the estimated result and the true value of the signal
as the best criterion. Therefore, the Wiener filter is the optimal filter in the statistical sense
or the optimal linear estimator of the waveform. However, the Wiener filter method has a
general ability to handle non-stationary noise, which will cause voice distortion.
Appl. Sci. 2022, 12, 3461 3 of 15
The principle of the subspace algorithms is to decompose the vector space of the
observed signal into a signal subspace, a noise subspace, and estimate the clean speech by
eliminating the noise subspace and retaining the signal subspace. The process of subspace
decomposition is to perform KLT transformation on the noisy speech signal, then set the
KLT coefficient of the noise to 0, and finally obtain the enhanced speech through the inverse
KLT transformation. The subspace method generally does not cause the problem of voice
distortion and can maintain the quality of the enhanced voice. Its disadvantage is that it
removes less noise and the enhancement effect is also not very good.
Compared with traditional speech enhancement algorithms, the algorithms [27–31]
based on deep learning have obtained a relatively obvious improvement in performance
and effect. With the development of deep neural network (DNN), which has promoted
the rapid development of related research in the field of speech enhancement, researchers
have proposed many DNN models to solve the problem of speech enhancement [28].
Computational inefficiencies and phase inconsistencies exist in non-end-to-end algorithms,
so the researchers performed speech enhancement directly in the time domain. Most of the
end-to-end algorithms use one-dimensional dilated convolution to improve the network’s
extraction of contextual information from the original speech waveform, and to a certain
extent, the enhancement effect of the network model is improved.
3. Model Description
3.1. Problem Formulation
Given a single-microphone noisy signal y(t), the target of single-channel speech
enhancement is to estimate the target speech x (t). This paper focuses on the additive
Appl. Sci. 2022, 12, 3461 4 of 15
relationship between the target speech and noise. Therefore, the noisy signal y(t) can be
defined as
y(t) = x (t) + n(t) (1)
where y(t) is the time-domain noisy signal at time t, x (t) is the time-domain clean signal at
time t, and n(t) is the time-domain noise signal at time t.
PReLU [39], and the filter size is 11. The expansion rate of the 6 HDMs grows exponentially
and is set to 1, 2, 4, 8, 16, and 32, respectively. The decoder is a mirror-image of the encoder,
with four layers of one-dimensional deconvolution [40], in which the hopping connection
is used to connect each coding layer with its corresponding decoding layer, making up for
the loss of features in the coding process. In the decoder, the activation function used in the
first three layers is PReLU, and the activation function used in the last layer is Tanh. The
filter size of each layer is 11.
Figure 3. (a) RLSENet: recursive learning speech enhancement network, (b) RLBlock: recursive
learning block.
Figure 4. (a) RRSENet: recursive recurrent speech enhancement network, (b) RRBlock: recursive
recurrent block.
4. Experiments
4.1. Datasets
In the experiment, the clean corpus used comes from the TIMIT corpus [41], which
includes 630 speakers of 8 major dialects of American English with each reading 10 utter-
ances. All sentences in the TIMIT corpus are sampled at 16 KHZ, with 4620 utterances in
the training set and 1680 utterances in the test set, resulting in a total of 6300 utterances.
Then, 1000, 200, and 100 clean utterances are randomly selected for training, validation,
and testing, respectively. Training and validation dataset are mixed under different SNR
levels ranging from −5 dB to 10 dB with the interval 1 dB while the testing datasets are
mixed under −5 dB and −2 dB conditions.
For training and validation, we used two noisy datasets. One dataset is a noise library
recorded in the laboratory of Prof. Wang at Ohio State University, which has 100 sounds
with different durations and a sampling rate of 16 kHz. A total of 100 kinds of noise, which
includes machine, water, wind, crying noise, etc, were used. The other is NOISEX92 [42],
with 15 noises, a duration of 235 s, and a sampling frequency of 19.98 kHz. A total of
15 kinds of noise, which includes truck, machine gun, factory, etc, were used. Another five
Appl. Sci. 2022, 12, 3461 8 of 15
types of noises from NOISEX92, including babble, f16, factory2, m109, and white, were
chosen to test the network generalization capacity.
The dataset is constructed using the following steps. First, the noise is downsampled
to 16 kHz and stitched into a long noise. Second, a clean speech is randomly selected, and a
noise of the same length is selected. Finally, during each mixed process, the cutting point is
randomly generated, which is subsequently mixed with a clean utterance under one SNR
condition. As a result, totally 10,000, 2000, and 400 noisy-clean utterance pairs are created
for training, validation, and testing, respectively.
STOI evalutation metrics. Among them, “HDMNet” is the speech enhancement network
that uses only the HDM module. The traditional method LogMMSE enhancement is the
least effective, indicating that it is difficult to handle non-smooth noise. In contrast to
the temporal convolution module used in TCNN, which only considers the historical
information and uses a one-dimensional dilated convolution with its own defects, the
hybrid dilated module proposed in this paper is better than the traditional convolution
block, making full use of the information of the neighboring points of the speech waveform
without losing local information and improving the enhancement performance of the
model. Compared with RLSENet, RRSENet adds a GRU module, which makes the results
of RRSENet better than RLSENet. In general, comparing with other network models, the
RRSENet network model proposed in this paper has the best performance. For example, in
a low SNR-5dB noise environment, the enhanced speech with RRSENet network model
achieves the best enhancement performance compared to the unprocessed noisy speech
with PESQ and STOI by 0.693 and 16.92%, respectively. The AECNN and the TCNN using
the temporal convolution module are slightly inferior, which proves that the RRSENet
network model is more suitable for end-to-end speech enhancement tasks.
Table 1. Experimental results of different network models under seen noise conditions for PESQ
and STOI. BOLD indicates the best result for each case.
From the results in Table 2, it can be seen that at a low signal-to-noise ratio of −5 dB, the
enhanced speech of RRSENet has increased by 4.5% and 4.98% compared with the enhanced
speech of TCNN and AECNN, respectively. Because under the background of low signal-
to-noise ratio, people pay more attention to the intelligibility of speech, that is, the STOI
evaluation index. This shows that RRSENet can improve the intelligibility of speech under
low signal-to-noise ratio. In addition, the “Avg.” of RRSENet under the PESQ and STOI
evaluation indicators is higher than the other four comparative experiments. Therefore, the
enhanced speech of RRSENet obtains the best speech quality and intelligibility, which also
shows that RRSENet is good at the generalization ability on the mismatched noise test set
is better [48].
Table 2. Experimental results of different network models under unseen noise conditions for PESQ
and STOI. BOLD indicates the best result for each case.
The network architecture of this experiment keeps the whole structure of the encoder
and decoder unchanged. In the comparison of the modules, the GRU module of recursive
learning is not used. Therefore, the network that uses HDM is defined as HDMNet. The
network that used full-dilated convolution modules is defined as full-dilated convolution
module network (FDMNet), and the dilation rate of dilated convolution in FDMNet is
consistent with that in HDMNet. The network used the full-conventional convolution
module network (FCMNet).
Table 3 shows the experimental results of PESQ and STOI for three different modules
under seen noise conditions. Table 4 shows the experimental results of PESQ and STOI
for three different modules under unseen noise conditions. The average PESQ value is the
evaluation value of the average speech quality under different signal-to-noise ratios, the
average STOI value represents the evaluation value of the average speech intelligibility
under different signal-to-noise ratios, and “Avg.” represents the three types of signals
under different evaluation indicators.
Table 3. Experimental results of different modules under seen noise conditions for PESQ and STOI.
BOLD indicates the best result for each case.
Table 4. Experimental results of different modules under unseen noise conditions for PESQ and STOI.
BOLD indicates the best result for each case.
From the results in Tables 3 and 4, it can be seen that HDMNet obtained the best
results, followed by FDMNet, and FCMNet was the worst. The experiment proves: (1) that
the dilated convolution is very effective in end-to-end speech enhancement tasks, greatly
improving the enhancement effect of the network model. (2) The use of the hybrid dilated
convolution model improves the evaluation index compared with the full dilated convolu-
tion model, which shows that the HDM makes full use of the information of the adjacent
points of the speech waveform without losing the local feature information of the speech,
and improves the enhancement effect of the model.
Table 5. Experimental results of RLSENet and RRSENet under seen noise conditions. BOLD indicates
the best result for each case.
Table 6 shows the PESQ and STOI values of the speech enhanced by the RLSENet
and RRSENet modles. The PESQ value is the mean of speech quality under different
signal-to-noise ratios in the case of unmatched noise. The STOI value is the mean of speech
intelligibility under different signal-to-noise ratios. The results in Table 6 show that when
t = 4, RRSENet achieves the best results under the two indicators of PESQ and STOI. As
the number of recursion increases, the results of RRSENet are better than RLSENet, mainly
due to the addition of the GRU module. Combining the two experiments, the higher the
number of recursion, the better. In RRSENet, t = 4 can obtain better results.
Appl. Sci. 2022, 12, 3461 12 of 15
Table 6. Experimental results of RLSENet and RRSENet under unseen noise conditions. BOLD
indicates the best result for each case.
Figure 7. Spectrogram of real-world speech enhancement. The first row is the spectrogram of real-
world noise, the second row is the spectrogram of real-world speech, the third row is the spectrogram
of the enhancement speech
Appl. Sci. 2022, 12, 3461 13 of 15
5. Conclusions
In the paper, a speech enhancement algorithm based on recursive learning with a
hybrid dilated convolution model was proposed. A hybrid dilated convolution module
(HDM) is proposed to solve the problem of insufficient utilization of local information in
one-dimensional dilated convolution. Through HDM, the receptive field can be enhanced,
the context information can be fully utilized, and the speech enhancement performance of
the model can be improved. A recursive recurrent network training model is proposed,
which solves the problems of a conventional network with many training parameters
and a complex network structure. We improved the speech enhancement quality while
reducing the training parameters. The experimental results showed that RRSENet performs
better than the other two baseline time-domain models. Future research includes explor-
ing the RRSENet model for other speech processing tasks such as de-reverberation and
speaker separation.
Author Contributions: Methodology, Z.S. and Y.M.; software, Y.M.; validation, F.T.; investigation,
X.F.; writing—original draft preparation, Y.M.; writing—review and editing, Z.S.; visualization, F.T.;
supervision, X.F.; project administration, X.F.; funding acquisition, X.F. All authors have read and
agreed to the published version of the manuscript.
Funding: This work is partly supported by the Key Research and Development Program of Shaanxi
(Program Nos. 2021ZDLGY15-01, 2021ZDLGY09-04, 2021GY-004 and 2020GY-050), the International
Science and Technology Cooperation Research Project of Shenzhen (GJHZ20200731095204013), the
National Natural Science Foundation of China (Grant No. 61772419).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are openly available in [41,42].
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979,
27, 113–120. [CrossRef]
2. Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE
Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [CrossRef]
3. Ephraim, Y.; Van Trees, H.L. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 1995,
3, 251–266. [CrossRef]
4. Wang, D. On ideal binary mask as the computational goal of auditory scene analysis. In Speech Separation by Humans and Machines;
Springer: Berlin, Germany, 2005; pp. 181–197.
5. Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang.
Process. 2014, 22, 1849–1858. [CrossRef]
6. Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM
Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [CrossRef]
7. Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the
Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3229–3233.
8. Zhao, H.; Zarar, S.; Tashev, I.; Lee, C.H. Convolutional-recurrent neural networks for speech enhancement. In Proceedings of the
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018;
pp. 2401–2405.
9. Paliwal, K.; Wójcicki, K.; Shannon, B. The importance of phase in speech enhancement. Speech Commun. 2011, 53, 465–494.
[CrossRef]
10. Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452.
11. Kolbæk, M.; Tan, Z.H.; Jensen, S.H.; Jensen, J. On loss functions for supervised monaural time-domain speech enhancement.
IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 825–838. [CrossRef]
12. Abdulbaqi, J.; Gu, Y.; Chen, S.; Marsic, I. Residual Recurrent Neural Network for Speech Enhancement. In Proceedings of
the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain,
4–8 May 2020; pp. 6659–6663.
Appl. Sci. 2022, 12, 3461 14 of 15
13. Fu, S.W.; Tsao, Y.; Lu, X.; Kawai, H. Raw waveform-based speech enhancement by fully convolutional networks. In Proceed-
ings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC),
Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 006–012.
14. Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K.
Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499.
15. Rethage, D.; Pons, J.; Serra, X. A wavenet for speech denoising. In Proceedings of the 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5069–5073.
16. Pandey, A.; Wang, D. TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In
Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Brighton, UK, 12–17 May 2019; pp. 6875–6879.
17. Ren, D.; Zuo, W.; Hu, Q.; Zhu, P.; Meng, D. Progressive Image Deraining Networks: A Better and Simpler Baseline. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
18. Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
19. Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP’79), Washington, DC, USA, 2–4 April 1979.
20. Sim, B.L.; Tong, Y.C.; Chang, J.S.; Tan, C.T. A parametric formulation of the generalized spectral subtraction method. IEEE Trans.
Speech Audio Process. 1998, 6, 328–337. [CrossRef]
21. Kamath, S.D.; Loizou, P.C. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.
In Proceedings of the ICASSP international Conference on Acoustics Speech and Signal Processing, Orlando, FL, USA,
13–17 May 2002; Volume 4.
22. Sovka, P.; Pollak, P.; Kybic, J. Extended Spectral Subtraction. In Proceedings of the 8th European Signal Processing Conference,
Trieste, Italy, 10–13 September 1996.
23. Cohen, I. Relaxed statistical model for speech enhancement and a priori SNR estimation. IEEE Trans. Speech Audio Process. 2005,
13, 870–881. [CrossRef]
24. Cohen, I.; Berdugo, B. Speech enhancement for non-stationary noise environments. Signal Process. 2001, 81, 2403–2418. [CrossRef]
25. Hasan, M.K.; Salahuddin, S.; Khan, M.R. A modified a priori SNR for speech enhancement using spectral subtraction rules. IEEE
Signal Process. Lett. 2004, 11, 450–453. [CrossRef]
26. Cappe, O. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio
Process. 1994, 2, 345–349. [CrossRef]
27. Li, A.; Zheng, C.; Cheng, L.; Peng, R.; Li, X. A Time-domain Monaural Speech Enhancement with Feedback Learning. In
Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA
ASC), Auckland, New Zealand, 7–10 December 2020; pp. 769–774.
28. Yuliani, A.R.; Amri, M.F.; Suryawati, E.; Ramdan, A.; Pardede, H.F. Speech Enhancement Using Deep Learning Methods: A
Review. J. Elektron. Dan Telekomun. 2021, 21, 19–26. [CrossRef]
29. Yan, Z.; Xu, B.; Giri, R.; Tao, Z. Perceptually Guided Speech Enhancement Using Deep Neural Networks. In Proceedings of the
ICASSP 2018—2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada,
15–20 April 2018.
30. Karjol, P.; Ajay Kumar, M.; Ghosh, P.K. Speech Enhancement Using Multiple Deep Neural Networks. In Proceedings of the 2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018;
pp. 5049–5052.
31. Xu, Y.; Du, J.; Huang, Z.; Dai, L.R.; Lee, C.H. Multi-objective learning and mask-based post-processing for deep neural network
based speech enhancement. arXiv 2017, arXiv:1703.07172.
32. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2015, arXiv:1409.2329.
33. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
34. Cho, K.; Merrienboer, B.V.; Gulcehre, C.; BaHdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078.
35. Gao, T.; Du, J.; Dai, L.R.; Lee, C.H. Densely Connected Progressive Learning for LSTM-Based Speech Enhancement. In
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada,
15–20 April 2018; pp. 5054–5058.
36. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122.
37. Ye, S.; Hu, X.; Xu, X. Tdcgan: Temporal Dilated Convolutional Generative Adversarial Network for End-to-end Speech
Enhancement. arXiv 2020, arXiv:2008.07787.
38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
39. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034.
40. Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic Segmentation. In Proceedings of the IEEE International
Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528.
Appl. Sci. 2022, 12, 3461 15 of 15
41. Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus
CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report n. 1993. Volume 93, p. 27403. Available online:
https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir4930.pdf (accessed on 8 March 2022).
42. Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study
the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [CrossRef]
43. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
44. Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans.
Acoust. Speech Signal Process. 1985, 33, 443–445. [CrossRef]
45. Pandey, A.; Wang, D. A New Framework for CNN-Based Speech Enhancement in the Time Domain. IEEE/ACM Trans. Audio
Speech Lang. Process. 2019, 27, 1179–1188. [CrossRef]
46. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual Evaluation of Speech Quality (PESQ): A New Method for Speech
Quality Assessment of Telephone Networks and Codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP’01), Salt Lake City, UT, USA, 7–11 May 2001.
47. Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy
Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [CrossRef]
48. Demiar, J.; Schuurmans, D. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30.