Sensors 21 06258

sensors
Article
Accent Recognition with Hybrid Phonetic Features
Zhan Zhang , Yuehai Wang * and Jianyi Yang
Department of Information and Electronic Engineering, Zhejiang University, Hangzhou 310007, China;
zhan_zhang@zju.edu.cn (Z.Z.); yangjy@zju.edu.cn (J.Y.)
* Correspondence: wyuehai@zju.edu.cn
Abstract: The performance of voice-controlled systems is usually influenced by accented speech.

To make these systems more robust, frontend accent recognition (AR) technologies have received
increased attention in recent years. As accent is a high-level abstract feature that has a profound
relationship with language knowledge, AR is more challenging than other language-agnostic audio
classification tasks. In this paper, we use an auxiliary automatic speech recognition (ASR) task
to extract language-related phonetic features. Furthermore, we propose a hybrid structure that
incorporates the embeddings of both a fixed acoustic model and a trainable acoustic model, making
the language-related acoustic feature more robust. We conduct several experiments on the AESRC
dataset. The results demonstrate that our approach can obtain an 8.02% relative improvement
compared with the Transformer baseline, showing the merits of the proposed method.
Keywords: accent recognition; audio classification; accented English speech recognition

1. Introduction
Citation: Zhang, Z.; Wang, Y.; Yang, With the quick growth of voice-controlled systems, speech-related technologies are
J. Accent Recognition with Hybrid becoming part of our daily life. However, the variability of speech poses a serious challenge
Phonetic Features. Sensors 2021, 21, to these technologies. Among the many factors that can influence the speech variability,
6258. https://doi.org/10.3390/ accent is a typical one that will cause degradation in recognition accuracy [1–3].
s21186258 Accent is a diverse pronouncing behavior under certain languages, which can be
influenced by social environment, education, residential zone, and so on. As analyzed
Academic Editors: Steven Guan and in [4], English speakers are constructed by not only about 380 million natives, but also by
Yinong Chen
close to 740 million non-native speakers. Influenced by their native language, the speakers
may have a very wide variety of accents.
Received: 8 August 2021
To analyze the accent attribute in the collected speech and make the whole voice-
Accepted: 14 September 2021
controlled system more generalized, accent recognition (AR) or accent classification tech-
Published: 18 September 2021
nologies can be applied to custom the downstream subsystems. Thus, AR technologies
have received increased attention in recent years.
Publisher’s Note: MDPI stays neutral
From the point of deep learning, as accent is an utterance-level attribute, AR is also a
with regard to jurisdictional claims in
classification task that converts an audio sequence into a certain class. In this respect, the
published maps and institutional affil-
iations.
audio classification tasks, including audio scene classification, speaker recognition, and AR,
can share similar ideas on network structures. However, AR is a more challenging task.
Generally, acoustic scene classification or speaker recognition task can be finished
using certain low-level discriminative features. For example, speaker recognition can be
completed by recognizing the unique timbre (such as the frequency) of the speaker, which
Copyright: © 2021 by the authors.
is unrelated to the language they speak. Thus, both acoustic scene classification and speaker
Licensee MDPI, Basel, Switzerland.
recognition can be language-agnostic tasks.
This article is an open access article
In contrast, we generally realize someone has a certain accent when hearing that a
distributed under the terms and
specific pronunciation is different from the standard one. Therefore, for the AR task, to
conditions of the Creative Commons
Attribution (CC BY) license (https://
judge whether someone has a different accent, knowledge of that language is needed. The
creativecommons.org/licenses/by/
discriminative feature of the accented pronunciation is also more subtle. As a result, we
4.0/). think that AR differs from acoustic scene classification or speaker recognition because
Sensors 2021, 21, 6258. https://doi.org/10.3390/s21186258 https://www.mdpi.com/journal/sensors

Sensors 2021, 21, 6258 2 of 14
AR requires a more fine-grained and language-related feature, which is more difficult

compared to the language-agnostic tasks. For our method, we use a CNN-based ASR
frontend to encode the language-related information. Then we append extra self-attention
layers and an aggregation layer to capture the time sequence relevance and conduct
classification. We apply the ASR loss as an auxiliary multitask learning (MTL) [5] target to
extract language-related acoustic features. To make the acoustic features more robust, we
fuse the embedding of a trainable acoustic model that is trained on the accented dataset
and a fixed acoustic model trained on the standard dataset for a hybrid aggregation.
In this paper, first, we review the related work in Section 2. Next, we give the proposed
method a detailed description in Section 3. We conduct dense experiments to compare the
model architecture and analyze different training methods in Section 4. Finally, we give
our conclusion in Section 5.
The contributions of this paper are summarized as follows.
• We propose a novel hybrid structure (Codes available at https://github.com/zju-
zhan-zhang/Hybrid-AR, accessed on 7 September 2021) to solve the AR task. Our
method adopts an ASR backbone and fuses the language-related phonetic features to
perform the AR task along with ASR MTL.
• We investigate the relationship between ASR and AR. Specifically, we find that without
the auxiliary ASR task, AR may easily overfit to the speaker label in the trainset. The
proposed ASR MTL alleviates such a phenomenon.
• We conduct dense experiments on the AESRC dataset [6]. The results show that the
recognition accuracy of the proposed method is 8.02% better than the Transformer-
based baseline relatively. Moreover, we test the robustness by corrupting the text
annotations. Compared with the baseline, the phonetic acoustic features extracted by
the hybrid acoustic models can make the AR model more robust.
2. Related Works
For the audio classification task, on the one hand, the input features are of vital
importance. Ref. [7] ensembles different channel features for a more robust classification.
Ref. [8] explores how the classification knowledge is perceived in CNN and proposes to
boost the performance using enhanced log-Mel input features. Moreover, Ref. [9] fuses
different types of spectrogram to make use of their different characteristics on the time-
frequency domain. Ref. [10] integrates both the speech attribute features and the acoustic
cues to better capture foreign accents.
On the other hand, for the network structures, conventional methods including [11,12]
adopt Hidden Markov Model (HMM) to model the time features. Ref. [13] applies Fisher
Vector (FV) and Vector of Locally Aggregated Descriptors (VLAD) methods for the language
identification task. [14] gives a thorough study about the encoding structures and loss
functions for the language recognition task.
Besides the aforementioned works that focus on the recognition task alone, there are
other works that apply AR as an auxiliary task to boost the performance of the other main
tasks. Refs. [15,16] explore the relationship between accent and the ASR task, and show that
the accent recognition task can lead to a more robust performance of the main ASR task.
Although research in this area has shown that AR can be an auxiliary task to boost the
ASR performance, when ASR performs as the auxiliary task for AR, the effects have not
been closely examined. How to combine AR with the ASR task for a better language-related
feature remains to be investigated. This paucity inspires us to design our novel hybrid AR
model combined with ASR MTL, which will be discussed in the next section.
3. Proposed Method
In this section, we first start from a baseline that directly models the relationship
between the input spectrum and the accent label. This baseline is constructed by the
acoustic model and the attention-based aggregation model. However, such a baseline does
not make full use of the spoken words. In the next subsection, we further analyze the
Sensors 2021, 21, 6258 3 of 14
relationship between the accent label and the speaker information. We propose our ASR
MTL method which can incorporate the extra text information to prevent the model from
overfitting to the speaker label. Finally, we analyze the different learning target between
AR and ASR and propose to use hybrid acoustic models for a more robust prediction. The
whole model structure is shown in Figure 1.
GT GT
Phonemes Accent Class
𝑤𝑤 𝑐𝑐
CTC Loss CE Loss
Predicted Predicted
Phonemes Accent Class
𝑤𝑤
� 𝑐𝑐̂
Linear Linear
ASR MTL
weight=𝜆𝜆
Merged Classification
Embedding Feature
𝐴𝐴𝑀𝑀
𝑎𝑎𝑎𝑎𝑎𝑎 𝐴𝐴𝑐𝑐
Mean Std
Fusion Block
Pooling
Trainable Reference
Embedding Embedding
𝐴𝐴𝑎𝑎𝑎𝑎𝑎𝑎 𝐴𝐴𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎
Transformer
Encoder
Trainable Fixed
Acoustic Acoustic
Model Model
𝐴𝐴𝑀𝑀𝑡𝑡 𝐴𝐴𝑀𝑀𝑓𝑓 Main Branch (AR)
Positional Auxiliary Branch (ASR)
(AESRC+ Encoding
Librispeech) (Librispeech) Conv1D Input & Output
Ground Truth
Feature Extractor Aggregation Model
Trainable Weight
FBank
Input Fixed Weight
Figure 1. Proposed hybrid structure for accent recognition. We list the dataset used for the acoustic
model (AM) in the brackets and use the gray color to indicate the fixed acoustic model does not
participant in the AR training process. The auxiliary branch plotted in dash line (the green block) is
used only during training.
3.1. Network Structure

To summarize, acoustic scene classification, speaker recognition, and AR tasks dis-
cussed in Section 1 all process a sequence into a certain label. As the AR task needs more
fine-grained features, we choose to adopt an ASR-based backbone to extract the high-level
features for each frame, and then aggregate them for the final accent prediction.
In this paper, we use a CNN-based acoustic model called Jasper [17] to extract deep
phonetic features from the input 40-dim FBank spectrum. Jasper is constructed by multi-
Jasper blocks, and each Jasper block is further constructed by multi-CNN-based subblocks
(1D Conv, BatchNorm, ReLU, dropout) with residual connections. We use the Jasper 5×3
version (5 Jasper blocks, each Jasper block has 3 subblocks) for our experiments.
The attention mechanism can be a good choice to aggregate the extracted acoustic
features, and there are also attention-based aggregation methods successfully applied to the
accent classification task, including [13] and the self-attention pooling layer [14]. However,
a single self-attention pooling layer appended to the acoustic model may not be enough.
As we also perform an auxiliary ASR task (discussed in the next subsection) for the acoustic
Sensors 2021, 21, 6258 4 of 14
model, the acoustic model still needs to adapt from the ASR task to the AR task. In other
words, a single pooling layer may be too shallow to transfer the task.
For our method, we tend to keep the aggregation model independent from the acoustic
model so that the ASR loss applied to the acoustic model can be more efficient. Thus, we
choose to stack the multi-head self-attention layers for a high-level extraction of the accent
attribute for each frame. This extractor is, in fact, the encoder part of Transformer [18]. The
attention mechanism finds the similarity between the query matrix and the key matrix,
and use the similarity score to obtain the output from the value matrix. In our experiments,
the attention mechanism is applied in the self-attention form, where the query, key, and
value matrix is the same (the input of the self-attention layer). The self-attention layer
finds the relationship between acoustic features across different time steps, and outputs
the processed ones.
Formally, the attention mechanism can be described as,
QK T

Attention( Q, K, V ) = so f tmax √ V, (1)
dk

QK T
√
where so f tmax is the dot-product similarity of the query matrix Q and key matrix
dk
K T (scaled by √1 ), constructing the weight of the value matrix V. The multi-head attention
dk
(MHA) is described as,
MH A( Q, K, V ) = concat(hd1 , . . . , hdh )W O , (2)
where hdi = Attention( QWiQ , KWiK , VWiV ). h is the head number, and W Q , W K , W V , W O
is the projections matrices for query, key, value, and output, correspondingly. For the
self-attention case, Q = K = V = x, where x is the input features for each attention layer.
Compared with the self-attention pooling layer, multiple attention heads help the
model to capture different representation subspaces at different positions of the accent.
After the stacked multi-head self-attention layers, we use a statistic pooling layer [19]
to compute the mean and standard deviation across the time dimension of the acoustic
sequence. Finally, we use a fully connected layer to project the aggregated features into the
final accent prediction.
For the aggregation model, we set the attention dim d attn = 256, the feed-forward dim
d f f = 1024, the head number h = 4, and the number of stacked multi-head attention layers
nlayers = 3.
3.2. Multitask Learning

As discussed in the aforementioned Section 1, AR requires a more fine-grained feature
compared with acoustic scene classification and speaker recognition. Meanwhile, acoustic
scene classification and speaker recognition can be a language-agnostic task, which is easier
compared to the language-related AR task. Consequently, the AR task may degrade to the
speaker recognition task if we do not set extra constrains to the learning targets.
Formally, if we denote the model parameters as θ, the acoustic features for classification
as Ac , the accent label as c, the predicted accent class ĉ can be described as,
ĉ = argmaxc P(c| Ac ; θ ), c ∈ {ci , . . . , cn }, (3)
where n is the number of accents in the dataset.

We use the cross-entropy (CE) loss as the classification loss function between the
predicted accent class ĉ and the ground truth label c,
lc = CE(c, ĉ). (4)

Sensors 2021, 21, 6258 5 of 14
However, we should note that for the training dataset, the accent attribute is labelled
for each speaker. In other words, each accent label c j can be inferred from the speaker
label si .
As a result, if we only apply the cross-entropy loss for the accent classification, a
possible solution for the network is to simply memorize each speaker s in the training
dataset and learn to predict the speaker label ŝ0
ŝ0 = argmaxs P(s| Ac ; θ ), s ∈ {si , . . . , sm } (5)
and map it to the accent label ĉ0 , where m is the total number of speakers in the training
dataset.
In other words, the network will giving up learning the accent label and overfit to the
speaker label on the trainset as discriminating the speaker is easier than accent recognition.
For unseen speakers, the AR performance will be severely degraded. We demonstrate this
phenomenon in Section 4.3.
We assume that accent recognition using speaker-invariant features is more accurate
than the method via speaker recognition, and we force the network to learn the language-
related information. On the one hand, the text transcription is independent of the speaker
information, and ASR MTL is suitable for this task. On the other hand, ASR MTL can offer
complementary information for the main AR task. The ASR task forces the model to learn
the language-related information, and the fine-grained phonetic feature can contribute to
accent recognition. We build another branch of the proposed model in Figure 1 to perform
the ASR task during training. The auxiliary ASR branch tries to predict the pronounced
phonemes ŵ based on the ASR acoustic features A asr and the parameters of the ASR
acoustic model θ asr ,
ŵ = argmaxw P(w| A asr ; θ asr ). (6)
We convert the text from word-level to phoneme-level using a grapheme to phoneme (G2P)
tool (https://github.com/Kyubyong/g2p, accessed on 30 July 2021) and use CTC [20] as
the ASR loss function,
lasr = CTC (w, ŵ). (7)
For the proposed model, the feature for accent classification, Ac , is aggregated from
the feature for ASR, A asr ,
Ac = AggregationModel ( A asr ). (8)
We use the parameter λ to balance the weight between the ASR task and the AR task,
l = lc + λlasr . (9)
3.3. Hybrid Phonetic Features

Although the AR performance can be boosted by the auxiliary ASR task, we should
note that there is a difference between the ASR target and the AR target.
Given a certain speech, ASR allows mispronunciations caused by accents and corrects
them to the target text. For example, non-native speakers with different accents may
pronounce the target word “AE P AH L” (APPLE) to “AE P OW L” or “AE P AO L”. We
denote this pronounced phoneme as AHOW or AH AO . There is no need for ASR models to
distinguish from different accents, and ASR models will ignore different accents and still
map both AHOW and AH AO accented speech into the non-accented transcription of this
dataset, AH. In other words, ASR models will not work hard to explore the differences
between different accents.
On the contrary, AR needs to find the differences between the AHOW and AH AO
referring to the original AH. This conflict inspired us to use another acoustic model trained
on the non-accented dataset for a fixed reference.
Sensors 2021, 21, 6258 6 of 14
For the proposed method, two Jasper acoustic models are used. The first one is
trained on the non-accented dataset (Librispeech). The second one is first trained on the
non-accented dataset and then trained on the accented dataset (AESRC, these two datasets
will be introduced later). We pretrain these two acoustic models with the CTC loss and
keep the model with the lowest validation phone error rate (PER) for further experiments.
We freeze the weight parameters of the non-accented acoustic model (AM) and call it the
“fixed AM” (denoted as AM f ). The accented one is further fine-tuned together with the
aggregation model using Equation (9), and we call it the “trainable AM” (denoted as AMt ).
As illustrated in Figure 1, we merge the reference phonetic embedding of the fixed AM into
the trainable one with a fusion block.
In this paper, we consider three different fusion blocks as illustrated in Figure 2.
𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒
Merged Merged Merged

Embedding Embedding Embedding
Conv1D Conv1D
1 × (2 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 )
Channel
Attention
𝑇𝑇 × (2 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 ) Cat 𝑇𝑇 × (2 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 ) Cat
Linear Linear Linear Linear Linear Linear
Trainable Reference Trainable Reference Trainable Reference

Embedding Embedding Embedding Embedding Embedding Embedding
𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒
(a) (b) (c)

Figure 2. Fusion block with different structures. (a) Adding-based fusion. (b) Concatenation-based
fusion. (c) Concatenation–ChannelAttention-based fusion. In (b,c), a Conv1D layer with kernelsize=1
is adopted to cast the concatenated channels 2 ∗ demb back to demb .
First, a linear projection is applied to the embeddings,

0
A asr = linear ( A asr ), (10)
R0

R
A asr = linear A asr . (11)
Adding-based fusion. As illustrate in Figure 2a, an add function is applied to merge

the processed embeddings.
M 0 R0
A asr = A asr + A asr . (12)
However, simply adding them may lose the information of the raw data.
Concatenation-based fusion. Instead, we stack the embeddings on the channel do-
main and use a Conv1D(kernelsize=1) to half the channels of the concatenated embedding
ACasr . This structure is illustrated in Figure 2b. This process can be described as,
0
R0

ACasr = ChannelCat A asr , A asr , (13)

M
A asr = Conv ACasr , (14)
where ACasr ∈ RT ×2demb and A asr

M ∈ RT ×demb .
Concatenation-ChannelAttention-based fusion. Furthermore, we adopt channel-

attention to better control the importance of the reference embedding as illustrated in
Figure 2c. The channel-attention is obtained by the squeeze-excite block [21,22]. Based
Sensors 2021, 21, 6258 7 of 14
on Equation (13), the concatenated feature ACasr goes through a global MaxPooling and a
global AveragePooling on the time domain,
C = TimeMax ( ACasr ) + TimeAve( ACasr ), (15)
where C ∈ R1×2demb . C is further squeezed and excited,
Csqueeze = ReLU (linear (C )), (16)
Cexcite = ReLU (linear (Csqueeze )), (17)

2demb
where Csqueeze ∈ R1× r and Cexcite ∈ R1×2demb . We choose r = 16.
The channel-attention CA is obtained by the sigmoid activation of Cexcite ,
CA = Sigmoid(Cexcite ), (18)
and used as the scaling factor of ACasr on the channel domain,

M
A asr = Conv CA · ACasr . (19)
Finally, for these three fusion methods, we aggregate the merged ASR feature A asrM
instead of the original A asr in Equation (8) and optimize the whole model with the afore-
mentioned MTL loss (Equation (9)) for AR classification. We use 39 English phonemes plus
hBLANKi for CTC classification, and there are 8 accents included in the AESRC dataset.
The detailed model description is summarized in Table 1.
Table 1. Output summary of the proposed model. Tin is the length of the input sequence. Please
note that the acoustic model will downsample the feature sequence by 2 on the time domain, i.e.,
Tin = 2T.
Module Output Shape Output Description

FBank Tin × d f bank Input feature, d f bank = 40
AMt T × demb Trainable embedding, A asr , demb = 1024
Linear T × d phoneme Phoneme prediction, ŵ, d phoneme = 40
AM f T × demb Reference embedding, A asr R
Fusion Block T × demb Merged embedding, A asr M
Conv1D (kernelsize = 1) T × d attn Reshaped embedding , d attn = 256

Self-Attention Aggregation 1 × d attn Aggregated feature, Ac
Linear 1 × d accent Predicted accent class, ĉ, d accent = 8
4. Experiments
In this section, we first give a detailed description of the dataset and the experiment
environment. We show the AR results in the following subsection. Next, we conduct
a speaker recognition test to demonstrate that the auxiliary ASR task can be of vital
importance to keep the AR task from overfitting of the speaker recognition task. Finally,
we explore the relationship between the given transcription for the ASR task and the
performance of the AR task to test the robustness.
4.1. Dataset and Environment

We use two datasets for training in our experiments. The first one is Librispeech [23],
which does not contain accent labels. This dataset is constructed by approximately 1000 h of
16 kHz read English speech. The second one is AESRC [6], which is composed of 8 different
accents, including Chinese (CN), Indian (IN), Japanese (JP), Korean (KR), American (US),
British (UK), Portuguese (PT), Russian (RU). Each accent is made up of 20 h of speech
data collected from around 60 speakers on average. The speech recordings are collected
Sensors 2021, 21, 6258 8 of 14
in relatively quiet environments with cellphones and provided in Microsoft wav format
(16 kHz, 16 bit and mono).
Utterances read by certain specific speakers are kept for the test set.
We use Pytorch [24] to build our systems. We use the 40-dim FBank spectrum features
extracted by Kaldi toolkit [25] for the input. Furthermore, we apply SpecAug [26] to
perform the data augmentation. We conduct the experiments on a server with Intel Xeon
E5-2680 CPU, and 4 NVIDIA P100 GPUs. We use the Adam optimizer (lr = 10−4 ) for both
the ASR pretraining and the AR task. The ASR pretraining task on Librispeech takes about
3.5 days to converge, while adapting to the merged Librispeech plus AESRC takes about
another 2 days. For the accent recognition task, we train each version for about one day.
All experiments are conducted using the same hardware.
4.2. Accent Recognition Test

To better demonstrate the training process, we show different training methods in
Figure 3.
Aggregation Model Aggregation Model
𝐴𝐴𝑎𝑎𝑎𝑎𝑎𝑎 𝐴𝐴𝑎𝑎𝑎𝑎𝑎𝑎
𝐴𝐴𝑀𝑀𝑡𝑡 𝐴𝐴𝑀𝑀𝑡𝑡
(No Pretraining) (AESRC+Librispeech)
Feature Extractor Feature Extractor
FBank FBank
Input Input
(a) (b)
weight=𝜆𝜆
ASR MTL
Aggregation Model
weight=𝜆𝜆
𝐴𝐴𝑀𝑀
𝑎𝑎𝑎𝑎𝑎𝑎
ASR MTL
Fusion Block
Aggregation Model
𝐴𝐴𝑎𝑎𝑎𝑎𝑎𝑎 𝐴𝐴𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎
𝐴𝐴𝑎𝑎𝑎𝑎𝑎𝑎
𝐴𝐴𝑀𝑀𝑡𝑡 𝐴𝐴𝑀𝑀𝑓𝑓
𝐴𝐴𝑀𝑀𝑡𝑡 (AESRC+Librispeech) (Librispeech)
(AESRC+Librispeech)
Feature Extractor Feature Extractor

FBank FBank
Input Input
(c) (d)
Figure 3. Different training methods. (a) Directly train the whole model for the AR task. (b) Pretrain
the acoustic model for ASR and then train the whole model for AR. (c) Pretrain the acoustic model
for ASR and then train the whole model for AR and ASR MTL (weight = λ). (d) The proposed hybrid
method. Based on (c), the non-accented embedding is merged for reference.
Sensors 2021, 21, 6258 9 of 14
We show the accent recognition results on the validation set of different models in
Table 2. The baseline systems use the encoder of Transformer and a statical pooling to
perform the AR task. In Table 2, they are denoted as Transformer-XL, where X represents
the number of the encoder layers.
Table 2. Accuracy comparison. The vanilla versions (Id 0–2, 4) are demonstrated in Figure 3a. ASR-
initialized versions (Id 3,5) in Figure 3b. ASR MTL versions (Id 6–8) in Figure 3c. The fusion-based
versions (Id 9–11) in Figure 3d (Figure 2a–c, correspondingly.)
Id Model US UK CN IN JP KR PT RU Ave
Baseline
0 Trans 3L 45.7 70.0 56.2 83.5 48.5 45.0 57.2 30.0 54.1
1 Trans 6L 30.6 74.5 50.9 75.2 44.0 43.7 65.7 34.0 52.2
2 Trans 12L 21.2 85.0 38.2 66.1 42.7 26.0 51.8 49.6 47.8
3 Trans 12L(ASR) 60.2 93.9 67.0 97.0 73.2 55.6 85.5 75.7 76.1
Proposed
4 Vanilla 45.7 81.0 40.1 79.1 45.7 37.8 84.5 63.6 60.0
5 ASR 40.3 93.7 75.0 97.3 76.3 52.1 88.3 76.0 75.1
6 MTL, λ = 0.1 68.6 91.8 86.9 99.1 71.2 56.6 89.5 76.0 79.9
7 MTL, λ = 0.2 57.8 92.0 79.5 98.5 70.6 71.3 84.7 66.2 77.5
8 MTL, λ = 0.3 60.9 85.3 85.1 98.9 66.1 69.6 80.8 63.2 76.0
9 Add, λ = 0.1 50.4 95.0 83.3 99.4 72.1 73.5 92.9 77.0 80.6
10 Cat, λ = 0.1 61.8 88.9 89.6 98.9 71.9 66.0 95.1 74.5 80.8
11 CatCA, λ = 0.1 63.1 93.3 88.9 98.3 73.9 66.3 95.3 73.7 82.2
As we can see from Table 2, a bigger model (from Id 1 to Id 3) will lead to overfitting
and a degraded classification accuracy. However, as we can observe from both the baseline
models and our models, this overfitting phenomenon can be alleviated by the ASR task.
ASR pretraining on Librispeech and AESRC (Id 3, Id 5) can greatly improve the AR
accuracy compared with the model without pretraining (Id 2, Id 4, correspondingly).
Furthermore, the proposed MTL version (Id 6–8) is better than ASR initiation only. We
obtain the best result of the MTL-based models by setting the ASR weight to λ = 0.1. By
merging the outputs of the fixed AM and the trainable AM, the proposed hybrid methods
(Id 9–11) show a better performance compared with the MTL version. The Concatenation-
ChannelAttention-based fusion method (Id 11) shows the best performance among these
models, which results in an 8.02% relative improvement over the best ASR-initialized
baseline (Id 3). Please note that the performance of US data is relatively low. We think this
is because the accented dataset supposes that the data collected in each country belongs to
one type of accent. Thus, the accent label is decided by which country the speaker belongs
to, rather than their native language. As speakers in the US are various, the performance
for US can be downgraded.
4.3. Speaker Recognition Test

As discussed above, the model with ASR MTL is better than the one without the ASR
pretraining (Id 6, Id 4 in Table 2). To validate our assumption in Section 3.2 that directly
training the network for AR may overfit to the speaker label on the trainset, and ASR MTL
can help the model to learn a speaker-invariant feature thus alleviating this phenomenon,
we compare the speaker recognition performances of these two models.
To probe whether the output of these two models contains speaker information,
we freeze the weight parameters of these two models and replace the final linear layer
of the aggregation model to perform the speaker recognition task. We only train this
linear layer to see how the original feature extracted by each model correlates with the
speaker information. We train the linear layer with the cross-entropy loss and Adam
optimizer (lr = 10−4 ). We should note that for the original AR experiments, the training
and validation dataset is split by different speakers. To perform the speaker recognition
Sensors 2021, 21, 6258 10 of 14
task, we did another splitting by utterances, and a certain speaker appears in both the
training set and the validation set.
We show the training process in Figure 4a and the validation process in Figure 4b.
Speaker Recoginition CrossEntropy Loss (Train) Speaker Recoginition Accuracy (Validate)

Model Model
Ours w/ ASR 0.40 Ours w/ ASR
5 Ours w/o ASR Ours w/o ASR
0.35
4
0.30
Accuracy
Loss
3
0.25
0.20
2
0.15
1
0.10
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Step Step
(a) (b)
Figure 4. Speaker recognition test. (a) Cross-entropy loss curve for the speaker label while training
the models with/without the ASR task. (b) Speaker recognition accuracy curve on the validation
dataset while training the models with/without the ASR task.
The model without ASR pretraining has a lower training cross-entropy loss for the
speaker label and a much higher speaker recognition accuracy compared to the ASR MTL
version. Such a phenomenon suggests that adding the ASR task indeed helps the model
to learn a speaker-invariant feature. We plot ROC curves of the compared two models in
Figure 5. As shown in Figure 5b, the speaker-invariant feature improves the Area Under
Curve (AUC) score (the averaged AUC score is 0.967) greatly compared to Figure 5a (the
averaged AUC score is 0.895), suggesting that this feature is more powerful to solve the AR
task. We also plot the embeddings of the predictions on the validation dataset in Figure 6
using t-SNE [27]. The embeddings learned by the MTL version are in a more reasonable
subspace.
Avearaged AUC=0.895 Avearaged AUC=0.967

1.0 1.0
0.8 0.8
True Positive rate
True Positive rate
0.6 0.6
0.4 US 0.4 US
UK UK
CN CN
IN IN
0.2 JP 0.2 JP
KR KR
PT PT
0.0 RU 0.0 RU
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
(a) (b)
Figure 5. ROC curve of each accent. (a) The baseline version without the auxiliary ASR task. The
averaged Area Under Curve (AUC) score is 0.895. (b) The proposed MTL version. The averaged
Area Under Curve (AUC) score is 0.967.
Sensors 2021, 21, 6258 11 of 14
Acc=60.0% Acc=79.9%
10.0
10.0
7.5
7.5
5.0
5.0
2.5
2.5
0.0 0.0
2.5 2.5
US US
UK UK
5.0 CN 5.0 CN
IN IN
JP JP
7.5 KR 7.5 KR
PT PT
10.0 RU 10.0 RU
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
(a) (b)
Figure 6. 2D embeddings of the accent features. (a) The baseline version without the auxiliary ASR
task. (b) The proposed MTL version.
4.4. Robustness Test

As noted in Section 3.3, if the ASR dataset is constructed by non-native speakers with
accents, the transcription of certain phonemes may be labelled as the text that the speakers
are asked to read, rather than the pronounced one. In other words, pronunciations with
different accents are merged and all labelled as the same text. If the model is dominated by
the auxiliary ASR task, it will be hard for the model to find the uniqueness for different
accents. As a result, the performance may be degraded for the AR task. In this subsection,
we set experiments to validate the performance of the proposed hybrid model under
different ASR situations, as shown in Table 3.
To simulate the transcription confusion introduced by accents, we randomly map
a certain phoneme w to the upper-level group it belongs to (denoted as G (w)), with the
probability p sampled from a uniform distribution U (0, 1). The hierarchy of phonemes in
English language can be seen in Figure 7, and is commonly used by Ref. [28–30].
Phoneme
Obstruent Sonorants
Aspirate Fricative Affricate Stop Nasal Vowel Semivowel Liquid
AA AE AH
UW AO AW
DH F S SH BDG
HH CH JH M N NG AY UH EH WY LR
TH V Z ZH KPT
ER EY OY
IH IY OW
Figure 7. The hierarchy of phonemes in English language.
Formally, this process can be denoted as follows,

(
G (w) if p < θ
w0 = , p ∼ U (0, 1). (20)
w otherwise
As θ increases, there will be more pronunciations that are not labelled as the original
phonemes, but are messed into the upper group. We use the group index for classification
instead of the original phoneme index.
Meanwhile, to test the effect of the phonetic information for AR, we also test an
extreme situation that the text transcriptions are randomly generated. We still use the MTL
version (Figure 3c) for experiment, but the trainable acoustic model AMt is not pretrained
Sensors 2021, 21, 6258 12 of 14
on AESRC or Librispeech. Under this situation, the whole model learns the wrong phonetic
information of the accented speech. We also test the proposed hybrid version (Figure 3d,
AMt is also not pretrained) for comparison.
Table 3. Accuracy for the robustness test under different transcription situations. For example, “50%
↓” suggests that θ = 0.5 (i.e., 50% of the transcriptions are messed). All models use λ = 0.1. For the
hybrid version, we use the channel-attention-based fusion (Figure 2c). Please note that Id 12, 13 is in
fact the same as Id 6, 11 in Table 2, correspondingly.
Trans Id Model US UK CN IN JP KR PT RU Ave

12 MTL 68.6 91.8 86.9 99.1 71.2 56.6 89.5 76.0 79.9
0% ↓
13 Hybrid 63.1 93.3 88.9 98.3 73.9 66.3 95.3 73.7 82.2
14 MTL 54.5 93.7 91.9 99.6 62.6 67.2 79.6 66.2 76.8
50% ↓
15 Hybrid 50.1 91.8 88.3 98.6 72.9 63.7 94.6 73.3 79.3
16 MTL 49.0 93.6 86.6 99.5 56.6 58.4 76.7 77.1 74.7
100% ↓
17 Hybrid 52.6 90.9 78.6 98.3 72.7 59.6 97.5 68.9 77.5
18 MTL 28.8 81.9 33.6 73.3 39.1 34.9 73.3 45.1 51.5
Random
19 Hybrid 46.6 73.5 61.8 88.6 49.7 43.0 86.6 68.1 64.8
As the random transcription is not helpful for the AR task, the channel-attention
mechanism learns to pay more attention to the reference embedding in Figure 8b compared
to the normal situation in Figure 8a.
0 0
0.8 0.8
0.6 0.6
1023 1023
0.4 0.4
0.2 0.2
2047 2047
(a) (b)
Figure 8. Channel-attention weight (CA ∈ R1×2demb ) under different training conditions. When
the transcription becomes unreliable, the channel-attention block pays more attention to the refer-
ence embedding. For the model trained with normal transcription (a), summed attention weight
for the reference embedding dividing summed attention weight for the trainable embedding is
emb −1
(∑2d emb
c=d CAc )/(∑dc= 0 CAc ) = 1.02. In contrast, for the model trained with randomly generated
emb
emb −1
transcription (b), (∑2d
c=d
emb
CAc )/(∑cd= 0 CAc ) = 1.63.
emb
5. Conclusions
In this paper, we propose that the AR model may overfit to the speaker recognition
task as using the speaker information in the trainset to distinguish the accent label is easier
compared with using language-related information. To validate this assumption, we probe
the speaker-related information of the baseline and the proposed MTL version by freezing
the model and replacing the last linear layer for another speaker recognition training. The
results show that the auxiliary ASR task can force the model to extract speaker-invariant
and language-related features, and this auxiliary task can lead to a better AR performance.
Furthermore, the hybrid structure is designed to fuse the embeddings of two acoustic
models. With the proposed Concatenation-ChannelAttention-based fusion, the model
can choose to pay more attention to the standard reference embedding if the accented
transcription is not reliable. Thus, the extracted features can be more robust. The proposed
method is 8.02% better than the best Transformer-based baseline relatively, showing the
merits of our method.
Sensors 2021, 21, 6258 13 of 14
However, despite the improved performance, the training process of the proposed
method is quite complex as it requires two acoustic models to be trained separately. Mean-
while, as the accent recognition model is generally applied as a frontend to custom the
downstream system, we expect the model to be smaller in its parameters. Therefore, it is
worthy to explore a more efficient accent prediction structure in the future.
Author Contributions: Investigation, Y.W.; Methodology, Z.Z.; Resources, J.Y.; Software, Z.Z.; Super-
vision, Y.W. and J.Y.; Validation, Y.W.; Writing original draft, Z.Z. All authors have read and agreed
to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Chu, X.; Combs, E.; Wang, A.; Picheny, M. Accented Speech Recognition Inspired by Human Perception. 2021. Available online:
https://arxiv.org/pdf/2104.04627.pdf (accessed on 30 July 2021) .
2. Huang, C.; Chen, T.; Li, S.; Chang, E.; Zhou, J. Analysis of speaker variability. In Proceedings of the 7th European Conference on
Speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001; pp. 1377–1380.
3. Levis, J.; Barriuso, T. Nonnative speakers’ pronunciation errors in spoken and read English. In Proceedings of the 3rd
Pronunciation in Second Language Learning and Teaching (PSLLTP), Ames, IA, USA, 16–17 September 2011; pp. 187–194.
4. Viglino, T.; Motlicek, P.; Cernak, M. End-to-end accented speech recognition. In Proceedings of the 20th Annual Conference of
the International Speech Communication Association, Graz, Austria, 15–19 September 2019 ; pp. 2140–2144. [CrossRef]
5. Crawshaw, M. Multi-Task Learning with Deep Neural Networks: A Survey. 2020. Available online: https://arxiv.org/abs/2009.0
9796 (accessed on 29 July 2021) .
6. Shi, X.; Yu, F.; Lu, Y.; Liang, Y.; Feng, Q.; Wang, D.; Qian, Y.; Xie, L. The accented english speech recognition challenge 2020: Open
datasets, tracks, baselines, results and methods. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech
and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6918–6922.
7. Naranjo-Alcazar, J.; Perez-Castanos, S.; Zuccarello, P.; Cobos, M. CNN Depth Analysis with Different Channel Inputs for Acoustic
Scene Classification. 2019. Available online: https://arxiv.org/abs/1906.04591 (accessed on 28 July 2021)
8. Wu, Y.; Lee, T. Enhancing Sound Texture in CNN-based Acoustic Scene Classification. In Proceedings of the 2019 IEEE
International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 815–819. [CrossRef]
9. Zheng, W.; Mo, Z.; Xing, X.; Zhao, G. CNNs-Based Acoustic Scene Classification Using Multi-Spectrogram Fusion and Label
Expansions. 2018. Available online: https://arxiv.org/abs/1809.01543 (accessed on 30 July 2021).
10. Behravan, H.; Hautama, V.; Siniscalchi, S.M.; Kinnunen, T.; Lee, C.H. Introducing attribute features to foreign accent recognition.
In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 4–9 May
2014; pp. 5332–5336.
11. Mannepalli, K.; Sastry, P.N.; Suman, M. MFCC-GMM based accent recognition system for Telugu speech signals. Int. J. Speech
Technol. 2016, 19, 87–93. [CrossRef]
12. Behravan, H.; Hautamaki, V.; Siniscalchi, S.M.; Kinnunen, T.; Lee, C.H. I-Vector Modeling of Speech Attributes for Automatic
Foreign Accent Recognition. Ieee/Acm Trans. Audio Speech Lang. Process. 2016, 24, 29–41. [CrossRef]
13. Chen, J.; Cai, W.; Cai, D.; Cai, Z.; Zhong, H.; Li, M. End-to-end language identification using NetFV and NetVLAD. In Proceedings
of the 2018 11th International Symposium on Chinese Spoken Language Processing, Taipei, Taiwan, 26–29 November 2018;
pp. 319–323. [CrossRef]
14. Cai, W.; Chen, J.; Li, M. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition
System. In Proceedings of the Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France,
26–29 June 2018; pp. 74–81.
15. Najafian, M.; Russell, M. Automatic accent identification as an analytical tool for accent robust automatic speech recognition.
Speech Commun. 2020, 122, 44–55. [CrossRef]
16. Weninger, F.; Sun, Y.; Park, J.; Willett, D.; Zhan, P. Deep learning based Mandarin accent identification for accent robust ASR. In
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria,
15–19 September 2019; pp. 510–514. [CrossRef]
Sensors 2021, 21, 6258 14 of 14
17. Li, J.; Lavrukhin, V.; Ginsburg, B.; Leary, R.; Kuchaiev, O.; Cohen, J.M.; Nguyen, H.; Gadde, R.T. Jasper An End-to-End
Convolutional Neural Acoustic Model. In Proceedings of the Annual Conference of the International Speech Communication
Association, Graz, Austria, 15–19 September 2019; pp. 71–75.
18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In
Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December
2017; pp. 5998–6008.
19. Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April
2018; pp. 5329–5333.
20. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence
data with recurrent neural networks. Acm Int. Conf. Proc. Ser. 2006, 148, 369–376. [CrossRef]
21. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Piscataway, NJ, USA, 18–22 June 2018; pp. 7132–7141. [CrossRef]
22. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference
on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
23. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, QLD, Australia,
19–24 April 2015; pp. 5206–5210. [CrossRef]
24. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch:
An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037.
25. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.;
et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and
Understanding, Big Island, HI, USA, 11–15 December 2011.
26. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation
Method for Automatic Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication
Association, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [CrossRef]
27. Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605.
28. Hamooni, H.; Mueen, A. Dual-Domain Hierarchical Classification of Phonetic Time Series. In Proceedings of the 2014 IEEE
International Conference on Data Mining, Shenzhen, China, 14–17 December 2014; pp. 160–169. [CrossRef]
29. Dekel, O.; Keshet, J.; Singer, Y. An Online Algorithm for Hierarchical Phoneme Classification. In Proceedings of the International
Workshop Machine Learning for Multimodal Interaction, Edinburgh, UK, 11–13 July 2005; pp. 146–158. [CrossRef]
30. Lee, K.F.; Hon, H.W. Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal
Process. 1989, 37, 1641–1648. [CrossRef]

Sensors 21 06258

Uploaded by

Copyright:

Available Formats

Sensors 21 06258

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sensors 21 06258

Uploaded by

Copyright:

Available Formats

sensors

Abstract: The performance of voice-controlled systems is usually influenced by accented speech.

Keywords: accent recognition; audio classification; accented English speech recognition

Sensors 2021, 21, 6258. https://doi.org/10.3390/s21186258 https://www.mdpi.com/journal/sensors

AR requires a more fine-grained and language-related feature, which is more difficult

3.1. Network Structure

MH A( Q, K, V ) = concat(hd1 , . . . , hdh )W O , (2)

3.2. Multitask Learning

ĉ = argmaxc P(c| Ac ; θ ), c ∈ {ci , . . . , cn }, (3)

where n is the number of accents in the dataset.

lc = CE(c, ĉ). (4)

ŝ0 = argmaxs P(s| Ac ; θ ), s ∈ {si , . . . , sm } (5)

Ac = AggregationModel ( A asr ). (8)

3.3. Hybrid Phonetic Features

𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 𝑇𝑇 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒

Merged Merged Merged

𝑇𝑇 × (2 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 ) Cat 𝑇𝑇 × (2 × 𝑑𝑑𝑒𝑒𝑒𝑒𝑒𝑒 ) Cat

Linear Linear Linear Linear Linear Linear

Trainable Reference Trainable Reference Trainable Reference

(a) (b) (c)

First, a linear projection is applied to the embeddings,

Adding-based fusion. As illustrate in Figure 2a, an add function is applied to merge

where ACasr ∈ RT ×2demb and A asr

Concatenation-ChannelAttention-based fusion. Furthermore, we adopt channel-

C = TimeMax ( ACasr ) + TimeAve( ACasr ), (15)

where C ∈ R1×2demb . C is further squeezed and excited,

Csqueeze = ReLU (linear (C )), (16)

Cexcite = ReLU (linear (Csqueeze )), (17)

and used as the scaling factor of ACasr on the channel domain,

Module Output Shape Output Description

Fusion Block T × demb Merged embedding, A asr M

Conv1D (kernelsize = 1) T × d attn Reshaped embedding , d attn = 256

4.1. Dataset and Environment

4.2. Accent Recognition Test

Aggregation Model Aggregation Model

Feature Extractor Feature Extractor

Feature Extractor Feature Extractor

4.3. Speaker Recognition Test

Speaker Recoginition CrossEntropy Loss (Train) Speaker Recoginition Accuracy (Validate)

Avearaged AUC=0.895 Avearaged AUC=0.967

4.4. Robustness Test

Aspirate Fricative Affricate Stop Nasal Vowel Semivowel Liquid

Figure 7. The hierarchy of phonemes in English language.

Formally, this process can be denoted as follows,

Trans Id Model US UK CN IN JP KR PT RU Ave

You might also like