Improving mispronunciation detection and diagnosis for non- native learners of the Arabic language

Norah Alrashoudi¹,
Hend Al-Khalifa¹ &
Yousef Alotaibi²

799 Accesses
Explore all metrics

Abstract

Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT), which aims to provide opportunities for second language learners (L2) to learn and practice their speaking skills. Arabic is one of the most widespread languages in the world, with more than 422 million speakers. It is the language of the Holy Quran, which increases the importance of learning Arabic. Most existing Arabic MDD systems focus on learning-based techniques rather than state-of-the-art deep-learning methods. Most existing Arabic MDD systems are primarily relying on traditional learning techniques. However, integrating Transformer-based algorithms into the system is crucial, as it can significantly enhance the accuracy, efficiency, and overall performance of the Arabic MDD systems. This paper introduces an Arabic MDD system using transformer-based techniques for non-native learners of spoken Arabic language to enhance their learning of Arabic and help non-native speakers practice their pronunciation skills. The study focuses on detecting mispronunciation phonemes and analyzing these pronunciation errors, by identifying the type of each error, whether it was insertion, deletion, or substitution error. To train the MDD system, we constructed a speech dataset for native and non-native Arabic speakers. The performance evaluation was obtained based on the phoneme recognition and MDD performance, and we achieved a phoneme error rate of 3.1%, 80.8%, and 91.3% for diagnosis accuracy and detection accuracy, respectively. Additionally, we conducted a human perceptual test to assess human proficiency in detecting pronunciation errors and compare their evaluations with automatic verification. The automatic verification surpassed human verification, achieving a detection accuracy of 97.0% and a diagnosis accuracy of 80.4%.

Anomaly detection with a variational autoencoder for Arabic mispronunciation detection

Article 25 June 2024

Recognition of Arabic speech sound error in children

Article 07 September 2020

Diacritics Effect on Arabic Speech Recognition

Article 10 July 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In a world increasingly connected through globalization and digital communication, people are willing to learn new languages to enhance cross-cultural communication and international collaboration. Arabic is one of the most widespread languages in the world, with more than 422 million speakers [1]. In addition to communication, learning Arabic is essential for every Muslim, since it is the language of the Holy Quran and Sunnah, which are the primary sources for every Muslim. Therefore, the Arabic language can be considered a second language (L2) for every non-Arabic speaking Muslim. This has led to an increasing need for learning Arabic as a second language across the world and the demand for Arabic teachers to teach non-native Arabic learners.

Computer-assisted pronunciation training (CAPT) systems have been established to address the huge demands of learning L2 and the shortage of learning resources. As an integral part of CAPT systems, mispronunciation detection and diagnosis (MDD) were introduced to help non-native language learners improve their pronunciation skills by detecting pronunciation errors and providing diagnostic feedback. Substantial research efforts have been put into the development of MDD across languages, such as English, Dutch, Chinese, and Mandarin. Although the Arabic language is one of the most widely spoken languages in the world, existing tools and resources for mispronunciation detection and diagnosis in Arabic are limited and focus on learning-based techniques rather than state-of-the-art deep learning methods.

Earlier MDD systems were implemented using scoring-based methods with confidence measures originally proposed for automatic speech recognition (ASR). One of the most popular scoring-based methods is goodness of pronunciation (GOP) that is computed by log-posterior probability based on force alignment [2]. GOP achieved high-performance results on mispronunciation detection but failed to provide detailed feedback to learners [3]. In order to provide effective feedback to learners, the extended recognition network (ERN) was implemented for the MDD system, which incorporates common pronunciation error patterns into the lexicon along with the canonical transcription for each prompted word [4]. The ERN-based models were implemented using handcrafted and data-driven rules to detect errors and error types. However, it is practically difficult to handle sufficient phonological rules in the lexicon for various L1 and L2 language pairs [5]. In addition, handling too many phonological rules may affect ASR accuracy, leading to low MDD performance.

Recently, the end-to-end structure has achieved promising performance in ASR systems as well as in MDD. End-to-end systems leverage the advancement of deep neural networks (DNNs) by learning complex and abstract representations based on speech waveforms for ASR. Various MDD models have been proposed based on an end-to-end speech recognition approach, including CNN-RNN-CTC [6], which uses acoustic features as input, without the need for phonemic information; hence, no force alignment is required, which may affect MDD performance. The CTC-Attention [7] model is another CTC-based method that extends the original L2 phone set to include both categorical and non- categorical errors. Further improvements presented an end-to-end architecture, referred to as a sentence-dependent end-to-end model for MDD (SED-MDD) [8] that integrates both linguistic and acoustic features. This model implicitly learns phonological rules directly from phonological annotations and transcriptions in the training data.

Transformer [9] is a modern deep learning technique that depends mainly on the self-attention mechanism with strong representation capabilities. The high-performance results achieved with the transformer in different natural language processing (NLP) tasks and ASR systems have attracted researchers to deploy this approach for MDD tasks. Minglin Wu et al. [10] introduced transformer-based models for MDD, the first model is based on an encoder-decoder architecture and the second model is based on Wav2Vec 2.0. Both models achieved significant performance results in terms of phone recognition and MDD, with diagnostic accuracies of 91.39% and 90.05%, respectively, while the phone accuracies were 91.31% and 94.03% for the first and second models, respectively.

The promising results achieved with Transformer-based techniques for MDD tasks motivated us to investigate these techniques to detect pronunciation errors at the phoneme level and provide feedback for non-native Arabic speakers in continuous speech. Several Transformer-based models have been fine-tuned for MDD tasks, including Wav2Vec2.0 [11], HuBERT [12], and Whisper [13]. All these models integrate acoustic and linguistic features in the training process to learn phonological rules directly from phonological transcription.

The main objectives of this study are as follows: (1) to collect speech data and construct a dataset for native and non-native Arabic speakers, (2) to implement a framework to detect mispronunciations in Arabic speech and provide pinpoint feedback to learners using Transformer-based models, and (3) to contribute to the improvement of e-learning by developing techniques that help second language (L2) learners to enhance their language skills without the need for qualified teachers.

This study explored innovative methods for enhancing pronunciation accuracy among non-native Arabic speakers. In light of this objective, the research question guiding this investigation is:

How do Transformer-based techniques influence the ability of Arabic MDD systems to detect pronunciation errors among non-native Arabic speakers?

The remainder of this paper is organized as follows. The rest of Sect. 1 presents the background information on the concepts related to this study. Section 2 investigates the existing literature in our research area and discusses its limitations to identify the need for the current study. Section 3 provides an overview of the methodological framework and performance evaluation criteria. Section 4 describes the experimental studies, and Sect. 5 discusses the obtained results. Section 6 provides a human perceptual test to assess the ability of humans to detect mispronunciation errors in the L2-KSU dataset and compares their evaluation to automatic verification. Finally, Sect. 7 presents the conclusion along with the limitations and future works.

2 Background

2.1 Mispronunciation detection and diagnosis (MDD)

Mispronunciation Detection and Diagnosis (MDD) is a Computer-Aided Pronunciation Training (CAPT) technology that enhances self-directed language learning with individualized feedback and accessibility [14]. The main goal of MDD systems is to facilitate the second language (L2) learning process through two major tasks: mispronunciation detection, which detects pronunciation errors in the learner’s articulation, and mispronunciation diagnosis, which identifies errors and generates corrective feedback. MDD systems have been implemented using ASR technologies to transcribe speech inputs along with pronunciation accuracy [14]. Additionally, MDD can be implemented based on segmental and suprasegmental features [15]. Segmental features involve phonemes and words, while suprasegmental features include rich information structures, such as pitch accent [16, 17], lexical stress [18], and intonation [19]. The development of mispronunciation detection and diagnosis (MDD) has been investigated through different methods of detecting the phone-level mispronunciation patterns of L2 learners using ASR-related techniques. The earliest method was pronunciation-scoring (e.g., GOP), which computes the likelihood correlation between the canonical phone sequence of a given text and the pronounced phones produced by the acoustic model to detect pronunciation errors [20]. Recent studies on MDD have focused on employing end-to-end ASR models with deep learning techniques owing to their efficient performance compared to scoring-based methods [6, 20]. The basic structure of end-to-end MDD is to combine two models into a unified model, which is an acoustic model for recognizing spoken phonemes and a label classifier for detecting mispronunciation.

2.2 Transformer

The transformer [9] is a prominent deep learning technique that has been widely applied in different areas, such as NLP and ASR areas. The success of attention on sequence-to-sequence tasks led to the proposal of a transformer that applies attention directly to the input without the need for recurrent connections in the network. Attention is a deep learning technique that was proposed to address some issues in convolutional and recurrent networks, such as the long-term dependencies that are difficult for recurrent networks to handle using gradient descent [21]. The attention mechanism aims to address this issue by focusing on parts of the sequence for each word in the output sequence; hence, it improves the prediction accuracy. In addition, it considers all time steps that have contributed to the output, but it may increase the amount of computation for long sequences. The transformer minimizes the sequential dependencies of the network by using “multi-head” attention directly to the input embeddings [22]. There are three types of attention used in the transformer in terms of queries and key-value pairs: self-attention in the transformer encoder, masked self-attention in the transformer decoder, and cross-attention to project the queries from the output of the decoder. The original architecture of Transformer (AKA vanilla Transformer) is a sequence-to-sequence model consisting of an encoder and decoder, each of which is a stack of L identical blocks [9]. As shown in Fig. 1, the encoder block consists of multi-head self-attention module and a position-wise feed-forward network (FFN). The decoder blocks insert cross-attention modules between the multi-head self-attention modules and the position-wise FFNs [23]. Furthermore, the Transformer architecture can be used in three different ways: encoder–decoder, which is used in the sequence-to-sequence model; encoder only, which uses only the encoder and is usually applied for classification and sequence labeling problems; and decoder only, which uses only the decoder and can be applied for sequence generation, such as language modeling [23].

2.3 Acoustic characteristics of arabic sounds

The phonological classification of Arabic sounds is divided into two main parts: consonants and vowels [24]. The differences between consonants and vowels are based on the ability of articulators to make movements in the vocal tract. The production of consonants requires some movements on parts of the vocal tract, whereas vowels are produced with less movement and in an open vocal tract [25]. Arabic language consists of 36 sounds, including three short vowels (/a/ َ, /i/ ِ, /u/ُ), three long vowels (/aː/ ا, /iː/ ي, and /uː/و), two diphthongs (i.e., a sound formed by combining two vowels), and twenty-eight consonants [24]. All Arabic consonants, long vowels, and short vowels are presented in Tables 1 and 2 with their corresponding International Phoneme Association (IPA) symbols.

Table 1 Arabic Consonants with IPA Symbols

Improving mispronunciation detection and diagnosis for non- native learners of the Arabic language

Abstract

Similar content being viewed by others

Anomaly detection with a variational autoencoder for Arabic mispronunciation detection

Recognition of Arabic speech sound error in children

Diacritics Effect on Arabic Speech Recognition

1 Introduction

2 Background

2.1 Mispronunciation detection and diagnosis (MDD)

2.2 Transformer

2.3 Acoustic characteristics of arabic sounds

3 Related works

3.1 Arabic automatic speech recognition (arabic ASR)

3.2 Mispronunciation detection and diagnosis for non-arabic languages

3.3 Mispronunciation detection and diagnosis for arabic language

3.4 Discussion

4 Methodology

4.1 Dataset

4.2 Methodology overview

4.2.1 Wav2Vec2.0

4.2.2 HuBERT

4.2.3 Whisper

4.3 Methodological framework

5 Experimental studies

5.1 Data preprocessing

5.2 Model fine-tuning

5.3 Model evaluation

6 Results and discussion

6.1 Performance in the phoneme recognition

6.2 MDD performance evaluation

6.3 Pronunciation errors analysis

6.4 Discussion

7 Human perceptual test

7.1 Reviewer selection

7.2 Evaluation form design

7.3 Scoring criteria

7.4 Human evaluation results

7.5 Inter-rater agreement

8 Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation