1 Introduction

In a world increasingly connected through globalization and digital communication, people are willing to learn new languages to enhance cross-cultural communication and international collaboration. Arabic is one of the most widespread languages in the world, with more than 422 million speakers [1]. In addition to communication, learning Arabic is essential for every Muslim, since it is the language of the Holy Quran and Sunnah, which are the primary sources for every Muslim. Therefore, the Arabic language can be considered a second language (L2) for every non-Arabic speaking Muslim. This has led to an increasing need for learning Arabic as a second language across the world and the demand for Arabic teachers to teach non-native Arabic learners.

Computer-assisted pronunciation training (CAPT) systems have been established to address the huge demands of learning L2 and the shortage of learning resources. As an integral part of CAPT systems, mispronunciation detection and diagnosis (MDD) were introduced to help non-native language learners improve their pronunciation skills by detecting pronunciation errors and providing diagnostic feedback. Substantial research efforts have been put into the development of MDD across languages, such as English, Dutch, Chinese, and Mandarin. Although the Arabic language is one of the most widely spoken languages in the world, existing tools and resources for mispronunciation detection and diagnosis in Arabic are limited and focus on learning-based techniques rather than state-of-the-art deep learning methods.

Earlier MDD systems were implemented using scoring-based methods with confidence measures originally proposed for automatic speech recognition (ASR). One of the most popular scoring-based methods is goodness of pronunciation (GOP) that is computed by log-posterior probability based on force alignment [2]. GOP achieved high-performance results on mispronunciation detection but failed to provide detailed feedback to learners [3]. In order to provide effective feedback to learners, the extended recognition network (ERN) was implemented for the MDD system, which incorporates common pronunciation error patterns into the lexicon along with the canonical transcription for each prompted word [4]. The ERN-based models were implemented using handcrafted and data-driven rules to detect errors and error types. However, it is practically difficult to handle sufficient phonological rules in the lexicon for various L1 and L2 language pairs [5]. In addition, handling too many phonological rules may affect ASR accuracy, leading to low MDD performance.

Recently, the end-to-end structure has achieved promising performance in ASR systems as well as in MDD. End-to-end systems leverage the advancement of deep neural networks (DNNs) by learning complex and abstract representations based on speech waveforms for ASR. Various MDD models have been proposed based on an end-to-end speech recognition approach, including CNN-RNN-CTC [6], which uses acoustic features as input, without the need for phonemic information; hence, no force alignment is required, which may affect MDD performance. The CTC-Attention [7] model is another CTC-based method that extends the original L2 phone set to include both categorical and non- categorical errors. Further improvements presented an end-to-end architecture, referred to as a sentence-dependent end-to-end model for MDD (SED-MDD) [8] that integrates both linguistic and acoustic features. This model implicitly learns phonological rules directly from phonological annotations and transcriptions in the training data.

Transformer [9] is a modern deep learning technique that depends mainly on the self-attention mechanism with strong representation capabilities. The high-performance results achieved with the transformer in different natural language processing (NLP) tasks and ASR systems have attracted researchers to deploy this approach for MDD tasks. Minglin Wu et al. [10] introduced transformer-based models for MDD, the first model is based on an encoder-decoder architecture and the second model is based on Wav2Vec 2.0. Both models achieved significant performance results in terms of phone recognition and MDD, with diagnostic accuracies of 91.39% and 90.05%, respectively, while the phone accuracies were 91.31% and 94.03% for the first and second models, respectively.

The promising results achieved with Transformer-based techniques for MDD tasks motivated us to investigate these techniques to detect pronunciation errors at the phoneme level and provide feedback for non-native Arabic speakers in continuous speech. Several Transformer-based models have been fine-tuned for MDD tasks, including Wav2Vec2.0 [11], HuBERT [12], and Whisper [13]. All these models integrate acoustic and linguistic features in the training process to learn phonological rules directly from phonological transcription.

The main objectives of this study are as follows: (1) to collect speech data and construct a dataset for native and non-native Arabic speakers, (2) to implement a framework to detect mispronunciations in Arabic speech and provide pinpoint feedback to learners using Transformer-based models, and (3) to contribute to the improvement of e-learning by developing techniques that help second language (L2) learners to enhance their language skills without the need for qualified teachers.

This study explored innovative methods for enhancing pronunciation accuracy among non-native Arabic speakers. In light of this objective, the research question guiding this investigation is:

How do Transformer-based techniques influence the ability of Arabic MDD systems to detect pronunciation errors among non-native Arabic speakers?

The remainder of this paper is organized as follows. The rest of Sect. 1 presents the background information on the concepts related to this study. Section 2 investigates the existing literature in our research area and discusses its limitations to identify the need for the current study. Section 3 provides an overview of the methodological framework and performance evaluation criteria. Section 4 describes the experimental studies, and Sect. 5 discusses the obtained results. Section 6 provides a human perceptual test to assess the ability of humans to detect mispronunciation errors in the L2-KSU dataset and compares their evaluation to automatic verification. Finally, Sect. 7 presents the conclusion along with the limitations and future works.

2 Background

2.1 Mispronunciation detection and diagnosis (MDD)

Mispronunciation Detection and Diagnosis (MDD) is a Computer-Aided Pronunciation Training (CAPT) technology that enhances self-directed language learning with individualized feedback and accessibility [14]. The main goal of MDD systems is to facilitate the second language (L2) learning process through two major tasks: mispronunciation detection, which detects pronunciation errors in the learner’s articulation, and mispronunciation diagnosis, which identifies errors and generates corrective feedback. MDD systems have been implemented using ASR technologies to transcribe speech inputs along with pronunciation accuracy [14]. Additionally, MDD can be implemented based on segmental and suprasegmental features [15]. Segmental features involve phonemes and words, while suprasegmental features include rich information structures, such as pitch accent [16, 17], lexical stress [18], and intonation [19]. The development of mispronunciation detection and diagnosis (MDD) has been investigated through different methods of detecting the phone-level mispronunciation patterns of L2 learners using ASR-related techniques. The earliest method was pronunciation-scoring (e.g., GOP), which computes the likelihood correlation between the canonical phone sequence of a given text and the pronounced phones produced by the acoustic model to detect pronunciation errors [20]. Recent studies on MDD have focused on employing end-to-end ASR models with deep learning techniques owing to their efficient performance compared to scoring-based methods [6, 20]. The basic structure of end-to-end MDD is to combine two models into a unified model, which is an acoustic model for recognizing spoken phonemes and a label classifier for detecting mispronunciation.

2.2 Transformer

The transformer [9] is a prominent deep learning technique that has been widely applied in different areas, such as NLP and ASR areas. The success of attention on sequence-to-sequence tasks led to the proposal of a transformer that applies attention directly to the input without the need for recurrent connections in the network. Attention is a deep learning technique that was proposed to address some issues in convolutional and recurrent networks, such as the long-term dependencies that are difficult for recurrent networks to handle using gradient descent [21]. The attention mechanism aims to address this issue by focusing on parts of the sequence for each word in the output sequence; hence, it improves the prediction accuracy. In addition, it considers all time steps that have contributed to the output, but it may increase the amount of computation for long sequences. The transformer minimizes the sequential dependencies of the network by using “multi-head” attention directly to the input embeddings [22]. There are three types of attention used in the transformer in terms of queries and key-value pairs: self-attention in the transformer encoder, masked self-attention in the transformer decoder, and cross-attention to project the queries from the output of the decoder. The original architecture of Transformer (AKA vanilla Transformer) is a sequence-to-sequence model consisting of an encoder and decoder, each of which is a stack of L identical blocks [9]. As shown in Fig. 1, the encoder block consists of multi-head self-attention module and a position-wise feed-forward network (FFN). The decoder blocks insert cross-attention modules between the multi-head self-attention modules and the position-wise FFNs [23]. Furthermore, the Transformer architecture can be used in three different ways: encoder–decoder, which is used in the sequence-to-sequence model; encoder only, which uses only the encoder and is usually applied for classification and sequence labeling problems; and decoder only, which uses only the decoder and can be applied for sequence generation, such as language modeling [23].

Fig. 1
figure 1

The Transformer Encoder- Decoder Architecture [9]

2.3 Acoustic characteristics of arabic sounds

The phonological classification of Arabic sounds is divided into two main parts: consonants and vowels [24]. The differences between consonants and vowels are based on the ability of articulators to make movements in the vocal tract. The production of consonants requires some movements on parts of the vocal tract, whereas vowels are produced with less movement and in an open vocal tract [25]. Arabic language consists of 36 sounds, including three short vowels (/a/ َ, /i/ ِ, /u/ُ), three long vowels (/aː/ ا, /iː/ ي, and /uː/و), two diphthongs (i.e., a sound formed by combining two vowels), and twenty-eight consonants [24]. All Arabic consonants, long vowels, and short vowels are presented in Tables 1 and 2 with their corresponding International Phoneme Association (IPA) symbols.

Table 1 Arabic Consonants with IPA Symbols
Table 2 Short Vowels and Long Vowels in Arabic

Consonants are classified based on three standards: airstream mechanism, place of articulation, and manner of articulation. Airstream mechanisms focus on the initiation and direction of airflow to produce sounds, and the major mechanisms are pulmonic, velaric, and glottalic, which are further classified as egressive and ingressive [25]. The place of articulation refers to the position of the articulators in the vocal tract and the relationships between these articulators, starting from the lips and ending at the larynx [24, 25]. Arabic sounds, such as bilabial, uvular, and pharyngeal sounds, are described based on the place of articulation produced from the vocal tract. Table 3 presents the places of articulation employed in Arabic with their corresponding sounds [24]. In addition, Table 3 describes the active articulators responsible for movements and dynamic adjustments during speech production and the passive articulators that provide a stable and fixed surface against active articulators [26]. Furthermore, there is an Arabic sound that shares more than one articulation place, which is the sound (/w/ و) that can be described as both a labiodental and a velar sound [24].

Table 3 The Major Places of Articulation in the Arabic Language

3 Related works

In this section, we review the literature related to our research area. We begin by exploring prior studies on automatic speech recognition, specifically focusing on Arabic. Subsequently, we investigated the field of mispronunciation detection and diagnosis as it applies to non-Arabic languages, representing the latest approaches and techniques. Furthermore, we extended our analysis of MDD to include studies applied to Arabic, identifying the key challenges and advancements. By synthesizing these diverse related works, we identified gaps, discussed the limitations of the related studies, and drew inspiration for our contributions and experiments.

3.1 Arabic automatic speech recognition (arabic ASR)

The development of Arabic automatic speech recognition is a multi-disciplinary process that involves machine learning, linguistics, and speech signal processing that have been addressed by number of researchers [27]. In addition, Arabic ASR systems have been applied in various linguistic fields including: phonology [28], morphology [29], semantics [30], and syntactic features [31].

Because one of our goals in this research is to enhance e-learning using ASR systems; therefore, in this section, we reviewed Arabic ASR systems that leverage linguistic principles to enhance learning and memorization processes for different types of texts, including the Quran and other Arabic texts.

A recent study [32] investigated DNN-based models for recognizing classical Arabic speech by considering the diacritics in the transcription texts. Three models were developed: Time Delay Neural Network (TDNN-CTC), Recurrent Neural Network (RNN)-CTC, and Transformer, which were trained using a dataset consisting of 100 h of Quran recording. The RNN-CTC model obtained the best results with a Word Error Rate (WER) of 19.43% and a Character Error Rate (CER) of 3.5%. Additionally, Hadwan et al. [33] proposed an end-to-end Transformer-based model for recognizing Quran verses with diacritized transcription. The model employed a Mel filter bank with 40-dimensional acoustic features and integrated a CNN as an encoder frontend for subsampling. The model was trained and evaluated on a new dataset comprising 10 h of Quran verses recited by 60 reciters. The evaluation results achieved a CER of 1.98% and a WER of 6.16%, establishing its advantages in verse recognition.

Other studies have utilized speech recognition APIs to develop ASR systems. These systems were evaluated by comparing the similarities between the transcribed text and original text. Marie-Sainte et al. [34] proposed an Arabic ASR system called “Samee’a”, developed to enhance learning and memorization of any kind of Arabic text. The speech recognition process was developed using the Google Cloud Speech Recognition API [35], and the similarity results were obtained using the Jaro Winkler Distance algorithm. The experiment was evaluated using the average similarity results obtained using different text files, and the results showed that the average similarities of the small, medium, and large files were 95%, 90%, and 87%, respectively. In addition, Gerhana et al. [36] proposed an Arabic ASR system that helps memorize Al-Qura’an on Juz 30 using Google Speech API. The system converted speech into text, and the similarity between the converted text and the original text are measured using Jaro Winkler Distance algorithm. The system was evaluated by comparing the resulting text from the ASR system with the original Quranic text, and it achieved an accuracy of 90%.

3.2 Mispronunciation detection and diagnosis for non-arabic languages

Extensive research has explored mispronunciation detection and diagnosis (MDD) in various languages, including English, Mandarin, and Dutch. As discussed in the Introduction, the improvement of MDD systems has achieved through three approaches, (1) approaches based on scoring pronunciation, such as GOP, [37, 38] (2) approaches based on phonological rules, such as ERN [39], and (3) approaches based on End-to-End ASR systems.

The third type of MDD approach is deployed to overcome the ERN limitations that may not provide efficient detection of all mispronunciation possibilities from all language learners. When the recognition network covers too many mispronunciations, the accuracy of providing sufficient detection among the many alternatives in the acoustic model may drop [6]. Various studies have investigated the advancement of neural-based end-to-end architecture that has been applied effectively to ASR, which aims to convert a sequence of input acoustic features into a sequence of words or graphemes [40]. End-to-end systems have achieved high-performance results in ASR because of the efficiency of neural networks in modeling the context and history of speech and text sequences [41].

One of the earliest end-to-end MDD systems was proposed by Leung et al. [6], which was implemented using CTC, RNN, and CNN, and a free-phone recognition approach was applied. They developed end-to-end speech recognition using the CNN-RNN-CTC model to improve the MDD task. The CNN was applied to reduce the phone error rate and improve performance in noisy environments. The experiment was conducted using the TIMIT corpus as a native corpus and CU-CHLOE as a non-native corpus for Chinese speakers learning English. The model was evaluated based on phone recognition performance and the MDD task. The phone recognition results were represented as the correct phone rate, insertion rate, deletion rate, and substitution rate and the best results were 87.93%, 2.15%, 2.90%, and 7.01% respectively. The MDD performance outperformed the other baseline models, with an F-measure of 74.62%.

Yan et al. [7] proposed an E2E-based MDD model consisting of a hybrid CTC-Attention model. To overcome the gap in detecting non-categorical pronunciation errors in phoneme-based approaches (e.g., ERN), they expanded the original L2 phone set with their corresponding anti-phone set to include more mispronunciation possibilities and improve mispronunciation detection and diagnosis performance. Another study [42] used acoustic and linguistic features to develop a sentence-dependent end-to-end model of MDD.

In modern deep-learning stacks, transformers have been increasingly used in ASR and have achieved promising performance results [10, 43]. For MDD systems, two Transformer-based models were proposed in [8] for MDD tasks. The first model (T-1) followed the encoder-decoder architecture, and the second model (T-2) was implemented using Wav2Vec 2.0. Experiments were conducted using CU-CHLOE for English learners with Chinese speakers as a non-native corpus. The TIMIT corpus with a training set of the CU-CHLOE corpus was used to train models T-1 and T-2, and the models were evaluated based on the phone recognition and MDD performance. The phone recognition evaluation of model T-1 achieved phone accuracies of 91.31%, and model T-2 achieved 94.03%, respectively. In terms of MDD evaluation, the diagnostic accuracy of T-1 was 91.39%, while that of T-2 was 90.05%. Both models outperformed other state-of-the-art MDD models (CNN-RNN-CTC and AGPM). In this study [44], the authors developed an MDD model that presents two-stage experiments. First, it evaluates the feasibility of the pre-trained model Wav2Vec2.0, against conventional methods, and investigates the impact of the quantity and type of training data on fine-tuning Wav2Vec2.0 for MDD performance. Second, building upon the optimal configuration from the first stage, the study primarily examines the influence of various textual modulation gates on MDD performance. Gue et al. [45], proposed an MDD framework for the Mandarin language based on Squeezeformer (i.e. Transformer model for automatic speech recognition), dual encoder, multi-modal features, and a secondary decoding mechanism. The model can efficiently integrate multi-modal information to enhance the representation capacity of the speech features. Therefore, it can capture the inherent aspects of a speaker’s pronunciation content, leading to improved accuracy in phoneme recognition and a higher quality in detecting and diagnosing mispronunciations. This model achieved high-performance results with a PER of 1.50% and an F1-score of 79.43%. Recently, Das et al. [46] proposed a multi-task learning MDD approach that combines MDD and text-to-speech (TTS) tasks in parallel. The proposed system takes the speech signals as input and extracts latent features with Wav2Vec model as a feature extractor, and combines it with the input text to predict the phone sequence. Then, a text-to-speech system was deployed to reconstruct speech from the phone sequence along with the identification of the target speaker. They achieved an MDD accuracy of 85.4%, with False Rejection Rate (FRR) score of 7.2%, and False Acceptance Rate (FAR) of 47.0%. Moreover, Wan et al. [47] deployed a Metal-learning approach that utilizes existing knowledge for swift adaption to new tasks, and minimizes the model’s reliance on extensive labeled data. The study employed a Model-agnostic meta-learning (MAML) method for MDD to mitigate the challenges posed by limited data of L2 learners. They conducted various few-shot fine-tuning experiments, achieving a best PER of 40.89%. Although this result is a little high and suboptimal compared to previous studies, it is likely due to the limited training data. In contrast, the FAR and FRR get better when fine-tuning with less data, the best FAR was 8.4% and the FRR was 29.7%. Furthermore, a comparative study conducted by Soundarya et al. [48] evaluate various MDD models, including Transformer, Anti-phone modeling, Accent modulation, APL embeddings, E2E MDD, Non-Auto regressive MDD, and Hybrid CTC-ATT. The Transformer-based model shows the highest F-measure value of 60.50% and detection accuracy of 88.2%. In contrast, the CTC-ATT model achieved the second highest value of F-measure (56%) with detection and diagnosis accuracy of 75.45% and 74.96%, respectively. Lastly, Lounic et al. [49] conducted a systematic literature review of available works on MDD to identify and evaluate the current approaches in using deep learning techniques for MDD tasks. They reviewed 53 papers published between 2015 and 2023. They concluded that the majority of the papers were focused on DNN-based models, and the most utilized corpora were in English and Mandarin languages. Table 4 summarizes the studies that investigated the end-to-end architecture of MDD systems, representing the implemented techniques and performance results.

Table 4 MDD Systems for Non-Arabic Languages

3.3 Mispronunciation detection and diagnosis for arabic language

Despite the increasing improvement in MDD research across languages, the Arabic language has received little attention in this area. In this section, we review previous studies on Arabic MDD using traditional and deep-learning methods.

Nazir et al. [50] proposed three models for detecting mispronunciation of Arabic phonemes with different techniques: Handcrafted features model, CNN features model and transfer learning model. The transfer learning-based method achieved the highest accuracy of 92.2%, whereas the handcrafted features model and CNN feature models obtained accuracies of 82% and 91.7%, respectively. Akhtar et al. [51] proposed a feature-based model for mispronunciation detection of Arabic words by extracting deep features using different layers of Convolutional Neural Network (CNN). They constructed a dataset consisting of spoken Arabic words from different Pakistani speakers learning Arabic. The proposed model achieved an accuracy of 93.30% in detecting the mispronunciation of Arabic words compared with other methods of extracting speech processing features (i.e., learning- based models and traditional hand-crafted features).

Several studies have leveraged deep-learning techniques to enhance the accuracy of MDD in Arabic speech. Ziafat et al. [52] proposed a model of mispronunciation detection for Arabic alphabet using deep learning techniques. The proposed model is divided into two processes: (a) Arabic alphabet classification that trains the model to recognize an alphabet and (b) Arabic alphabet pronunciation that trains the model to evaluate the quality of pronunciation. They used mel-spectrogram for feature extraction of the Arabic alphabet. In addition, they employed a deep convolutional neural network (DCNN), AlexNet with transfer learning, and bidirectional long short-term memory (BiLSTM) for classification. The accuracy of the alphabet classification was 95.95%, 98.41%, and 88.32% for DCNN, AlexNet, and BiLSTM, respectively. For alphabet pronunciation classification, the accuracies of DCNN, AlexNet, and BLSTM were 97.88%, 99.14%, and 77.71%, respectively. In addition, Algabri et al. [53] investigated end-to-end deep learning techniques to develop an MDD system for non-native Arabic speech. The proposed system uses a multi-label object detector to recognize the phoneme sequence along with the articulate features (AF) of the spoken utterance. The MDD task was accomplished by recognizing the phoneme sequence, whereas the AF recognized the mispronunciation correction at the articulatory level. The model was trained using the Arabic CAPT and Arabic-CAPT-S corpora implemented in their study. The performance results of the phoneme recognition task achieved a 3.83% PER with a 46.42% improvement compared to the baseline (CNN-RNN-CTC), while the MDD performance results achieved a 70.53% F1 score. For the mispronunciation detection, Ahmad et al. [54] proposed a framework for two classification tasks: mispronunciation detection and speaker gender identification. The speech dataset was collected from native and non-native Arabic speakers of diverse nationalities and languages. They implemented MFCC for feature extraction and LSTM model for the classification process. The evaluation results showed that mispronunciation detection achieved an average accuracy rate of 81.52%, whereas the incorporation of gender recognition with mispronunciation detection led to a lower accuracy rate of 83.52%. This study indicated that Arabic mispronunciation patterns might not be gender specific. Furthermore, Calik et al. [55] introduced an ensemble model that detect mispronunciations of Arabic phonemes. They employed MFCC and Mel spectrogram techniques for feature extraction and several classification methods, such as SVM, K-NN, Decision Tree, Naïve Bayes, and Random Forest. Additionally, they utilized ensemble learning techniques to enhance the overall accuracy of the classification models, such as voting, bagging, boosting, and stacking. The dataset was collected from 11 speakers to record Arabic letters, each letter recorded individually. The results show that the utilizing of voting ensemble method with mel-spectrogram feature extractor achieved high accuracy score with 95.9%. Recently, El Kheir et al. [56] proposed an L1-aware multilingual MDD framework, an E2E-MDD model designed to accommodate speakers from different L1 backgrounds. The proposed model supports 3 target languages, English, Arabic, and Mandarin. The framework consists of a primary MDD network, and an auxiliary network implemented to encode L1-L2 aware representation. The MDD network deployed for the phoneme recognition to detect the phoneme sequence, and the auxiliary network categorizes the speaker's invariant representation by supervised classification of the speaker's L1 and L2 languages. Table 5 presents a summary of the aforementioned studies, based on the implemented techniques and performance results.

Table 5 MDD Systems for Arabic Languages

3.4 Discussion

In view of the aforementioned studies, despite the rapid improvements in MDD for different languages by applying several deep learning techniques and end-to-end models, we found little research effort on the Arabic language. Most existing studies on MDD in Arabic are employed classifier-based systems that extract phoneme features and feed them into different classifiers, such as KNN, SVM, and CNN. The deep learning techniques, such as CNN, DCNN, and AlexNet, are used in the feature extraction process to extract deep features with transfer learning [52]. Furthermore, these studies focused solely on mispronunciation detection without diagnosing errors, specifically analyzing the type of each detected error. Analyzing the pronunciation errors is crucial as it allows for a detailed examination of the type of each detected error at the phoneme level. This level of analysis provides valuable insights into the pronunciation errors made by language learners, which can improve the effectiveness of language learning tools.

End-to-end ASR systems have been effectively applied to many MDD systems for non-Arabic languages using different deep learning models to enhance the performance of phone recognition and MDD tasks. Recently, self-supervised and semi-supervised learning techniques have gained traction in speech recognition by learning speech data from unlabeled examples and fine-tuning the model on labeled data [10]. The self-supervised transformers achieved good results in downstream tasks for both ASR [10, 43] and MDD [8]. This paper introduced an Arabic MDD framework to detect pronunciation errors made by non-native speakers in the phoneme level and determine the type of each error using Transformer-based techniques. Additionally, this paper focuses on read speech with many non-native Arabic speakers. Furthermore, we covered human perceptual test to evaluate and compare our results.

4 Methodology

In this section, we provide an overview of the study’s methodology. First, we describe the datasets used, followed by a detailed description of the methodology and framework structure. In addition, we elucidate the evaluation metrics employed to gauge the effectiveness of our proposed framework.

4.1 Dataset

In this study, we mainly used our dataset, namely L2-KSU dataset to train the MDD system. The L2-KSU dataset is our constructed dataset that consists of 6 h and 6 min of audio recording (4086 audio files) along with their labeled transcriptions, including canonical text and real transcriptions with pronunciation errors. The data was collected from 80 speakers who recorded 4077 utterances. There was an even split of 40 native speakers and 40 non-native speakers, with an equal distribution of male and female participants for each speaker type. The read sentences are included both Quranic text and other MSA sentences. In the selection of sentences, we focused on phonemes that are hard to pronounce by non-native speakers, such as the sounds (/ʕ/ ع and /ħ/ ح), to improve the model to learn more from speakers' pronunciation errors. The selected sentences are presented in Table 6. The extension of audio files is wav format and sample rate of 16 kHz mono. Table 7 presents an overall description of the L2-KSU dataset. This dataset is available upon request.

Table 6 Selected Sentences for The L2-KSU Dataset
Table 7 L2-KSU Dataset Description

For the training process, we split the dataset into training and testing sets. According to [57], data from native speakers were used only in the training set, whereas data from non-native speakers were employed in both training and testing splits. In addition, the splitting of the L2-KSU dataset was based on the speakers, 60 speakers were selected for the training split and the remaining 20 were chosen for the testing split. Only non-native speakers’ data were selected for the test set to evaluate the system for detecting pronunciation errors from non-native speakers. Furthermore, different speakers may exhibit variations in their pronunciations, language backgrounds, and speech patterns. By splitting the data based on speakers, we ensured that the system remained unbiased to specific speakers, enhancing its robustness and generalizability through evaluations on unseen speakers. Table 8 presents the details of the data division.

Table 8 Details of the L2-KSU Dataset Setup

4.2 Methodology overview

In this study, we developed an Arabic mispronunciation detection and diagnosis framework. We fine-tuned several Transformer-based pre-trained models, based on Wav2vec2.0 [11], HuBERT [12], and Whisper [13] using our constructed dataset. These models represent advanced speech recognition technology designed to enhance accuracy and efficiency across multiple languages, including Arabic, by leveraging state-of-the-art techniques in deep learning and speech processing. The following subsections provide an overview of each model.

4.2.1 Wav2Vec2.0

Wav2vec2.0 is a self-supervised learning framework that learns speech representation from unlabeled speech data [11]. This represents a significant advancement in the field of speech processing, leveraging self-supervised learning techniques to learn powerful speech representations from large amounts of unlabeled data. The model design focuses on robustness, scalability, and efficiency to perform effectively in various applications, including transcription and emotion recognition. Wav2Vec 2.0 utilizes a combination of a convolutional neural network (CNN) for feature extraction and a transformer-based architecture for context modelling to achieve state-of-the-art speech recognition by pre-training on large unlabeled datasets and fine-tuning task-specific data. It has shown remarkable success in various speech-related tasks, and has significantly advanced state-of-the-art speech recognition studies across various languages [11]. Wav2Vec2.0 was pre-trained using several models with different parameters and training data. The base model named ‘Wav2vec2-base-960h’ was trained with more than 94M parameters for 960 h using the Librispeech corpus for native English speakers. In addition, a large-scale multi-lingual pre-trained model named ‘Wav2Vec2-XLS-R-300M’ [58], was pre-trained with up to 300M parameters on 436k hours of unannotated speech data from various corpora in 128 languages, including Arabic. Figure 2 shows the architecture of the self-supervised cross-lingual representation learning model.

Fig. 2
figure 2

The architecture of Wav2Vec2.0/XLS-R model: self-supervised cross- lingual representation learning [58]

4.2.2 HuBERT

A speech representation model based on self-supervised learning is called the hidden unit bert (HuBERT) [12]. Hubert produced the following three main challenges in speech signal processing: (1) the presence of multiple sounds in each input utterance, (2) no lexicon of input sound units is present in the pre-training phase, as in nlp applications in which words or word units are used, and (3) the differences in boundaries between sound units without explicit segmentation. To overcome these challenges, hubert benefits from the bert model, which employs masked continuous speech features to predict pre-determined cluster assignments. Predictive loss is applied to force the model to learn high-level representations of unmasked inputs, allowing accurate inference of masked targets. Therefore, the hubert model learns both acoustic and language models using continuous inputs. Hubert was pre-trained on both standard librispeech 960 h and the libri-light 60 k hours to produce three model sizes: base (90 m parameters), large (300 m), and x-large (1b). Figure 3 shows the architecture of the hubert model, which predicts the hidden cluster assignment of masked frames produced through one or more iterations of k-means clustering [12].

Fig. 3
figure 3

The architecture of HuBERT model [12]

4.2.3 Whisper

Whisper is a state-of-the-art ASR system developed by OpenAI [13]. It is a Transformer based encoder-decoder model; Fig. 4 shows the architecture of the model. The input audio was segmented and divided into 30-s chunks. These segments were converted into Log-Mel Spectrograms and fed into an encoder. A decoder was trained to predict the transcription text with special tokens, guiding the single model to perform different tasks, such as language identification and multilingual speech translation. Whisper was pre-trained using different models with different parameter sizes and target languages, including tiny with 39 m, base with 74 m parameters, small with 244 m parameters, medium with 769 m parameters, large, large-v2, and large-v3 with 1550 m parameters. The models were trained on 680 k hours of speech labelled data, with either English-only or multi-lingual data to perform two different tasks: speech recognition and speech translation. English-only models were trained on speech recognition tasks, whereas multi-lingual models were trained on speech recognition and speech translation tasks to predict the transcription of different languages [13].

Fig. 4
figure 4

The architecture of the Whisper model, implemented as an encoder-decoder Transformer [13]

4.3 Methodological framework

In this section, we provide an overview of our methodological framework for implementing an MDD system in the Arabic language.

Figures 5 and 6 show the methodological framework of this study based on fine-tuning Wav2vec2.0 [11], HuBERT [12] in Fig. 5 and Whisper in Fig. 6. This framework is divided into two phases, the first phase represents the implementation of the ASR model, and the second phase shows the processing of pronunciation detection and diagnosis. Starting from phase 1 in all models, the raw data is the input that includes audio and textual data. The audio data are pre-processed by changing the sampling rate feature to 16 kHz and converting the raw waveform of the speech signal to a float array. Additionally, the annotated data are mapped into phoneme-based transcription including pronunciation errors. In Fig. 5, the speech features are extracted through a CNN-based feature encoder and a Transformer-based context network to transform raw audio waveforms into contextualized representations. Then, Wav2Vec2.0 and HuBERT models are trained using the Connectionist Temporal Classification (CTC) loss function. For the speech recognition decoding, a CTC tokenizer be applied to decode the predicted output into phoneme-based transcription. For the Whisper model in Fig. 6, the feature extractor converted the raw audio inputs into a log-Mel spectrogram. Then, the Transformer-encoder processes the spectrogram to form a sequence of encoder-hidden states. Lastly, the decoder predicts text tokens autoregressively, considering both the preceding tokens and the encoder's hidden states.

Fig. 5
figure 5

Methodological Framework for the Arabic MDD based on Wav2Vec2.0 and HuBERT ASR models

Fig. 6
figure 6

Methodological Framework for the Arabic MDD based on Whisper ASR model

As illustrated in phase 2 in Figs. 5, 6 and 7, from the transcription text, the mispronounced phonemes are detected by calculating the similarity between the detected transcription, the labeled transcription, and the canonical text using the Levenshtein distance algorithm [59].

Fig. 7
figure 7

Spectrogram of audio file before and after noise removal. a represents a spectrogram with background noise appearing as random signals in the low-frequency range. b shows the spectrogram after removing noise

5 Experimental studies

In this section, we present the experimental details of the study, including data preprocessing and model fine-tuning. Additionally, we present the results and discussion, highlighting the performance evaluation based on phoneme recognition and MDD.

5.1 Data preprocessing

During the dataset development, the audio recording files were normalized by removing background noise and silence using the PRAAT software tool [60]. Figure 8 shows the spectrogram of the audio file before and after the normalization process. Figure 8 (a) shows the spectrogram with background noise that appears as continuous and random signals in the low-frequency range, and Fig. 8 (b) shows the spectrogram after performing noise reduction and eliminating the noise-related pattern in the signals. This noise reduction aims to enhance the clarity of the audio signals, making it more distinguishable in the spectrogram. In addition, some audio files were stereo audio files containing two channels. To ensure compatibility with speech recognition, we converted them into mono-channels.

Fig. 8
figure 8

The evaluation hierarchal structure for mispronunciation detection and diagnosis [14]

The annotated textual data were processed by mapping the canonical text with the transcribed text, including pronunciation errors. The mapping of pronunciation errors was based on the error type as follows: for the deletion errors, the unpronounced phoneme was removed from labelled text; for insertion errors, the inserted phoneme was added; and for the substitution errors, the canonical phoneme was replaced with the pronounced phoneme.

5.2 Model fine-tuning

For the training process, all models were trained on Tesla T4 GPUs with 54GB of memory. We utilized PyTorch [61] (version2.0.1 + cu118) to perform GPU-accelerated training, which is essential for deep learning tasks. In addition, we used Transformers [62], which provide convenient APIs and tools for downloading and fine-tuning state-of-the-art pre-trained models.

The fine-tuning process was applied to various Transformer-based models, Wav2Vec2.0 [11], Whisper [13], and HuBERT [12] with different parameters and training data. In our experiment, we fine-tuned the multilingual Wav2Vec2.0 model, ‘Wav2Vec2-XLS-R-300 M’ [58], HuBERT-large and HuBERT-xlarge models. When training the x-large model, we encountered issues related to memory and processing capacity because of its approximately one billion parameters. To address this issue, we applied parameter-efficient fine-tuning (PEFT) techniques. The PEFT approach depends on decreasing the number of hyperparameters for large-scale language models. This technique ensures significant minimization of computational and storage costs [63]. We trained the HuBERT-xlarge model using two PEFT-based methods: Int8 matrix multiplication for Transformers at scale (LLM.int8) [64] and low-rank adaption of large language models (LORA) [65]. LLM.int8 was applied to reduce the precision of the floating-point data types and to decrease the memory used to store the model weights. The LORA approach involves freezing the weight of the pre-trained model and integrating trainable rank decomposition matrices into every layer of the Transformer architecture. This method reduces the number of trainable parameters required for the different downstream tasks. After performing these methods, the number of parameters in the HuBERT-xlarge model was reduced from 964 to 7 million parameters. To equalize the fine-tuning process of the large and x-large models, we applied PEFT to the large model, reducing its parameters from 315 to 3 million. Therefore, the HuBERT-large model was trained on both the original and the optimized models (parameter reduction), and the HuBERT-xlarge model was trained after optimization.

Since the HuBERT models closely follow the architecture of the Wav2Vec2.0 models, we selected similar configurations and hyperparameter values for both models. According to [66], we fine-tuned several hyperparameters for the training of Wav2Vec2.0 and HuBERT models, including batch size, learning rate, number of epochs, and warm-up ratio. In our experiment, we selected two batch sizes with a learning rate of 1e-4, and the number of epochs was 20. Smaller batch sizes are common when fine-tuning large models, because they require less GPU memory and can prevent memory issues. We chose to train the model for 20 epochs to let the model better understand the pattern of data. A lower learning rate ensures the stability of the training process.

For the Whisper experiment, we trained the tiny, base, small, medium, large-v2, and large-v3 models with varying numbers of parameters. As presented in Table 9, Whisper models consist of a large number of parameters, ranging from 39 M to 1.55B parameters. These large models affect the GPU and other computational resources during training. To address this, we applied PEFT techniques to reduce the number of model parameters and train it with less computational resources and time. In our study, we reduced the training parameters of Whisper models to utilize only from 1% to 1.5% of all trainable parameters. For example, the parameters of the large model were reduced from 1550 to 15 M parameters. Hence, the computational efficiency was enhanced, training time was reduced, and the model became more scalable. The large-v2 and large-v3 models of Whisper consist of the same architecture and training parameters, except for some differences, which are the input of large-v3 model uses 128 Mel frequency bins instead of 80, and the amount of trained data in large-v3 is 1 million hours of weakly labelled data and 4 million hours of pseudo-labeled audio compared to 680 k hours of labelled data in large-v2 [13]. The architectural and training parameters for each fine-tuned model used in this study are listed in Table 9 along with the time required to train each model. The training times increased incrementally with an increase in the model size, as we observed that each model required more training hours with an increase in its size. Wav2Vec2.0 and HuBERT models were trained on a similar amount of time, 5–5.30 h because they were trained on the same architecture. The HuBERT-large model took 5.30 h of training time when utilizing 315 M of parameters, and 3.3 h of training time when reducing its parameters to 3.1 M. Table 10 lists the fine-tuned hyperparameters for each model.

Table 9 The Architecture Parameters of Wav2Vec2.0 [11], HuBERT [12], And Whisper [13] Models Utilized In This Study
Table 10 Hyperparameter Settings for Fine-Tuning Wav2Vec2.0, HuBERT, And Whisper

5.3 Model evaluation

The models were evaluated on the test dataset to evaluate the performance of phone recognition and MDD tasks.

The evaluation measure for phone recognition is the phoneme error rate (PER), which is commonly used to evaluate the performance of ASR systems. PER was calculated using the following equation [67]:

$$\begin{array}{*{20}c} {PER = \frac{{\left( {I + D + S} \right)}}{N}} \\ \end{array}$$
(1)

S, D, and I refer to the number of substitutions, deletions, and insertions in the recognized phone sequences and N represents the number of annotated labels.

The performance on the MDD task was evaluated using the evaluation structure proposed in [14] to measure mispronunciation detection and diagnosis. As shown in Fig. 7, the correct detections consist of true acceptance (TA) and false rejection (FR), whereas incorrect detections include false acceptance (FA) and true rejection (TR). Also, the TR cases were further categorized into correct diagnosis (CD) and diagnosis error (DE). The metrics for evaluating mispronunciation detection were calculated using the FRR, FAR, and Diagnosis Error Rate (DER).

TA indicates that both human-transcribed phones and predicted phones are similar to canonical pronunciations. TR denotes that the transcribed phone and the predicted phone are not identical to the canonical pronunciation. FA indicates that the predicted phone is the same as the canonical phone, but the transcribed phone is different. FR indicates that the transcribed phone and canonical pronunciation are similar, but the predicted phone is different. Correct diagnosis (CD) and diagnosis error (DE) rates were used to evaluate mispronunciation diagnoses.

The evaluation of the accuracies of mispronunciation detection and mispronunciation diagnosis was calculated as follows:

$$\begin{array}{*{20}c} {Detection\, Accuracy = \frac{TA + TR}{{TA + FR + FA + TR}}} \\ \end{array}$$
(2)
$$\begin{array}{*{20}c} {Diagnosis\, Accuracy = \frac{CD}{{CD + DE}} = 1 - DER } \\ \end{array}$$
(3)

6 Results and discussion

In this section, we delve into the findings obtained through the evaluation of various models and examine their performance in the context of phoneme recognition and mispronunciation detection and diagnosis tasks. Through these evaluation results, we aim to investigate the effectiveness of Transformer-based models, including: Wav2Vec2.0, Whisper, and HuBERT.

6.1 Performance in the phoneme recognition

The phoneme recognition evaluation was based on the phoneme error rate (PER). Table 11 shows the evaluation performance of each model. The Wav2Vec2.0, HuBERT-large, and Whisper models achieved promising results, ranging from 3.9% to 3.1% PER, and outperformed the baseline CNN-RNN-CTC model, which achieved a PER of 13.3%. These results demonstrate the effectiveness of Transformer-based models in accurately transcribing Arabic phonemes against the CNN-RNN-CTC baseline model.

Table 11 Evaluation Results for Wav2Vec2.0, HuBERT, and Whisper Models in Terms of Phone Recognition

All Transformer-based models yielded low PER results except the HuBERT-large and HuBERT-xlarge models, which achieved suboptimal results after the optimization and reduction of their parameters. The reduction of parameters in HuBERT models compromised the model’s performance in recognizing Arabic phonemes, gaining a PER of 79% and 40% in the large and x-large models, respectively. However, the HuBERT-large model, without parameter reduction (300 M), achieved slightly similar results to the large models of Whisper and Wav2Vec2.0, with 3.6% PER, compared to 3.1% and 3.2% for the Whisper-large and Wav2Vec2.0-XLS-R models, respectively. When comparing the results of HuBERT with Whisper after optimization and parameter reduction, we found that Whisper models achieved much higher results despite having fewer training parameters than HuBERT models. The base, tiny, and small Whisper models outperformed the HuBERT models with smaller parameter sizes. One of the main reasons is that the Whisper model was trained on multi-lingual data, including Arabic, whereas the HuBERT system was trained only on data in the English language.

The overall results in Table 11 indicate that the Whisper models (small, medium, and large) achieved the best phone recognition results because they were trained on multi-lingual data, which enhanced the models by recognizing different language and pronunciation patterns. In addition, Whisper utilizes a unique generative inference procedure, sequentially inferring on 30-s audio windows. The encoder embeds the audio window, which is then mapped to a text sequence predicted by the decoder using the encoder outputs as a context vector. This process is repeated by advancing the audio window based on the last predicted timestamp token [13]. This feature allows Whisper to maintain a dynamic understanding of the evolving context within each window to generate more contextually informed and accurate predictions. The results of the Whisper models gradually improved with an increase in model size, and we observed that the medium and large models achieved the best results with similar PER of 3.12% and 3.18% for the medium and large-v2 models, respectively. Furthermore, the training parameter sizes of Whisper models are smaller than those of Wav2Vec 2.0 and the original HuBERT-large models. However, when comparing the model size with the achieved results, the Whisper models excel in their lightweight size and high performance. This indicates that the Whisper models achieve a balance between model performance and efficiency in terms of parameter size. This efficiency is crucial for practical applications where resource usage is a concern. Additionally, the consistent outperformance of the Whisper models across different sizes indicates the robustness of this approach.

Figure 9 shows the confusion matrix for the recognized and misrecognized phonemes using the Whisper-medium model with the test set, containing 21,511 phonemes belonging to 813 utterances. The confusion matrix shows the effectiveness of the model in phoneme recognition, with a higher number of correctly recognized phonemes compared to the misrecognized phonemes. The most misrecognized phonemes were between the long vowels (/ا/ aː), (/ي/ iː), and (/و/ uː) and other phonemes, since the short vowels (/َ/ a), (/ِ/ i), and (/ُ/ u) are not included in the annotated data and the model may misrecognized the short vowel as a long vowel.

Fig. 9
figure 9

Confusion matrix of the phoneme recognition task by the Whisper model

To assess the performance on the training and validation sets, Fig. 10 illustrates the validation and training loss during the training process of each model, Wav2Vec-XLS-R, HuBERT-large, and Whisper-medium. In all models, the training and validation loss are decreased and stabilized at a specific point, which indicates that the models generalize well to unseen data.

Fig. 10
figure 10

Performance of the training and validation loss during the training process for the Wav2Vec-XLS-R, HuBERT-large, and Whisper-large models

6.2 MDD performance evaluation

The evaluation of MDD performance was based on the hierarchical evaluation structure proposed by [14] to measure the performance of MDD systems. Table 12 shows the MDD performance results obtained in this study. The large models of Wav2Vec2.0, HuBERT, and Whisper demonstrated convergent precision scores of 92.8% for Wav2Vec2.0, 94.9% for HuBERT-large, and 96.5% for Whisper-large-v2. The highest precision score achieved by the large-v3 Whisper model was 96.2%, illustrating the model’s ability to correctly identify true-positive cases. The HuBERT-large and Whisper models achieved slightly similar recall scores, and the higher scores yielded by the Whisper-small model with 91.0% recall signified their capacity to identify a vast majority of actual mispronunciations. The higher recall resulted in an increase in the F-measure score, highlighting the model’s overall accuracy, the Whisper-small model outperformed other models with F-measure of 93.5%, even with smaller training parameter size than Wav2Vec2.0 and HuBERT models. While the baseline CNN-RNN-CTC model achieved the poor f-measure result of 80.0%.

Table 12 Evaluation Results for Wav2Vec2-XLS-R, HuBERT-Large, and Whisper Models, And the Baseline CNN-RNN-CTCiIn Terms of MDD Task

The false rejection rates (FRR) of all our models achieved similar results, ranging from 12.2% for Whisper-medium to 6.2% for Whisper-large-v3, indicating that the Whisper-large models works well in detecting correct pronunciations. The best FAR (false acceptance rate) results were achieved by the Whisper-small of 8.9%, while the HuBERT-large and Wav2Vec2.0 achieved similar FAR with 13.0% for HuBERT-large and 12.6% for Wav2Vec2.0. The FAR reflects the model’s proficiency in modeling mispronunciations. In terms of detection accuracy and diagnosis accuracy, the best detection accuracy achieved by the Whisper-small was 91.3%, outperforming Wav2Vec2.0, HuBERT-large, and the baseline CNN-RNN-CTC by 3.8%, 2.7%, and 17.5% respectively. Similarly, the Whisper-small model achieved 4.4%, 1.9%, and 24.6% improvements in diagnosis accuracy against Wav2Vec2.0, HuBERT-large, and CNN-RNN-CTC, respectively.

The overall results verified the efficiency of the Whisper models in accurately classifying mispronunciations and recognizing Arabic phonemes. However, the performances of the Whisper, Wav2Vec2.0, and HuBERT models were slightly similar in phoneme recognition, however in terms of MDD performance, the Whisper models significantly outperformed Wav2Vec2.0 and HuBERT, with convergence results for Whisper and HuBERT. We assume that the large and diverse datasets trained on Whisper can enable this model to better identify various types of pronunciations. Whisper is trained on 630 K hours of labeled datasets collected from the Web including a wide range of accents, dialects, speaking styles, and background sound, while Wav2Vec2-XLS-R model was trained on 436 k hours of unlabeled data from different benchmark datasets.

6.3 Pronunciation errors analysis

This section analyzes the errors detected by the model which provides a valuable insight of pronunciation errors produced from language learners or speakers. The Whisper-small model achieved the best MDD evaluation results among other evaluated models, in terms of detection accuracy and diagnosis accuracy. Based on the obtained results, we extracted pronunciation errors detected by the Whisper-small model and analyzed them from the perspective of Arabic linguistics. This analysis demonstrates the accuracy and effectiveness of the system, aligning with linguists and providing valuable insights into its performance in detecting pronunciation errors in the Arabic language.

The testing set consisted of 813 utterances from non-native speakers including 21,511 phonemes, where the detected phoneme errors occurred in 510 utterances including 13,660 phonemes. Table 13 presents the pronunciation errors of all types resulting from both male and female speakers, with a total of 1750 pronunciation errors from all speakers. The number of errors was similar between males and females, with a slight increase in errors for males.

Table 13 Statistical Summary for Pronunciation Errors Produced from Non-Native Speakers and Detected by Whisper-Small Model

Figure 11 illustrates that substitution errors had the highest occurrence among non-native speakers, followed by deletion errors and then insertion errors. Substitution errors occur when non-native speakers replace a target phoneme with a similar one that they are more familiar with. In particular, they tended to transfer phonetic patterns from their native language to the target language, indicating the effect of language transfer and its crucial role in improving their speaking proficiency.

Fig. 11
figure 11

Error count based on the insertion, deletion, and substitution errors

Figure 12 shows the confusion matrix for the most frequent substitution errors obtained from non-native speakers. Each cell in the matrix represents the frequency of substitution errors for specific phoneme pairs, where the x-axis represents the canonical phoneme, and the y-axis represents the substituted phoneme. The results in the confusion matrix demonstrate consistency with the annotated errors in the dataset, as well as with the most commonly occurring errors produced by non-Arabic speakers. These errors include frequent substitutions in the pharyngeal sounds (/ʕ/ ع and /ħ/ ح) by replacing the sound (/ʕ/ ع) with (/ʔ/ أ), as well as the replacing between the sounds (/ħ/ ح) and (/h/ ه). From an Arabic linguistic perspective, these pharyngeal sounds present significant pronunciation difficulties for non-native Arabic learners. These difficulties can be attributed to the form of pronunciation from the pharynx area, which is often inactive in many languages worldwide [68]. In addition, the voiced sound (/dˤ/ ض) was frequently substituted with its voiceless counterpart (/d/ د). This substitution often occurs among non-Arabic speakers since the sound (/dˤ/ ض) is only presented in the Arabic language and pronouncing it requires specific tongue placement, leading speakers to exert more effort compared to its voiceless counterpart [69, 70]. Furthermore, the interdental sounds (/ðˤ/ ظ) and (/ð/ ذ) were frequently substituted with the sound (/z/ ز). The non-native speakers in our dataset were from West and Central Africa, where the most common first languages in these countries are Wolof and Yorba. According to [70, 71], these two sounds (/ðˤ/ ظ and /ð/ ذ) are not presented in the Wolof and Yorba languages, making their pronunciation more challenging. Finally, the sound (/ɣ/ غ) was commonly substituted with the sound (/q/ ق), despite both sounds being produced from the same uvular place of articulation. According to [71], the constant (/ɣ/ غ) in Arabic has an equivalent in Yorba language, which is (hard G), like “g” in “morning”. This equivalent sound does not exist in the MSA or Classical Arabic, but it is presented in some dialects, and in our dataset, we considered this sound as (/q/ ق). Therefore, there was a frequent substitution between (/ɣ/ غ) and (/q/ ق).

Fig. 12
figure 12

Confusion matrix for the most frequent substitution errors detected by the Whisper-small model

Through this analysis, the system demonstrated its effectiveness by identifying pronunciation errors that linguists agree to be common among non-Arabic speakers, especially those who speak the Yorba and Wolof languages.

The insertion and deletion errors identified by the Whisper-small model are shown in Figs. 13 and 14, respectively. The most frequent deletion error has occurred on the nunation sound, which is represented as a diacritical mark: (ً) and may not be easily distinguished by non-Arabic speakers. According to previous linguistic studies, nunation was identified as one of the sounds that presented difficulties for non-native Arabic speakers in both detection and pronunciation [72]. Figure 14 illustrates the insertion errors obtained from non-native speakers, which is the lowest type of errors identified by the model.

Fig. 13
figure 13

Deletion errors produced from non-native Arabic speakers and detected by Whisper-small model

Fig. 14
figure 14

Insertion errors produced from non-native Arabic speakers and detected by Whisper-small model

6.4 Discussion

To demonstrate the efficiency of our models in phoneme recognition and MDD performance, we conducted a comparative analysis using datasets from prior literature. We selected L2ARCTIC [73] and Speechocean762 [74] datasets, designed for non-native English speakers and widely employed in MDD research. For Arabic language, we could not find any public dataset for non-native Arabic speakers containing the annotation of pronunciation errors. Table 14 demonstrates a comparative analysis across different datasets trained on three models, Wav2Vec-XLS-R, HuBERT-large, and Whisper-small models, using the same experimental settings as our study. In the L2ARCTIC dataset, we achieved good results in terms of PER, F1 score, and detection accuracy, with the Whisper-small model showing the best performance in F1 and detection accuracy compared to all other datasets. However, the diagnosis accuracy was suboptimal. This dataset comprises non-native speakers with diverse native languages such as Arabic, Mandarin, Hindi, Korean, Spanish, and Vietnamese. These variations in native languages can lead to different speaking styles and uncommon pronunciation errors, which may challenge the model's ability to accurately diagnose pronunciation errors. The SpeechOcean762 dataset yielded promising results in terms of PER, but the F-measure achieved suboptimal score. This could be due to the low occurrence of pronunciation errors in the dataset, leading to a high level of agreement between transcribed phonemes and canonical phonemes compared to the dis-agreement between transcribed phonemes and canonical phonemes (pronunciation errors). The L2-KSU dataset, our dataset, achieved the best evaluation results with the Whisper-small model in terms of PER and diagnosis accuracy metrics.

Table 14 Evaluation Results of Our Models Using Different Datasets

To contextualize our obtained results with existing state-of-the-art studies in the field, we compared our findings against both Arabic and non-Arabic MDD studies. Algabri et al. [53] obtained a 70.42% F-measure and 84.01% diagnostic accuracy with a model based on CNN-RNN-CTC. In addition, El Kheir et. al. [56] introduced an L1-aware multilingual MDD framework using Transformer-based technique that achieved 6.13% PER and 78.42 F-measure. In terms of non-Arabic data, Wu. et al. [8], explored both Transformer encoder-decoder architecture and Wav2Vec2.0 model for non-native English speakers. The Wav2Vec2.0 model achieved an F-measure of 80.98% and a detection accuracy of 90.05%. Additionally, Peng et al. [44] fine-tuned the Wav2Vec2.0-XLS-R model, leveraging the L2-ARCTIC and TIMIT datasets, and obtained an F-measure of 60.44%, PER of 14.68%, and DER of 29.28%. Furthermore, Shen et al. [75], utilized the WavLM-large pre-trained model for the Mandarin language, achieving a PER of 4.16%, and F-measure of 39.5%.

The research question of our study lies in the ability of Transformer-based techniques to detect pronunciation errors among non-native Arabic speakers. Our findings against the baseline model (CNN-RNN-CTC) demonstrate the effectiveness of Transformer-based techniques in addressing mispronunciation challenges in the context of non-native Arabic speakers. In particular, Whisper models stand out with competitive performance in terms of phoneme recognition and MDD evaluation.

7 Human perceptual test

A human perceptual test in speech processing studies refers to the assessment and evaluation of speech data conducted by human participants [76]. This perceptual experiment was conducted to assess the ability of humans to detect mispronunciation errors in the L2-KSU dataset and to compare their evaluations with automatic verification. In this section, we describe the design of the experiment, including the selection of reviewers, the design of the evaluation form, and the scoring criteria used to assess the human verification results. In addition, we presented and discussed the evaluation results of human verification and compared them with automatic verification, along with the levels of inter-related agreements among the participants of the experiments.

7.1 Reviewer selection

The selection criteria were based on the gender and educational background. Two males and two females were selected to participate in the human perceptual test. All the participants were native Arabic speakers with different educational backgrounds. The first male participant had a bachelor's degree in Arabic, and the second had a bachelor's degree in Shari’ah (i.e., Islamic law) and Islamic studies. He is a Hafiz of the Holy Quran (i.e., completely memorized the Quran). The third participant was a female master's student in the sciences of the Holy Quran who was Hafiz of the Holy Quran. The fourth participant was a female with a bachelor's degree in English language literature. This diversity of education and academic focus was selected to determine whether the reviewer’s educational background had any effect on the accuracy of Arabic mispronunciation perception. All participants were asked to fill a participation consent to evaluate the provided data.

7.2 Evaluation form design

We selected 200 samples from non-native speakers of the L2-KSU dataset, with an equal distribution of males (100 samples) and females (100 samples). Each sample was presented in a separate document and categorized according to gender. The samples were provided in the form of WAV files, each accompanied by a canonical transcription. The samples were identified with random names assigned to each file, and the presentation order of the files was shuffled to minimize bias and order effects. The participants were tasked with identifying three key aspects of each sample: error type, word of error, and canonical phonemes with mispronounced phonemes. Participants were required to determine whether the error in pronunciation was a deletion, substitution, or insertion. In addition, they had to identify the specific words in which mispronunciation occurred. They should then pinpoint the correct phoneme with a mispronounced phoneme in the identified error. Figure 15 shows an example of a sample from the evaluation form before and after the participant completed it. The instructions were given to the participants, outlining the structure of the form, the process of accessing WAV files, and the definition of error types with examples of each type to enhance participant understanding. Additionally, the participants were allowed to replay the audio file, if necessary, and prevent them from replaying past utterances once they moved to a subsequent file to avoid the comparison of multiple utterances of the same speaker [76].

Fig. 15
figure 15

Sample of the form before and after a participant filled it out with the error type, a word in which the mispronunciation occurred, and the true phonemes with mispronounced phonemes

7.3 Scoring criteria

After collecting evaluation forms from the reviewers, we extracted and analyzed the evaluation results to obtain the accuracy of human verification and compared it with automatic verification. As shown in Fig. 16, we calculated the results by capturing the following key components for each audio sample: file name, pronunciation errors identified by each reviewer, and the frequency of each pronunciation error. Subsequently, we computed the average frequency of each pronunciation error for all the reviewers. This step aimed to quantify the agreement among reviewers regarding the determination of specific errors. We adopted an average review produced by more than two reviewers. Based on these average scores, we built a phoneme-based transcription for comparison with the target annotation and canonical text.

Fig. 16
figure 16

Human evaluation result for an audio sample of the sentence: “وَعَدَ ٱللَّهُ ٱلَّذِينَ ءَامَنُواْ وَعَمِلُواْ ٱلصَّالِحَاتِ” in the L2-KSU dataset

7.4 Human evaluation results

To compare the results of the human and automatic evaluations, we followed the same evaluation structure obtained for the model evaluation in Sect. 3-D [14]. We compared the human-transcribed phones with the target phones and canonical phones to measure the mispronunciation detection and diagnosis of the human perceptual test. The evaluation is based on the data subset represented by the reviewers, which consists of 200 utterances from non-native speakers with an even distribution among males and females. Table 15 illustrates the error counts assigned by each reviewer.

Table 15 Statistical Summary of Error Count Detected by Individual Reviewers for Both Male and Female Speakers

Table 16 presents the diagnosis accuracy and detection accuracy of the human perceptual test in comparison to the evaluation of the Whisper-small model. We selected the small model of Whisper as it is the best-evaluated model in the L2-KSU dataset for the MDD task. The results show that the automatic verification outperforms the human verification of mispronunciation detection and diagnosis.

Table 16 Evaluation Results for Arabic MDD Based on Human Verification and Automated Verification

Detecting pronunciation errors is a sensitive task that varies from one individual to another and depends on listeners’ concentration and linguistic background. In addition, some Arabic sounds share their characteristics with other sounds, such as the sounds (/س/ s) and (/ص/ sˤ) which share the articulation of Alveo-dental, as well as common features like softness, voiced, and whistling. These two sounds in the Arabic language are pronounced similarly in some instances, leading some Arabic linguists to consider interchangeability acceptable in specific contexts. Through our experiment, we observed participants’ diversity in their perception of this interchange, with some considering it a pronunciation error while others overlooked it. Similarly, with the sounds (/ظ/ ðˤ) and (/ذ/ ð) which share the articulation point and some features like voicing and softness, some listeners found it challenging to distinguish between the pronunciation of the sound (/ظ/ ðˤ) and (/ذ/ ð) in some cases. Finally, some speakers pronounce sentences with multiple pronunciation errors, which makes it difficult for listeners to identify and pinpoint all of them. From these results, it is evident that automated verification outperforms human verification in detecting and diagnosing Arabic pronunciation errors. The significance of enhancing these techniques and increasing the amount of data is crucial for improving their quality and enabling Arabic language learners to benefit from them.

7.5 Inter-rater agreement

In this study, the evaluation of inter-rater agreement aimed to assess the consistency among four reviewers tasked with detecting pronunciation errors in a subset of L2-KSU dataset that comprised 200 utterances from non-native Arabic speakers. The reviewers were selected based on their gender and educational background, to ensure comprehensive assessment. To quantify inter-rater agreement, we employed a common statistical measure called Cohen’s kappa [77]. Cohen’s kappa is a correlation statistic for either inter-rater or intra-rater reliability testing. The Kappa value ranges from – 1 to + 1, where values less than 0 indicate no agreement, 0.01–0.20 as non to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement [78]. Cohen’s kappa quantifies the level of agreement between two raters, each classifying N items into C mutually exclusive categories. The formula for calculating K is as follows [77]:

$$\begin{array}{*{20}c} {K = \frac{\Pr \left( a \right) - \Pr \left( e \right)}{{1 - \Pr \left( e \right)}}} \\ \end{array}$$
(4)

where Pr(a) refers to the observed agreement among raters, and Pr(e) refers to the hypothetical probability of chance agreement based on the marginal probabilities of the categories assigned by each rater. Furthermore, we generalized this measure to assess the overall agreement among all raters by replacing Cohen’s kappa with Fleiss’ kapp. Fleiss’ kapp is a statistical measure that generalizes Cohen’s kapp to assess the reliability of the agreement between more than two observers [79].

The inter-rater agreements of the four reviewers are shown in Table 17. The results of the first two relations indicated a moderate level of agreement between the male and female reviewers. Similarly, the reviewers with a Quran education background were 44.6%, referred to a moderate level of agreement. The reviewers with a linguistic background obtained a value of agreement of 60.0%, indicating a moderate level of agreement with the highest value among other relations in this study. This high value refers to the identical linguistic and phonetic backgrounds between the two reviewers. Additionally, the overall agreement among all four reviewers using Fleiss’ kappa demonstrates an agreement level of 50.9%. The 50.9% agreement suggests a moderate degree of concordance in their assessments.

Table 17 The Inter-Rater Agreement for Pair Reviewers Using Cohen’s Kappa and the Overall Agreement Among All Reviewers Using Fleiss’ Kappa

8 Conclusion

In this study, we investigated automatic speech recognition for the MDD task in the Arabic language, focusing on both native and non-native speakers. We also investigated the previous research studies concerning both ASR and advanced methodologies, as well as MDD studies for Arabic and non-Arabic languages. Owing to the limited resources of Arabic speech datasets containing non-native speakers, we constructed a speech dataset that specifically focused on the differences in speech patterns between native and non-native Arabic speakers. The nationalities of the non-native participants were from countries in Central and West Africa, where they spoke languages and dialects that were closely similar. Furthermore, we annotated speech data based on word and phoneme levels, including pronunciation errors. Finally, we used the constructed dataset, L2-KSU dataset, to demonstrate the effectiveness of state-of-the-art Transformer-based speech recognition models, namely Wav2Vec2.0 [11], HuBERT [12] and Whisper [13], against the recurrent-based CNN-RNN-CTC model in the context of the MDD task. Each model was trained to detect pronunciation errors in Arabic phonemes and analyzed these errors by categorizing them into insertion, deletion, or substitution errors. Our findings showed promising results in terms of phoneme recognition performance and MDD. The phoneme error rate (PER) reached 3.1%, 3.2%, and 3.6% for Whisper, Wav2Vec2.0, and HuBERT, respectively, which significantly outperformed the baseline rate of 13.2%, highlighting the accuracy and efficiency of Transformer-based in recognizing phonemes. For the evaluation of the MDD task, Whisper model achieved the best diagnosis accuracy and detection accuracy results of 80.0% and 91.3% against the CNN-RNN-CTC results of 55.4% and 73.8%. These findings indicated the efficiency of Transformer-based techniques in enhancing the detection of pronunciation errors in non-native Arabic speakers. Since the nationalities of the non-native speakers in our dataset are closely related, we cannot generalize this experience to all non-Arabic speakers, and in the future, we aim to expand the dataset to include participants from other nationalities and various pronunciation patterns. In addition, we conducted a further experiment for human perceptual tests to assess the ability of humans to detect mispronunciation errors in the L2-KSU dataset and compared their evaluation with automatic verification. The evaluation results indicated that automatic verification outperformed human verification with a detection accuracy of 97.0% and diagnostic accuracy of 80.4%, whereas human perception achieved 90.5% detection accuracy and 66.0% diagnostic accuracy.

In the upcoming stages, we aim to increase the volume of data of the L2-KSU dataset, along with targeting other geographical regions to study the differences in pronunciation patterns. Additionally, we aimed to provide a real-time implementation of the proposed model to assess the feasibility of deploying Transformer-based models in real-time applications.