Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper
Open access

Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data

Published: 10 March 2023 Publication History

Abstract

In recent studies, pre-trained models and pseudo data have been key factors in improving the performance of the English grammatical error correction (GEC) task. However, few studies have examined the role of pre-trained models and pseudo data in the Chinese GEC task. Therefore, we develop Chinese GEC models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART, and then incorporate these models with pseudo data to determine the best configuration for the Chinese GEC task. On the natural language processing and Chinese computing (NLPCC) 2018 GEC shared task test set, all our single models outperform the ensemble models developed by the top team of the shared task. Chinese BART achieves an F score of 37.15, which is a state-of-the-art result. We then combine our Chinese GEC models with three kinds of pseudo data: Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation). We find that most models can benefit from pseudo data, and BART+Lang-8 (MaskGEC) is the ideal setting in terms of accuracy and training efficiency. The experimental results demonstrate the effectiveness of the pre-trained models and pseudo data on the Chinese GEC task and provide an easily reproducible and adaptable baseline for future works. Finally, we annotate the error types of the development data; the results show that word-level errors dominate all error types, and word selection errors must be addressed even when using pre-trained models and pseudo data. Our codes are available at https://github.com/wang136906578/BERT-encoder-ChineseGEC.

1 Introduction

Grammatical error correction (GEC) is the task of correcting a variety of grammatical errors in text written typically by non-native speakers. To date, many models based on encoder–decoder (EncDec) have been proposed for GEC and have achieved human-parity performance, particularly on several benchmark datasets for English [3, 9]. The key factor in performance improvement is the use of pre-trained models [11, 12, 23] and pseudo data [14, 34]. EncDec requires a large amount of training data, but GEC has limitations in available data compared with machine translation (MT).
In contrast to the rapid progress of research on English GEC, few studies on Chinese GEC, where available data is even more limited than in English, have investigated the methodologies for incorporating pre-trained models and pseudo data into GEC models. For pre-trained models, a limited number of studies have used BERT [1, 6, 16], although new Chinese pre-trained models are developed and released continuously [28, 29]. Moreover, Wang et al. [33] is one of the few studies that incorporated pseudo data into Chinese GEC models. Although they combined both rule-based and backtranslation methods to generate the pseudo data for Chinese GEC, they used non-public data to generate the pseudo data. Therefore, it is difficult to analyze the contribution of pseudo data to the final performance. It has also been reported that suitable settings for pseudo data utilization in GEC vary depending on language [20], suggesting that the best practices in English GEC cannot directly be applied to Chinese GEC.
This study comprehensively investigates methodologies for utilizing pre-trained models and pseudo data in Chinese GEC and provides the Chinese GEC community with an improved understanding of the incorporation of pre-trained models and pseudo data. Through extensive experiments with three large-scale pre-trained models (Chinese BERT [4], Chinese T5 [29], Chinese BART [28]), and three types of pseudo data (Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation)), we show that BART offers the best performance, and BART+Lang-8 (MaskGEC) is the ideal setting in terms of accuracy and training efficiency. Additionally, we annotate the error types of the development data; the results show that word-level errors dominate all error types, and word selection errors must be addressed even when incorporating pre-trained models and pseudo data.

2 Related Work

2.1 English GEC Using Pre-trained Model and Pseudo Data

For English GEC tasks, BERT [5] is primarily used as a pre-trained model to improve the performance. Additionally, large-scale pseudo data are shown to contribute to the accuracy. In this subsection, we summarize some details about the works that attempted to incorporate BERT and pseudo data into their correction models.
Pre-trained model as a feature: . Kaneko et al. [10] first fine-tuned BERT on a learner corpus and then employed the word probability provided by BERT as re-ranking features. Using BERT for re-ranking features, they obtained an improvement of approximately 0.7 point in the \(\mathrm{F_{0.5}}\) score. By contrast, Kaneko et al. [11] first fine-tuned BERT using a grammatical error diagnosis task and then incorporated the fine-tuned BERT into the correction model by using a fusion method. They showed the effectiveness of BERT on the English GEC task and achieved comparatively high scores.
Pre-trained model in a pipeline: . Kantor et al. [12] used BERT to solve the GEC task by iteratively querying BERT as a black box language model. They added a [MASK] token into source sentences and predicted the word represented by the [MASK] token. If the word probability predicted by BERT exceeded the threshold, the word was output as a correction candidate. Using BERT, they obtained an improvement of 0.27 point in the \(\mathrm{F_{0.5}}\) score. Omelianchuk et al. [23] treated the GEC problem as a sequence editing problem. They used a BERT-based pre-trained model to predict the edit operations for an erroneous sentence, and the predicted edit operations were then used to correct the erroneous sentence.
Generating Pseudo Data for GEC: Xie et al. [34] proposed a method for generating pseudo data based on backtranslation. They first trained a backtranslation model on GEC data and then applied the backtranslation model to clean the monolingual corpus and acquire pseudo data. They also adopted noising methods to generate diverse pseudo data. Kiyono et al. [14] conducted experiments, focusing on three aspects: generating methods, selection of seed corpus, and optimization settings for English GEC pseudo data. They showed how the pseudo data should be generated or used for the English GEC task and achieved state-of-the-art performance at the time of publication.

2.2 Chinese GEC

In this subsection, we first describe the NLPCC 2018 Chinese GEC dataset and then provide some details about five methods that have been experimented on this dataset.
Given the success of the shared tasks on English GEC at the Conference on Natural Language Learning (CoNLL) [21, 22], a Chinese GEC shared task was introduced at NLPCC 2018. In this task, approximately one million sentences from the language learning website Lang-81 were used as training data and two thousand sentences from the PKU Chinese Learner Corpus [38] were used as test data. Two types of supervised GEC models were proposed by training them on the dataset: simple and complex models. Simple models are easy to understand and use but are less effective, whereas complex models achieve high accuracy but are hard to maintain. Our approaches, using pre-trained models and pseudo data, offer the best of both worlds; they are simple and effective.
Simple models: . Ren et al. [26] utilized a convolutional neural network (CNN), similar to that of Chollampatt and Ng [3]. However, because the structure of the CNN is different from that of BERT, it cannot be initialized with the weights learned by BERT. Zhao and Wang [39] proposed a dynamic masking method that replaces the tokens in the source sentences of NLPCC 2018 GEC shared task training data with other tokens (e.g., [PAD] token). They achieved comparatively high scores on the shared task without using any extra knowledge. This is a data augmentation method that can be combined with methods that utilize the pre-trained models.
Complex models: . Fu et al. [7] combined a 5-gram language model-based spell checker with subword-level and character-level encoder–decoder models using Transformer to obtain five types of outputs. Then, they re-ranked these outputs using the language model. Although they reported a high performance, several models were required, and their method of combining these models was complex. Hinson et al. [8] proposed a heterogeneous approach for Chinese GEC. In their method, an erroneous sentence is corrected using a spell checker model, sequence editing model, and sequence-to-sequence model in multiple rounds. Before their work, only sequence-to-sequence models were used for recycle generation in Chinese GEC [24]. They also used an automatic annotator to annotate four error types and evaluated the model performance on these error types. However, they only used character-level edit operations as the error types, which may not be appropriate for reflecting the nature of Chinese grammatical errors. Chen et al. [2] proposed a method comprising two parts: a sequence tagging error detection model and a sequence-to-sequence error correction model. The sequence tagging model identifies the erroneous text spans in the source sentence, and the detected text spans are then fed into sequence-to-sequence model. The sequence-to-sequence model corrects these detected text spans. Their method performs comparably to conventional sequence-to-sequence methods, with less than 50% time cost for inference. Sun et al. [30] proposed a shallow aggressive decoding method to improve the online inference speed for GEC models. Their approach offers a 12.0\(\times\) online inference speedup over the baseline model on the Chinese GEC task.

3 Chinese Pre-trained Models

We adopt three pre-trained models: Chinese BERT built by Cui et al. [4], Chinese T5 built by Su [29], and Chinese BART built by Shao et al. [28] to construct our Chinese GEC models. These pre-trained models were originally proposed by Devlin et al. [5], Raffel et al. [25], and Lewis et al. [15]. The main details and differences among the three pre-trained models are summarized in Table 1. All the details pertain to Chinese variations [4, 28, 29] rather than the original one [5, 15, 25]. From the table, we can observe that the three pre-trained models differ in following aspects:
Table 1.
 BERTT5BART
Arch.Transformer Encoder 12-layer, 768-hidden, 12-headFull Transformer 12-layer, 768-hidden, 12-headFull Transformer 12-layer, 1024-hidden, 16-head
Param.110M275M406M
Tok.CharacterWord/CharacterCharacter
Vocab.21,12850,00021,128
MaskWhole Word Masking-Token Infilling
TaskMasked Language ModelSummarizationDenoising Auto-Encoding
Data5.4B Tokens30GB200GB
Table 1. Summary of Pre-trained Models used in our Study
The architecture, number of parameters, tokenization, vocabulary size, masking strategy, pre-training task, and pre-training data size are presented.
Architecture and Number of Parameters: . BERT has the fewest number of parameters because it only uses a Transformer Encoder architecture. Note that we initialize the encoder side of a full Transformer with BERT in the next experimental step; hence, the total number of parameters should be larger than 110M. T5 and BART adopt the full Transformer architecture, and BART has the largest number of parameters because it has the largest hidden size and head number.
Tokenization: . The tokenization of BERT and BART is character-based, in which all Chinese strings are divided into characters. The tokenization of T5 is word/character-based. It creates a vocabulary that contains the top 50,000 frequent words. Any out-of-vocabulary words are divided into characters.
Masking Strategy: . BERT adopts a masking method called whole word masking (WWM). In WWM, when a Chinese character is masked, other Chinese characters that belong to the same word are also masked. BART adopts a masking strategy called token infilling, in which the whole word is replaced by a single [MASK] token when a character is masked. T5 does not employ a masking strategy.
Pre-training Task: . For BERT, Cui et al. [4] removed the next sentence prediction task and used only the masked language model task following Liu et al. [17]. T5 adopts summarization as the pre-training task following Zhang et al. [37]. The input is a document, and the output is its summary in this task. BART employs a pre-training task called denoising autoencoding (DAE), in which the model reconstructs the original document based on the corrupted input.
Pre-training Data: . BERT uses Chinese Wikipedia (0.4B tokens) and an extended corpus (5.0B tokens) that consists of Baidu Baike (a Chinese encyclopedia) and Question & Answer data during pre-training. T5 uses 30 GB of pre-training data collected from the internet. BART adopts the pre-training data that contains 200 GB of text from Chinese Wikipedia and a part of WuDaoCorpora [36].

4 Generating Pseudo Data for Chinese GEC

Although the pseudo data generation method for English GEC has been extensively studied [14], few studies have examined the effect of pseudo data on the performance of a seq2seq model for Chinese GEC. Wang et al. [33] combined both rule-based and backtranslation methods to generate the pseudo data. However, they mixed the pseudo data generated by the two methods and used non-public data for generation; hence, it is difficult to analyze the contribution of pseudo data to the final performance. Therefore, we conducted a series of thorough experiments to investigate the effects of pseudo data generated via rule-based and backtranslation methods when they are combined with the pre-trained models.

4.1 Rule-based Method (MaskGEC)

We used a rule-based method called MaskGEC [39] to generate the rule-based pseudo data. The pseudo data are generated by replacing tokens in the original sentence. There are four kinds of strategies for token replacement: (1) selected token is substituted with a padding symbol; (2) selected token is substituted with a random token from the vocabulary; (3) selected token is substituted with a token from the vocabulary according to frequency; and (4) selected token is substituted with a homophone according to frequency. For every sentence in the corpus used to generate pseudo data, one of the four strategies is randomly applied to the sentence; for every character in the sentence, the character is selected as a candidate with a probability \(\delta\). To help readers understand this algorithm, a pseudo code is included in supplementary materials. Dynamic masking and static masking strategies are also utilized. In the dynamic masking strategy, the pseudo data are generated in every epoch; hence, each training instance may be seen with a different mask in different epochs. In the static masking strategy, the pseudo data are generated only once; hence, each training instance remains unchanged across different epochs. We adopt the static masking strategy in this work for simplicity, as there is no apparent difference between the two strategies based on the experimental results.

4.2 Backtranslation Method

Backtranslation was originally proposed by Sennrich et al. [27] to generate the pseudo data for the machine translation task. In the GEC setting, the input of the backtranslation model is a correct sentence, and the output is an erroneous sentence. Following Xie et al. [34], we first train a backtranslation model using the Chinese GEC training data. Then, we apply the backtranslation model to a seed corpus to generate the pseudo data. For inference, we adopt the random noising method from Xie et al. [34] to generate diverse noise. Every hypothesis is penalized by adding \(r\beta _{\mathit {\mathit {random}}}\) to its score, where r is drawn uniformly from the interval [0, 1]. \(\beta _{\mathit {random}}\) is a hyper-parameter greater than or equal to 0. Sufficiently large \(\beta _{\mathit {random}}\) results in a random shuffling of the ranks of the hypotheses according to their scores. \(\beta _{\mathit {random}}=0\) implies that it is identical to standard backtranslation. We set \(\beta _{\mathit {random}}=6\) following Kiyono et al. [14].

5 Experiments

To investigate methodologies for utilizing pre-trained models and pseudo data in Chinese GEC, we design our experiments by using three large-scale pre-trained models (described in Section 3) and two pseudo data generation methods (described in Section 4).

5.1 Data

We train and evaluate our models using the data provided by the NLPCC 2018 GEC shared task. We first segment all sentences into characters because the Chinese pre-trained models that we used are character-based. The training data consist of 1.2 million sentence pairs extracted from the language learning website Lang-8.2 For development data, we randomly extracted 5,000 sentences from the training data as the development data following Ren et al. [26], because the NLPCC 2018 GEC shared task did not provide development data. The test data consist of 2,000 sentences extracted from the PKU Chinese Learner Corpus. According to Zhao et al. [38], the annotation guidelines follow the minimum edit distance principle [19], which selects the edit operation that minimizes the edit distance from the original sentence.
Following Zhao and Wang [39], we use the source side of the NLPCC 2018 GEC training data to generate pseudo data. Before generation, we used a tokenization script from the BERT project3 to tokenize the Chinese texts into characters and keep the non-Chinese tokens unchanged. We set the substitution probability \(\delta\)=0.1, which achieves the best perplexity on the development data (the perplexity for each \(\delta\) is depicted in Figure 2 of the supplementary materials). This is different from Zhao and Wang [39], where they set \(\delta\)=0.3, which achieved the best \(\mathrm{F_{0.5}}\) score on the test set. We name the generated pseudo data Lang-8 (MaskGEC) in the remaining parts of this article. Note that we did not perform backtranslation for Lang-8 because there are only few unannotated Chinese learners’ sentences in the Lang-8 corpus.
We also utilize Chinese Wikipedia4 as a seed corpus to generate the pseudo data. We download the preprocessed Chinese Wikipedia data from nlp_chinese_corpus5 and use tools from Chinese-wikipedia-corpus-creator6 to split the document into sentences. We acquire approximately nine million sentences after the aforementioned steps. Then, we apply the rule-based and backtranslation methods to those Wikipedia sentences. We treat the generated noisy sentences as erroneous sentences and the original Wikipedia sentences as correct sentences. We call the generated pseudo data using a rule-based method as Wiki (MaskGEC) and the generated pseudo data using backtranslation method as Wiki (Backtranslation).

5.2 Model

We used Transformer as our baseline model. Transformer offers excellent performance in sequence-to-sequence tasks, such as machine translation, and has been widely adopted in recent studies on English GEC [9, 14].
A BERT-based pre-trained model only uses the encoder of Transformer; therefore, it cannot be directly applied to sequence-to-sequence tasks that require both an encoder and a decoder, such as GEC. Hence, we initialized the encoder of Transformer with the parameters of Chinese BERT; the decoder is initialized randomly. Finally, we train the initialized model on Chinese GEC data.
As for Chinese T5 and Chinese BART, because they are both encoder–decoder architectures, we could fine-tune them on the Chinese GEC dataset.
Finally, we have the following models trained on different data:
Baseline: A plain Transformer model that is initialized randomly without using a pre-trained model. This model is trained on the original Lang-8 data.
BERT-encoder, T5, BART: The models finetuned on the original Lang-8 data.
Baseline, BERT-encoder, T5, BART + Lang-8 (MaskGEC): The models are fine-tuned on Lang-8 (MaskGEC) pseudo data.
Baseline, BERT-encoder, T5, BART + Wiki (MaskGEC): The models are first warmed up on Wiki (MaskGEC) pseudo data until convergence and then fine-tuned on Lang-8 (MaskGEC) pseudo data. We adopt this setting because to avoid the appearance of [MASK] token in both the finetuning steps.
Baseline, BERT-encoder, T5, BART + Wiki (Backtranslation): The models are first warmed up on Wiki (Basktranslation) pseudo data until convergence and then fine-tuned on the original pseudo data.
We implement the baseline, BERT-encoder, T5, and BART models based on following projects, respectively: awesome-transformer,7 BERT-encoder-ChineseGEC,8 t5-pegasus-chinese9 and CPT.10 Readers can refer to these URLs and supplementary materials to acquire more details about implementations.

5.3 Evaluation

As the evaluation is performed on word units, we strip all delimiters from the system output sentences and segment the sentences using the pkunlp11 provided in the NLPCC 2018 GEC shared task.
Based on the setup of the NLPCC 2018 GEC shared task, the evaluation is conducted using MaxMatch (\(M^2\)).12 The MaxMatch algorithm computes the phrase-level edits between the source sentence and the system output. Then, it finds the overlaps between the system edits and gold edits.

5.4 Evaluation Results

Table 2 summarizes the experimental results of our models. We run the single models three times and report the average score. For comparison, we also cite the results of recent works [8, 39] as well as those of the models developed by two teams [7, 26] in the NLPCC 2018 GEC shared task.
Table 2.
Original DataPR\(\mathbf {\mathrm{F_{0.5}}}\)Lang-8 (MaskGEC)PR\(\mathbf {\mathrm{F_{0.5}}}\)
Baseline37.7816.9930.23Baseline36.4623.0332.66
BERT-encoder39.7820.8433.66BERT-encoder36.0224.1032.77
T541.6120.2234.34T539.0724.1034.73
BART39.5030.0137.15BART41.0832.1838.93
Hinson et al. [8]36.7927.8234.56Zhao and Wang [39]   
Fu et al. [7]35.2418.6429.91dynamic masking44.3622.1836.97
Ren et al. [26]41.7313.0829.02static masking43.7321.7136.35
Ren et al. [26] (4-ens)47.6312.5630.57    
Wiki (Backtrans.)PR\(\mathbf {\mathrm{F_{0.5}}}\)Wiki (MaskGEC)PR\(\mathbf {\mathrm{F_{0.5}}}\)
Baseline36.2821.1431.74Baseline34.0321.1830.33
BERT-encoder37.6622.7633.29BERT-encoder35.2324.5032.39
T542.6120.5635.08T539.3524.8435.23
BART40.3230.6837.94BART41.4131.7939.05
Table 2. Experimental Results of the NLPCC 2018 GEC Shared Task
The top column of the first table shows the results of our models, and the bottom column of the first table shows the results of previous works trained on original data and Lang-8 (MaskGEC) pseudo data. The second table presents the results of our models trained on Wiki (MaskGEC) and Wiki (Backtranslation) pseudo data.
For the models trained on the original data, the performance of all our models using pre-trained models is superior to that of the baseline model and those achieved by the two teams in the NLPCC 2018 GEC shared task, indicating the effectiveness of adopting the pre-trained model. Moreover, the BART model yields an \(\mathrm{F_ {0.5}}\) score that is seven points higher than the baseline model and achieves the best result among all models. This result indicates the effectiveness of BART owing to its larger parameters, larger pre-training data size, and more suitable pre-training task for GEC.
For two recent related works, results from Zhao and Wang [39] adequately balanced precision and recall, thus achieving a comparatively high \(\mathrm{F_{0.5}}\) score. Hinson et al. [8] achieved a comparatively high recall. However, our BART model still exceeds their results. This result is in agreement with Katsumata and Komachi [13], showing that BART is a simple but strong baseline for English, German, and Czech.13
When the models are combined with pseudo data, almost all models (except for BERT-encoder) benefit from the Lang-8 (MaskGEC) pseudo data. These results confirm the effectiveness of Zhao and Wang [39], which uses only Lang-8 to generate pseudo data by a rule-based method. Comparing the performances of models trained on Lang-8 (MaskGEC) and Wiki (MaskGEC), there were no significant differences, although the latter uses 10\(\times\) more pseudo data. Additionally, comparing the performances of Wiki (MaskGEC) and Wiki (Backtranslation), we obtained a mixed result; MaskGEC is better for T5 and BART, whereas backtranslation is better for the baseline and BERT-encoder. Considering the training cost and final performance, incorporating the pre-trained models with the Lang-8 (MaskGEC) pseudo data is the ideal setting at the present stage.

6 Analysis

To thoroughly understand the performance of models in the Chinese GEC setting, we conduct qualitative and quantitative analysis.

6.1 System Output

Table 3 presents the sample outputs of our models we trained in experiments.
Table 3.
Table 3. Source Sentence, Gold Edit, and Output from Baseline, BERT, T5, and BART Models
In the first example, the spelling error 持别 is accurately corrected to 特别 (which means especially) by all the pre-trained models, whereas it is not corrected by the baseline model. This example shows the effectiveness of pre-trained models for handling easy errors.
In the second example, the models were required to change the word order (没有说服 which means didn’t convince), delete the redundant word 对 (to) and insert the missing word 的 (of). T5 and BART perform well in this case; their outputs are nearly the same as the gold correction, except for inserting the missing word. This may be because T5 and BART have a pre-trained decoder that makes the output more fluent.
In the third example, the output made by T5 seems different from others. It copies the original source sentence and appends its correction after the source sentence. This is because there is some noise in the Lang-8 training data in that some native speakers copy the original erroneous sentence and append their corrections or comments after it [18], and T5 is pre-trained using a summarization task; hence, it is sensitive to length changes and tends to detect these patterns. Compared with T5, BART gives an ideal correction that is almost the same as the gold correction. Considering BART is pre-trained using a DAE task, which is similar to the GEC task, we can conclude that we should select the pre-trained models that are pre-trained by a task similar to the downstream task.
In the last example, we observe that BART rewrites the sentence, and its output is more fluent than the gold correction. The meaning of BART’s output is Happiness is good medicine that can cure millions of diseases. Compared with mood can be effective for diseases from gold correction, good medicine can cure diseases is more fluent. Moreover, 情绪 (mood) appears two times in the gold correction, which makes the whole sentence verbose. However, this type of fluent change may affect the precision because the gold correction followed the principle of minimum edit distance [38]. This motivates us to propose and evaluate models on a new dataset for the Chinese GEC, such as Wang et al. [32], which can evaluate the fluent changes suitably.

6.2 Error Type

To understand the error distribution of Chinese GEC, we annotated 66 sentences of development data and obtained 100 errors (one sentence may contain more than one error). We referred to the annotation of the HSK learner corpus14 and adopted eight most frequent categories of errors in the corpus: B, CC, CD, CQ, CJwo, CJ+, CJ,- and CJetc. B denotes character-level errors, which are primarily spelling errors. CC, CD, and CQ are word-level errors, which are word selection, redundant word, and missing word errors, respectively. CJ denotes sentence-level errors, which contain several complex errors: CJwo is word order errors, CJ+ and CJ- represent redundant and missing sentence constituents errors, and CJetc contains other sentence-level errors such as wrong usage of 有 (there is) and 是 (is). Several examples are presented in Table 6.2. Based on the number of errors, it is evident that word-level errors (CC, CQ, and CD) are the most frequent.
Table 4.
Error TypeFrequencyExamples
B (spell)8关主{关注} 天气 预报 。 (Pay attention to the weather forecast.)
CC (word selection)28古书店{旧书店} , 买 了 十 本 书 。(Bought ten books at a second-hand bookstore.)
CD (redundant word)8我 很 喜欢 {NONE}读 小说 。 (I like to read novels.)
CQ (missing word)24在 上海 我 总是 住 NONE{} 一家 特定 NONE{} 酒店 。 (I always stay in the same hotel in Shanghai.)
CJwo (word order)10我 决定 学习 努力{努力 学习} 。 (I decided to study hard.)
CJ+ (redundant constituents)3去年 我 到 克拉克夫 来了{NONE} 读书 。 (I went to study in Krakow last year.)
CJ- (missing constituents)9NONE{打算} 在 夏天 好好学 汉语 。 (I plan to study Chinese hard in the summer.)
CJetc (other sentence-level errors)10{今年} 23 岁 。 (I am 23 years old.)
Table 4. Examples of Each Error Type
The underlined tokens are detected errors that should be replaced with the tokens in braces.
Figure 1 presents the correction results of the three pre-trained models for each error type. We report the recall score here for simplicity, which reflects the proportion of gold edits that are the same as edits made by systems. These results indicate that BART offers the best performance compared with the other two pre-trained models on every error type. This is consistent with the previous evaluation results presented in Section 5.4 and shows the effectiveness of BART. All the systems achieve a comparatively high score on B (spelling errors) and CJetc (other sentence-level errors), showing that the two error types are somewhat easy for the systems. By contrast, considering that the systems perform poorly on CC (word selection errors), which has the largest number among all error types, we can conclude that CC is the most crucial error type and must be addressed in future work. We can also observe that T5 performs better than BERT on CQ (missing word errors) and CJwo (word order errors). This may be because T5 has a pre-trained decoder, and thus can solve the errors that are related to word insertion and word order more efficiently.
Fig. 1.
Fig. 1. Recall score of three pre-trained models on each error type.

7 Conclusion

In this study, we developed Chinese grammatical error correction (GEC) models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART. Among these models, Chinese BART achieved state-of-the-art results. The experimental results demonstrated the usefulness of pre-trained models on the Chinese GEC task. We combined the pre-trained model with pseudo data and found that the BART+Lang-8 (MaskGEC) was the ideal setting in terms of accuracy and training efficiency. Additionally, the error type analysis showed that word selection errors remain to be addressed.
The majority of the methods proposed in the NLPCC 2018 GEC shared task are simply based on the methods of English GEC; however, Chinese GEC has its own characteristics. For example, spelling errors primarily arise from the similarity of the glyph and pronunciation, and sentence-level errors often depend on word position. Therefore, we plan to study and improve the Chinese GEC system while considering these characteristics, using methods such as incorporating the glyph and pronunciation information into the system [35], or adopting the neural model whose positional embeddings can capture word order more efficiently [31].

Acknowledgments

We would like to thank all editors and reviewers for their constructive comments and kind helps.

Footnotes

13
Note that they did not conduct an experiment with Chinese, and they reported that the pre-trained model alone (without any pseudo data) did not yield satisfactory performance for Russian.

Supplementary Material

tallip-22-0263-File002 (tallip-22-0263-file002.zip)
Supplementary material

References

[1]
Yongchang Cao, Liang He, Robert Ridley, and Xinyu Dai. 2020. Integrating BERT and score-based feature gates for Chinese grammatical error diagnosis. In NLPTEA. 49–56.
[2]
Mengyun Chen, Tao Ge, Xingxing Zhang, Furu Wei, and Ming Zhou. 2020. Improving the efficiency of grammatical error correction with erroneous span detection and correction. In EMNLP. 7162–7169.
[3]
Shamil Chollampatt and Hwee Tou Ng. 2018. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In AAAI. 5755–5762.
[4]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. In Findings of EMNLP.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171–4186.
[6]
Meiyuan Fang, Kai Fu, Jiping Wang, Yang Liu, Jin Huang, and Yitao Duan. 2020. A hybrid system for NLPTEA-2020 CGED shared task. In NLPTEA. 67–77.
[7]
Kai Fu, Jin Huang, and Yitao Duan. 2018. Youdao’s winning solution to the NLPCC-2018 task 2 challenge: A neural machine translation approach to Chinese grammatical error correction. In NLPCC. 341–350.
[8]
Charles Hinson, Hen-Hsen Huang, and Hsin-Hsi Chen. 2020. Heterogeneous recycle generation for Chinese grammatical error correction. In COLING. 2191–2201.
[9]
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. 2018. Approaching neural grammatical error correction as a low-resource machine translation task. In NAACL-HLT. 595–606.
[10]
Masahiro Kaneko, Kengo Hotate, Satoru Katsumata, and Mamoru Komachi. 2019. TMU transformer system using BERT for re-ranking at BEA 2019 grammatical error correction on restricted track. In BEA. 207–212.
[11]
Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Suzuki, and Kentaro Inui. 2020. Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In ACL. 4248–4254.
[12]
Yoav Kantor, Yoav Katz, Leshem Choshen, Edo Cohen-Karlik, Naftali Liberman, Assaf Toledo, Amir Menczel, and Noam Slonim. 2019. Learning to combine grammatical error corrections. In BEA. 139–148.
[13]
Satoru Katsumata and Mamoru Komachi. 2020. Stronger baselines for grammatical error correction using a pretrained encoder-decoder model. In AACL-IJCNLP. 827–832.
[14]
Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, and Kentaro Inui. 2019. An empirical study of incorporating pseudo data into grammatical error correction. In EMNLP-IJCNLP. 1236–1242.
[15]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL. 7871–7880.
[16]
Deng Liang, Chen Zheng, Lei Guo, Xin Cui, Xiuzhang Xiong, Hengqiao Rong, and Jinpeng Dong. 2020. BERT enhanced neural machine translation and sequence tagging model for Chinese grammatical error diagnosis. In NLPTEA. 57–66.
[17]
Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv abs/1907.11692 (2019), 13 pages.
[18]
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. Mining revision log of language learning SNS for automated Japanese error correction of second language learners. In IJCNLP. 147–155.
[19]
Ryo Nagata and Keisuke Sakaguchi. 2016. Phrase structure annotation and parsing for learner English. In ACL. 1837–1847.
[20]
Jakub Náplava and Milan Straka. 2019. Grammatical error correction in low-resource scenarios. In W-NUT. 346–356.
[21]
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In CoNLL. 1–14.
[22]
Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In CoNLL. 1–12.
[23]
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. GECToR – grammatical error correction: Tag, not rewrite. In BEA. 163–170.
[24]
Zhaoquan Qiu and Youli Qu. 2019. A two-stage model for Chinese grammatical error correction. IEEE Access 7 (2019), 146772–146777.
[25]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 140 (2020), 1–67.
[26]
Hongkai Ren, Liner Yang, and Endong Xun. 2018. A sequence to sequence learning for Chinese grammatical error correction. In NLPCC. 401–410.
[27]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In ACL. 86–96.
[28]
Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, and Xipeng Qiu. 2021. CPT: A pre-trained unbalanced transformer for both Chinese language understanding and generation. arXiv abs/2109.05729 (2021), 9 pages.
[29]
Jianlin Su. 2021. T5 PEGASUS - ZhuiyiAI. Technical Report. https://github.com/ZhuiyiTechnology/t5-pegasus.
[30]
Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. 2021. Instantaneous grammatical error correction with shallow aggressive decoding. In ACL-IJCNLP. 5937–5947.
[31]
Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2020. Encoding word order in complex embeddings. In ICLR. 15 pages.
[32]
Yingying Wang, Cunliang Kong, Liner Yang, Yijun Wang, Xiaorong Lu, Renfen Hu, Shan He, Zhenghao Liu, Yuxiang Chen, Erhong Yang, and Maosong Sun. 2021. YACLC: A Chinese learner corpus with multidimensional annotation. ArXiv abs/2112.15043 (2021).
[33]
Yi Wang, Ruibin Yuan, Yan‘gen Luo, Yufang Qin, NianYong Zhu, Peng Cheng, and Lihuan Wang. 2020. Chinese grammatical error correction based on hybrid models with data augmentation. In BEA. 78–86.
[34]
Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew Ng, and Dan Jurafsky. 2018. Noising and denoising natural language: Diverse backtranslation for grammar correction. In NAACL-HLT. 619–628.
[35]
Heng-Da Xu, Zhongli Li, Qingyu Zhou, Chao Li, Zizhen Wang, Yunbo Cao, Heyan Huang, and Xian-Ling Mao. 2021. Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In Findings of ACL. 13 pages.
[36]
Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. 2021. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open 2 (2021), 65–68.
[37]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In ICML. 12 pages.
[38]
Yuanyuan Zhao, Nan Jiang, Weiwei Sun, and Xiaojun Wan. 2018. Overview of the NLPCC 2018 shared task: Grammatical error correction. In NLPCC. 439–445.
[39]
Zewei Zhao and Houfeng Wang. 2020. MaskGEC: Improving neural grammatical error correction via dynamic masking. In AAAI. 1226–1233.

Cited By

View all
  • (2024)Revisiting the Evaluation for Chinese Grammatical Error CorrectionJournal of Advanced Computational Intelligence and Intelligent Informatics10.20965/jaciii.2024.p138028:6(1380-1390)Online publication date: 20-Nov-2024
  • (2024)Evaluating LLMs’ grammatical error correction performance in learner ChinesePLOS ONE10.1371/journal.pone.031288119:10(e0312881)Online publication date: 30-Oct-2024

Index Terms

  1. Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
    March 2023
    570 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3579816
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 March 2023
    Online AM: 02 November 2022
    Accepted: 11 October 2022
    Revised: 29 July 2022
    Received: 25 July 2021
    Published in TALLIP Volume 22, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. NLP education application
    2. pre-trained model
    3. pseudo data
    4. Chinese grammatical error correction

    Qualifiers

    • Short-paper

    Funding Sources

    • Aid for Scientific Research from the Japan Society for the Promotion of Science

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,324
    • Downloads (Last 6 weeks)169
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Revisiting the Evaluation for Chinese Grammatical Error CorrectionJournal of Advanced Computational Intelligence and Intelligent Informatics10.20965/jaciii.2024.p138028:6(1380-1390)Online publication date: 20-Nov-2024
    • (2024)Evaluating LLMs’ grammatical error correction performance in learner ChinesePLOS ONE10.1371/journal.pone.031288119:10(e0312881)Online publication date: 30-Oct-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media