short-paper

Open access

Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data

Authors:

Mamoru KomachiAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 3

Article No.: 89, Pages 1 - 12

https://doi.org/10.1145/3570209

Published: 10 March 2023 Publication History

All formats PDF

Abstract

In recent studies, pre-trained models and pseudo data have been key factors in improving the performance of the English grammatical error correction (GEC) task. However, few studies have examined the role of pre-trained models and pseudo data in the Chinese GEC task. Therefore, we develop Chinese GEC models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART, and then incorporate these models with pseudo data to determine the best configuration for the Chinese GEC task. On the natural language processing and Chinese computing (NLPCC) 2018 GEC shared task test set, all our single models outperform the ensemble models developed by the top team of the shared task. Chinese BART achieves an F score of 37.15, which is a state-of-the-art result. We then combine our Chinese GEC models with three kinds of pseudo data: Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation). We find that most models can benefit from pseudo data, and BART+Lang-8 (MaskGEC) is the ideal setting in terms of accuracy and training efficiency. The experimental results demonstrate the effectiveness of the pre-trained models and pseudo data on the Chinese GEC task and provide an easily reproducible and adaptable baseline for future works. Finally, we annotate the error types of the development data; the results show that word-level errors dominate all error types, and word selection errors must be addressed even when using pre-trained models and pseudo data. Our codes are available at https://github.com/wang136906578/BERT-encoder-ChineseGEC.

1 Introduction

Grammatical error correction (GEC) is the task of correcting a variety of grammatical errors in text written typically by non-native speakers. To date, many models based on encoder–decoder (EncDec) have been proposed for GEC and have achieved human-parity performance, particularly on several benchmark datasets for English [3, 9]. The key factor in performance improvement is the use of pre-trained models [11, 12, 23] and pseudo data [14, 34]. EncDec requires a large amount of training data, but GEC has limitations in available data compared with machine translation (MT).

In contrast to the rapid progress of research on English GEC, few studies on Chinese GEC, where available data is even more limited than in English, have investigated the methodologies for incorporating pre-trained models and pseudo data into GEC models. For pre-trained models, a limited number of studies have used BERT [1, 6, 16], although new Chinese pre-trained models are developed and released continuously [28, 29]. Moreover, Wang et al. [33] is one of the few studies that incorporated pseudo data into Chinese GEC models. Although they combined both rule-based and backtranslation methods to generate the pseudo data for Chinese GEC, they used non-public data to generate the pseudo data. Therefore, it is difficult to analyze the contribution of pseudo data to the final performance. It has also been reported that suitable settings for pseudo data utilization in GEC vary depending on language [20], suggesting that the best practices in English GEC cannot directly be applied to Chinese GEC.

This study comprehensively investigates methodologies for utilizing pre-trained models and pseudo data in Chinese GEC and provides the Chinese GEC community with an improved understanding of the incorporation of pre-trained models and pseudo data. Through extensive experiments with three large-scale pre-trained models (Chinese BERT [4], Chinese T5 [29], Chinese BART [28]), and three types of pseudo data (Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation)), we show that BART offers the best performance, and BART+Lang-8 (MaskGEC) is the ideal setting in terms of accuracy and training efficiency. Additionally, we annotate the error types of the development data; the results show that word-level errors dominate all error types, and word selection errors must be addressed even when incorporating pre-trained models and pseudo data.

2 Related Work

2.1 English GEC Using Pre-trained Model and Pseudo Data

For English GEC tasks, BERT [5] is primarily used as a pre-trained model to improve the performance. Additionally, large-scale pseudo data are shown to contribute to the accuracy. In this subsection, we summarize some details about the works that attempted to incorporate BERT and pseudo data into their correction models.

Pre-trained model as a feature: . Kaneko et al. [10] first fine-tuned BERT on a learner corpus and then employed the word probability provided by BERT as re-ranking features. Using BERT for re-ranking features, they obtained an improvement of approximately 0.7 point in the \(\mathrm{F_{0.5}}\) score. By contrast, Kaneko et al. [11] first fine-tuned BERT using a grammatical error diagnosis task and then incorporated the fine-tuned BERT into the correction model by using a fusion method. They showed the effectiveness of BERT on the English GEC task and achieved comparatively high scores.

Pre-trained model in a pipeline: . Kantor et al. [12] used BERT to solve the GEC task by iteratively querying BERT as a black box language model. They added a [MASK] token into source sentences and predicted the word represented by the [MASK] token. If the word probability predicted by BERT exceeded the threshold, the word was output as a correction candidate. Using BERT, they obtained an improvement of 0.27 point in the \(\mathrm{F_{0.5}}\) score. Omelianchuk et al. [23] treated the GEC problem as a sequence editing problem. They used a BERT-based pre-trained model to predict the edit operations for an erroneous sentence, and the predicted edit operations were then used to correct the erroneous sentence.

Generating Pseudo Data for GEC: Xie et al. [34] proposed a method for generating pseudo data based on backtranslation. They first trained a backtranslation model on GEC data and then applied the backtranslation model to clean the monolingual corpus and acquire pseudo data. They also adopted noising methods to generate diverse pseudo data. Kiyono et al. [14] conducted experiments, focusing on three aspects: generating methods, selection of seed corpus, and optimization settings for English GEC pseudo data. They showed how the pseudo data should be generated or used for the English GEC task and achieved state-of-the-art performance at the time of publication.

2.2 Chinese GEC

In this subsection, we first describe the NLPCC 2018 Chinese GEC dataset and then provide some details about five methods that have been experimented on this dataset.

Given the success of the shared tasks on English GEC at the Conference on Natural Language Learning (CoNLL) [21, 22], a Chinese GEC shared task was introduced at NLPCC 2018. In this task, approximately one million sentences from the language learning website Lang-8¹ were used as training data and two thousand sentences from the PKU Chinese Learner Corpus [38] were used as test data. Two types of supervised GEC models were proposed by training them on the dataset: simple and complex models. Simple models are easy to understand and use but are less effective, whereas complex models achieve high accuracy but are hard to maintain. Our approaches, using pre-trained models and pseudo data, offer the best of both worlds; they are simple and effective.

Simple models: . Ren et al. [26] utilized a convolutional neural network (CNN), similar to that of Chollampatt and Ng [3]. However, because the structure of the CNN is different from that of BERT, it cannot be initialized with the weights learned by BERT. Zhao and Wang [39] proposed a dynamic masking method that replaces the tokens in the source sentences of NLPCC 2018 GEC shared task training data with other tokens (e.g., [PAD] token). They achieved comparatively high scores on the shared task without using any extra knowledge. This is a data augmentation method that can be combined with methods that utilize the pre-trained models.

Complex models: . Fu et al. [7] combined a 5-gram language model-based spell checker with subword-level and character-level encoder–decoder models using Transformer to obtain five types of outputs. Then, they re-ranked these outputs using the language model. Although they reported a high performance, several models were required, and their method of combining these models was complex. Hinson et al. [8] proposed a heterogeneous approach for Chinese GEC. In their method, an erroneous sentence is corrected using a spell checker model, sequence editing model, and sequence-to-sequence model in multiple rounds. Before their work, only sequence-to-sequence models were used for recycle generation in Chinese GEC [24]. They also used an automatic annotator to annotate four error types and evaluated the model performance on these error types. However, they only used character-level edit operations as the error types, which may not be appropriate for reflecting the nature of Chinese grammatical errors. Chen et al. [2] proposed a method comprising two parts: a sequence tagging error detection model and a sequence-to-sequence error correction model. The sequence tagging model identifies the erroneous text spans in the source sentence, and the detected text spans are then fed into sequence-to-sequence model. The sequence-to-sequence model corrects these detected text spans. Their method performs comparably to conventional sequence-to-sequence methods, with less than 50% time cost for inference. Sun et al. [30] proposed a shallow aggressive decoding method to improve the online inference speed for GEC models. Their approach offers a 12.0\(\times\) online inference speedup over the baseline model on the Chinese GEC task.

3 Chinese Pre-trained Models

We adopt three pre-trained models: Chinese BERT built by Cui et al. [4], Chinese T5 built by Su [29], and Chinese BART built by Shao et al. [28] to construct our Chinese GEC models. These pre-trained models were originally proposed by Devlin et al. [5], Raffel et al. [25], and Lewis et al. [15]. The main details and differences among the three pre-trained models are summarized in Table 1. All the details pertain to Chinese variations [4, 28, 29] rather than the original one [5, 15, 25]. From the table, we can observe that the three pre-trained models differ in following aspects:

Table 1.

	BERT	T5	BART
Arch.	Transformer Encoder 12-layer, 768-hidden, 12-head	Full Transformer 12-layer, 768-hidden, 12-head	Full Transformer 12-layer, 1024-hidden, 16-head
Param.	110M	275M	406M
Tok.	Character	Word/Character	Character
Vocab.	21,128	50,000	21,128
Mask	Whole Word Masking	-	Token Infilling
Task	Masked Language Model	Summarization	Denoising Auto-Encoding
Data	5.4B Tokens	30GB	200GB

Table 1. Summary of Pre-trained Models used in our Study

The architecture, number of parameters, tokenization, vocabulary size, masking strategy, pre-training task, and pre-training data size are presented.

Architecture and Number of Parameters: . BERT has the fewest number of parameters because it only uses a Transformer Encoder architecture. Note that we initialize the encoder side of a full Transformer with BERT in the next experimental step; hence, the total number of parameters should be larger than 110M. T5 and BART adopt the full Transformer architecture, and BART has the largest number of parameters because it has the largest hidden size and head number.

Tokenization: . The tokenization of BERT and BART is character-based, in which all Chinese strings are divided into characters. The tokenization of T5 is word/character-based. It creates a vocabulary that contains the top 50,000 frequent words. Any out-of-vocabulary words are divided into characters.

Masking Strategy: . BERT adopts a masking method called whole word masking (WWM). In WWM, when a Chinese character is masked, other Chinese characters that belong to the same word are also masked. BART adopts a masking strategy called token infilling, in which the whole word is replaced by a single [MASK] token when a character is masked. T5 does not employ a masking strategy.

Pre-training Task: . For BERT, Cui et al. [4] removed the next sentence prediction task and used only the masked language model task following Liu et al. [17]. T5 adopts summarization as the pre-training task following Zhang et al. [37]. The input is a document, and the output is its summary in this task. BART employs a pre-training task called denoising autoencoding (DAE), in which the model reconstructs the original document based on the corrupted input.

Pre-training Data: . BERT uses Chinese Wikipedia (0.4B tokens) and an extended corpus (5.0B tokens) that consists of Baidu Baike (a Chinese encyclopedia) and Question & Answer data during pre-training. T5 uses 30 GB of pre-training data collected from the internet. BART adopts the pre-training data that contains 200 GB of text from Chinese Wikipedia and a part of WuDaoCorpora [36].

4 Generating Pseudo Data for Chinese GEC

Although the pseudo data generation method for English GEC has been extensively studied [14], few studies have examined the effect of pseudo data on the performance of a seq2seq model for Chinese GEC. Wang et al. [33] combined both rule-based and backtranslation methods to generate the pseudo data. However, they mixed the pseudo data generated by the two methods and used non-public data for generation; hence, it is difficult to analyze the contribution of pseudo data to the final performance. Therefore, we conducted a series of thorough experiments to investigate the effects of pseudo data generated via rule-based and backtranslation methods when they are combined with the pre-trained models.

4.1 Rule-based Method (MaskGEC)

We used a rule-based method called MaskGEC [39] to generate the rule-based pseudo data. The pseudo data are generated by replacing tokens in the original sentence. There are four kinds of strategies for token replacement: (1) selected token is substituted with a padding symbol; (2) selected token is substituted with a random token from the vocabulary; (3) selected token is substituted with a token from the vocabulary according to frequency; and (4) selected token is substituted with a homophone according to frequency. For every sentence in the corpus used to generate pseudo data, one of the four strategies is randomly applied to the sentence; for every character in the sentence, the character is selected as a candidate with a probability \(\delta\). To help readers understand this algorithm, a pseudo code is included in supplementary materials. Dynamic masking and static masking strategies are also utilized. In the dynamic masking strategy, the pseudo data are generated in every epoch; hence, each training instance may be seen with a different mask in different epochs. In the static masking strategy, the pseudo data are generated only once; hence, each training instance remains unchanged across different epochs. We adopt the static masking strategy in this work for simplicity, as there is no apparent difference between the two strategies based on the experimental results.

4.2 Backtranslation Method

Backtranslation was originally proposed by Sennrich et al. [27] to generate the pseudo data for the machine translation task. In the GEC setting, the input of the backtranslation model is a correct sentence, and the output is an erroneous sentence. Following Xie et al. [34], we first train a backtranslation model using the Chinese GEC training data. Then, we apply the backtranslation model to a seed corpus to generate the pseudo data. For inference, we adopt the random noising method from Xie et al. [34] to generate diverse noise. Every hypothesis is penalized by adding \(r\beta _{\mathit {\mathit {random}}}\) to its score, where r is drawn uniformly from the interval [0, 1]. \(\beta _{\mathit {random}}\) is a hyper-parameter greater than or equal to 0. Sufficiently large \(\beta _{\mathit {random}}\) results in a random shuffling of the ranks of the hypotheses according to their scores. \(\beta _{\mathit {random}}=0\) implies that it is identical to standard backtranslation. We set \(\beta _{\mathit {random}}=6\) following Kiyono et al. [14].

5 Experiments

To investigate methodologies for utilizing pre-trained models and pseudo data in Chinese GEC, we design our experiments by using three large-scale pre-trained models (described in Section 3) and two pseudo data generation methods (described in Section 4).

5.1 Data

We train and evaluate our models using the data provided by the NLPCC 2018 GEC shared task. We first segment all sentences into characters because the Chinese pre-trained models that we used are character-based. The training data consist of 1.2 million sentence pairs extracted from the language learning website Lang-8.² For development data, we randomly extracted 5,000 sentences from the training data as the development data following Ren et al. [26], because the NLPCC 2018 GEC shared task did not provide development data. The test data consist of 2,000 sentences extracted from the PKU Chinese Learner Corpus. According to Zhao et al. [38], the annotation guidelines follow the minimum edit distance principle [19], which selects the edit operation that minimizes the edit distance from the original sentence.

Following Zhao and Wang [39], we use the source side of the NLPCC 2018 GEC training data to generate pseudo data. Before generation, we used a tokenization script from the BERT project³ to tokenize the Chinese texts into characters and keep the non-Chinese tokens unchanged. We set the substitution probability \(\delta\)=0.1, which achieves the best perplexity on the development data (the perplexity for each \(\delta\) is depicted in Figure 2 of the supplementary materials). This is different from Zhao and Wang [39], where they set \(\delta\)=0.3, which achieved the best \(\mathrm{F_{0.5}}\) score on the test set. We name the generated pseudo data Lang-8 (MaskGEC) in the remaining parts of this article. Note that we did not perform backtranslation for Lang-8 because there are only few unannotated Chinese learners’ sentences in the Lang-8 corpus.

We also utilize Chinese Wikipedia⁴ as a seed corpus to generate the pseudo data. We download the preprocessed Chinese Wikipedia data from nlp_chinese_corpus⁵ and use tools from Chinese-wikipedia-corpus-creator⁶ to split the document into sentences. We acquire approximately nine million sentences after the aforementioned steps. Then, we apply the rule-based and backtranslation methods to those Wikipedia sentences. We treat the generated noisy sentences as erroneous sentences and the original Wikipedia sentences as correct sentences. We call the generated pseudo data using a rule-based method as Wiki (MaskGEC) and the generated pseudo data using backtranslation method as Wiki (Backtranslation).

5.2 Model

We used Transformer as our baseline model. Transformer offers excellent performance in sequence-to-sequence tasks, such as machine translation, and has been widely adopted in recent studies on English GEC [9, 14].

A BERT-based pre-trained model only uses the encoder of Transformer; therefore, it cannot be directly applied to sequence-to-sequence tasks that require both an encoder and a decoder, such as GEC. Hence, we initialized the encoder of Transformer with the parameters of Chinese BERT; the decoder is initialized randomly. Finally, we train the initialized model on Chinese GEC data.

As for Chinese T5 and Chinese BART, because they are both encoder–decoder architectures, we could fine-tune them on the Chinese GEC dataset.

Finally, we have the following models trained on different data:

•

Baseline: A plain Transformer model that is initialized randomly without using a pre-trained model. This model is trained on the original Lang-8 data.

•

BERT-encoder, T5, BART: The models finetuned on the original Lang-8 data.

•

Baseline, BERT-encoder, T5, BART + Lang-8 (MaskGEC): The models are fine-tuned on Lang-8 (MaskGEC) pseudo data.

•

Baseline, BERT-encoder, T5, BART + Wiki (MaskGEC): The models are first warmed up on Wiki (MaskGEC) pseudo data until convergence and then fine-tuned on Lang-8 (MaskGEC) pseudo data. We adopt this setting because to avoid the appearance of [MASK] token in both the finetuning steps.

•

Baseline, BERT-encoder, T5, BART + Wiki (Backtranslation): The models are first warmed up on Wiki (Basktranslation) pseudo data until convergence and then fine-tuned on the original pseudo data.

We implement the baseline, BERT-encoder, T5, and BART models based on following projects, respectively: awesome-transformer,⁷ BERT-encoder-ChineseGEC,⁸ t5-pegasus-chinese⁹ and CPT.¹⁰ Readers can refer to these URLs and supplementary materials to acquire more details about implementations.

5.3 Evaluation

As the evaluation is performed on word units, we strip all delimiters from the system output sentences and segment the sentences using the pkunlp¹¹ provided in the NLPCC 2018 GEC shared task.

Based on the setup of the NLPCC 2018 GEC shared task, the evaluation is conducted using MaxMatch (\(M^2\)).¹² The MaxMatch algorithm computes the phrase-level edits between the source sentence and the system output. Then, it finds the overlaps between the system edits and gold edits.

5.4 Evaluation Results

Table 2 summarizes the experimental results of our models. We run the single models three times and report the average score. For comparison, we also cite the results of recent works [8, 39] as well as those of the models developed by two teams [7, 26] in the NLPCC 2018 GEC shared task.

Table 2.

Original Data	P	R	\(\mathbf {\mathrm{F_{0.5}}}\)	Lang-8 (MaskGEC)	P	R	\(\mathbf {\mathrm{F_{0.5}}}\)
Baseline	37.78	16.99	30.23	Baseline	36.46	23.03	32.66
BERT-encoder	39.78	20.84	33.66	BERT-encoder	36.02	24.10	32.77
T5	41.61	20.22	34.34	T5	39.07	24.10	34.73
BART	39.50	30.01	37.15	BART	41.08	32.18	38.93
Hinson et al. [8]	36.79	27.82	34.56	Zhao and Wang [39]
Fu et al. [7]	35.24	18.64	29.91	dynamic masking	44.36	22.18	36.97
Ren et al. [26]	41.73	13.08	29.02	static masking	43.73	21.71	36.35
Ren et al. [26] (4-ens)	47.63	12.56	30.57
Wiki (Backtrans.)	P	R	\(\mathbf {\mathrm{F_{0.5}}}\)	Wiki (MaskGEC)	P	R	\(\mathbf {\mathrm{F_{0.5}}}\)
Baseline	36.28	21.14	31.74	Baseline	34.03	21.18	30.33
BERT-encoder	37.66	22.76	33.29	BERT-encoder	35.23	24.50	32.39
T5	42.61	20.56	35.08	T5	39.35	24.84	35.23
BART	40.32	30.68	37.94	BART	41.41	31.79	39.05

Table 2. Experimental Results of the NLPCC 2018 GEC Shared Task

The top column of the first table shows the results of our models, and the bottom column of the first table shows the results of previous works trained on original data and Lang-8 (MaskGEC) pseudo data. The second table presents the results of our models trained on Wiki (MaskGEC) and Wiki (Backtranslation) pseudo data.

For the models trained on the original data, the performance of all our models using pre-trained models is superior to that of the baseline model and those achieved by the two teams in the NLPCC 2018 GEC shared task, indicating the effectiveness of adopting the pre-trained model. Moreover, the BART model yields an \(\mathrm{F_ {0.5}}\) score that is seven points higher than the baseline model and achieves the best result among all models. This result indicates the effectiveness of BART owing to its larger parameters, larger pre-training data size, and more suitable pre-training task for GEC.

For two recent related works, results from Zhao and Wang [39] adequately balanced precision and recall, thus achieving a comparatively high \(\mathrm{F_{0.5}}\) score. Hinson et al. [8] achieved a comparatively high recall. However, our BART model still exceeds their results. This result is in agreement with Katsumata and Komachi [13], showing that BART is a simple but strong baseline for English, German, and Czech.¹³

When the models are combined with pseudo data, almost all models (except for BERT-encoder) benefit from the Lang-8 (MaskGEC) pseudo data. These results confirm the effectiveness of Zhao and Wang [39], which uses only Lang-8 to generate pseudo data by a rule-based method. Comparing the performances of models trained on Lang-8 (MaskGEC) and Wiki (MaskGEC), there were no significant differences, although the latter uses 10\(\times\) more pseudo data. Additionally, comparing the performances of Wiki (MaskGEC) and Wiki (Backtranslation), we obtained a mixed result; MaskGEC is better for T5 and BART, whereas backtranslation is better for the baseline and BERT-encoder. Considering the training cost and final performance, incorporating the pre-trained models with the Lang-8 (MaskGEC) pseudo data is the ideal setting at the present stage.

6 Analysis

To thoroughly understand the performance of models in the Chinese GEC setting, we conduct qualitative and quantitative analysis.

6.1 System Output

Table 3 presents the sample outputs of our models we trained in experiments.

Table 3.

In the first example, the spelling error 持别 is accurately corrected to 特别 (which means especially) by all the pre-trained models, whereas it is not corrected by the baseline model. This example shows the effectiveness of pre-trained models for handling easy errors.

In the second example, the models were required to change the word order (没有说服 which means didn’t convince), delete the redundant word 对 (to) and insert the missing word 的 (of). T5 and BART perform well in this case; their outputs are nearly the same as the gold correction, except for inserting the missing word. This may be because T5 and BART have a pre-trained decoder that makes the output more fluent.

In the third example, the output made by T5 seems different from others. It copies the original source sentence and appends its correction after the source sentence. This is because there is some noise in the Lang-8 training data in that some native speakers copy the original erroneous sentence and append their corrections or comments after it [18], and T5 is pre-trained using a summarization task; hence, it is sensitive to length changes and tends to detect these patterns. Compared with T5, BART gives an ideal correction that is almost the same as the gold correction. Considering BART is pre-trained using a DAE task, which is similar to the GEC task, we can conclude that we should select the pre-trained models that are pre-trained by a task similar to the downstream task.

In the last example, we observe that BART rewrites the sentence, and its output is more fluent than the gold correction. The meaning of BART’s output is Happiness is good medicine that can cure millions of diseases. Compared with mood can be effective for diseases from gold correction, good medicine can cure diseases is more fluent. Moreover, 情绪 (mood) appears two times in the gold correction, which makes the whole sentence verbose. However, this type of fluent change may affect the precision because the gold correction followed the principle of minimum edit distance [38]. This motivates us to propose and evaluate models on a new dataset for the Chinese GEC, such as Wang et al. [32], which can evaluate the fluent changes suitably.

6.2 Error Type

To understand the error distribution of Chinese GEC, we annotated 66 sentences of development data and obtained 100 errors (one sentence may contain more than one error). We referred to the annotation of the HSK learner corpus¹⁴ and adopted eight most frequent categories of errors in the corpus: B, CC, CD, CQ, CJwo, CJ+, CJ,- and CJetc. B denotes character-level errors, which are primarily spelling errors. CC, CD, and CQ are word-level errors, which are word selection, redundant word, and missing word errors, respectively. CJ denotes sentence-level errors, which contain several complex errors: CJwo is word order errors, CJ+ and CJ- represent redundant and missing sentence constituents errors, and CJetc contains other sentence-level errors such as wrong usage of 有 (there is) and 是 (is). Several examples are presented in Table 6.2. Based on the number of errors, it is evident that word-level errors (CC, CQ, and CD) are the most frequent.

Table 4.

Error Type	Frequency	Examples
B (spell)	8	要关主{关注} 天气预报。 (Pay attention to the weather forecast.)
CC (word selection)	28	到古书店{旧书店} ，买了十本书。(Bought ten books at a second-hand bookstore.)
CD (redundant word)	8	我很喜欢念{NONE}读小说。 (I like to read novels.)
CQ (missing word)	24	在上海我总是住 NONE{在} 一家特定 NONE{的} 酒店。 (I always stay in the same hotel in Shanghai.)
CJwo (word order)	10	我决定学习努力{努力学习} 。 (I decided to study hard.)
CJ+ (redundant constituents)	3	去年我到克拉克夫来了{NONE} 读书。 (I went to study in Krakow last year.)
CJ- (missing constituents)	9	我 NONE{打算} 在夏天好好学汉语。 (I plan to study Chinese hard in the summer.)
CJetc (other sentence-level errors)	10	我是{今年} 23 岁。 (I am 23 years old.)

Table 4. Examples of Each Error Type

The underlined tokens are detected errors that should be replaced with the tokens in braces.

Figure 1 presents the correction results of the three pre-trained models for each error type. We report the recall score here for simplicity, which reflects the proportion of gold edits that are the same as edits made by systems. These results indicate that BART offers the best performance compared with the other two pre-trained models on every error type. This is consistent with the previous evaluation results presented in Section 5.4 and shows the effectiveness of BART. All the systems achieve a comparatively high score on B (spelling errors) and CJetc (other sentence-level errors), showing that the two error types are somewhat easy for the systems. By contrast, considering that the systems perform poorly on CC (word selection errors), which has the largest number among all error types, we can conclude that CC is the most crucial error type and must be addressed in future work. We can also observe that T5 performs better than BERT on CQ (missing word errors) and CJwo (word order errors). This may be because T5 has a pre-trained decoder, and thus can solve the errors that are related to word insertion and word order more efficiently.

Fig. 1.

7 Conclusion

In this study, we developed Chinese grammatical error correction (GEC) models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART. Among these models, Chinese BART achieved state-of-the-art results. The experimental results demonstrated the usefulness of pre-trained models on the Chinese GEC task. We combined the pre-trained model with pseudo data and found that the BART+Lang-8 (MaskGEC) was the ideal setting in terms of accuracy and training efficiency. Additionally, the error type analysis showed that word selection errors remain to be addressed.

The majority of the methods proposed in the NLPCC 2018 GEC shared task are simply based on the methods of English GEC; however, Chinese GEC has its own characteristics. For example, spelling errors primarily arise from the similarity of the glyph and pronunciation, and sentence-level errors often depend on word position. Therefore, we plan to study and improve the Chinese GEC system while considering these characteristics, using methods such as incorporating the glyph and pronunciation information into the system [35], or adopting the neural model whose positional embeddings can capture word order more efficiently [31].

Acknowledgments

We would like to thank all editors and reviewers for their constructive comments and kind helps.

Footnotes

https://lang-8.com/.

https://github.com/google-research/bert.

⁴

https://dumps.wikimedia.org/zhwiki/latest/.

⁵

https://github.com/brightmart/nlp_chinese_corpus.

⁶

https://github.com/howl-anderson/chinese-wikipedia-corpus-creator.

⁷

https://github.com/ictnlp/awesome-transformer.

⁸

https://github.com/wang136906578/BERT-encoder-ChineseGEC.

⁹

https://github.com/SunnyGJing/t5-pegasus-chinese.

¹⁰

https://github.com/fastnlp/CPT.

¹¹

http://59.108.48.12/lcwm/pkunlp/downloads/libgrass-ui.tar.gz.

¹²

https://github.com/nusnlp/m2scorer.

¹³

Note that they did not conduct an experiment with Chinese, and they reported that the pre-trained model alone (without any pseudo data) did not yield satisfactory performance for Russian.

¹⁴

http://hsk.blcu.edu.cn/.

Supplementary Material

tallip-22-0263-File002 (tallip-22-0263-file002.zip)

Supplementary material

Download
471.45 KB

References

[1]

Yongchang Cao, Liang He, Robert Ridley, and Xinyu Dai. 2020. Integrating BERT and score-based feature gates for Chinese grammatical error diagnosis. In NLPTEA. 49–56.

Abstract

1 Introduction

2 Related Work

2.1 English GEC Using Pre-trained Model and Pseudo Data

2.2 Chinese GEC

3 Chinese Pre-trained Models

4 Generating Pseudo Data for Chinese GEC

4.1 Rule-based Method (MaskGEC)

4.2 Backtranslation Method

5 Experiments

5.1 Data

5.2 Model

5.3 Evaluation

5.4 Evaluation Results

6 Analysis

6.1 System Output

6.2 Error Type

7 Conclusion

Acknowledgments

Footnotes

Supplementary Material

References

Cited By

Index Terms

Recommendations

Weaken Grammatical Error Influence in Chinese Grammatical Error Correction

Chinese Grammatical Error Correction Using Statistical and Neural Models

Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data Augmentation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations