Abstract
Chinese Spelling Check (CSC) is very important for Chinese language processing. To utilize supervised learning for CSC, one of the main challenges is that high-quality annotated corpora are not enough in building models. This paper proposes new approaches to automatically build the corpora of CSC based on the input method. We build two corpora: one is used to check the errors in the texts generated by the Pinyin input method, called p-corpus, and the other is used to check the errors in the texts generated by the voice input method, called v-corpus. The p-corpus is constructed using two methods, one is based on the conversion between Chinese characters and the sounds of the characters, and the other is based on Automatic Speech Recognition (ASR). The v-corpus is constructed based on ASR. We use the misspelled sentences in real language situation as the test set. Experimental results demonstrate that our corpora can get a better checking effect than the benchmark corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Pinyin is the annotation of Chinese pronunciation. https://en.wikipedia.org/wiki/Pinyin.
- 2.
Chinese tones range from 1 to 4.
- 3.
According to [6], sound edit distance 1 covers about 90% of spelling errors, and sound edit distance 2 accounts for almost all of the remaining spelling errors. Thus we consider two characters with sound edit distances 1 or 2 as similar characters. Such as “ ” (zhen4 “shock”) and “ ” (zheng4 “positive”), their sound edit distance is 1; hence, they are similar characters.
- 4.
According to statistics, there are 11 groups of fuzzy sounds in Chinese characters: z-zh, c-ch, s-sh, l-n, f-h, r-l, an-ang, en-eng, in-ing, ian-iang, uan-uang.
- 5.
https://catalog.ldc.upenn.edu/LDC2011T13, these articles reported have undergone a rigorous editing process and are considered to be all correct.
- 6.
http://www.openslr.org/resources/33/data_aishell, this speech library is transcoded by professional voice proofreaders and pass strict quality inspection. The correct rate of AlShell is above \(95\%\).
- 7.
The word segmentation tool used in this paper is jieba. https://github.com/fxsjy/jieba.
- 8.
It can extract the sounds of the Chinese characters. https://github.com/mozillazg/python-pinyin.
- 9.
It can convert the sounds into Chinese characters. https://github.com/letiantian/Pinyin2Hanzi.
- 10.
The score is calculated based on the HMM principle. In general, the more commonly used words, the higher the score. https://github.com/letiantian/Pinyin2Hanzi.
- 11.
- 12.
A speech recognition kit. https://github.com/kaldi-asr/kaldi.
References
Amodei, D., et al.: End to end speech recognition in English and Mandarin (2016)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Chang, T.H., Chen, H.C., Tseng, Y.H., Zheng, J.L.: Automatic detection and correction for Chinese misspelled words using phonological and orthographic similarities. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 97–101 (2013)
Chen, Y.Z., Wu, S.H., Yang, P.C., Ku, T., Chen, G.D.: Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Eng. Educ. Life Long Learn. 21(1), 103–116 (2011)
Chen, Z., Lee, K.F.: A new statistical approach to Chinese pinyin input. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)
Hsieh, Y.M., Bai, M.H., Huang, S.L., Chen, K.J.: Correcting chinese spelling errors with word lattice decoding. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 14(4), 18 (2015)
Liu, C.L., Lai, M.H., Tien, K.W., Chuang, Y.H., Wu, S.H., Lee, C.Y.: Visually and phonologically similar characters in incorrect Chinese words: analyses, identification, and applications. ACM Trans. Asian Lang. Inf. Process. (TALIP) 10(2), 10 (2011)
Liu, Y., Zan, H., Zhong, M., Ma, H.: Detecting simultaneously chinese grammar errors based on a BiLSTM-CRF model. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 188–193 (2018)
Povey, D., et al.: The Kaldi speech recognition toolkit. IEEE Signal Processing Society, Technical report (2011)
Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)
Wang, D., Fung, G.P.C., Debosschere, M., Dong, S., Zhu, J., Wong, K.F.: A new benchmark and evaluation schema for Chinese typo detection and correction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)
Wu, S.H., Liu, C.L., Lee, L.H.: Chinese spelling check evaluation at SIGHAN bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42 (2013)
Yang, S., Zhao, H., Wang, X., Lu, B.L.: Spell checking for Chinese. In: LREC, pp. 730–736 (2012)
Yongwei, Z., Qinan, H., Fang, L., Yueguo, G.: CMMC-BDRC solution to the NLP-TEA-2018 Chinese grammatical error diagnosis task. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 180–187 (2018)
Yu, J., Li, Z.: Chinese spelling error detection and correction based on language model, pronunciation, and shape. In: Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 220–223 (2014)
Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132 (2014)
Zheng, Y., Li, C., Sun, M.: CHIME: an efficient error-tolerant Chinese pinyin input method. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61672040), Beijing Urban Governance Research Center and the North China University of Technology Startup Fund. The corresponding author is Hao Wang.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Duan, J., Pan, L., Wang, H., Zhang, M., Wu, M. (2019). Automatically Build Corpora for Chinese Spelling Check Based on the Input Method. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11838. Springer, Cham. https://doi.org/10.1007/978-3-030-32233-5_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-32233-5_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32232-8
Online ISBN: 978-3-030-32233-5
eBook Packages: Computer ScienceComputer Science (R0)