research-article

A Hybrid Model for Chinese Spelling Check

Authors:

Zhongye JiaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 16, Issue 3

Article No.: 21, Pages 1 - 22

https://doi.org/10.1145/3047405

Published: 30 March 2017 Publication History

Abstract

Spelling check for Chinese has more challenging difficulties than that for other languages. A hybrid model for Chinese spelling check is presented in this article. The hybrid model consists of three components: one graph-based model for generic errors and two independently trained models for specific errors. In the graph model, a directed acyclic graph is generated for each sentence, and the single-source shortest-path algorithm is performed on the graph to detect and correct general spelling errors at the same time. Prior to that, two types of errors over functional words (characters) are first solved by conditional random fields: the confusion of “在” (at) (pinyin is zai in Chinese), “再” (again, more, then) (pinyin: zai) and “的” (of) (pinyin: de), “地” (-ly, adverb-forming particle) (pinyin: de), and “得” (so that, have to) (pinyin: de). Finally, a rule-based model is exploited to distinguish pronoun usage confusion: “她” (she) (pinyin: ta), “他” (he) (pinyin: ta), and some other common collocation errors. The proposed model is evaluated on the standard datasets released by the SIGHAN Bake-off shared tasks, giving state-of-the-art results.

References

[1]

Farooq Ahmad and Grzegorz Kondrak. 2005. Learning a spelling error model from search query logs. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing. 955--962.

Digital Library

[2]

Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 409--420.

[3]

Richard G. Casey and Eric Lecolinet. 1996. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 7, 690--706.

Digital Library

[4]

Chaohuang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium. 278--283.

[5]

Kuanyu Chen, Hungshin Lee, Chunghan Lee, Hsinmin Wang, and Hsinhsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 79--83.

[6]

Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 4, 359--393.

Digital Library

[7]

Hsunwen Chiu, Jiancheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.

[8]

Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2014. Chinese spell checking based on noisy channel model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 202--209.

[9]

Wei-Cheng Chu and Chuan-Jie Lin. 2014. NTOU Chinese spelling check system in CLP Bake-off 2014. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 210--215.

[10]

Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.

[11]

David Eppstein. 1998. Finding the k shortest paths. SIAM Journal on Computing 28, 2, 652--673.

Digital Library

[12]

Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of the 9th Annual Conference of the International Speech Communication Association. 1618--1621.

[13]

Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics. 358--366.

Digital Library

[14]

Lei Gu, Yong Wang, and Xitao Liang. 2014. Introduction to NJUPT Chinese spelling check systems in CLP-2014 Bakeoff. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 167--172.

[15]

Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 74--78.

[16]

Yu He and Guohong Fu. 2013. Description of HLJU Chinese spelling checker for SIGHAN Bakeoff 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 84--87.

[17]

Yuming Hsieh, Minghong Bai, and Kehjiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.

[18]

Changning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3, 8--20.

[19]

Qiang Huang, Peijie Huang, Xinrui Zhang, Weijian Xie, Kaiduo Hong, Bingzhou Chen, and Lei Huang. 2014. Chinese spelling check system based on tri-gram model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 173--178.

[20]

Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.

[21]

Zhongye Jia and Hai Zhao. 2014. A joint graph model for Pinyin-to-Chinese conversion with typo correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1512--1523.

[22]

Junhui Li, Guodong Zhou, Hai Zhao, Qiaoming Zhu, and Peide Qian. 2009. Improving nominal SRL in Chinese language with verbal SRL information and automatic predicate recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1280--1288.

[23]

Chuanjie Lin and Weicheng Chu. 2013. NTOU Chinese spelling check system in SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 102--107.

[24]

Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Chia-Ying Lee. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 739--747.

Digital Library

[25]

Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Transactions on Asian Language Information Processing 10, 2, 10.

Digital Library

[26]

Min Liu, Ping Jian, and Heyan Huang. 2014. Introduction to BIT Chinese spelling correction system at CLP 2014 Bake-off. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 179--185.

[27]

Xiaodong Liu, Kevin Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 54--58.

[28]

Xuezhe Ma, Xiaotian Zhang, Hai Zhao, and Bao-Liang Lu. 2010. Dependency parser for Chinese constituent parsing. In Proceedings of the Joint Conference on Chinese Language Processing. 1--6.

[29]

Heming Shou and Hai Zhao. 2012. Hybrid rule-based algorithm for coreference resolution. In Proceedings of the Joint Conference on EMNLP and CoNLL-Shared Task. 118--121.

Digital Library

[30]

Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 266--274.

Digital Library

[31]

Yih-Ru Wang and Yuan-Fu Liao. 2014. NCTU and NTUT’s entry to CLP-2014 Chinese spelling check evaluation. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 216--219.

[32]

Shihhung Wu, Chaolin Liu, and Lunghao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 35--42.

[33]

Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157--166.

[34]

Jinhua Xiong, Qiao Zhang, Jianpeng Hou, Qianbo Wang, Yuanzhuo Wang, and Xueqi Cheng. 2014. Extended HMM and ranking models for Chinese spelling correction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 133--138.

[35]

Shaohua Yang, Hai Zhao, Xiaolin Wang, and Baoliang Lu. 2012. Spell checking for Chinese. In Proceedings of the International Conference on Language Resources and Evaluation. 730--736.

[36]

Tinghao Yang, Yulun Hsieh, Yuhsuan Chen, Michael Tsang, Chengwei Shih, and Wenlian Hsu. 2013. Sinica-IASL Chinese spelling check system at SIGHAN-7. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 93--96.

[37]

Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on N-gram ranked inverted index list. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 43--48.

[38]

Jui-Feng Yeh, Yun-Yun Lu, Chen-Hsien Lee, Yu-Hsiang Yu, and Yong-Ting Chen. 2014. Chinese word spelling correction based on rule induction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 139--145.

[39]

Junjie Yu and Zhenghua Li. 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 220--223.

[40]

Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using Web-based measure for Chinese spelling error correction. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 108--112.

[41]

Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 Bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126--132.

[42]

Xiaotian Zhang, Chunyang Wu, and Hai Zhao. 2012. Chinese coreference resolution via ordered filtering. In Proceedings of the Joint Conference on EMNLP and CoNLL—Shared Task. 95--99.

Digital Library

[43]

Xiaotian Zhang and Hai Zhao. 2011. Unsupervised Chinese phrase parsing based on tree pattern mining. In Proceedings of the 11th China National Conference on Computational Linguistics.

[44]

Hai Zhao. 2009. Character-level dependencies in Chinese: Usefulness and learning. In Proceedings of the 12th Conference of the European Chapter of the ACL. 879--887.

[45]

Hai Zhao, Chang-Ning Huang, and Mu Li. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 162--165.

[46]

Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Vol. 20. 87--94.

[47]

Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9, 2, 5.

Digital Library

[48]

Hai Zhao and Chunyu Kit. 2007. Scaling conditional random field with application to Chinese word segmentation. In Proceedings of the 3rd International Conference on Natural Computation, Vol. 5. 95--99.

Digital Library

[49]

Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.

[50]

Hai Zhao and Chunyu Kit. 2009. A simple and efficient model pruning method for conditional random fields. In Proceedings of the International Conference on Computer Processing of Oriental Languages. 145--155.

Digital Library

[51]

Hai Zhao, Yan Song, and Chunyu Kit. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation. 1672--1677.

[52]

Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for Chinese machine translation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 248--263.

Digital Library

Cited By

B AOluwaseyi EB AM OAbiodun OO BBadeji-Ajisafe BA AA S(2024)Automatic Spelling Corrector for Yorùbá Language Using Edit Distance and N-Gram Language Models2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG)10.1109/SEB4SDG60871.2024.10630121(1-6)Online publication date: 2-Apr-2024
https://doi.org/10.1109/SEB4SDG60871.2024.10630121
杜晓(2023)Research on Chinese Proofreading Technology with Feed-Back MechanismComputer Science and Application10.12677/CSA.2023.13303713:03(390-398)Online publication date: 2023
https://doi.org/10.12677/CSA.2023.133037
Lin WHan MJin T(2023)Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data AugmentationProceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing10.1145/3639479.3639524(215-220)Online publication date: 27-Dec-2023
https://dl.acm.org/doi/10.1145/3639479.3639524
Show More Cited By

Index Terms

A Hybrid Model for Chinese Spelling Check
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

A Hybrid Ranking Approach to Chinese Spelling Check
Special Issue on Chinese Spell Checking

We propose a novel framework for Chinese Spelling Check (CSC), which is an automatic algorithm to detect and correct Chinese spelling errors. Our framework contains two key components: candidate generation and candidate ranking. Our framework differs ...
Hybrid model for Chinese character recognition based on Tesseract-OCR

Optical character recognition (OCR) is an important way to input information into a computer. And text information can be extracted by OCR from an image. Currently, the accuracy rate of Chinese OCR can also be improved. This study proposes a hybrid ...
Improve Chinese Spelling Check by Reevaluation
Advances in Knowledge Discovery and Data Mining
Abstract
Chinese Spelling Check (CSC) aims to detect and correct the spelling errors in Chinese. Most Chinese spelling errors are misused semantically, phonetically or graphically similar characters. Previous state-of-the-art works on the CSC task pursue ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16, Issue 3

September 2017

167 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3041821

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 March 2017

Accepted: 01 January 2017

Revised: 01 November 2016

Received: 01 July 2016

Published in TALLIP Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Major Basic Research Program of Shanghai Science and Technology Committee
National Basic Research Program of China
National Natural Science Foundation of China
Cai Yuanpei Program
Art and Science Interdisciplinary Funds of Shanghai Jiao Tong University
Key Project of the National Society Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
717
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

B AOluwaseyi EB AM OAbiodun OO BBadeji-Ajisafe BA AA S(2024)Automatic Spelling Corrector for Yorùbá Language Using Edit Distance and N-Gram Language Models2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG)10.1109/SEB4SDG60871.2024.10630121(1-6)Online publication date: 2-Apr-2024
https://doi.org/10.1109/SEB4SDG60871.2024.10630121
杜晓(2023)Research on Chinese Proofreading Technology with Feed-Back MechanismComputer Science and Application10.12677/CSA.2023.13303713:03(390-398)Online publication date: 2023
https://doi.org/10.12677/CSA.2023.133037
Lin WHan MJin T(2023)Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data AugmentationProceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing10.1145/3639479.3639524(215-220)Online publication date: 27-Dec-2023
https://dl.acm.org/doi/10.1145/3639479.3639524
Lertpiya AChalothorn TBuabthong P(2023)How to Progressively Build Thai Spelling Correction Systems?IEEE Access10.1109/ACCESS.2023.329500411(72704-72716)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295004
Jiang WYe ZOu ZZhao RZheng JLiu YLiu BLi SYang YZheng YAl Hasan MXiong L(2022)MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling CorrectionProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557636(4084-4088)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557636
Zhao GGuo YXia FMa C(2022)A Multimodal Method for Chinese Spelling Correction2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892297(01-07)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892297
Sun XZhou JWang SLi HJia JZhu J(2022)Chinese Spelling Error Detection and Correction Based on Knowledge GraphDatabase Systems for Advanced Applications. DASFAA 2022 International Workshops10.1007/978-3-031-11217-1_11(149-159)Online publication date: 16-Jul-2022
https://doi.org/10.1007/978-3-031-11217-1_11
Gou WChen Z(2021)Think Twice: A Post-Processing Approach for the Chinese Spelling Error CorrectionApplied Sciences10.3390/app1113583211:13(5832)Online publication date: 23-Jun-2021
https://doi.org/10.3390/app11135832
Nguyen MNgo GChen N(2021)Domain-Shift Conditioning Using Adaptable Filtering Via Hierarchical Embeddings for Robust Chinese Spell CheckIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2021.308310829(2027-2036)Online publication date: 2021
https://doi.org/10.1109/TASLP.2021.3083108
Hládek DStaš JPleva M(2020)Survey of Automatic Spelling CorrectionElectronics10.3390/electronics91016709:10(1670)Online publication date: 13-Oct-2020
https://doi.org/10.3390/electronics9101670
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents