Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Hybrid Model for Chinese Spelling Check

Published: 30 March 2017 Publication History

Abstract

Spelling check for Chinese has more challenging difficulties than that for other languages. A hybrid model for Chinese spelling check is presented in this article. The hybrid model consists of three components: one graph-based model for generic errors and two independently trained models for specific errors. In the graph model, a directed acyclic graph is generated for each sentence, and the single-source shortest-path algorithm is performed on the graph to detect and correct general spelling errors at the same time. Prior to that, two types of errors over functional words (characters) are first solved by conditional random fields: the confusion of “在” (at) (pinyin is zai in Chinese), “再” (again, more, then) (pinyin: zai) and “的” (of) (pinyin: de), “地” (-ly, adverb-forming particle) (pinyin: de), and “得” (so that, have to) (pinyin: de). Finally, a rule-based model is exploited to distinguish pronoun usage confusion: “她” (she) (pinyin: ta), “他” (he) (pinyin: ta), and some other common collocation errors. The proposed model is evaluated on the standard datasets released by the SIGHAN Bake-off shared tasks, giving state-of-the-art results.

References

[1]
Farooq Ahmad and Grzegorz Kondrak. 2005. Learning a spelling error model from search query logs. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing. 955--962.
[2]
Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 409--420.
[3]
Richard G. Casey and Eric Lecolinet. 1996. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 7, 690--706.
[4]
Chaohuang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium. 278--283.
[5]
Kuanyu Chen, Hungshin Lee, Chunghan Lee, Hsinmin Wang, and Hsinhsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 79--83.
[6]
Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 4, 359--393.
[7]
Hsunwen Chiu, Jiancheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.
[8]
Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2014. Chinese spell checking based on noisy channel model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 202--209.
[9]
Wei-Cheng Chu and Chuan-Jie Lin. 2014. NTOU Chinese spelling check system in CLP Bake-off 2014. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 210--215.
[10]
Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.
[11]
David Eppstein. 1998. Finding the k shortest paths. SIAM Journal on Computing 28, 2, 652--673.
[12]
Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of the 9th Annual Conference of the International Speech Communication Association. 1618--1621.
[13]
Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics. 358--366.
[14]
Lei Gu, Yong Wang, and Xitao Liang. 2014. Introduction to NJUPT Chinese spelling check systems in CLP-2014 Bakeoff. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 167--172.
[15]
Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 74--78.
[16]
Yu He and Guohong Fu. 2013. Description of HLJU Chinese spelling checker for SIGHAN Bakeoff 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 84--87.
[17]
Yuming Hsieh, Minghong Bai, and Kehjiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.
[18]
Changning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3, 8--20.
[19]
Qiang Huang, Peijie Huang, Xinrui Zhang, Weijian Xie, Kaiduo Hong, Bingzhou Chen, and Lei Huang. 2014. Chinese spelling check system based on tri-gram model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 173--178.
[20]
Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.
[21]
Zhongye Jia and Hai Zhao. 2014. A joint graph model for Pinyin-to-Chinese conversion with typo correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1512--1523.
[22]
Junhui Li, Guodong Zhou, Hai Zhao, Qiaoming Zhu, and Peide Qian. 2009. Improving nominal SRL in Chinese language with verbal SRL information and automatic predicate recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1280--1288.
[23]
Chuanjie Lin and Weicheng Chu. 2013. NTOU Chinese spelling check system in SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 102--107.
[24]
Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Chia-Ying Lee. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 739--747.
[25]
Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Transactions on Asian Language Information Processing 10, 2, 10.
[26]
Min Liu, Ping Jian, and Heyan Huang. 2014. Introduction to BIT Chinese spelling correction system at CLP 2014 Bake-off. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 179--185.
[27]
Xiaodong Liu, Kevin Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 54--58.
[28]
Xuezhe Ma, Xiaotian Zhang, Hai Zhao, and Bao-Liang Lu. 2010. Dependency parser for Chinese constituent parsing. In Proceedings of the Joint Conference on Chinese Language Processing. 1--6.
[29]
Heming Shou and Hai Zhao. 2012. Hybrid rule-based algorithm for coreference resolution. In Proceedings of the Joint Conference on EMNLP and CoNLL-Shared Task. 118--121.
[30]
Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 266--274.
[31]
Yih-Ru Wang and Yuan-Fu Liao. 2014. NCTU and NTUT’s entry to CLP-2014 Chinese spelling check evaluation. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 216--219.
[32]
Shihhung Wu, Chaolin Liu, and Lunghao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 35--42.
[33]
Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157--166.
[34]
Jinhua Xiong, Qiao Zhang, Jianpeng Hou, Qianbo Wang, Yuanzhuo Wang, and Xueqi Cheng. 2014. Extended HMM and ranking models for Chinese spelling correction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 133--138.
[35]
Shaohua Yang, Hai Zhao, Xiaolin Wang, and Baoliang Lu. 2012. Spell checking for Chinese. In Proceedings of the International Conference on Language Resources and Evaluation. 730--736.
[36]
Tinghao Yang, Yulun Hsieh, Yuhsuan Chen, Michael Tsang, Chengwei Shih, and Wenlian Hsu. 2013. Sinica-IASL Chinese spelling check system at SIGHAN-7. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 93--96.
[37]
Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on N-gram ranked inverted index list. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 43--48.
[38]
Jui-Feng Yeh, Yun-Yun Lu, Chen-Hsien Lee, Yu-Hsiang Yu, and Yong-Ting Chen. 2014. Chinese word spelling correction based on rule induction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 139--145.
[39]
Junjie Yu and Zhenghua Li. 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 220--223.
[40]
Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using Web-based measure for Chinese spelling error correction. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 108--112.
[41]
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 Bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126--132.
[42]
Xiaotian Zhang, Chunyang Wu, and Hai Zhao. 2012. Chinese coreference resolution via ordered filtering. In Proceedings of the Joint Conference on EMNLP and CoNLL—Shared Task. 95--99.
[43]
Xiaotian Zhang and Hai Zhao. 2011. Unsupervised Chinese phrase parsing based on tree pattern mining. In Proceedings of the 11th China National Conference on Computational Linguistics.
[44]
Hai Zhao. 2009. Character-level dependencies in Chinese: Usefulness and learning. In Proceedings of the 12th Conference of the European Chapter of the ACL. 879--887.
[45]
Hai Zhao, Chang-Ning Huang, and Mu Li. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 162--165.
[46]
Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Vol. 20. 87--94.
[47]
Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9, 2, 5.
[48]
Hai Zhao and Chunyu Kit. 2007. Scaling conditional random field with application to Chinese word segmentation. In Proceedings of the 3rd International Conference on Natural Computation, Vol. 5. 95--99.
[49]
Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.
[50]
Hai Zhao and Chunyu Kit. 2009. A simple and efficient model pruning method for conditional random fields. In Proceedings of the International Conference on Computer Processing of Oriental Languages. 145--155.
[51]
Hai Zhao, Yan Song, and Chunyu Kit. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation. 1672--1677.
[52]
Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for Chinese machine translation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 248--263.

Cited By

View all
  • (2024)Automatic Spelling Corrector for Yorùbá Language Using Edit Distance and N-Gram Language Models2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG)10.1109/SEB4SDG60871.2024.10630121(1-6)Online publication date: 2-Apr-2024
  • (2023)Research on Chinese Proofreading Technology with Feed-Back MechanismComputer Science and Application10.12677/CSA.2023.13303713:03(390-398)Online publication date: 2023
  • (2023)Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data AugmentationProceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing10.1145/3639479.3639524(215-220)Online publication date: 27-Dec-2023
  • Show More Cited By

Index Terms

  1. A Hybrid Model for Chinese Spelling Check

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 3
    September 2017
    167 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3041821
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 March 2017
    Accepted: 01 January 2017
    Revised: 01 November 2016
    Received: 01 July 2016
    Published in TALLIP Volume 16, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese spelling check
    2. conditional random field
    3. graph model
    4. hybrid model
    5. rule-based model

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Major Basic Research Program of Shanghai Science and Technology Committee
    • National Basic Research Program of China
    • National Natural Science Foundation of China
    • Cai Yuanpei Program
    • Art and Science Interdisciplinary Funds of Shanghai Jiao Tong University
    • Key Project of the National Society Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automatic Spelling Corrector for Yorùbá Language Using Edit Distance and N-Gram Language Models2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG)10.1109/SEB4SDG60871.2024.10630121(1-6)Online publication date: 2-Apr-2024
    • (2023)Research on Chinese Proofreading Technology with Feed-Back MechanismComputer Science and Application10.12677/CSA.2023.13303713:03(390-398)Online publication date: 2023
    • (2023)Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data AugmentationProceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing10.1145/3639479.3639524(215-220)Online publication date: 27-Dec-2023
    • (2023)How to Progressively Build Thai Spelling Correction Systems?IEEE Access10.1109/ACCESS.2023.329500411(72704-72716)Online publication date: 2023
    • (2022)MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling CorrectionProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557636(4084-4088)Online publication date: 17-Oct-2022
    • (2022)A Multimodal Method for Chinese Spelling Correction2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892297(01-07)Online publication date: 18-Jul-2022
    • (2022)Chinese Spelling Error Detection and Correction Based on Knowledge GraphDatabase Systems for Advanced Applications. DASFAA 2022 International Workshops10.1007/978-3-031-11217-1_11(149-159)Online publication date: 16-Jul-2022
    • (2021)Think Twice: A Post-Processing Approach for the Chinese Spelling Error CorrectionApplied Sciences10.3390/app1113583211:13(5832)Online publication date: 23-Jun-2021
    • (2021)Domain-Shift Conditioning Using Adaptable Filtering Via Hierarchical Embeddings for Robust Chinese Spell CheckIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2021.308310829(2027-2036)Online publication date: 2021
    • (2020)Survey of Automatic Spelling CorrectionElectronics10.3390/electronics91016709:10(1670)Online publication date: 13-Oct-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media