Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An Unsupervised Domain-Adaptive Framework for Chinese Spelling Checking

Published: 21 November 2024 Publication History

Abstract

Chinese Spelling Check (CSC) is a meaningful task in the area of natural language processing, which aims at detecting spelling errors in Chinese texts and then correcting these errors. Current typical CSC models have shown impressive performance in general datasets with the help of pretrained language models such as BERT, but they suffer great performance loss in downstream tasks with domain-specific terms because they are primarily trained on general corpora. To verify the cross-domain adaptation ability of these models, we build three new datasets with abundant domain-specific terms on financial, medical, and legal domains and conduct empirical investigations on them in the corresponding domain-specific test datasets to verify the cross-domain adaptation ability. In response to the poor performance of the existing models, we propose a framework named uChecker, which utilizes an unsupervised method in spelling error detection and correction. Experimental results prove that uChecker can perform well in domain-specific test datasets while not losing its performance in the general domain.

References

[1]
Haithem Afli, Zhengwei Qiu, Andy Way, and Páraic Sheridan. 2016. Using SMT for OCR error correction of historical texts. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC ’16). 962–966. https://aclanthology.org/L16-1153
[2]
Zuyi Bao, Chen Li, and Rui Wang. 2020. Chunk-based Chinese spelling check with global optimization. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 2031–2040.
[3]
Kuan-Yu Chen, Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang, and Hsin-Hsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 79–83. https://aclanthology.org/W13-4414
[4]
Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua Jiang, Feng Wang, Taifeng Wang, Wei Chu, and Yuan Qi. 2020. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL ’20). 871–881.
[5]
Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49–53. https://aclanthology.org/W13-4408
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
[7]
Shichao Dong, Gabriel Pui Cheong Fung, Binyang Li, Baolin Peng, Ming Liao, Jia Zhu, and Kam-Fai Wong. 2016. ACE: Automatic colloquialism, typographical and orthographic errors detection for Chinese language. In Proceedings of the 26th International Conference on Computational Linguistics: System Demonstrations(COLING ’16). 194–197. https://aclanthology.org/C16-2041
[8]
Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING ’10). 358–366. https://aclanthology.org/C10-1041
[9]
Zhao Guo, Yuan Ni, Keqiang Wang, Wei Zhu, and Guotong Xie. 2021. Global attention decoder for Chinese spelling error correction. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 1410–1428.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 770–778.
[11]
Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and Junhui Liu. 2019. FASPell: A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT@EMNLP ’19). 160–169.
[12]
Li Huang, Junjie Li, Weiwei Jiang, Zhiyu Zhang, Minchuan Chen, Shaojun Wang, and Jing Xiao. 2021. PHMOSpell: Phonological and morphological knowledge guided Chinese spelling check. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1 (Long Papers). 5958–5967.
[13]
Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88–92. https://aclanthology.org/W13-4416
[14]
C.-L. Liu, M.-H. Lai, K.-W. Tien, Y.-H. Chuang, S.-H. Wu, and C.-Y. Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Transactions on Asian and Low-Resource Language Information Processing 10, 2 (2011), Article 10, 39 pages.
[15]
Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Chia-Ying Lee. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In COLING 2010: Posters. Coling 2010 Organizing Committee, Beijing, China, 739–747. https://aclanthology.org/C10-2085
[16]
Shulin Liu, Shengkang Song, Tianchi Yue, Tao Yang, Huihui Cai, TingHao Yu, and Shengli Sun. 2022. CRASpell: A contextual typo robust approach to improve Chinese spelling correction. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, 3008–3018.
[17]
Shulin Liu, Tao Yang, Tianchi Yue, Feng Zhang, and Di Wang. 2021. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1 (Long Papers) (ACL/IJCNLP ’21). 2991–3000.
[18]
Xiaodong Liu, Fei Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing (SIGHAN@IJCNLP ’13). 54–58.
[19]
Qi Lv, Ziqiang Cao, Lei Geng, Chunhui Ai, Xu Yan, and Guohong Fu. 2022. General and domain adaptive Chinese spelling check with error consistent pretraining. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 4 (Sept. 2022), Article 124, 18 pages.
[20]
Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2 (1989), 257–286.
[21]
Shuming Shi, Enbo Zhao, Duyu Tang, Yan Wang, Piji Li, Wei Bi, Haiyun Jiang, Guoping Huang, Leyang Cui, Xinting Huang, Cong Zhou, Yong Dai, and Dongyang Ma. 2022. Effidit: Your AI writing assistant. CoRR abs/2208.01815 (2022).
[22]
Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017).
[23]
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and Hsin-Hsi Chen. 2015. Introduction to SIGHAN 2015 Bake-off for Chinese spelling check. In Proceedings of the 8th SIGHAN Workshop on Chinese Language Processing. 32–37.
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 1–11.
[25]
Baoxin Wang, Wanxiang Che, Dayong Wu, Shijin Wang, Guoping Hu, and Ting Liu. 2021. Dynamic connected networks for Chinese spelling check. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Chenqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2437–2446.
[26]
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and Haisong Zhang. 2018. A hybrid approach to automatic corpus generation for Chinese spelling check. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2517–2527.
[27]
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and Haisong Zhang. 2018. A hybrid approach to automatic corpus generation for Chinese spelling check. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2517–2527.
[28]
Dingmin Wang, Yi Tay, and Li Zhong. 2019. Confusionset-guided pointer networks for Chinese spelling check. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Volume 1 (Long Papers) (ACL ’19). 5780–5785.
[29]
Shih-Hung Wu, Yong-Zhi Chen, Ping-Che Yang, Tsun Ku, and Chao-Lin Liu. 2010. Reducing the false alarm rate of Chinese character error detection and correction. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing. https://aclanthology.org/W10-4107
[30]
Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 35–42. https://aclanthology.org/W13-4406
[31]
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018. CAIL2018: A large-scale legal dataset for judgment prediction. arXiv:1807.02478 (2018).
[32]
Weijian Xie, Peijie Huang, Xinrui Zhang, Kaiduo Hong, Qiang Huang, Bingzhou Chen, and Lei Huang. 2015. Chinese spelling check system based on N-gram model. In Proceedings of the 8th SIGHAN Workshop on Chinese Language Processing. 128–136.
[33]
Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157–166.
[34]
Heng-Da Xu, Zhongli Li, Qingyu Zhou, Chao Li, Zizhen Wang, Yunbo Cao, Heyan Huang, and Xian-Ling Mao. 2021. Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 716–728.
[35]
Junjie Yu and Zhenghua Li. 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 220–223.
[36]
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 Bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126–132.
[37]
H. Zan, W. Li, K. Zhang, Y. Ye, and Z. Sui. 2021. Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation. Chinese Lexical Semantics.
[38]
Ningyu Zhang, Mosha Chen, Zhen Bi, Xiaozhuan Liang, Lei Li, Xin Shang, Kangping Yin, Chuanqi Tan, Jian Xu, Fei Huang, Luo Si, Yuan Ni, Guotong Xie, Zhifang Sui, Baobao Chang, Hui Zong, Zheng Yuan, Linfeng Li, Jun Yan, Hongying Zan, Kunli Zhang, Buzhou Tang, and Qingcai Chen. 2022. CBLUE: A Chinese biomedical language understanding evaluation benchmark. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Volume 1 (Long Papers). 7888–7915.
[39]
Ruiqing Zhang, Chao Pang, Chuanqiang Zhang, Shuohuan Wang, Zhongjun He, Yu Sun, Hua Wu, and Haifeng Wang. 2021. Correcting Chinese spelling errors with phonetic pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 2250–2261.
[40]
Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang Li. 2020. Spelling error correction with soft-masked BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL ’20). 882–890. https://www.aclweb.org/anthology/2020.acl-main.82/
[41]
Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang Li. 2020. Spelling error correction with soft-masked BERT. CoRR abs/2005.07421 (2020). https://arxiv.org/abs/2005.07421
[42]
Haoxi Zhong, Chaojun Xiao, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018. Overview of CAIL2018: Legal Judgment Prediction competition. arXiv:1810.05851 (2018).

Index Terms

  1. An Unsupervised Domain-Adaptive Framework for Chinese Spelling Checking

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 11
    November 2024
    248 pages
    EISSN:2375-4702
    DOI:10.1145/3613714
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 November 2024
    Online AM: 27 August 2024
    Accepted: 18 August 2024
    Revised: 18 July 2024
    Received: 16 August 2023
    Published in TALLIP Volume 23, Issue 11

    Check for updates

    Author Tags

    1. Natural language processing
    2. Chinese spelling check
    3. domain adaptation
    4. unsupervised learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Scientific Research Starting Foundation of Nanjing University of Aeronautics and Astronautics
    • High Performance Computing Platform of Nanjing University of Aeronautics and Astronautics

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 165
      Total Downloads
    • Downloads (Last 12 months)165
    • Downloads (Last 6 weeks)73
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media