Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3639479.3639524acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmlnlpConference Proceedingsconference-collections
research-article

Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data Augmentation

Published: 28 February 2024 Publication History

Abstract

Legal instruments serve as the carrier for judicial organs and citizens to exercise legal rights and enjoy legal benefits. Therefore, the accuracy of the textual content is highly demanded. In order to use machine-intelligent text proofreading technology to assist judicial personnel in automatically detecting and correcting errors in legal instruments, we propose a multi-stage Chinese Grammatical Error Correction model (MsCGEC). The model consists of two components: a grammatical error correction model and a spelling check model. The grammatical error correction model is based on the neural sequence-to-edit model–GECToR and enhanced by pinyin information which focuses on correcting grammatical errors, while the spelling check model based on the MacBERT4CSC model focuses on correcting spelling errors. In addition, we propose a rule-based data augmentation strategy to generate a large amount of training data that is as similar as possible to legal instrument error texts, alleviating the data shortage problem. Our model achieves 67.281 correct F1 and 77.028 dectect F1 on CAIL2022 legal instruments proofreading track stage2 testing set.

References

[1]
Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. Parallel iterative edit models for local sequence transduction. arXiv preprint arXiv:1910.02893 (2019).
[2]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4]
Jiquan Li, Junliang Guo, Yongxin Zhu, Xin Sheng, Deqiang Jiang, Bo Ren, and Linli Xu. 2022. Sequence-to-action: Grammatical error correction with action guided sequence generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10974–10982.
[5]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[6]
Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, 2022. Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction. arXiv preprint arXiv:2210.10442 (2022).
[7]
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. GECToR–grammatical error correction: tag, not rewrite. arXiv preprint arXiv:2005.12592 (2020).
[8]
Gaoqi Rao, Erhong Yang, and Baolin Zhang. 2020. Overview of NLPTEA-2020 shared task for Chinese grammatical error diagnosis. In Proceedings of the 6th workshop on natural language processing techniques for educational applications. 25–35.
[9]
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and Hsin-Hsi Chen. 2015. Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing. 32–37.
[10]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[11]
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2019. Structbert: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577 (2019).
[12]
Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN bake-off 2013. In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. 35–42.
[13]
Xiuyu Wu and Yunfang Wu. 2022. From Spelling to Grammar: A New Framework for Chinese Grammatical Error Correction. arXiv preprint arXiv:2211.01625 (2022).
[14]
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478 (2018).
[15]
Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157–166.
[16]
Hao Xu, Chunhui He, Chong Zhang, Zhen Tan, Shengze Hu, and Bin Ge. 2022. A Multi-channel Chinese Text Correction Method Based on Grammatical Error Diagnosis. In 2022 8th International Conference on Big Data and Information Analytics (BigDIA). IEEE, 396–401.
[17]
Lvxiaowei Xu, Jianwang Wu, Jiawei Peng, Jiayu Fu, and Ming Cai. 2022. FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction. arXiv preprint arXiv:2210.12364 (2022).
[18]
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 bake-off for Chinese spelling check. In Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126–132.
[19]
Zheng Yuan and Ted Briscoe. 2016. Grammatical error correction using neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 380–386.
[20]
Zheng Yuan and Mariano Felice. 2013. Constrained grammatical error correction using statistical machine translation. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. 52–61.
[21]
Tianchi Yue, Shulin Liu, Huihui Cai, Tao Yang, Shengkang Song, and Tinghao Yu. 2022. Improving Chinese Grammatical Error Detection via Data augmentation by Conditional Error Generation. In Findings of the Association for Computational Linguistics: ACL 2022. 2966–2975.
[22]
Yue Zhang, Zhenghua Li, Zuyi Bao, Jiacheng Li, Bo Zhang, Chen Li, Fei Huang, and Min Zhang. 2022. Mucgec: a multi-reference multi-source evaluation dataset for chinese grammatical error correction. arXiv preprint arXiv:2204.10994 (2022).
[23]
Yue Zhang, Bo Zhang, Zhenghua Li, Zuyi Bao, Chen Li, and Min Zhang. 2022. SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser. arXiv preprint arXiv:2210.12484 (2022).
[24]
Hai Zhao, Deng Cai, Yang Xin, Yuzhu Wang, and Zhongye Jia. 2017. A hybrid model for Chinese spelling check. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16, 3 (2017), 1–22.
[25]
Yuanyuan Zhao, Nan Jiang, Weiwei Sun, and Xiaojun Wan. 2018. Overview of the nlpcc 2018 shared task: Grammatical error correction. In Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II 7. Springer, 439–445.

Cited By

View all
  • (2024)An MLM Decoding Space Enhancement for Legal Document ProofreadingKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_5(57-73)Online publication date: 26-Jul-2024

Index Terms

  1. Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data Augmentation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    MLNLP '23: Proceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing
    December 2023
    252 pages
    ISBN:9798400709241
    DOI:10.1145/3639479
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 February 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese Grammatical Error Correction
    2. data augmentation.
    3. pre-training model
    4. sequence to edit

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MLNLP 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An MLM Decoding Space Enhancement for Legal Document ProofreadingKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_5(57-73)Online publication date: 26-Jul-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media