research-article

Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data Augmentation

Authors:

Ting JinAuthors Info & Claims

MLNLP '23: Proceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing

Pages 215 - 220

https://doi.org/10.1145/3639479.3639524

Published: 28 February 2024 Publication History

Abstract

Legal instruments serve as the carrier for judicial organs and citizens to exercise legal rights and enjoy legal benefits. Therefore, the accuracy of the textual content is highly demanded. In order to use machine-intelligent text proofreading technology to assist judicial personnel in automatically detecting and correcting errors in legal instruments, we propose a multi-stage Chinese Grammatical Error Correction model (MsCGEC). The model consists of two components: a grammatical error correction model and a spelling check model. The grammatical error correction model is based on the neural sequence-to-edit model–GECToR and enhanced by pinyin information which focuses on correcting grammatical errors, while the spelling check model based on the MacBERT4CSC model focuses on correcting spelling errors. In addition, we propose a rule-based data augmentation strategy to generate a large amount of training data that is as similar as possible to legal instrument error texts, alleviating the data shortage problem. Our model achieves 67.281 correct F1 and 77.028 dectect F1 on CAIL2022 legal instruments proofreading track stage2 testing set.

References

[1]

Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. Parallel iterative edit models for local sequence transduction. arXiv preprint arXiv:1910.02893 (2019).

[2]

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[4]

Jiquan Li, Junliang Guo, Yongxin Zhu, Xin Sheng, Deqiang Jiang, Bo Ren, and Linli Xu. 2022. Sequence-to-action: Grammatical error correction with action guided sequence generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10974–10982.

[5]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[6]

Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, 2022. Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction. arXiv preprint arXiv:2210.10442 (2022).

[7]

Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. GECToR–grammatical error correction: tag, not rewrite. arXiv preprint arXiv:2005.12592 (2020).

[8]

Gaoqi Rao, Erhong Yang, and Baolin Zhang. 2020. Overview of NLPTEA-2020 shared task for Chinese grammatical error diagnosis. In Proceedings of the 6th workshop on natural language processing techniques for educational applications. 25–35.

[9]

Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and Hsin-Hsi Chen. 2015. Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing. 32–37.

[10]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[11]

Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2019. Structbert: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577 (2019).

[12]

Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN bake-off 2013. In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. 35–42.

[13]

Xiuyu Wu and Yunfang Wu. 2022. From Spelling to Grammar: A New Framework for Chinese Grammatical Error Correction. arXiv preprint arXiv:2211.01625 (2022).

[14]

Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478 (2018).

[15]

Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157–166.

[16]

Hao Xu, Chunhui He, Chong Zhang, Zhen Tan, Shengze Hu, and Bin Ge. 2022. A Multi-channel Chinese Text Correction Method Based on Grammatical Error Diagnosis. In 2022 8th International Conference on Big Data and Information Analytics (BigDIA). IEEE, 396–401.

[17]

Lvxiaowei Xu, Jianwang Wu, Jiawei Peng, Jiayu Fu, and Ming Cai. 2022. FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction. arXiv preprint arXiv:2210.12364 (2022).

[18]

Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 bake-off for Chinese spelling check. In Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126–132.

[19]

Zheng Yuan and Ted Briscoe. 2016. Grammatical error correction using neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 380–386.

[20]

Zheng Yuan and Mariano Felice. 2013. Constrained grammatical error correction using statistical machine translation. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. 52–61.

[21]

Tianchi Yue, Shulin Liu, Huihui Cai, Tao Yang, Shengkang Song, and Tinghao Yu. 2022. Improving Chinese Grammatical Error Detection via Data augmentation by Conditional Error Generation. In Findings of the Association for Computational Linguistics: ACL 2022. 2966–2975.

[22]

Yue Zhang, Zhenghua Li, Zuyi Bao, Jiacheng Li, Bo Zhang, Chen Li, Fei Huang, and Min Zhang. 2022. Mucgec: a multi-reference multi-source evaluation dataset for chinese grammatical error correction. arXiv preprint arXiv:2204.10994 (2022).

[23]

Yue Zhang, Bo Zhang, Zhenghua Li, Zuyi Bao, Chen Li, and Min Zhang. 2022. SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser. arXiv preprint arXiv:2210.12484 (2022).

[24]

Hai Zhao, Deng Cai, Yang Xin, Yuzhu Wang, and Zhongye Jia. 2017. A hybrid model for Chinese spelling check. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16, 3 (2017), 1–22.

Digital Library

[25]

Yuanyuan Zhao, Nan Jiang, Weiwei Sun, and Xiaojun Wan. 2018. Overview of the nlpcc 2018 shared task: Grammatical error correction. In Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II 7. Springer, 439–445.

Digital Library

Cited By

Liu JLuo X(2024)An MLM Decoding Space Enhancement for Legal Document ProofreadingKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_5(57-73)Online publication date: 26-Jul-2024
https://doi.org/10.1007/978-981-97-5492-2_5

Index Terms

Multi-stage Legal Instrument Grammatical Error Correction via Seq2Edit and Data Augmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep ...
Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data
In recent studies, pre-trained models and pseudo data have been key factors in improving the performance of the English grammatical error correction (GEC) task. However, few studies have examined the role of pre-trained models and pseudo data in the ...
Weaken Grammatical Error Influence in Chinese Grammatical Error Correction
Natural Language Processing and Chinese Computing
Abstract
Chinese grammatical error correction (CGEC), a task of correcting grammatical errors in text, is treated as a translation task, where error sentences are “translated” to correct sentences. However, some grammatical errors in the training data can ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

MLNLP '23: Proceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing

December 2023

252 pages

ISBN:9798400709241

DOI:10.1145/3639479

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 February 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Natural Science Foundation of Hainan Province of China
National Natural Science Foundation of China

Conference

MLNLP 2023

MLNLP 2023: 2023 6th International Conference on Machine Learning and Natural Language Processing

December 27 - 29, 2023

Sanya, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
19
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu JLuo X(2024)An MLM Decoding Space Enhancement for Legal Document ProofreadingKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_5(57-73)Online publication date: 26-Jul-2024
https://doi.org/10.1007/978-981-97-5492-2_5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten