VSEC: Transformer-Based Model for Vietnamese Spelling Correction

Dinh-Truong Do¹²,
Ha Thanh Nguyen ¹³,
Thang Ngoc Bui¹² &
…
Hieu Dinh Vo¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13032))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1520 Accesses
7 Citations

Abstract

Spelling error correction is one of topics which have a long history in natural language processing. Although previous studies have achieved remarkable results, challenges still exist. In the Vietnamese language, a state-of-the-art method for the task infers a syllable’s context from its adjacent syllables. The method’s accuracy can be unsatisfactory, however, because the model may lose the context if two (or more) spelling mistakes stand near each other. In this paper, we propose a novel method to correct Vietnamese spelling errors. We tackle the problems of mistyped errors and misspelled errors by using a deep learning model. The embedding layer, in particular, is powered by the byte pair encoding technique. The sequence to sequence model based on the Transformer architecture makes our approach different from the previous works on the same problem. In the experiment, we train the model with a large synthetic dataset, which is randomly introduced spelling errors. We test the performance of the proposed method using a realistic dataset. This dataset contains 11,202 human-made misspellings in 9,341 different Vietnamese sentences. The experimental results show that our method achieves encouraging performance with 86.8% errors detected and 81.5% errors corrected, which improves the state-of-the-art approach 5.6% and 2.2%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hierarchical Transformer Encoders for Vietnamese Spelling Correction

Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model

Deep Learning Approach for Vietnamese Consonant Misspell Correction

Notes

References

Ahmad, F., Kondrak, G.: Learning a spelling error model from search query logs. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 955–962 (2005)
Google Scholar
Büyük, O.: Context-dependent sequence-to-sequence turkish spelling correction. ACM Trans. Asian Low-Resource Lang. Inform. Process. (TALLIP) 19(4), 1–16 (2020)
Article Google Scholar
Choudhary, H., Pathak, A.K., Saha, R.R., Kumaraguru, P.: Neural machine translation for english-tamil. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 770–775 (2018)
Google Scholar
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gao, J., Quirk, C., et al.: A large scale ranker-based system for search query spelling correction (2010)
Google Scholar
Gong, H., Li, Y., Bhat, S., Viswanath, P.: Context-sensitive malicious spelling error correction. In: The World Wide Web Conference, pp. 2771–2777 (2019)
Google Scholar
Gu, S., Lang, F.: A chinese text corrector based on seq2seq model. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 322–325. IEEE (2017)
Google Scholar
Thi Xuan Huong, N., Dang, T.-T., Nguyen, T.-T., Le, A.-C.: Using large N-gram for vietnamese spell checking. In: Nguyen, V.-H., Le, A.-C., Huynh, V.-N. (eds.) Knowledge and Systems Engineering. AISC, vol. 326, pp. 617–627. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-11680-8_49
Chapter Google Scholar
Li, M., Zhu, M., Zhang, Y., Zhou, M.: Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 1025–1032 (2006)
Google Scholar
Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40
Chapter Google Scholar
Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P., Huynh, T.Q.: Vietnamese spelling detection and correction using bi-gram, minimum edit distance, soundex algorithms with some additional heuristics. In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102. IEEE (2008)
Google Scholar
Nguyen, Q.D., Le, D.A., Zelinka, I.: Ocr error correction for unconstrained vietnamese handwritten text. In: Proceedings of the Tenth International Symposium on Information and Communication Technology, pp. 132–138 (2019)
Google Scholar
Nguyen, V.H., Nguyen, H.T., Snasel, V.: Normalization of vietnamese tweets on Twitter. In: Abraham, A., Jiang, X.H., Snášel, V., Pan, J.-S. (eds.) Intelligent Data Analysis and Applications. AISC, vol. 370, pp. 179–189. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21206-7_16
Chapter Google Scholar
Norvig, P.: Natural language corpus data. Beautiful Data, pp. 219–242 (2009)
Google Scholar
Reynaert, M.W.: Character confusion versus focus word-based correction of spelling and ocr variants in corpora. Int. J. Document Anal. Recogn. (IJDAR) 14(2), 173–187 (2011)
Article Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Tacorda, A.J., Ignacio, M.J., Oco, N., Roxas, R.E.: Controlling byte pair encoding for neural machine translation. In: 2017 International Conference on Asian Language Processing (IALP), pp. 168–171. IEEE (2017)
Google Scholar
Taghva, K., Stofsky, E.: Ocrspell: an interactive spelling correction system for ocr errors in text. Int. J. Document Anal. Recogn. 3(3), 125–137 (2001)
Article Google Scholar
Takahashi, H., Itoh, N., Amano, T., Yamashita, A.: A spelling correction method and its application to an ocr system. Pattern Recogn. 23(3–4), 363–377 (1990)
Article Google Scholar
Thue, A.: Uber die gegenseitige lage gleicher teile gewisser zeichenreihen. Kra. Vidensk. Selsk. Skrifer, I. Mat. Nat. Kl, pp. 1–67 (1912)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wint, Z.Z., Ducros, T., Aritsugi, M.: Spell corrector to social media datasets in message filtering systems. In: 2017 Twelfth International Conference on Digital Information Management (ICDIM), pp. 209–215. IEEE (2017)
Google Scholar
Wint, Z.Z., Ducros, T., Aritsugi, M.: Non-words spell corrector of social media data in message filtering systems. Journal of Digital Information Management, vol. 16, no. 2 (2018)
Google Scholar
Zhou, Y., Porwal, U., Konow, R.: Spelling correction as a foreign language. arXiv preprint arXiv:1705.07371 (2017)

Download references

Acknowledgement

This work has been supported by Vietnam National University, Hanoi (VNU), under Project No. QG.18.61.

Author information

Authors and Affiliations

VNU University of Engineering and Technology, Hanoi, Vietnam
Dinh-Truong Do, Thang Ngoc Bui & Hieu Dinh Vo
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Ha Thanh Nguyen

Authors

Dinh-Truong Do
View author publications
You can also search for this author in PubMed Google Scholar
Ha Thanh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thang Ngoc Bui
View author publications
You can also search for this author in PubMed Google Scholar
Hieu Dinh Vo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dinh-Truong Do .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Beijing, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, DT., Nguyen , H.T., Bui, T.N., Vo, H.D. (2021). VSEC: Transformer-Based Model for Vietnamese Spelling Correction. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13032. Springer, Cham. https://doi.org/10.1007/978-3-030-89363-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-89363-7_20
Published: 01 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89362-0
Online ISBN: 978-3-030-89363-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VSEC: Transformer-Based Model for Vietnamese Spelling Correction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Transformer Encoders for Vietnamese Spelling Correction

Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model

Deep Learning Approach for Vietnamese Consonant Misspell Correction

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VSEC: Transformer-Based Model for Vietnamese Spelling Correction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Transformer Encoders for Vietnamese Spelling Correction

Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model

Deep Learning Approach for Vietnamese Consonant Misspell Correction

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation