Abstract
Spelling error correction is one of topics which have a long history in natural language processing. Although previous studies have achieved remarkable results, challenges still exist. In the Vietnamese language, a state-of-the-art method for the task infers a syllable’s context from its adjacent syllables. The method’s accuracy can be unsatisfactory, however, because the model may lose the context if two (or more) spelling mistakes stand near each other. In this paper, we propose a novel method to correct Vietnamese spelling errors. We tackle the problems of mistyped errors and misspelled errors by using a deep learning model. The embedding layer, in particular, is powered by the byte pair encoding technique. The sequence to sequence model based on the Transformer architecture makes our approach different from the previous works on the same problem. In the experiment, we train the model with a large synthetic dataset, which is randomly introduced spelling errors. We test the performance of the proposed method using a realistic dataset. This dataset contains 11,202 human-made misspellings in 9,341 different Vietnamese sentences. The experimental results show that our method achieves encouraging performance with 86.8% errors detected and 81.5% errors corrected, which improves the state-of-the-art approach 5.6% and 2.2%, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmad, F., Kondrak, G.: Learning a spelling error model from search query logs. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 955–962 (2005)
Büyük, O.: Context-dependent sequence-to-sequence turkish spelling correction. ACM Trans. Asian Low-Resource Lang. Inform. Process. (TALLIP) 19(4), 1–16 (2020)
Choudhary, H., Pathak, A.K., Saha, R.R., Kumaraguru, P.: Neural machine translation for english-tamil. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 770–775 (2018)
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gao, J., Quirk, C., et al.: A large scale ranker-based system for search query spelling correction (2010)
Gong, H., Li, Y., Bhat, S., Viswanath, P.: Context-sensitive malicious spelling error correction. In: The World Wide Web Conference, pp. 2771–2777 (2019)
Gu, S., Lang, F.: A chinese text corrector based on seq2seq model. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 322–325. IEEE (2017)
Thi Xuan Huong, N., Dang, T.-T., Nguyen, T.-T., Le, A.-C.: Using large N-gram for vietnamese spell checking. In: Nguyen, V.-H., Le, A.-C., Huynh, V.-N. (eds.) Knowledge and Systems Engineering. AISC, vol. 326, pp. 617–627. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-11680-8_49
Li, M., Zhu, M., Zhang, Y., Zhou, M.: Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 1025–1032 (2006)
Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40
Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P., Huynh, T.Q.: Vietnamese spelling detection and correction using bi-gram, minimum edit distance, soundex algorithms with some additional heuristics. In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102. IEEE (2008)
Nguyen, Q.D., Le, D.A., Zelinka, I.: Ocr error correction for unconstrained vietnamese handwritten text. In: Proceedings of the Tenth International Symposium on Information and Communication Technology, pp. 132–138 (2019)
Nguyen, V.H., Nguyen, H.T., Snasel, V.: Normalization of vietnamese tweets on Twitter. In: Abraham, A., Jiang, X.H., Snášel, V., Pan, J.-S. (eds.) Intelligent Data Analysis and Applications. AISC, vol. 370, pp. 179–189. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21206-7_16
Norvig, P.: Natural language corpus data. Beautiful Data, pp. 219–242 (2009)
Reynaert, M.W.: Character confusion versus focus word-based correction of spelling and ocr variants in corpora. Int. J. Document Anal. Recogn. (IJDAR) 14(2), 173–187 (2011)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Tacorda, A.J., Ignacio, M.J., Oco, N., Roxas, R.E.: Controlling byte pair encoding for neural machine translation. In: 2017 International Conference on Asian Language Processing (IALP), pp. 168–171. IEEE (2017)
Taghva, K., Stofsky, E.: Ocrspell: an interactive spelling correction system for ocr errors in text. Int. J. Document Anal. Recogn. 3(3), 125–137 (2001)
Takahashi, H., Itoh, N., Amano, T., Yamashita, A.: A spelling correction method and its application to an ocr system. Pattern Recogn. 23(3–4), 363–377 (1990)
Thue, A.: Uber die gegenseitige lage gleicher teile gewisser zeichenreihen. Kra. Vidensk. Selsk. Skrifer, I. Mat. Nat. Kl, pp. 1–67 (1912)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wint, Z.Z., Ducros, T., Aritsugi, M.: Spell corrector to social media datasets in message filtering systems. In: 2017 Twelfth International Conference on Digital Information Management (ICDIM), pp. 209–215. IEEE (2017)
Wint, Z.Z., Ducros, T., Aritsugi, M.: Non-words spell corrector of social media data in message filtering systems. Journal of Digital Information Management, vol. 16, no. 2 (2018)
Zhou, Y., Porwal, U., Konow, R.: Spelling correction as a foreign language. arXiv preprint arXiv:1705.07371 (2017)
Acknowledgement
This work has been supported by Vietnam National University, Hanoi (VNU), under Project No. QG.18.61.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Do, DT., Nguyen , H.T., Bui, T.N., Vo, H.D. (2021). VSEC: Transformer-Based Model for Vietnamese Spelling Correction. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13032. Springer, Cham. https://doi.org/10.1007/978-3-030-89363-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-89363-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89362-0
Online ISBN: 978-3-030-89363-7
eBook Packages: Computer ScienceComputer Science (R0)