Abstract
Spelling errors are fundamental errors in text writing. The digital era has added another dimension called keyboard layout to this problem. Memorization, language orthography, and keyboard layout are sources of spelling errors in electronic texts. English is being the linked language of the world, good quantum of work towards the spelling error detection and plausible suggestions has been done for English language. But it is not the case for digital resources scarce languages like Indian languages. Marathi which is the official language of Maharashtra State in India and the world’s 10th highest spoken language is not exception to this. Various computational approaches for spelling error detection and correction have been advocated in the literature. Amongst these, similarity-based measures have proven to be the prominent ones. This paper discusses the detailed contrastive study of the two popular similarity measures viz. minimum edit distance and cosine similarity measures in the context of mis-spelled Marathi words. The philosophical and empirical aspects of these methods have also been presented. For experimentation purpose we have chosen a dataset of 9, 29, 663 unique Marathi words harvested from various sources. We have obtained an accuracy of 85.88% and 86.76% for minimum edit distance algorithm and the cosine similarity algorithm, respectively.
Similar content being viewed by others
References
Al-Jefri MM, Mahmoud SA (2013) Context sensitive Arabic spell checker using context words and n gram language models
Arun P (2001) Marathi Lekhan Kosh, vol 2001. Keshav Bhikaji Dhavale Publishers, Mumbai
Asadullah, M (2007) “Finite state recognizer and string similarity based spelling checker for Bangla”, Department of Computer Science and Engineering. BRAC University
Available at (n.d.) https://code.google.com/archive/p/hunspell-marathi-dictionary
Avalilable at (n.d.) http://www.tdil.dc.in
Awny S, Amal AM (2017) IBRI-CASONTO: Ontology-based semantic search engine. Egypt Inform J 18:181–192
Basri S, Alfred R, On C (2012) Automatic spell checker for malay blog, pp 506–510. https://doi.org/10.1109/ICCSCE.2012.6487198
Bhattacharya (1946) On a measure of divergence of two multinomial populations. Sankhya 7:401–406
Bilenko MY (2006) Learnable similarity functions and their application to record linkage and clustering
Broder Z, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Networks ISDN Syst 29(8–13):1157–1166. https://doi.org/10.1016/S0169-7552(97)00031-7
Bruno M, Silva MJ (2004) Spelling correction for search engine queries. Advanced natural language processing. Springer, Berlin, pp 372–383
Comodi A, Conficconi D, Scolari A (2018) “TiReX: tiled regular expression matching architecture”, IEEE
Amorim RC, Zampieri M (2013) Effective spell checking methods using clustering algorithms. RANLP, Hissar
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176. https://doi.org/10.1145/363958.363994
Das M, Borgohain SK, Gogoi J, Nair SB (2002) Design and implementation of a spell checker for Assamese. Language Engineering Conference, 2002. Proceedings, pp 156–162
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302
Dixit VD, Dethe SS, Joshi RK (2005) Design and implementation of a morphology-based spellchecker for Marathi, an Indian language. Arch Control Sci 5:301–308
Etoori P, Chinnakotla M, Mamidi R (2018) Automatic spelling correction for resource scarce languages using deep learning, Melbourne, Australia
FlorM, Futagi Y (2012) On using context for automatic correction of non-word misspellings in student essays. BEA@NAACL-HLT
Forum for Information Retrieval (FIRE) (n.d.) Information Retrieval Society of India. (12 2–4). Mumbai, Maharashtra, India. Retrieved from http://www.isical.ac.in/~fire/2010/index.html
Friedman JH (1997) On bias, variance, 0/1—loss, and the curse-of- dimensionality. Data Min Knowl Disc 1(1):55–77. https://doi.org/10.1023/A:1009778005914
Gravano L et al (2001) Approximate string joins in a database (almost) for free. In: VLDB, vol. 1, pp 491–500. Available at: http://www.vldb.org/conf/2001/P491.pdf
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160
Hamza B, Abdellah Y, Hicham G, Mostafa B (2014) For an independent SpellChecking system from the Arabic language vocabulary, 5
Hatem M (2016) Automatic Arabic spelling errors detection and correction based on confusion matrix noisy channel hybrid system. Egypt Comput Sci J 40:6164
Huang G, Chen J, Sun Z (2020) A correction method of word spelling mistake for English text. J Phys Conf Ser 1693:012118
Jaccard P (1901) Étude Comparative de la Distribution Florale Dans Une Portion Des Alpes et Des Jura. Bull Soc Vaudoise Sci Nat 37:547–579
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of of Tampa, Florida. J Am Stat Assoc 84:414–420
Jayakodi K, Bandara M, Perera I, Meedeniya DA (2016) WordNet and cosine similarity based classifier of exam questions using bloom’s taxonomy. Int J Emerg Technol Learn 11:142–149
Kaur K, Kaur H (2018) A hybrid approach for spell check and error correction for english and punjabi text paragraphs
Paramjeet Singh D (2015) Spellchecking and error correcting system for text paragraphs written in Punjabi language using hybrid approach
Kaur H et al (2007) Punjabi spell checker using dictionary clustering. Int J Sci Eng Technol Res 4(7):23692374
Kondrak G (2005) N-gram similarity and distance. SPIRE. https://doi.org/10.1007/11575832_13
Krause EF (1987) Taxicab geometry: an adventure in non-euclidean geometry
Lawaye A, Purkayastha B (2016) Design and implementation of spell checker for Kashmiri. Int J Sci Res 5:199200
Lee, D-G, Hyuk-Chul K (2022) Automatic string generator based on standard Korean pronunciation
Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1(1):8–17
Lu, Chris, Aronson, Alan Shooshan, Sonya Demner-Fushman, Dina.(2019). “Spell checker for Consumer Language (CSpell)”. J Am Med Inform Assoc. 26. 211–218. https://doi.org/10.1093/jamia/ocy171.
Mahdi M, Tiun S (2014) Utilizing wordnet for instance-based schema matching. In: Proceedings of the International Conference on Advances in Computer Science and Electronics Engineering (CSEE 2014), pp 59–63
Mandal, P., Hossain M., “Clustering based Bangla spell checker”, 2017.
Maulana Y (2018) Autocomplete and spell checking Levenshtein distance algorithm to getting text suggest error data searching in library, 5, 6775
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453. https://doi.org/10.1016/0022-2836(70)90057-4
Padhy H, Mohanty S (2013) Designing hybrid approach Spell checker for Oriya
Patil KT, Bhavsar RP, Pawar BV (2021) Spelling checking and error corrector system for Marathi language text using minimum edit distance algorithm
Patil KT, Bhavsar RP, Pawar BV (2021) Word suggestions for non-word text errors using similarity measure. 7th International Conference on Advanced Computing and Communication Systems (ICACCS 2021) Coimbatore, IEEE Xplore, pp 892–897
Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687
Prasetya DD, Wibawa AP, Hirashima T (2018) The performance of text similarity algorithms. Int J Adv Intell Inform 4(1):63–69 ISSN 2442–6571
(2017) Morphological analyzer for Kannada inflectional words using hybrid approach, 4 December 2016
Chan C. Querol, A. Cheng, J. Querol, J., “SpellCheF: spelling checker and corrector for Filipino”, J Res Sci Comput Eng, 4, 2008.
Smith TF, Waterman MS (1981) Identification of common molecular sub-sequences. J Mol Biol 147(1):195–197. https://doi.org/10.1016/0022-2836(81)90087-5
Soel TT, Sann Z (2019) “Study on spell-checking system using Levenshtein distance algorithm”, Int J Recent Dev Eng Technol, pp. 1–3, Website: www.ijrdet.com ISSN 2347-6435(Online) 8, 9
Soyusiawaty, D Wolley, D (2021) Hybrid spelling correction and query expansion for relevance document searching. Int J Adv Comput Sci Appl. 12. https://doi.org/10.14569/IJACSA.2021.0120838.
Umar R, Hendriana Y, Budiyono E (2015) Implementation of edit-distance algorithm for E-commerce of bravoisitees distro. Int J Comput Trends Technol 27(3):131–136
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM 21:168–173
Wang J, Li G, Fe J (2011) Fast-join: An efficient method for fuzzy token matching based string similarity join. In: 2011 IEEE 27th International Conference on Data Engineering, pp 458–469
Watcharabutsarakham S (2007) Spell checker for Thai document. TENCON 2005 - 2005 IEEE Region 10 Conference, pp 1–4
Winkler WE (1991) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage
Yu M, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comput Sci 10(3):399–417. https://doi.org/10.1007/s11704-015-5900-5
Yulianto M, Arifudin R, Alamsyah A (2018) Autocomplete and spell checking levenshtein distance algorithm to getting text suggest error data searching in library. Sci J Inform 5:75
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interests/competing interests
The authors have not received any funding for this research work and have no Conflicts of interests/Competing interests with respect to this work with any organization/third party. Authors further state that there is no any financial interests that are directly or indirectly related to the work submitted for publication.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Patil, K.T., Bhavsar, R.P. & Pawar, B.V. Contrastive study of minimum edit distance and cosine similarity measures in the context of word suggestions for misspelled Marathi words. Multimed Tools Appl 82, 15573–15591 (2023). https://doi.org/10.1007/s11042-022-13948-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13948-z