Abstract
As the number of Arabic corpora is constantly increasing, there is an obvious and growing need for concordancing software for corpus search and analysis that supports as many features as possible of the Arabic language, and provides users with a greater number of functions. This paper evaluates six existing corpus search and analysis tools based on eight criteria which seem to be the most essential for searching and analysing Arabic corpora, such as displaying Arabic text in its right-to-left direction, normalising diacritics and Hamza, and providing an Arabic user interface. The results of the evaluation revealed that three tools: Khawas, Sketch Engine, and aConCorde, have met most of the evaluation criteria and achieved the highest benchmark scores. The paper concluded that developers’ conscious consideration of the linguistic features of Arabic when designing these three tools was the most significant factor behind their superiority.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The ALC may be accessed here http://www.arabiclearnercorpus.com.
Khawas can be downloaded from http://www.sourceforge.net/projects/kacst-acptool.
aConCorde can be downloaded from http://www.andy-roberts.net/coding/aconcorde.
AntConc can be downloaded from http://www.antlab.sci.waseda.ac.jp/software.html.
WordSmith Tools can be downloaded from http://www.lexically.net/wordsmith.
The manual can be accessed here http://www.lexically.net/wordsmith/step_by_step_Arabic6/index.html.
Sketch Engine can be accessed from http://www.sketchengine.co.uk.
IntelliText Corpus Queries can be accessed from http://smlc09.leeds.ac.uk/itb/htdocs/Query.html.
References
Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. Paper presented at the seventh conference of language engineering ESOLEC (5–6 December 2007), Cairo, Egypt.
Alfaifi, A., Atwell, E., & Hedaya, I. (2014). Arabic learner corpus (ALC) v. 2: A new written and spoken corpus of Arabic learners. In S. Ishikawa (Ed.), Learner corpus studies in Asia and the World (Vol. 2, pp. 77–89). Papers from LCSAW2014. Kobe: School of Languages and Communication, Kobe University.
Al-Khalifa, H., & Al-Thubaity, A. (Eds.) (2014). In Proceedings of the workshop on free/open-source Arabic corpora and corpora processing tools, Reykjavik, Iceland. http://www.kacstac.org.sa/osact/proceedings.rar.
Al-Sulaiti, L. (2010). Arabic corpora. The University of Leeds, Latifa Al-Sulaiti’s Homepage: http://www.comp.leeds.ac.uk/eric/latifa/arabic_corpora.htm.
Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11, 135–171.
Al-Thubaity, A., & Al-Mazrua, M. (2014). Khawas: Arabic Corpora Processing Tool USER GUIDE. Retrieved April 6, 2014, from http://www.sourceforge.net/projects/kacst-acptool/files/?source=navbar.
Al-Thubaity, A., Khan, M., Al-Mazrua, M., & Almoussa, M. (2013). New language resources for Arabic Corpora containing more than two million words and a corpus processing tool. In Proceedings of IALP international conference on Asian language processing, Urumqui, Xinjiang Uyghur Autonomous Region, China (pp. 67–70).
Al-Thubaity, A., Khan, M., Al-Mazrua, M., & Almoussa, M. (2014). KACST Arabic Corpora Processing Tool “Khawas” [Computer Software]. Retrieved April 6, 2014, from http://www.kacst-acptool.sourceforge.net/.
AntConc-discussion. (2013). AntConc and Arabic Texts. Retrieved September 20, 2014, from https://www.groups.google.com/d/msg/antconc/7v3TrtW2LiE/DySK9GIzPooJ.
Anthony, L. (2005). AntCone: design and development of a freeware corpus analysis toolkit for the technical writing classroom. In Proceedings of IPCC international professional communication conference, Limerick (pp. 729–737).
Anthony, L. (2014a). AntConc, (Version 3.4.2) [Computer Software]. Tokyo, Japan: Waseda University. http://www.antlab.sci.waseda.ac.jp/.
Anthony, L. (2014b). AntConc 3.4.2—Readme. Tokyo, Japan: Waseda University. http://www.laurenceanthony.net/software/antconc341/AntConc_readme.pdf.
Atwell, E., & Hardie, A. (Eds.) (2013). In Proceedings of WACL’2, 22nd to 26th July 2013. Lancaster: Lancaster University. http://www.comp.leeds.ac.uk/eric/wacl/wacl2proceedings.pdf.
Atwell, E.S., Al-Sulaiti, L., Al-Osaimi, S., & Abu Shawar, B. A. (2004). A review of Arabic corpus analysis tools—un examen d’outils pour l’analyse de corpora Arabes. In B. Bel & I. Marlien (Eds.) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles (Vol. 2, pp. 229–234).
Burnard, L. (2005). Metadata for corpus work. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 30–46). Oxford: Oxbow Books.
Habash, N. (2010). Introduction to Arabic natural language processing. In G. Hirst (Ed.), Synthesis lectures on human language technologies. San Rafael, CA: Morgan and Claypool.
Kilgarriff, A. (2014). Sketch engine [Computer Software]. Retrieved April 6, 2014, from http://www.sketchengine.co.uk/.
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of the Euralex, 6–10 July 2004, pp. 105–116, Lorient, France.
Roberts, A. (2014). aConCorde [Computer Software]. Retrieved April 6, 2014, from http://www.andy-roberts.net/coding/aconcorde.
Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora (Vol. 1, pp. 39–60).
Samy, W., & Samy, L. (2014). Basic arabic: A grammar and workbook. London: Routledge.
Scott, M. (2008). Developing Wordsmith. International Journal of English Studies, 8(1), 95–106.
Scott, M. (2012). WordSmith Tools version 6 [Computer Software], Liverpool: Lexical Analysis Software. Retrieved September 16, 2014, from http://www.lexically.net/wordsmith.
Sharoff, S. (2014). IntelliText Corpus Queries [Computer Software]. Retrieved April 6, 2014, from http://www.corpus.leeds.ac.uk/itweb/htdocs/Query.html.
Sketch Engine. (2014). Overview of language integration in Sketch Engine. Retrieved September 22, 2014, from https://www.sketchengine.co.uk/documentation/wiki/LanguagesOverview.
Wiechmann, D., & Fuhs, S. (2006). Concordancing software. Corpus Linguistics and Linguistic Theory Journal, 2(1), 107–127.
Wilson, J., Hartley, A., Sharoff, S., & Stephenson, P. (2010). Advanced corpus solutions for humanities researchers. In Proceedings of PACLIC 24, Sendai, Japan.
WordSmith Tools. (2013). WordSmith Tools Manual. Retrieved September 22, 2014, from http://www.lexically.net/downloads/version6/HTML/index.html?language.htm.
Acknowledgments
The authors would like to thank the developers, Abdulmohsen Althubaity, Andrew Roberts, Laurence Anthony, Mike Scott, Adam Kilgarriff and James Wilson for their valuable comments and suggestions to improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Alfaifi, A., Atwell, E. Comparative evaluation of tools for Arabic corpora search and analysis. Int J Speech Technol 19, 347–357 (2016). https://doi.org/10.1007/s10772-015-9285-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9285-5