Skip to main content

Hossein Hassani

University of Kurdistan - Hawler, Computer Sicence and Engieering, Faculty Member

Followers

14

Following

2

Mentions

1

Public Views

I am a computational linguist and natural language processing expert. Staring as a programmer, during my career I undertook different positions as software and database designer, senior analyst, project manager, software quality auditor, and senior IT consultant. My proficiency in software development and Information Technology has been augmented by taking those positions in different projects with variety of themes from Commercial Software Development -for large statewide enterprises- to Translation Machine Development from one side, and from Software Quality Management (SQM) to Software Solution and Development Frameworks adaptation (e.g. MSF and RUP) on the other side. I joined the UKH academic staff in 2007, after having more than 17 years experience in software industry. During my career in the industry, I maintained my relationship with the higher education through providing seminars and occasional teaching at the universities.

less

Interests

Uploads

Papers by Hossein Hassani

The First Parallel Corpora for Kurdish Sign Language

arXiv (Cornell University), May 11, 2023

A Lingua Franca for Kurdish Populations

HAL (Le Centre pour la Communication Scientifique Directe), Apr 1, 2021

Kurdish Music Genre Recognition Using a CNN and DNN

ASEC 2022

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

ArXiv, 2020

Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish langua... more Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish language is a multi-dialect, under-resourced language which is written in different scripts. The lack of various segmented corpora is one of the major bottlenecks in Kurdish language processing. We used Punkt, an unsupervised machine learning method, to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. According to the literature, studies on using Punkt on non-Latin data are scanty. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is mainly due to the situation of abbreviations in Kurdish and partly because of ordinal numerals. The data is publicly available at this https URL KTC-Segmented for non-commercial use under the CC BY-NC-SA 4.0 licence.

Towards Finite-State Morphology of Kurdish

Morphological analysis is the study of the formation and structure of words. It plays a crucial r... more Morphological analysis is the study of the formation and structure of words. It plays a crucial role in various tasks in Natural Language Processing (NLP) and Computational Linguistics (CL) such as machine translation and text and speech generation. Kurdish is a less-resourced multi-dialect Indo-European language with highly inflectional morphology. In this paper, as the first attempt of its kind, the morphology of the Kurdish language (Sorani dialect) is described from a computational point of view. We extract morphological rules which are transformed into finite-state transducers for generating and analyzing words. The result of this research assists in conducting studies on language generation for Kurdish and enhances the Information Retrieval (IR) capacity for the language while leveraging the Kurdish NLP and CL into a more advanced computational level.

Towards Kurdish Text to Sign Translation

The resources and technologies for Sign language processing of resourceful languages are emerging... more The resources and technologies for Sign language processing of resourceful languages are emerging, while the low-resource languages are falling behind. Kurdish is a multi-dialect language, and it is considered a low-resource language. It is spoken by approximately 30 million people in several countries, which denotes that it has a large community with hearing-impairments as well. This paper reports on a project which aims to develop the necessary data and tools to process the Sign language for Sorani as one of the spoken Kurdish dialects. We present the results of developing a dataset in HamNoSys and its corresponding SiGML form for the Kurdish Sign lexicon. We use this dataset to implement a sign-supported Kurdish tool to check the accuracy of the Sign lexicon. We tested the tool by presenting it to hearing-impaired individuals. The experiment showed that 100% of the translated letters were understandable by a hearing-impaired person. The percentages were 65% for isolated words, an...

Towards Finite-State Morphology of Kurdish

ArXiv, 2020

Morphological analysis is the study of the formation and structure of words. It plays a crucial r... more Morphological analysis is the study of the formation and structure of words. It plays a crucial role in various tasks in Natural Language Processing (NLP) and Computational Linguistics (CL) such as machine translation and text and speech generation. Kurdish is a less-resourced multi-dialect Indo-European language with highly inflectional morphology. In this paper, as the first attempt of its kind, the morphology of the Kurdish language (Sorani dialect) is described from a computational point of view. We extract morphological rules which are transformed into finite-state transducers for generating and analyzing words. The result of this research assists in conducting studies on language generation for Kurdish and enhances the Information Retrieval (IR) capacity for the language while leveraging the Kurdish NLP and CL into a more advanced computational level.

Supervision of Undergraduate Final Year Projects in Computing: A Case Study

Education Sciences, 2018

Final Year Projects (FYPs) play a significant role in undergraduate education in the computing fi... more Final Year Projects (FYPs) play a significant role in undergraduate education in the computing field of study, and most of the related university departments and schools consider them an essential contribution to this study. However, issues such as whether to assign the projects individually or to a group of students, the procedures followed in their assignment, the supervision process and the evaluation of the outcomes have been of concern to many academics in the field. In this case study, we present the methods for activities such as assignment, supervision, and evaluation of FYPs at the University of Kurdistan Hewlêr (UKH) between the years 2009 and 2017. We discuss the development of our approach and the lessons learned during the mentioned period. Furthermore, we present our current way of managing the FYP module. The aim is to develop a platform for interested and involved academics to discuss the topic further. Sharing the experiences on managing FYPs would not only help in ...

Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus)

Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Sp... more Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged co...

Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset

We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition... more We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the project was to develop a system that automatically could recognize simple sentences based on the vocabulary which is used in grades one to three of the primary schools in the Kurdistan Region of Iraq. We used CMUSphinx as our experimental environment. We developed a dataset to train the system. The dataset is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Machine translation has been a major motivation of development in natural language processing. De... more Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this paper, we describe our approach in retrieving potentially-alignable news articles from multi-language websites and manually align them across dialects and languages based on lexical similarity and transliteration of scripts. We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani. The corpus is publicly available under the CC BY-NC-SA 4.0 license.

Developing a Fine-grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects written in various scripts.... more Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

Transliterating Kurdish texts in Latin into Persian-Arabic script

Kurdish is written in different scripts. The two most popular scripts are Latin and Persian-Arabi... more Kurdish is written in different scripts. The two most popular scripts are Latin and Persian-Arabic. However, not all Kurdish readers are familiar with both mentioned scripts that could be resolved by automatic transliterators. So far, the developed tools mostly transliterate Persian-Arabic scripts into Latin. We present a transliterator to transliterate Kurdish texts in Latin into Persian-Arabic script. We also discuss the issues that should be considered in the transliteration process. The tool is a part of Kurdish BLARK, and it is publicly available for non-commercial use1.

Creating a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects written in various scripts.... more Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of themain obstacles in Kurdish language processing. In this paper, we present KTC–the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license1.

Towards electronic lexicography for the Kurdish language

This project has received funding from the European Union’s Horizon 2020 research and innovation ... more

A Method for Proper Noun Extraction in Kurdish

This paper suggests a method for proper noun identification in Kurdish texts. Kurdish proper noun... more This paper suggests a method for proper noun identification in Kurdish texts. Kurdish proper nouns are not capitalized and they also assume other part-of-speech roles, which leads to a broad ambiguity that should be addressed in Kurdish proper noun recognition applications. Kurdish is also among less-resourced languages. We developed an application based on an architecture which includes a number of name lists, a set of rules, and a set of processes that recognizes Kurdish person names. This can help the study of Information Retrieval (IR) in Kurdish to advance and can also be used in Kurdish machine translation. We conducted several experiments which showed that the precision of the method is more than 95%, the recall is between 40% to 80%, and the F-measure is close to 60% to more than 80%. The reason for the low recall precision was because our name lists were not exhaustive enough to cover the vast majority of the Kurdish names.

A Corpus of the Sorani Kurdish Folkloric Lyrics

Kurdish poetry and prose narratives were historically transmitted orally and less in a written fo... more Kurdish poetry and prose narratives were historically transmitted orally and less in a written form. Being an essential medium of oral narration and literature, Kurdish lyrics have had a unique attribute in becoming a vital resource for different types of studies, including Digital Humanities, Computational Folkloristics and Computational Linguistics. As an initial study of its kind for the Kurdish language, this paper presents our efforts in transcribing and collecting Kurdish folk lyrics as a corpus that covers various Kurdish musical genres, in particular Beyt, Gorani, Bend, and Heyran. We believe that this corpus contributes to Kurdish language processing in several ways, such as compensation for the lack of a long history of written text by incorporating oral literature, presenting an unexplored realm in Kurdish language processing, and assisting the initiation of Kurdish computational folkloristics. Our corpus contains 49,582 tokens in the Sorani dialect of Kurdish. The corpus...

Kurdish Optical Character Recognition

Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdi... more Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdish is spoken in different dialects and uses several scripts for writing. The Persian/Arabic script is widely used among these dialects. The Persian/Arabic script is written from Right to Left (RTL), it is cursive, and it uses unique diacritics. These features, particularly the last two, affect the segmentation stage in developing a Kurdish OCR. In this article, we introduce an enhanced character segmentation based method which addresses the mentioned characteristics. We applied the method to text-only images and tested the Kurdish OCR using documents of different fonts, font sizes, and image resolutions. The results of the experiments showed that the accuracy rate of character recognition of the proposed method was 90.82% on average.

Digital Humanities Readiness Assessment Framework: DHuRAF

ArXiv, 2019

This research suggests a framework, Digital Humanities Readiness Assessment Framework (DHuRAF), t... more This research suggests a framework, Digital Humanities Readiness Assessment Framework (DHuRAF), to assess the maturity level of the required infrastructure for Digital Humanities studies (DH) in different communities. We use a similar approach to the Basic Language Resource Kit (BLARK) in developing the suggested framework. DH as a fairly new field, which has emerged at an intersection of digital technologies and humanities, currently has no framework based on which one could assess the status of the essential elements required for conducting research in a specific language or community. DH offers new research opportunities and challenges in the humanities, computer science and its relevant technologies, hence such a framework could provide a starting point for educational strategists, researchers, and software developers to understand the prerequisites for their tasks and to have a statistical base for their decisions and plans. The suggested framework has been applied in the conte...

Can Linguistic Distance help Language Classification? Assessing Hawrami-Zaza and Kurmanji-Sorani

To consider Hawrami and Zaza (Zazaki) standalone languages or dialects of a language have been di... more To consider Hawrami and Zaza (Zazaki) standalone languages or dialects of a language have been discussed and debated for a while among linguists active in studying Iranian languages. The question of whether those languages/dialects belong to the Kurdish language or if they are independent descendants of Iranian languages was answered by MacKenzie (1961). However, a majority of people who speak the dialects are against that answer. Their disapproval mainly seems to be based on the sociological, cultural, and historical relationship among the speakers of the dialects. While the case of Hawrami and Zaza has remained unexplored and under-examined, an almost unanimous agreement exists about the classification of Kurmanji and Sorani as Kurdish dialects. The related studies to address the mentioned cases are primarily qualitative. However, computational linguistics could approach the question from a quantitative perspective. In this research, we look into three questions from a linguistic ...

The First Parallel Corpora for Kurdish Sign Language

arXiv (Cornell University), May 11, 2023

A Lingua Franca for Kurdish Populations

HAL (Le Centre pour la Communication Scientifique Directe), Apr 1, 2021

Kurdish Music Genre Recognition Using a CNN and DNN

ASEC 2022

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

ArXiv, 2020

Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish langua... more Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish language is a multi-dialect, under-resourced language which is written in different scripts. The lack of various segmented corpora is one of the major bottlenecks in Kurdish language processing. We used Punkt, an unsupervised machine learning method, to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. According to the literature, studies on using Punkt on non-Latin data are scanty. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is mainly due to the situation of abbreviations in Kurdish and partly because of ordinal numerals. The data is publicly available at this https URL KTC-Segmented for non-commercial use under the CC BY-NC-SA 4.0 licence.

Towards Finite-State Morphology of Kurdish

Morphological analysis is the study of the formation and structure of words. It plays a crucial r... more Morphological analysis is the study of the formation and structure of words. It plays a crucial role in various tasks in Natural Language Processing (NLP) and Computational Linguistics (CL) such as machine translation and text and speech generation. Kurdish is a less-resourced multi-dialect Indo-European language with highly inflectional morphology. In this paper, as the first attempt of its kind, the morphology of the Kurdish language (Sorani dialect) is described from a computational point of view. We extract morphological rules which are transformed into finite-state transducers for generating and analyzing words. The result of this research assists in conducting studies on language generation for Kurdish and enhances the Information Retrieval (IR) capacity for the language while leveraging the Kurdish NLP and CL into a more advanced computational level.

Towards Kurdish Text to Sign Translation

The resources and technologies for Sign language processing of resourceful languages are emerging... more The resources and technologies for Sign language processing of resourceful languages are emerging, while the low-resource languages are falling behind. Kurdish is a multi-dialect language, and it is considered a low-resource language. It is spoken by approximately 30 million people in several countries, which denotes that it has a large community with hearing-impairments as well. This paper reports on a project which aims to develop the necessary data and tools to process the Sign language for Sorani as one of the spoken Kurdish dialects. We present the results of developing a dataset in HamNoSys and its corresponding SiGML form for the Kurdish Sign lexicon. We use this dataset to implement a sign-supported Kurdish tool to check the accuracy of the Sign lexicon. We tested the tool by presenting it to hearing-impaired individuals. The experiment showed that 100% of the translated letters were understandable by a hearing-impaired person. The percentages were 65% for isolated words, an...

Towards Finite-State Morphology of Kurdish

ArXiv, 2020

Morphological analysis is the study of the formation and structure of words. It plays a crucial r... more Morphological analysis is the study of the formation and structure of words. It plays a crucial role in various tasks in Natural Language Processing (NLP) and Computational Linguistics (CL) such as machine translation and text and speech generation. Kurdish is a less-resourced multi-dialect Indo-European language with highly inflectional morphology. In this paper, as the first attempt of its kind, the morphology of the Kurdish language (Sorani dialect) is described from a computational point of view. We extract morphological rules which are transformed into finite-state transducers for generating and analyzing words. The result of this research assists in conducting studies on language generation for Kurdish and enhances the Information Retrieval (IR) capacity for the language while leveraging the Kurdish NLP and CL into a more advanced computational level.

Supervision of Undergraduate Final Year Projects in Computing: A Case Study

Education Sciences, 2018

Final Year Projects (FYPs) play a significant role in undergraduate education in the computing fi... more Final Year Projects (FYPs) play a significant role in undergraduate education in the computing field of study, and most of the related university departments and schools consider them an essential contribution to this study. However, issues such as whether to assign the projects individually or to a group of students, the procedures followed in their assignment, the supervision process and the evaluation of the outcomes have been of concern to many academics in the field. In this case study, we present the methods for activities such as assignment, supervision, and evaluation of FYPs at the University of Kurdistan Hewlêr (UKH) between the years 2009 and 2017. We discuss the development of our approach and the lessons learned during the mentioned period. Furthermore, we present our current way of managing the FYP module. The aim is to develop a platform for interested and involved academics to discuss the topic further. Sharing the experiences on managing FYPs would not only help in ...

Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus)

Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Sp... more Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged co...

Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset

We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition... more We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the project was to develop a system that automatically could recognize simple sentences based on the vocabulary which is used in grades one to three of the primary schools in the Kurdistan Region of Iraq. We used CMUSphinx as our experimental environment. We developed a dataset to train the system. The dataset is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Machine translation has been a major motivation of development in natural language processing. De... more Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this paper, we describe our approach in retrieving potentially-alignable news articles from multi-language websites and manually align them across dialects and languages based on lexical similarity and transliteration of scripts. We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani. The corpus is publicly available under the CC BY-NC-SA 4.0 license.

Developing a Fine-grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects written in various scripts.... more Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

Transliterating Kurdish texts in Latin into Persian-Arabic script

Kurdish is written in different scripts. The two most popular scripts are Latin and Persian-Arabi... more Kurdish is written in different scripts. The two most popular scripts are Latin and Persian-Arabic. However, not all Kurdish readers are familiar with both mentioned scripts that could be resolved by automatic transliterators. So far, the developed tools mostly transliterate Persian-Arabic scripts into Latin. We present a transliterator to transliterate Kurdish texts in Latin into Persian-Arabic script. We also discuss the issues that should be considered in the transliteration process. The tool is a part of Kurdish BLARK, and it is publicly available for non-commercial use1.

Creating a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects written in various scripts.... more Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of themain obstacles in Kurdish language processing. In this paper, we present KTC–the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license1.

Towards electronic lexicography for the Kurdish language

This project has received funding from the European Union’s Horizon 2020 research and innovation ... more

A Method for Proper Noun Extraction in Kurdish

This paper suggests a method for proper noun identification in Kurdish texts. Kurdish proper noun... more This paper suggests a method for proper noun identification in Kurdish texts. Kurdish proper nouns are not capitalized and they also assume other part-of-speech roles, which leads to a broad ambiguity that should be addressed in Kurdish proper noun recognition applications. Kurdish is also among less-resourced languages. We developed an application based on an architecture which includes a number of name lists, a set of rules, and a set of processes that recognizes Kurdish person names. This can help the study of Information Retrieval (IR) in Kurdish to advance and can also be used in Kurdish machine translation. We conducted several experiments which showed that the precision of the method is more than 95%, the recall is between 40% to 80%, and the F-measure is close to 60% to more than 80%. The reason for the low recall precision was because our name lists were not exhaustive enough to cover the vast majority of the Kurdish names.

A Corpus of the Sorani Kurdish Folkloric Lyrics

Kurdish poetry and prose narratives were historically transmitted orally and less in a written fo... more Kurdish poetry and prose narratives were historically transmitted orally and less in a written form. Being an essential medium of oral narration and literature, Kurdish lyrics have had a unique attribute in becoming a vital resource for different types of studies, including Digital Humanities, Computational Folkloristics and Computational Linguistics. As an initial study of its kind for the Kurdish language, this paper presents our efforts in transcribing and collecting Kurdish folk lyrics as a corpus that covers various Kurdish musical genres, in particular Beyt, Gorani, Bend, and Heyran. We believe that this corpus contributes to Kurdish language processing in several ways, such as compensation for the lack of a long history of written text by incorporating oral literature, presenting an unexplored realm in Kurdish language processing, and assisting the initiation of Kurdish computational folkloristics. Our corpus contains 49,582 tokens in the Sorani dialect of Kurdish. The corpus...

Kurdish Optical Character Recognition

Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdi... more Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdish is spoken in different dialects and uses several scripts for writing. The Persian/Arabic script is widely used among these dialects. The Persian/Arabic script is written from Right to Left (RTL), it is cursive, and it uses unique diacritics. These features, particularly the last two, affect the segmentation stage in developing a Kurdish OCR. In this article, we introduce an enhanced character segmentation based method which addresses the mentioned characteristics. We applied the method to text-only images and tested the Kurdish OCR using documents of different fonts, font sizes, and image resolutions. The results of the experiments showed that the accuracy rate of character recognition of the proposed method was 90.82% on average.

Digital Humanities Readiness Assessment Framework: DHuRAF

ArXiv, 2019

This research suggests a framework, Digital Humanities Readiness Assessment Framework (DHuRAF), t... more This research suggests a framework, Digital Humanities Readiness Assessment Framework (DHuRAF), to assess the maturity level of the required infrastructure for Digital Humanities studies (DH) in different communities. We use a similar approach to the Basic Language Resource Kit (BLARK) in developing the suggested framework. DH as a fairly new field, which has emerged at an intersection of digital technologies and humanities, currently has no framework based on which one could assess the status of the essential elements required for conducting research in a specific language or community. DH offers new research opportunities and challenges in the humanities, computer science and its relevant technologies, hence such a framework could provide a starting point for educational strategists, researchers, and software developers to understand the prerequisites for their tasks and to have a statistical base for their decisions and plans. The suggested framework has been applied in the conte...

Can Linguistic Distance help Language Classification? Assessing Hawrami-Zaza and Kurmanji-Sorani

To consider Hawrami and Zaza (Zazaki) standalone languages or dialects of a language have been di... more To consider Hawrami and Zaza (Zazaki) standalone languages or dialects of a language have been discussed and debated for a while among linguists active in studying Iranian languages. The question of whether those languages/dialects belong to the Kurdish language or if they are independent descendants of Iranian languages was answered by MacKenzie (1961). However, a majority of people who speak the dialects are against that answer. Their disapproval mainly seems to be based on the sociological, cultural, and historical relationship among the speakers of the dialects. While the case of Hawrami and Zaza has remained unexplored and under-examined, an almost unanimous agreement exists about the classification of Kurmanji and Sorani as Kurdish dialects. The related studies to address the mentioned cases are primarily qualitative. However, computational linguistics could approach the question from a quantitative perspective. In this research, we look into three questions from a linguistic ...