research-article

Open access

Leveraging Bidirectionl LSTM with CRFs for Pashto Tagging

Authors:

Farooq Zaman,

Onaiza Maqbool,

Jaweria KanwalAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 4

Article No.: 58, Pages 1 - 17

https://doi.org/10.1145/3649456

Published: 15 April 2024 Publication History

PDF eReader

Abstract

Part-of-speech tagging plays a vital role in text processing and natural language understanding. Very few attempts have been made in the past for tagging Pashto Part-of-Speech. In this work, we present a Long Short-term Memory–based approach for Pashto part-of-speech tagging with special focus on ambiguity resolution. Initially, we created a corpus of Pashto sentences having words with multiple meanings and their tags. We introduce a powerful sentences representation and new architecture for Pashto text processing. The accuracy of the proposed approach is compared with state-of-the-art Hidden Markov Model. Our Model shows 87.60% accuracy for all words excluding punctuation and 95.45% for ambiguous words; however, Hidden Markov Model shows 78.37% and 44.72% accuracy, respectively. Results show that our approach outperforms Hidden Markov Model in Part-of-Speech tagging for Pashto text.

1 Introduction

Natural Language Processing (NLP) deals with understanding and generating human language by automated or semi-automated means. In natural language understanding, a machine takes sentences as input and gives their structural representation. In generation, a machine generates sentences of the language using the structure learned in the phase of understanding. NLP also deals with machine translation, where the machine takes input sentence of one language called the source language, understands it, and then generates sentence of the other language (target language) as output. Nowadays, information is available over the web in different languages. To make it available in other languages for information sharing and exchange, machine translation is needed.

Part-of-speech (POS) tagging is the process of assigning lexical label/category to each word in a sentence of a language. For example, given a sentence “Ali wrote a letter,” POS tagging will assign label noun to “Ali,” verb to “wrote,” determiner to “a,” and noun to “letter.” An important task in machine translation is POS tagging. POS Tagger is a software that reads text of a language, divides it into sentences and words, and assigns a part of speech to each word in a sentence [Toutanova et al. 2003]. The performance of machine translation is dependent on POS tagging. During translation it is required to understand the structure of a sentence, so that, for example, for a word with the tag “verb” in the source language, there should be an equivalent word or group of words of the same category (verb) in the target language. To improve the performance of machine translators, researchers have explored POS tagging [Khan et al. 2020; Advaith et al. 2022; Dwivedi and Malakar 2015]. Several approaches are used to achieve POS tagging, e.g., Rule-based approaches [Vaishali et al. 2022; Li et al. 2021; Dwivedi and Malakar 2015], Statistical-based approaches [Yajnik and Prajapati 2017; Ye et al. 2016; Spitkovsky et al. 2011], Hybrid approaches [Izzi and Ferilli 2020; Farrah et al. 2018; Dwivedi and Malakar 2015], Maximum Entropy-based approaches [Zhao and Wang 2002; Pattnaik and Nayak 2022; Ekbal et al. 2008; Rico-Sulayes et al. 2017], and Deep Learning-based approaches [Hirpassa and Lehal 2023; Warjri et al. 2021; Pathak et al. 2022; Collobert et al. 2011; Ma and Hovy 2016]. Pashto, also known as Pukhtoo, is a language spoken in Pakistan and Afghanistan. More than 37 million people in Pakistan and Afghanistan speaks Pashto as their mother tongue. Furthermore, some relatively small communities exist in Iran, India, Tajikistan, United Arab Emirates, and United Kingdom that speak Pashto as well. In 1936, Pashto became the first official language and Persian became the second official language of Afghanistan [Tegey and Robson 1996].

Our focus in this article is on POS tagging for Pashto language with special attention on words having more than one meaning in a sentence. Some example sentences and their translation using Google Translate having more than one meaning are given in Figure 1. Consider the two sentences, the word “saw” have two meanings (verb and noun). In translation, it is considered only in one way that is (verb) in both sentences. The main contributions of our research work are (i) Annotated Corpus of Pashto Language: To find an appropriate tag for a word having more than one meaning, we require a corpus that contains such words and their tags. No such corpus is available for Pashto, which has words with more than one tags. We created an annotated corpus of Pashto language, which contains words with more than one tags. The corpus was created by collecting sentences from Pashto Grammar [Tegey and Robson 1996], Pashto websites,¹ and Pashto dictionary [Achakzai 2017; Khan 2017]; (ii) Deep Learning Approach to POS tagging: Context of a word plays an important role in determining its correct meaning/ tag, specifically when the word has more than one tags associated with it. Deep learning approaches have shown promising results in NLP tasks such as text processing, sequence to sequence learning Named Entity Recognition, Semantic Role Labeling, Machine Translation, and POS tagging. We applied a deep learning-based approach to POS tagging of Pashto Language. For this, we did extensive work to find appropriate representations of words/ sentences and an appropriate architecture for deep learning; and (iii) Empirical Validation of POS Tagging Approaches: We conducted experiments to validate our deep learning-based POS tagging approach. Our experiments show that deep learning-based approaches gives better results than statistical-based approaches for Pashto language POS tagging.

Fig. 1.

The rest of this article is organized as follows. In Section 2, we explain related work. In Section 3, we describe the dataset. We further discuss our proposed approach in Sections 4 and 5, where we explain results and evaluations. Finally, we conclude our work in Section 6.

2 Related Work

Different approaches used by researchers for POS tagging are, i.e., Rule-based Approach, Statistical-based Approach, Hybrid Approach, and Maximum Entropy-based Approach [Advaith et al. 2022; Visuwalingam et al. 2021; Dwivedi and Malakar 2015]. Various techniques are also compared to analyze the performance of ML-based approaches for Pashto language [Momand et al. 2020].

2.1 POS Tagging for Pashto and other Languages

POS tagging for a langauge plays an important role for various applications, such as translation, text summarization, and simplification. For Pashto language, the research work on POS tagging was first done in 2009 [Rabbi et al. 2009]. The proposed approach was based on the lexicon and rule. The rules were defined for each tag, and if two tags exist in the lexicon, then use rules to disambiguate the situation. They achieved an accuracy of 88%, but need lots of manual rules and language-specific knowledge [Rabbi et al. 2009]. An approach of projecting Farsi POS data to tag Pashto text is proposed [Khan et al. 2011]. In this research work, Farsi corpus is modified with Pashto words and lexical rules from Pashto language, this is also knows as “Pashtifications.’ Pashitification rules were used to bridge the two languages. The modified corpus was used to train a modified trigram Hidden Markov Model (HMM) model. An extended reduced tagset of Farsi and Pashto was used, accuracy of 70.84% was reported after applying morphological analysis, lexical rules, and Pashtification.

There is a lot of research work that has been done for POS tagging for other low-resource languages, like Sindhi, Bingali, Chinese, and Urdu [Sodhar et al. 2020; Khan et al. 2019]. Rule-based semantic POS tagging is used for Sindhi language as Sindhi language is highly homographic language [Mahar and Memon 2010]. Lack of diacritic in Sindhi language creates lexical and morphological ambiguity. Wordnet has been used for POS tagging system. Several algorithms have been used, including disambiguation rules, word tagging, and word tokenization for Sindhi words. An accuracy of 97% has been reported, and a morphological analyzer using the paradigm-based approach has been proposed for the Sindhi language [Motlani et al. 2016]. A morphological analyzer is also proposed for Saraiki language by Alam et al. [2023] for the first time. It provides 83% coverage, 93% precision, and 96% recall. POS tagging for Urdu language has been proposed by various studies. Nasim et al. [2020] used BiLSTM CRF model for POS tagging for Urdu language. The accuracy (96% F1 score) indicates that the model outperforms existing approaches for Urdu language POS tagging. Different Machine Learning techniques have been applied on Urdu POS tagging by different researchers [Laeeq et al. 2022; Tehseen et al. 2023].

2.2 Rule-based Approaches

Rule-based approach is the earliest approach for POS tagging proposed by Klein and Simmons [1963]. In rule-based approach, manually hand-written grammar rules are used with some annotated data stored in corpus. Rules-based systems are accurate and fast, but their drawback is that they require a number of manually hand-written rules [Dwivedi and Malakar 2015]. An example of a rule is, “If an ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as an Adjective” [Antony and Soman 2011]. Rule-based systems are further divided into lexical rules and contextual rules. Lexical rules find the label of a word based on lexical properties while contextual rules use context to find the label for a word given a context. Accuracy of Rule-based Approach per word was 96% [Brill 1992], later Brill improved accuracy of these approaches to 97.5% by using transformation-based error driven learning [Brill 1995]. Transformation-based approach not only considers the previous word but also considers the next word.

2.3 Statistical-based Approaches

In statistical-based approaches, frequencies and probabilities of words are used. The training process of such models is relatively faster as compared to other models. Disadvantages of such approaches is that sometimes they assign tag sequence to sentences that are not correct as per grammar of the language. CLAWS tagger is a tagger uses a Statistical-based approach (Hidden Markov Model) and achieves an accuracy of 97% [Garside 1987]. Banko and Moore modified HMM and proposed a contextual model that can consider both previous and next word to extend the context. They achieved accuracy of 96.59% [Banko and Moore 2004]. Stanford NLP community also worked on a wide range of POS tagger. They worked on Bayesian Approach and used a structure similar to HMM. They reported accuracy near to state of the art taggers [Goldwater and Griffiths 2007]. Stanford NLP group worked on Unsupervised Dependency Parsing without Gold Part-of-Speech Tags [Goldwater and Griffiths 2007]. They explored Unsupervised methods for POS tagging. Also, they explored and removed limitation of Gold work that a word will always have the same tag. They further proposed a method that successfully assigns different tags to the same words in different context. They reported an accuracy of 59.1% directed accuracy on Section 3 all sentences of Wall Street Journal benchmark corpus [Spitkovsky et al. 2011; Charniak et al. 2000].

2.4 Maximum Entropy-based Approaches

Maximum Entropy-based approach POS tagging is performed by assigning the tag sequence to words sequence having maximum likelihood. Information from diverse sources can also be incorporated [Toutanova and Manning 2000]. In this approach, we model the known and ignore the unknown [Berger et al. 1996]. Ratnaparkhi, proposed Maximum Entropy-based Approach to POS tagging for English. In his study, he explored that Maximum Entropy-based Approach is flexible for virtually unrestricted feature set. The overall accuracy of this approach is same as that of state-of-the-art POS taggers. The sentence level accuracy achieved was 47.55% with dictionary used and 47.38% without dictionary [Ratnaparkhi 1996]. Toutanova also Explored Maximum Entropy-based Approach for POS tagging of English language. In his work, he incorporated extra information of (case information, form of verbs, and features for distinguishing particles from preposition and adverb) with his model. According to the findings, the feature of using prefixes impacts negatively on accuracy, therefore feature of prefixes was removed, which leads to the improvement in accuracy of unknown words as well as the overall accuracy. The overall accuracy for this work was reported as 96.86% for seen and 86.91% for previously unseen words [Toutanova and Manning 2000]. Ekbal used Maximum Entropy-based Approach to POS tagging for Bengali language. In this study, Named Entity Tag information was used with Maximum Entropy-based approach. It was further explored that at the time of testing for unknown words use of lexicon and Named Entity tag gives more accurate results. The overall accuracy achieved was 88.2%, which is an 8% improvement over HMM-based tagger for Bengali language [Ekbal et al. 2008].

2.5 Neural Network and Deep Learning-based Approach

Using this approach, the task of POS tagging is viewed as sequence classification task. The input word with its neighboring words and the associated tag is given to network, the network train over the training examples and adjust its parameters. the trained network is further used for inference.

In 2011, Collobert Proposed a new Approach for various NLP tasks including POS tagging, chunking, named entity recognition, and semantic role labeling. It was explored that use of hand-crafted features can be replaced by applying deep layers of neural network. Per-word accuracy reported for POS tagging was 96.37, F1 score of 90.33 for chunking, F1 score of 81.47 for NER, and 70.99 for SRL was reported [Collobert et al. 2011]. Zheng and Chen proposed a Deep Learning-based Approach to POS tagging and Chinese word Segmentation. The approach mainly focused on avoiding task specific feature engineering by applying Deep Learning. The accuracy reported for joint task of word segmentation and POS tagging was 91.82 F-scores. The accuracy achieved by this approach is near to state of the art with the cost of avoiding task-specific feature engineering [Zheng et al. 2013]. Santos and Zadrozny proposed Deep Learning-based Approach to POS tagging of English and Portuguese language. In this study, they used character level representation for words, to avoid hand craft features, a technique known as word embedding at character level was used. The main finding of this study was incorporating morphological knowledge to the model without any hand craft features. The accuracy achieved was 97.32% and 97.47% for English and Portuguese language, respectively [Santos and Zadrozny 2014]. Wang proposed an approach of Deep Learning for POS tagging, he explored new way of tagging sequential data. He used Bidirectional Long Short-term Memory–Recurrent Neural Network (BLSTM-RNN). Main findings of this research work are effective use of BLSTM-RNN to POS tagging without using any morphological knowledge with state of the art accuracy, and a new approach to word Embedding. The accuracy reported was 97.40% [Wang et al. 2015]. Ma and Hovy proposed an approach using deep learning called Bidirectional Long Short-term Memory–Convolution Neural Network (BLSTM-CNN) for POS tagging and Named Entity Recognition (NER). In this research work, the use of character level representation was also explored with word-level representation. Accuracy reported for POS tagging was 97.55% and 91.21 F1 score for NER [Ma and Hovy 2016]. Plank and his colleague also explored application of BLSTM for POS tagging on 22 different languages, they applied BLSTM with different input representation, word-level embedding, character level embedding, and Unicode byte level embedding, they evaluated that BLSTM works at the performance of state of the art for 22 different languages, specially works well with morphologically rich languages. It was further explored that performance of BLSTM is less dependent on size of training data, and label corruption with little noise [Plank et al. 2016].

2.6 Hybrid Approaches

In hybrid approaches, Rule-based approach is either combined with Statistical-based approach or with Neural Network-based approach. In such approaches, firstly statistical approach is applied and then rules are used for corrections. Lee proposed a hybrid method for POS tagging of Korean text. Lee combined HMM- and rule-based approaches and a morphological analyzer. Lee achieved an overall accuracy of 94.9% [Lee et al. 1997]. Murata used a hybrid approach for Thai language [Ma et al. 2000]. In his work, he combined neural network-based approach with Rule-based approach. Murata concluded that the hybrid approach performs better than HMM. Hadni proposed a hybrid method by combining rule-based approach with HMM-based approach for Arabic language. Hadni used four POS tag classes—Noun, Verb, Particle, and Quranic Initial. Accuracy achieved by this work is 94.4% for Rule-based and 97.6% after HMM is applied [Hadni et al. n.d.]. Sinha proposed a hybrid method by combining corpus-based approach with rules-based and HMM. In this work, first input word is checked in lexicon, if found then tag is assigned to it, otherwise the word is passed to HMM module and most likely tag is assigned. Next the word-tag sequence is passed to rule-based module for correction or making new entry to lexicon in case of unknown word [Sinha et al. 2015].

3 Deep Learning in NLP

Deep Learning is a subfield of Machine Learning that deals with learning from data with multiple levels of abstraction. Deep learning solved those problems of Artificial Intelligence with remarkable results, that were marked as unsolvable by traditional AI techniques [LeCun et al. 2015; Wang et al. 2017; Zheng et al. 2013]. The modeling ability of Deep Learning in doing complex tasks, such as object detection in images, Named Entity Recognition, Semantic Role Labeling, and Machine Translation, capture the key role in success of A-I.

Deep Learning also plays a key role in Natural Language Processing. Researchers are applying deep learning to NLP for the task of translation, sentence modeling, sentiment analysis, POS Tagging, speech recognition, and lots of other applications [Rawat et al. 2022; De Mulder et al. 2015; Du and Shanker 2009].

Deep learning models are learning by finding hierarchical or sequential patterns given the input data [Alzubaidi et al. 2021]. For example, in an image a square object can be formed by connecting four lines of equal length, each line can be formed by adding multiple points depending on the length of line, and a point can be formed using multiple pixels depending on the size of the point. However, a document is made-up of multiple paragraphs by adding them in a sequence, a paragraph can be formed by adding multiple coherent sentences in a sequence, and so a sentence can be formed by combining words in a specific sequence; changing that sequence of words can make the sentence meaningless. For hierarchical tasks, convolution neural network (CNN) usually performs well, whereas for sequential learning recurrent neural network (RNN) performs well. RNN learns from input overtime. RNN updates weights on the basis of gradient. Gradient gives us direction toward expected target. If gradient gets zero or tends to zero, then the learning is halted before the target/expected output is achieved. An illustration is given in Figure 3. In the first input, the network learns a feature as represented by dark shadow in the cell, in next few steps it decays, at time step 5 there is no learned information, the cell is completely white [Graves 2012]. Long Short-term Memory (LSTM) Neural Networks is a kind of RNN introduced to overcome the deficiency of vanishing gradient problem in RNN.

Fig. 2.

Fig. 3.

LSTM maintain hidden layer of memory cells that store information for a long time. The cell provides three basic kinds of operations (Read, Write, and Erase) through (Input-gate, Output gate, and Forget-gate). LSTM read information through input gate learn feature from it and produces output through output gate. The learned representation of the input propagates through the network till the end of the input sequence. At each time step the network takes input combine it with previous information and produces output. It also forgets the information at any stage if it is no more needed. In this way, the LSTM does not suffer from vanishing gradient problem and keeps information for long time, thus long contexts can be achieved [Graves 2012]. Information flow and gate mechanism of LSTM is highlighted in Figure 4. In the figure, “o” represents open gate and “-” represent closed gate. The Network learn from input sequence over time. At time \(step_1\) the network takes an input, learn from it and keep it for future use, at time \(step_4\) and time \(step_6\) the network output the leaned information and still keep it for future use. At time \(step_7\) the network allows the information to be overwritten. LSTM stores information and keep it through long distance [Graves 2012].

Fig. 4.

3.1 Hidden Markov Model

Hidden Markov Model is a probabilistic and statistical model that is based on a property called Markov property, Markov property states that the future is only dependent on the present, not the past. Observing at a particular state of the model, we can only think of the likelihood of next state to be occurred (transition or jump forward). Also at a particular state, we can think of events that occur at that state. Hidden Markov Model uses transition probabilities \(P_{T}\) for considering the co-occurrence of words and capturing context. Also, emission probabilities \(P_{E}\) for considering the chance of occurring an event given the present state [Schönhuth 2009]. For POS tagging problem consider words of a sentence as \((w_1 \ldots w_n)\) and tagset as \((t_1 \ldots t_n)\) more formally, we can define a Markov model for POS tagging as follows [Charniak et al. 1993]:

\begin{equation} f(w_1 , n) = argmax_{t1, n} \prod _{i=1}^{n}P(t_i | t_{i-1}) P(w_i| t_i). \end{equation}

(1)

In Equation (1) first clause of product \(P(t_i | t_{i-1})\) is the probability of tag \(t_i\) given tag \(t_{i-1}\) also know as transition probabilities and the second clause \(P(w_i| t_i)\) is the probability of word \(w_i\) given the tag \(t_i\) also know as emission probabilities.

HMM is a well-known model for tagging sequential data such as Part-of-Speech tagging. Various HMM-based approaches have been proposed for POS tagging, but there is no such approach for Pahsto language that exploits HMM for POS tagging. In this work, we tailored [Rudd 2009] implementation for Pashto.

4 Proposed Approach

We propose a new approach to POS tagging of Pashto language, based on LSTM. Steps of our proposed approach are illustrated in Figure 5.

Fig. 5.

4.1 Training Input Sentences and Tags

There is no benchmark dataset available for the task of POS tagging for Pashto language. We created a dataset for the task of Pashto POS tagging with focus on words having more than one tag. We developed a dataset for this task by collecting sentences from native Pashto speakers, and Pashto web sites, and got these sentences checked from experts of the language. Our dataset consists of 16 ambiguous words. For every ambiguous word, we collected 20 sentences, 10 sentences for each sense.² For example, in Figure 2 the word has two senses, one for noun and other for verb. Finally, all the sentences are manually hand tagged from two online dictionaries “The Pashto online dictionary” [Khan 2017], and “Daryab Pashto dictionary” [Achakzai 2017]. The entire dataset consists of 320 sentences. These sentences form the training input for our model as shown in Figure 5.

4.2 Encoding to Integers

Encoding is used in text to let the computer interpret the alphabets, words and characters of a language. Different encoding schemes are used, e.g., UTF/UNICODE, ISO-8859-x series, ASCII, Windows-1252 [Ishida 2015].

As Pashto language is not fully supported by UTF encoding, also not every Application and IDE supports Pashto language therefore manual encoding is needed.

For our system, we encoded Pashto words into integers by creating a vocabulary of unique words. We created sentences of integers by searching words of input sentence in the vocabulary and picking their indices. For making all sentences to have equal length, we padded each sentence with zeros. Example of Encoding sentences to integers is given in Figure 6. In Figure 6, there are three sentences of Pashto language. Each sentence has various number of words containing 16 unique words. Each unique word is assigned a unique number. Based on these unique numbers, three sentences are encoded.

Fig. 6.

4.3 Training LSTM Model

Our proposed LSTM model consists of three layers, an Embedding layer, a bidirectional LSTM layer, and a linear layer. The embedding layer input feature size is 580 and the output feature size is 300, which means that we encode each input token into a 300-dimensional vector also known as an embedding vector. The embedding vector is passed into the LSTM layer with input features as 300 and output features as 150. Finally, the 150-dimensional latent vector is then passed into a linear layer with input features of 300 and output features size of 16. This means that we have a total of 16 tags in our tag set including the start and stop sequence.

We trained our model for 20 epochs using the train set and then reported the accuracy score using the test set. To get better performance, we performed multiple experiments with varying epochs and learning rate. We first tuned learning rate by trying different values ranging from 0.02 to 0.1 with a step of 0.01, then we tried number of epochs ranging from 200 to 1,900 with a step of 100, and we recorded accuracy each time. Figure 7 shows the plot for the training loss against number of epochs.

Fig. 7.

Encoded input sentences are converted into a format that LSTM understands. LSTM understands data in contextual format. Figure 6 represents an example of converting simple sequential data to contextual data, which is a required data format by the supervised learning model. There are three textual sentences in Figure 6. To convert them into numeric format, a vocabulary vector consisting of all the words occurring in the three sentences is built where each word is assigned a unique numeric value. on the basis of that vector, sentences are represented in numeric format. In Figure 8, we have three sentences in Encoded format. Next, they are broken down into slices by taking previous words plus one following word at a time. In this way one sentence is broken down into four sub slices. Context is handled by using parenthesis, inside parenthesis are the context words and outside the parenthesis are the focus word, the word under observation. For example [(context words), \(w_i\)], we take context length up to eight words. In the next step, embedded vectors of contextual data is fed into LSTM network in one-word vector per cell fashion. Example of a sentence fed into LSTM is given in Figure 9. LSTM updates its weights accordingly and keeps information of the context till the end of the sentence, no matter how long the sentence is. After end of the sentence, LSTM, clears its context and gradients, because we take context only at sentence level. The same process is repeated for each sentence. During each iteration, weights are adjusted so that the distance between the target and observed values is minimized. After particular number of iterations, we achieve a learned network. The learned network is then used for testing/predicting unseen sentences.

Fig. 8.

Fig. 9.

4.4 Testing and Decoding Output to Tags

Word vector represents the encoded form of Pashto sentences. To predict an appropriate tag for each word vector, word vector is passed to LSTM. LSTM returns a vector of weights for each word of a sentence. We assign the tag that has maximum weight in the weight vector for that word. We get words and its tag in the form of integers, and then we decode it back to Pashto words/tags pairs using the dictionary we created for this purpose. After achieving predicted tags for word vectors, we calculate accuracy by comparing the predicted tags with actual tags of testing Examples. Table 1 represents an illustration of decoding scores vectors where words are assigned tags in form of word-tag pair, such as tag T5 is assigned to word W1, tag T2 is assigned to word W2, tag T10 is assigned to W3, and tag T3 is assigned to word W4.

Table 1.

Words/Tags	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10	T11
W1->5	0.800	0.600	0.020	0.400	\(\mathbf {0.900}\)	0.700	0.500	0.400	0.300	0.050	0.100
W2->2	0.400	\(\mathbf { 0.900 }\)	0.600	0.820	0.02	0.300	0.100	0.200	0.700	0.050	0.500
W3->10	0.470	0.090	0.010	0.009	0.032	0.101	0.002	0.342	0.159	\(\mathbf { 0.923}\)	0.218
W4->3	0.007	0.672	\(\mathbf { 0.987}\)	0.237	0.872	0.275	0.319	0.032	0.180	0.291	0.152

Table 1. One Hot Encoded Vectors of Words

Words with maximum score are represented by bold face.

4.5 Features Extraction

To speed up computations and improve performance a technique is used called “dimensionality reduction.” There are two ways for dimensionality reduction, feature extraction and feature selection. Feature extraction is the process of transforming high-dimensional feature vector to low-dimensional feature vector by finding association between features whereas feature selection is the process of selecting a subset of features that best represents the class in the process of classification.

We used feature extraction to find the association between words; for feature extraction, we used Word Embedding. Word embedding can be achieved as a standalone task or can be jointly used with a model of neural network. We also used the Word Embedding jointly with our model, which takes input sentence as a vector of integers, and then transforms it to low-dimensional vector. Then the vectors are passed to hidden layers of the neural network model.

5 Results and Discussion

Two approaches, Conditional Random Field BLSTM and HMM are used for POS tagging. In this section, we discuss their results and compare their accuracy for both single tag words and two tags words. For evaluation, most of the POS tagging approaches were evaluated by measuring accuracy. We calculate the accuracy with a simple formula proposed in Dandapat [2009] as given in Equation (2):

\begin{equation} Accuracy\% =\frac{correctly\;tagged\;words}{total\;words\;tagged} \times 100, \end{equation}

(2)

where \(correctly\;tagged\;words\) is the number of all words that are correctly assigned a tag, and \(total\;words\;tagged\) is the total number of words.

5.1 Training and Testing Datasets

For splitting our data into training and testing, we choose two sentences for each ambiguous word, one sentence that contains ambiguous word with one kind of tag and one sentence for other kind of tag, e.g., one sentence for noun and one for verb. We separated 32 sentences, which sum up to 352 words as testing set, and the remaining 288 sentences, which sums up to 3,168 words, are used as training set.

In addition to splitting of data as described in previous section, we applied K-Fold Cross validation to our data with \(k=10\). In this we created 10 slices of data and iterated through it; in each iteration, we pick one slice for testing and remaining 9 slices for training.

5.2 Results of Conditional Random Field BLSTM

We also applied Conditional Random Field BLSTM for POS tagging of Pashto. We trained/tested the system with the same dataset used for LSTM and BLSTM. We keep context size of eight words. For CRF-BLSTM, we also calculated separate accuracy for bi-tag words as well as for overall words. The results we obtained by applying CRF-BLSTM are given in Table 2.

Table 2.

Approach	% Overall accuracy	% Accuracy for bi-tag words	% Overall K-Fold (AVG)
CRF-BLSTM	87.60	95.45	88.59

Table 2. Results of CRF-BLSTM

5.3 Results of HMM Approach

We also applied HMM, a statistical model that works on frequencies/probabilities of words to POS tagging of Pashto language. We trained/tested HMM on the same dataset that was used for LSTM, BLSTM, and CRF-BLSTM. The context size of three words was used, because HMM is a trigram tagger; if we increase context size, then the computation cost increases exponentially. For HMM, we also measured accuracy separately, for overall words as well for bi-tag words. Table 3 shows the results we obtained after applying HMM.

Table 3.

Approach	% Overall accuracy	% Accuracy for bi-tag words	% Overall K-Fold (AVG)
HMM	78.37	44.73	79.24

Table 3. Results of HMM

5.4 Examples of Ambiguous Words Correctly Predicted by Our Model

Figure 10 shows nine examples of sentences with ambiguous words (highlighted in purple color). For instance, consider sentence 1, the highlighted word is a noun and the model predicted it as a noun, whereas, in sentence 2, the same word is used as a verb and the model also predicted it as a verb. Similar situationa can be observed in sentencea 3 and 4. The highlighted word is used as an adjective in sentence 3, while in sentence 4 it is used as a noun. Our model predicted it correctly in both places. This shows that our model is trained well on the ambiguous words and their tags and predicted it correctly.

Fig. 10.

5.5 Discussion

We applied CRF-BLSTM and Statistical-based HMM. We conclude that, for Pashto POS tagging, CRF-BLSTM-based approaches perform better than statistical-based approaches HMM (as shown in Table 4). The key fact behind the deep learning-based approaches performance is that they keep contextual information long through the sequence. Deep learning-based approaches keep information remembered in the sequence until explicitly directed to forget it, whereas HMM keep context up to limited length (three words). Moreover, for bi-tag words HMM fails to decide which tag to assign, if each possible tag appears with equal probability in the training set.

Table 4.

Approach	% Overall accuracy	% Accuracy for bi-tag words	% Overall K-Fold (AVG)
CRF-BLSTM	87.60	95.45	88.59
HMM	78.37	44.73	79.24

Table 4. Comparison of CRF-BLSTM and HMM

5.6 Limitations of the Study

Our current approach is supervised learning approach that relies on annotated data. The performance of supervised learning models heavily relies on the amount of training dataset. For this purpose, we need a large amount of annotated data for a low-resource language (Pashto) but the annotated data for Pashto language is not available. this is a limitation of our study.

For the task of POS tagging, there is no corpus readily available for Pashto language, which contains ambiguous words and their associated tags. Ambiguous words in Pashto may have more than one tag in different situations. For better training of the model for ambiguous words require a large number of such examples, but the unavailability of large datasets for Pashto language, it would not be possible.

6 Conclusion and Future Work

For the task of Pashto tagging, a corpus of annotated words with their grammatical tags is created, which contains ambiguous words and their associated tags. We proposed an approach based on CRF-BLSTM for Pashto POS tagging, which keep contextual information long through the input sentence. We performed experiments on our Pashto dataset and applied CRF-BLSTM to POS tagging of Pashto and compare the results with a statistical approach known as HMM. Our experiments show that if we increase context size for HMM more than three, it will cause exponential increase in computational cost. From the results, we conclude that CRF-BLSTM approach performs better than statistical approach HMM in terms of accuracy and computaional cost.

The current work addresses the problem of ambiguous word tagging for the Pashto language. This study will be beneficial in improving the basic NLP tasks, such as part-of-speech tagging, and it will also help in improving advanced NLP tasks for Pashto language, such as machine translation, text summarization, and text simplification.

In future work, we can extend our work through increase in level of ambiguity, changing in tag-set, and increase in dataset, and the number of sentences and examples for each word can be extended, to further improve the results.

Author’s Contribution

Farooq Zaman wrote the main manuscript text. Farooq Zaman and Onaiza Maqbool worked on the main idea for Pashto tagging. Onaiza Maqbool supervised the research work. Farooq Zaman and Jaweria Kanwal wrote the literature review and experimental sections (e.g., Figures 5 and 6). All authors reviewed the original and final manuscript.

Conflict of Interest

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this article.

Funding and Acknowledgment

No funding was received from any funding organization for this research work.

Footnotes

List of websites:

—

http://www.bbc.com/pashto

—

http://www.benawa.com/

—

http://www.khpalapashtu.com/

—

http://www.tolafghan.com/

—

https://www.tatobay.com/

—

http://www.sabawoon.com/

—

http://www.lekwal.com/

Words that may appear with different tags (e.g., verb, noun) in different sentences.

References

[1]

Ahmad Wali Achakzai. 2017. Pashto Dictionaries and Glossaries. Retrieved from http://www.qamosona.com/

Abstract

1 Introduction

2 Related Work

2.1 POS Tagging for Pashto and other Languages

2.2 Rule-based Approaches

2.3 Statistical-based Approaches

2.4 Maximum Entropy-based Approaches

2.5 Neural Network and Deep Learning-based Approach

2.6 Hybrid Approaches

3 Deep Learning in NLP

3.1 Hidden Markov Model

4 Proposed Approach

4.1 Training Input Sentences and Tags

4.2 Encoding to Integers

4.3 Training LSTM Model

4.4 Testing and Decoding Output to Tags

4.5 Features Extraction

5 Results and Discussion

5.1 Training and Testing Datasets

5.2 Results of Conditional Random Field BLSTM

5.3 Results of HMM Approach

5.4 Examples of Ambiguous Words Correctly Predicted by Our Model

5.5 Discussion

5.6 Limitations of the Study

6 Conclusion and Future Work

Author’s Contribution

Conflict of Interest

Funding and Acknowledgment

Footnotes

References

Cited By

Index Terms

Recommendations

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus

Correction of whitespace and word segmentation in noisy Pashto text using CRF

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations