1 Introduction
Natural Language Processing (NLP) deals with understanding and generating human language by automated or semi-automated means. In natural language understanding, a machine takes sentences as input and gives their structural representation. In generation, a machine generates sentences of the language using the structure learned in the phase of understanding. NLP also deals with machine translation, where the machine takes input sentence of one language called the source language, understands it, and then generates sentence of the other language (target language) as output. Nowadays, information is available over the web in different languages. To make it available in other languages for information sharing and exchange, machine translation is needed.
Part-of-speech (POS) tagging is the process of assigning lexical label/category to each word in a sentence of a language. For example, given a sentence “Ali wrote a letter,” POS tagging will assign label noun to “Ali,” verb to “wrote,” determiner to “a,” and noun to “letter.” An important task in machine translation is POS tagging. POS Tagger is a software that reads text of a language, divides it into sentences and words, and assigns a part of speech to each word in a sentence [Toutanova et al.
2003]. The performance of machine translation is dependent on POS tagging. During translation it is required to understand the structure of a sentence, so that, for example, for a word with the tag “verb” in the source language, there should be an equivalent word or group of words of the same category (verb) in the target language. To improve the performance of machine translators, researchers have explored POS tagging [Khan et al.
2020; Advaith et al.
2022; Dwivedi and Malakar
2015]. Several approaches are used to achieve POS tagging, e.g., Rule-based approaches [Vaishali et al.
2022; Li et al.
2021; Dwivedi and Malakar
2015], Statistical-based approaches [Yajnik and Prajapati
2017; Ye et al.
2016; Spitkovsky et al.
2011], Hybrid approaches [Izzi and Ferilli
2020; Farrah et al.
2018; Dwivedi and Malakar
2015], Maximum Entropy-based approaches [Zhao and Wang
2002; Pattnaik and Nayak
2022; Ekbal et al.
2008; Rico-Sulayes et al.
2017], and Deep Learning-based approaches [Hirpassa and Lehal
2023; Warjri et al.
2021; Pathak et al.
2022; Collobert et al.
2011; Ma and Hovy
2016]. Pashto, also known as Pukhtoo, is a language spoken in Pakistan and Afghanistan. More than 37 million people in Pakistan and Afghanistan speaks Pashto as their mother tongue. Furthermore, some relatively small communities exist in Iran, India, Tajikistan, United Arab Emirates, and United Kingdom that speak Pashto as well. In 1936, Pashto became the first official language and Persian became the second official language of Afghanistan [Tegey and Robson
1996].
Our focus in this article is on POS tagging for Pashto language with special attention on words having more than one meaning in a sentence. Some example sentences and their translation using Google Translate having more than one meaning are given in Figure
1. Consider the two sentences, the word “saw” have two meanings (verb and noun). In translation, it is considered only in one way that is (verb) in both sentences. The main contributions of our research work are
(i) Annotated Corpus of Pashto Language: To find an appropriate tag for a word having more than one meaning, we require a corpus that contains such words and their tags. No such corpus is available for Pashto, which has words with more than one tags. We created an annotated corpus of Pashto language, which contains words with more than one tags. The corpus was created by collecting sentences from Pashto Grammar [Tegey and Robson
1996], Pashto websites,
1 and Pashto dictionary [Achakzai
2017; Khan
2017];
(ii) Deep Learning Approach to POS tagging: Context of a word plays an important role in determining its correct meaning/ tag, specifically when the word has more than one tags associated with it. Deep learning approaches have shown promising results in NLP tasks such as text processing, sequence to sequence learning Named Entity Recognition, Semantic Role Labeling, Machine Translation, and POS tagging. We applied a deep learning-based approach to POS tagging of Pashto Language. For this, we did extensive work to find appropriate representations of words/ sentences and an appropriate architecture for deep learning; and
(iii) Empirical Validation of POS Tagging Approaches: We conducted experiments to validate our deep learning-based POS tagging approach. Our experiments show that deep learning-based approaches gives better results than statistical-based approaches for Pashto language POS tagging.
The rest of this article is organized as follows. In Section
2, we explain related work. In Section
3, we describe the dataset. We further discuss our proposed approach in Sections
4 and
5, where we explain results and evaluations. Finally, we conclude our work in Section
6.
3 Deep Learning in NLP
Deep Learning is a subfield of Machine Learning that deals with learning from data with multiple levels of abstraction. Deep learning solved those problems of Artificial Intelligence with remarkable results, that were marked as unsolvable by traditional AI techniques [LeCun et al.
2015; Wang et al.
2017; Zheng et al.
2013]. The modeling ability of Deep Learning in doing complex tasks, such as object detection in images, Named Entity Recognition, Semantic Role Labeling, and Machine Translation, capture the key role in success of A-I.
Deep Learning also plays a key role in Natural Language Processing. Researchers are applying deep learning to NLP for the task of translation, sentence modeling, sentiment analysis, POS Tagging, speech recognition, and lots of other applications [Rawat et al.
2022; De Mulder et al.
2015; Du and Shanker
2009].
Deep learning models are learning by finding hierarchical or sequential patterns given the input data [Alzubaidi et al.
2021]. For example, in an image a square object can be formed by connecting four lines of equal length, each line can be formed by adding multiple points depending on the length of line, and a point can be formed using multiple pixels depending on the size of the point. However, a document is made-up of multiple paragraphs by adding them in a sequence, a paragraph can be formed by adding multiple coherent sentences in a sequence, and so a sentence can be formed by combining words in a specific sequence; changing that sequence of words can make the sentence meaningless. For hierarchical tasks,
convolution neural network (CNN) usually performs well, whereas for sequential learning
recurrent neural network (RNN) performs well. RNN learns from input overtime. RNN updates weights on the basis of gradient. Gradient gives us direction toward expected target. If gradient gets zero or tends to zero, then the learning is halted before the target/expected output is achieved. An illustration is given in Figure
3. In the first input, the network learns a feature as represented by dark shadow in the cell, in next few steps it decays, at time step 5 there is no learned information, the cell is completely white [Graves
2012].
Long Short-term Memory (LSTM) Neural Networks is a kind of RNN introduced to overcome the deficiency of vanishing gradient problem in RNN.
LSTM maintain hidden layer of memory cells that store information for a long time. The cell provides three basic kinds of operations (Read, Write, and Erase) through (Input-gate, Output gate, and Forget-gate). LSTM read information through input gate learn feature from it and produces output through output gate. The learned representation of the input propagates through the network till the end of the input sequence. At each time step the network takes input combine it with previous information and produces output. It also forgets the information at any stage if it is no more needed. In this way, the LSTM does not suffer from vanishing gradient problem and keeps information for long time, thus long contexts can be achieved [Graves
2012]. Information flow and gate mechanism of LSTM is highlighted in Figure
4. In the figure, “o” represents open gate and “-” represent closed gate. The Network learn from input sequence over time. At time
\(step_1\) the network takes an input, learn from it and keep it for future use, at time
\(step_4\) and time
\(step_6\) the network output the leaned information and still keep it for future use. At time
\(step_7\) the network allows the information to be overwritten. LSTM stores information and keep it through long distance [Graves
2012].
3.1 Hidden Markov Model
Hidden Markov Model is a probabilistic and statistical model that is based on a property called Markov property, Markov property states that the future is only dependent on the present, not the past. Observing at a particular state of the model, we can only think of the likelihood of next state to be occurred (transition or jump forward). Also at a particular state, we can think of events that occur at that state. Hidden Markov Model uses transition probabilities
\(P_{T}\) for considering the co-occurrence of words and capturing context. Also, emission probabilities
\(P_{E}\) for considering the chance of occurring an event given the present state [Schönhuth
2009]. For POS tagging problem consider words of a sentence as
\((w_1 \ldots w_n)\) and tagset as
\((t_1 \ldots t_n)\) more formally, we can define a Markov model for POS tagging as follows [Charniak et al.
1993]:
In Equation (
1) first clause of product
\(P(t_i | t_{i-1})\) is the probability of tag
\(t_i\) given tag
\(t_{i-1}\) also know as transition probabilities and the second clause
\(P(w_i| t_i)\) is the probability of word
\(w_i\) given the tag
\(t_i\) also know as emission probabilities.
HMM is a well-known model for tagging sequential data such as Part-of-Speech tagging. Various HMM-based approaches have been proposed for POS tagging, but there is no such approach for Pahsto language that exploits HMM for POS tagging. In this work, we tailored [Rudd
2009] implementation for Pashto.
4 Proposed Approach
We propose a new approach to POS tagging of Pashto language, based on LSTM. Steps of our proposed approach are illustrated in Figure
5.
4.1 Training Input Sentences and Tags
There is no benchmark dataset available for the task of POS tagging for Pashto language. We created a dataset for the task of Pashto POS tagging with focus on words having more than one tag. We developed a dataset for this task by collecting sentences from native Pashto speakers, and Pashto web sites, and got these sentences checked from experts of the language. Our dataset consists of 16 ambiguous words. For every ambiguous word, we collected 20 sentences, 10 sentences for each sense.
2 For example, in Figure
2 the word has two senses, one for noun and other for verb. Finally, all the sentences are manually hand tagged from two online dictionaries “The Pashto online dictionary” [Khan
2017], and “Daryab Pashto dictionary” [Achakzai
2017]. The entire dataset consists of 320 sentences. These sentences form the training input for our model as shown in Figure
5.
4.2 Encoding to Integers
Encoding is used in text to let the computer interpret the alphabets, words and characters of a language. Different encoding schemes are used, e.g., UTF/UNICODE, ISO-8859-x series, ASCII, Windows-1252 [Ishida
2015].
As Pashto language is not fully supported by UTF encoding, also not every Application and IDE supports Pashto language therefore manual encoding is needed.
For our system, we encoded Pashto words into integers by creating a vocabulary of unique words. We created sentences of integers by searching words of input sentence in the vocabulary and picking their indices. For making all sentences to have equal length, we padded each sentence with zeros. Example of Encoding sentences to integers is given in Figure
6. In Figure
6, there are three sentences of Pashto language. Each sentence has various number of words containing 16 unique words. Each unique word is assigned a unique number. Based on these unique numbers, three sentences are encoded.
4.3 Training LSTM Model
Our proposed LSTM model consists of three layers, an Embedding layer, a bidirectional LSTM layer, and a linear layer. The embedding layer input feature size is 580 and the output feature size is 300, which means that we encode each input token into a 300-dimensional vector also known as an embedding vector. The embedding vector is passed into the LSTM layer with input features as 300 and output features as 150. Finally, the 150-dimensional latent vector is then passed into a linear layer with input features of 300 and output features size of 16. This means that we have a total of 16 tags in our tag set including the start and stop sequence.
We trained our model for 20 epochs using the train set and then reported the accuracy score using the test set. To get better performance, we performed multiple experiments with varying epochs and learning rate. We first tuned learning rate by trying different values ranging from 0.02 to 0.1 with a step of 0.01, then we tried number of epochs ranging from 200 to 1,900 with a step of 100, and we recorded accuracy each time. Figure
7 shows the plot for the training loss against number of epochs.
Encoded input sentences are converted into a format that LSTM understands. LSTM understands data in contextual format. Figure
6 represents an example of converting simple sequential data to contextual data, which is a required data format by the supervised learning model. There are three textual sentences in Figure
6. To convert them into numeric format, a vocabulary vector consisting of all the words occurring in the three sentences is built where each word is assigned a unique numeric value. on the basis of that vector, sentences are represented in numeric format. In Figure
8, we have three sentences in Encoded format. Next, they are broken down into slices by taking previous words plus one following word at a time. In this way one sentence is broken down into four sub slices. Context is handled by using parenthesis, inside parenthesis are the context words and outside the parenthesis are the focus word, the word under observation. For example [(context words),
\(w_i\)], we take context length up to eight words. In the next step, embedded vectors of contextual data is fed into LSTM network in one-word vector per cell fashion. Example of a sentence fed into LSTM is given in Figure
9. LSTM updates its weights accordingly and keeps information of the context till the end of the sentence, no matter how long the sentence is. After end of the sentence, LSTM, clears its context and gradients, because we take context only at sentence level. The same process is repeated for each sentence. During each iteration, weights are adjusted so that the distance between the target and observed values is minimized. After particular number of iterations, we achieve a learned network. The learned network is then used for testing/predicting unseen sentences.
4.4 Testing and Decoding Output to Tags
Word vector represents the encoded form of Pashto sentences. To predict an appropriate tag for each word vector, word vector is passed to LSTM. LSTM returns a vector of weights for each word of a sentence. We assign the tag that has maximum weight in the weight vector for that word. We get words and its tag in the form of integers, and then we decode it back to Pashto words/tags pairs using the dictionary we created for this purpose. After achieving predicted tags for word vectors, we calculate accuracy by comparing the predicted tags with actual tags of testing Examples. Table
1 represents an illustration of decoding scores vectors where words are assigned tags in form of word-tag pair, such as tag T5 is assigned to word W1, tag T2 is assigned to word W2, tag T10 is assigned to W3, and tag T3 is assigned to word W4.
4.5 Features Extraction
To speed up computations and improve performance a technique is used called “dimensionality reduction.” There are two ways for dimensionality reduction, feature extraction and feature selection. Feature extraction is the process of transforming high-dimensional feature vector to low-dimensional feature vector by finding association between features whereas feature selection is the process of selecting a subset of features that best represents the class in the process of classification.
We used feature extraction to find the association between words; for feature extraction, we used Word Embedding. Word embedding can be achieved as a standalone task or can be jointly used with a model of neural network. We also used the Word Embedding jointly with our model, which takes input sentence as a vector of integers, and then transforms it to low-dimensional vector. Then the vectors are passed to hidden layers of the neural network model.
5 Results and Discussion
Two approaches, Conditional Random Field BLSTM and HMM are used for POS tagging. In this section, we discuss their results and compare their accuracy for both single tag words and two tags words. For evaluation, most of the POS tagging approaches were evaluated by measuring accuracy. We calculate the accuracy with a simple formula proposed in Dandapat [
2009] as given in Equation (
2):
where
\(correctly\;tagged\;words\) is the number of all words that are correctly assigned a tag, and
\(total\;words\;tagged\) is the total number of words.
5.1 Training and Testing Datasets
For splitting our data into training and testing, we choose two sentences for each ambiguous word, one sentence that contains ambiguous word with one kind of tag and one sentence for other kind of tag, e.g., one sentence for noun and one for verb. We separated 32 sentences, which sum up to 352 words as testing set, and the remaining 288 sentences, which sums up to 3,168 words, are used as training set.
In addition to splitting of data as described in previous section, we applied K-Fold Cross validation to our data with \(k=10\). In this we created 10 slices of data and iterated through it; in each iteration, we pick one slice for testing and remaining 9 slices for training.
5.2 Results of Conditional Random Field BLSTM
We also applied Conditional Random Field BLSTM for POS tagging of Pashto. We trained/tested the system with the same dataset used for LSTM and BLSTM. We keep context size of eight words. For CRF-BLSTM, we also calculated separate accuracy for bi-tag words as well as for overall words. The results we obtained by applying CRF-BLSTM are given in Table
2.
5.3 Results of HMM Approach
We also applied HMM, a statistical model that works on frequencies/probabilities of words to POS tagging of Pashto language. We trained/tested HMM on the same dataset that was used for LSTM, BLSTM, and CRF-BLSTM. The context size of three words was used, because HMM is a trigram tagger; if we increase context size, then the computation cost increases exponentially. For HMM, we also measured accuracy separately, for overall words as well for bi-tag words. Table
3 shows the results we obtained after applying HMM.
5.4 Examples of Ambiguous Words Correctly Predicted by Our Model
Figure
10 shows nine examples of sentences with ambiguous words (highlighted in purple color). For instance, consider sentence 1, the highlighted word is a noun and the model predicted it as a noun, whereas, in sentence 2, the same word is used as a verb and the model also predicted it as a verb. Similar situationa can be observed in sentencea 3 and 4. The highlighted word is used as an adjective in sentence 3, while in sentence 4 it is used as a noun. Our model predicted it correctly in both places. This shows that our model is trained well on the ambiguous words and their tags and predicted it correctly.
5.5 Discussion
We applied CRF-BLSTM and Statistical-based HMM. We conclude that, for Pashto POS tagging, CRF-BLSTM-based approaches perform better than statistical-based approaches HMM (as shown in Table
4). The key fact behind the deep learning-based approaches performance is that they keep contextual information long through the sequence. Deep learning-based approaches keep information remembered in the sequence until explicitly directed to forget it, whereas HMM keep context up to limited length (three words). Moreover, for bi-tag words HMM fails to decide which tag to assign, if each possible tag appears with equal probability in the training set.
5.6 Limitations of the Study
Our current approach is supervised learning approach that relies on annotated data. The performance of supervised learning models heavily relies on the amount of training dataset. For this purpose, we need a large amount of annotated data for a low-resource language (Pashto) but the annotated data for Pashto language is not available. this is a limitation of our study.
For the task of POS tagging, there is no corpus readily available for Pashto language, which contains ambiguous words and their associated tags. Ambiguous words in Pashto may have more than one tag in different situations. For better training of the model for ambiguous words require a large number of such examples, but the unavailability of large datasets for Pashto language, it would not be possible.
6 Conclusion and Future Work
For the task of Pashto tagging, a corpus of annotated words with their grammatical tags is created, which contains ambiguous words and their associated tags. We proposed an approach based on CRF-BLSTM for Pashto POS tagging, which keep contextual information long through the input sentence. We performed experiments on our Pashto dataset and applied CRF-BLSTM to POS tagging of Pashto and compare the results with a statistical approach known as HMM. Our experiments show that if we increase context size for HMM more than three, it will cause exponential increase in computational cost. From the results, we conclude that CRF-BLSTM approach performs better than statistical approach HMM in terms of accuracy and computaional cost.
The current work addresses the problem of ambiguous word tagging for the Pashto language. This study will be beneficial in improving the basic NLP tasks, such as part-of-speech tagging, and it will also help in improving advanced NLP tasks for Pashto language, such as machine translation, text summarization, and text simplification.
In future work, we can extend our work through increase in level of ambiguity, changing in tag-set, and increase in dataset, and the number of sentences and examples for each word can be extended, to further improve the results.