Nothing Special   »   [go: up one dir, main page]

Bidirectional Long Short-Term Memory For Automatic English To Kannada Back-Transliteration

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Bidirectional Long Short-Term Memory

for Automatic English to Kannada


Back-Transliteration

B. S. Sowmya Lakshmi and B. R. Shambhavi

Abstract Transliteration is the key component in various Natural Language Pro-


cessing (NLP) tasks. Transliteration is the process of converting one orthographic
system to another. This paper demonstrates transliteration of Romanized Kannada
words to Kannada script. Our system utilizes a bilingual corpus of around one lakh
words, which comprise pairs of Romanized Kannada word with its corresponding
word in Kannada script and employs orthographic and phonetic information. Recur-
rent Neural Networks (RNNs) are widely used Neural Networking model for text
and speech processing as they better predict the next word based on past informa-
tion. Long Short-Term Memory (LSTM) Networks are exceptional kind of RNNs
which handles long-term dependencies. A Character level Bidirectional Long Short-
Term Memory (BLSTM) paradigm which drives down the perplexity with respect to
word-level paradigm has been employed. Knowledge of Characters uncovers struc-
tural (dis)similarities among words, thus refining the modeling of uncommon and
unknown words. Test data of 3000 Romanized Kannada words is used for model
evaluation and we obtained an accuracy of 83.32%.

Keywords Transliteration · Bilingual corpus · RNN, LSTM

1 Introduction

Transliteration is the task of mapping graphemes or phonemes of one language into


phoneme approximations of another language. It has got various applications in
the domain of NLP like Machine Translation (MT), Cross Language Information
Retrieval (CLIR), and information extraction. Even though the task appears trivial,
prediction of pronunciation of the original word is a crucial factor in translitera-
tion process. Transliteration process among two languages is minimum if they share

B. S. Sowmya Lakshmi (B) · B. R. Shambhavi


Department of ISE, BMS College of Engineering, Bangalore 560019, India
e-mail: sowmyalakshmibs.ise@bmsce.ac.in
B. R. Shambhavi
e-mail: shambhavibr.ise@bmsce.ac.in
© Springer Nature Singapore Pte Ltd. 2019 277
N. R. Shetty et al. (eds.), Emerging Research in Computing, Information, Communication
and Applications, Advances in Intelligent Systems and Computing 882,
https://doi.org/10.1007/978-981-13-5953-8_24
278 B. S. Sowmya Lakshmi and B. R. Shambhavi

identical alphabet set. However, for languages which practice nonidentical set of
alphabets, words have to be transliterated or portrayed in the native language alpha-
bets.
Majority of multilingual web users have a tendency to represent their native lan-
guages in Roman script on social media platforms. In spite of many recognized
transliteration standards, there is an intense inclination to use unofficial transliter-
ation standards in many websites, social media, and blog sites. There are ample of
issues such as spelling variation, diphthongs, doubled letters, and reoccurring con-
structions which are to be taken care while transcribing.
Neural Networks is a rapidly advancing approach to machine learning [1, 2] and
has shown promising performance when applied to a variety of tasks like image
recognition, speech processing, natural language processing, cognitive modeling,
and so on. It involves using neural networks for training a model for a specific task.
This paper demonstrates the application of neural network for machine transliteration
of English–Kannada, two linguistically distant and widely spoken languages.
The rest of this paper is arranged as follows. Section 2 describes prior work in this
area. An introduction to LSTM and BLSTM is described in Sect. 3. The methodology
adopted to build corpus is presented in Sect. 3. Proposed transliteration network
is portrayed in Sect. 4. Section 5 provides details of results obtained. Section 6
communicates conclusion and future work of the proposed method.

2 Previous Research

Research on Indic languages within the perspective of social media is quite ample,
with numerous studies concentrating on code-switching has become a quite familiar
phenomenon. There are a few substantial works being done on Transliteration or,
more precisely, back-transliteration of Indic languages [3, 4]. A shared task which
included, back-transliteration of Romanized Indic language words to its native scripts
was run in 2014 [5, 6]. In many areas, including machine transliteration, end-to-end
deep learning models have become a good alternative to more traditional statistical
approaches. A Deep Belief Network (DBN) was developed to transliterate from
English to Tamil with restricted corpus [7] and obtained an accuracy of 79.46%. A
character level attention-based encoder in deep learning was proposed to develop a
transliteration model for English–Persian [8]. The model presented a good accuracy,
with BLEU score of 76.4.
In [9], authors proposed transliteration of English to Malayalam using phonemes.
English–Malayalam pronunciation dictionary was used to map English graphemes
to Malayalam phonemes. Performance of the model was fairly good for phonemes
in pronunciation dictionary. However, it suffered from out-of-vocabulary issue when
a word is not in pronunciation dictionary.
The most essential requisite of transliterator is to retain the phonetic structure
of source language after transliterating in target language. Different transliteration
techniques for Indian languages were proposed. In [10], input text was split into
Bidirectional Long Short-Term Memory … 279

phonemes and was classified using Support Vector Machine (SVM) algorithm. Most
of the methods adopted features like n-grams [11], Unicode mapping [12], or a
combination-based approach by combining phoneme extraction and n-grams [13,
14].
Antony et al. [15–17] have proposed Named Entities (NE) transliteration tech-
niques from English to Kannada. In [15, 16], authors adopted a statistical approach
using widely available tools such as Mosses and Giza++ which yielded an accuracy
of about 89.27% for English names. System was also evaluated by comparing with
Google transliterator. A training corpus of 40,000 Named Entities was used to train
SVM algorithm [17] and obtained an accuracy of 87% for 1000 test dataset.

3 LSTM and BLSTM Network

RNNs have been employed to produce promising results on a variety of tasks includ-
ing language model [18] and speech recognition. A RNN foresees present output
based on the preserved memories of its past information. RNNs are designed for
capturing information from sequences or time series data.
RNN networks [19] comprises an input layer, hidden layer, and an output layer
where each cell preserves a memory of previous time. Figure 1 demonstrates a simple
RNN model where X 0 , X 1 , X 2 are inputs at timestamps t 0 , t 1 , t 2 and hidden layer
units are h0 , h1 , h2 :
The new state (ht ) of RNN at time t is a function of its previous state at time t −
1 (ht −1 ) and the input at time t (x t ). Output (yt ) of the hidden layer units at time t are

Fig. 1 A simple RNN model


280 B. S. Sowmya Lakshmi and B. R. Shambhavi

calculated using new state calculated and the weight matrix. The maths behind RNN
to calculate output from hidden and output layers are as follows:
 
h (t)  gh Wi X (t) + W R h (t−1) + bh (1)
 
Y (t)  g y WY h (t) + b y (2)

where W Y , W R, and Wi are weights which are to be calculated during training phase,
gh and gy are activation functions computed using Eqs. (3) and (4) respectively and
bh and by are bias. RNN uses backpropagation algorithm, but it is applied for every
time stamp. It is commonly known as Backpropagation Through Timestamp (BTT).
The dimensionality of output layer is same as labels and also characterizes likelihood
distribution of labels at time t.

1
gh  (3)
1 + e−z
ezm
g y   zk (4)
ke

where gh is a sigmoid function and gy is softmax activation function which maps


input to the output nonlinearly.
LSTM Networks are special kind of RNNs and these RNNs are accomplished to
learn long-term dependencies. Consider a language model of text prediction, which
predicts a next word based on the previous word. In order to predict German as the
last word in the sentence “I grew up in Germany. I speak fluent German” recent
information suggests that next word might probably be the language but in order to
narrow down the language context of German is needed which is quite long back
in the sentence. This is known as long-term dependencies. LSTMs are capable of
handling this type of dependencies where RNNs fail. Figure 2 shows a LSTM cell.

Fig. 2 A LSTM cell


Bidirectional Long Short-Term Memory … 281

where
σ  logistic sigmoid function
i  input gate
f = forget gate
o  output gate
c  cell vectors
h  hidden vector
W  weight matrix
LSTM is implemented as the following:

• Primary step in the LSTM is to decide the data to be neglected from the cell state
which is made by forget gate layer. This decision is made by a sigmoid layer
called the “forget gate layer”. It is a function of ht −1 and xt as shown in Eq. (5),
and outputs a number between 0 and 1 for each number in the cell state ct −1 . A
1 represents “completely keep this” while a 0 represents “completely get rid of
this.”

 
f t  σ (W f h t−1 , X t + b f ) (5)

• The second step is to decide the new data to be stored. This has two steps, input
obtained from the previous timestamp and the new input are passed through a
sigmoid function called “input gate layer” to get it as shown in Eq. (6). Next, input
obtained from the previous timestamp and the new input are passed through a tanh
function. Both the steps are combined with ft passed from the previous step as in
Eq. (7).

 
i t  σ (Wi h t−1 , X t + bi ) (6)
 
ct  f t ct−1 + i t tanh (Wc h t−1 , X t + bc ) (7)

• Last step is to obtain output using Eqs. (8) and (9), which is based on cell state.
First, a sigmoid layer decides what parts of the cell state are to be outputted. Then,
the tanh function pushes cell state values between −1 to 1, which is multiplied by
the output of the sigmoid gate, so that only decided parts happen to be the output.

   
ot  σ Wo h t−1 , X t + bo (8)
   
ot  σ Wo h t−1 , X t + bo (9)
282 B. S. Sowmya Lakshmi and B. R. Shambhavi

3.1 BLSTM Network

Sequence learning task requires previous and forthcoming input features at time t.
Hence, BLSTM network is used to utilize previous features and upcoming features
at a given time t. BLSTM [20] network hidden layer contains a sequence of forward
and backward recurrent neural components connected to the identical output layer.
Figure 3 shows a simple BLSTM network of four input units X 0 to X 4 . Network
hidden layer has four recurrent components h0 to h3 in the forward direction and
four recurrent components ho to h3 in the backward direction to help predict output
Y 0 to Y 3 by forming an acyclic graph. Most of the text processing task BLSTM
would provide reasonable results in the prediction of sequence of data.

4 Dataset Collection

To develop a transliteration model using neural networks necessitates a significant


amount of bilingual parallel corpus. For resource-poor languages like Kannada, it
is hard to obtain or build a large corpus for NLP applications. Therefore, we built
a training corpus of around 100,000 English–Kannada bilingual words. Bilingual
words were collected from following various sources.
• Various websites were scraped to collect most familiar Kannada words and their
Romanized words. Special characters, punctuation marks, and numerals were
removed by preprocessing. This data contributed around 20% of the training data.

Fig. 3 A BLSTM network


Bidirectional Long Short-Term Memory … 283

• Majority of the bilingual words were collected from music lyrics websites which
consist of song lyrics in Kannada script and its corresponding song lyrics in Roman-
ized Kannada. Non-Kannada words, punctuations, and vocalize words in song
lyrics were removed. Obtained list comprehends viable syllable patterns in Kan-
nada and contributed around 70% of the training data.
• The subsequent share of the corpus was manually transliterated NEs.

5 Experiments

5.1 Setup

The proposed approach is implemented on python platform and packages used are
numpy and neural network toolkit keras with Tensorflow in the backend. The network
parameters are set up as in Table 1.

5.2 Training Procedure

The proposed model is implemented using simple BLSTM network as shown in


Fig. 4. Paradigm is trained with the collected dataset which contained bilingual
corpus Romanized Kannada words and its corresponding word in Kannada script.
During Training, 20% of the training data is set as the validation data. Algorithm 1
describes network training procedure. In each epoch, entire training data is divided
into batches and one batch is processed at given time t. Batch size determines number
of words to be included in a batch. Characters in each input word are embedded and
provided as input to forward and backward state of LSTM. Later, we backpropagate
the errors from the output to the input to update the network parameters.

Table 1 Model parameters Parameter Value


No. of epochs 30
Batch size 128
Hidden units 128
Embedding dimension 64
Validation split 0.2
Output activation function Softmax
Learning rate 0.001
Training model Bidirectional LSTM
284 B. S. Sowmya Lakshmi and B. R. Shambhavi

Fig. 4 BLSTM network

Algorithm 1 BLSTM Model Training Procedure


for each epoch do
for each batch do
1) bidirectional LSTM model forward pass:
forward pass for forward state LSTM
forward pass for backward state LSTM
2) bidirectional LSTM model backward pass:
backward pass for forward state LSTM
backward pass for backward state LSTM
3) update parameters
end for
end for

6 Results

Model was tested for the dataset of around 3 K words collected from random websites.
Test dataset contains Romanized words and its transliterated words in Kannada script
which is kept as reference to compare with the result.
Snapshot of results obtained from the model is shown in Table 2. The correctness
of the transliteration is measured by Accuracy (ACC) or Word Error Rate (WAR)
yielded by a transliteration model. For completeness, other transliteration results
obtained by RNN and LSTM networks which are trained for the same datasets are
reported in Table 3.
Bidirectional Long Short-Term Memory … 285

Table 2 Snapshot of results


Romanized Kannada Gold standard Resultant word Transliteration result
word transliterated
word
Tirugu Correct

Setuve Correct

Aggalikeya Correct

Sadbhava Incorrect

Anaupacharika Incorrect

Table 3 Evaluation results Model Accuracy obtained (%)


RNN 74.33
LSTM 79.76
BLSTM 83.32

7 Conclusion and Future Work

Transliteration is the task of mapping graphemes or phonemes of one language into


phoneme approximations of another language. It is the elementary step for most of
the NLP applications like MT, CLIR, and text mining. English and Kannada language
trail dissimilar scripts and also vary in their phonetics. Furthermore, Romanization
of Kannada words does not go along a standard pattern as far as their pronunci-
ation is concerned. Thus, a particular set of rules do not guarantee an effective
back-transliteration. In this paper, we have presented a Transliteration model for
English–Kannada language pair.
A character level BLSTM model was investigated, which utilizes character embed-
ding for words. Along with BLSTM, model was also tested for LSTM and RNN
for English and Kannada. The correctness of the transliteration was measured by
Accuracy for a test data of 3000 Romanized Kannada words. As BLSTM has two
networks, one access information in the forward direction and another access in the
reverse direction, the output generated is from both the past and future context. Accu-
racy obtained by BLSTM was more when compared to LSTM and RNN for this test
data.
There are several possible courses for future improvement. First, model would
be further progressed by combining other algorithms with BLSTM. For example,
combination of CNN with BLSTM or CRF with BLSTM would yield better results.
Another improvement is to expand the training data by collecting data from other
domains such as social media (Twitter and Weibo) which would include all pos-
286 B. S. Sowmya Lakshmi and B. R. Shambhavi

sible orthographic variations. Since model is not restricted to specific domain or


knowledge, social media text would also provide a fair share in training data.

References

1. Kalchbrenner, N., & Blunsom, P. (2013). Recurrent continuous translation models. In Proceed-
ings of the 2013 Conference on Empirical Methods in Natural Language Processing.
2. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural
networks. In Advances in neural information processing systems.
3. Sharma, A., & Rattan, D. (2017). Machine transliteration for indian languages: A review.
International Journal, 8(8).
4. Dhore, M. L., Dhore, R. M., Rathod, P. H. (2015). Survey on machine transliteration and
machine learning models. International Journal on Natural Language Computing (IJNLC),
4(2).
5. Sequiera, R. D., Rao, S. S., Shambavi, B. R. (2014). Word-level language identification and
back transliteration of romanized text: A shared task report by BMSCE. In Shared Task System
Description in MSRI FIRE Working Notes.
6. Choudhury, M. et al. (2014). Overview of fire 2014 track on transliterated search. Proceedings
of FIRE, 68–89.
7. Sanjanaashree, P. (2014). Joint layer based deep learning framework for bilingual machine
transliteration. In 2014 International Conference on Advances in Computing, Communications
and Informatics (ICACCI). IEEE.
8. Mahsuli, M. M., & Safabakhsh, R. (2017). English to Persian transliteration using attention-
based approach in deep learning. In 2017 Iranian Conference on Electrical Engineering (ICEE).
IEEE.
9. Sunitha, C., & Jaya, A. (2015). A phoneme based model for english to malayalam translitera-
tion. In 2015 International Conference on Innovation Information in Computing Technologies
(ICIICT). IEEE.
10. Rathod, P. H., Dhore, M. L., & Dhore, R. M. (2013). Hindi and Marathi to English machine
transliteration using SVM. International Journal on Natural Language Computing, 2(4),
55–71.
11. Jindal, S. (2015, May). N-gram machine translation system for English to Punjabi translit-
eration. International Journal of Advances in Electronics and Computer Science, 2(5). ISSN
2393-2835.
12. AL-Farjat, A. H. (2012). Automatic transliteration among indic scripts using code mapping
formula. European Scientific Journal (ESJ), 8(11).
13. Dasgupta, T., Sinha, M., & Basu, A. (2013). A joint source channel model for the English to
Bengali back transliteration. In Mining intelligence and knowledge exploration (pp. 751-760).
Cham: Springer.
14. Dhindsa, B. K., & Sharma, D. V. (2017). English to Hindi transliteration system using
combination-based approach. International Journal, 8(8).
15. Antony, P. J., Ajith, V. P., & Soman. K. P. (2010). Statistical method for English to Kannada
transliteration. In Information Processing and Management (pp. 356–362). Berlin, Heidelberg:
Springer.
16. Reddy, M. V., & Hanumanthappa, M. (2011). English to Kannada/Telugu name transliteration
in CLIR: A statistical Approach. International Journal of Machine Intelligence, 3(4).
17. Antony, P. J., Ajith, V. P., & Soman, K. P. (2010). Kernel method for English to Kannada
transliteration. In 2010 International Conference on Recent Trends in Information, Telecom-
munication and Computing (ITC). IEEE.
Bidirectional Long Short-Term Memory … 287

18. Mikolov, T. et al. (2011). Extensions of recurrent neural network language model. In 2011
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
19. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
20. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Trans-
actions on Signal Processing, 45(11), 2673–2681.

You might also like