short-paper

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Authors:

Santwana Chimalamarri,

Dinkar Sitaram,

Ashritha JainAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 5

Article No.: 69, Pages 1 - 15

https://doi.org/10.1145/3390298

Published: 21 June 2020 Publication History

Abstract

Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.

References

[1]

S. S. Akhtar, A. Gupta, A. Vajpayee, A. Srivastava, and M. Shrivastava. 2017. Word similarity datasets for Indian languages: Annotation and baseline systems. In Proceedings of the 11th Linguistic Annotation Workshop (AW@ACL’17). 91--94.

[2]

M. Artetxe, G. Labaka, and E. Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2289--2294.

[3]

M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Robots (ACLR’18).

[4]

D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.

[5]

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 1137—1155.

Digital Library

[6]

P. Bhatia, R. Guthrie, and J. Eisenstein. 2016. Morphological priors for probabilistic neural word embeddings. arXiv:1608.01056.

[7]

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. arXiv:1607.04606.

[8]

J. A. Botha and P. Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1899--1907.

[9]

K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation (SSST-8). 103--111.

[10]

T. Cohn, S. Bird, G. Neubig, O. Adams, and A. J. Makarucha. 2017. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers. 937--947.

[11]

R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 160--167.

[12]

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537.

Digital Library

[13]

R. Cotterell and H. Schütze. 2015. Morphological word-embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1287--1292.

[14]

M. Creutz and K. Lagus. 2005. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Helsinki University of Technology.

[15]

L. Duong, H. Kanayama, T. Ma, S. Bird, and T. Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 1285--1295.

[16]

M. Fadaee, A. Bisazza, and C. Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv:1705.00440.

[17]

M. Faruqui and C. Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL’14). 462--471.

[18]

S. Gouws and A. Søgaard. 2015. Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1386--1390.

[19]

J. Gu, H. Hassan, J. Devlin, and V. O. K. Li. 2018. Universal neural machine translation for extremely low resource languages. arXiv:1802.05368.

[20]

H. Kanayama, T. Cohn, T. Ma, S. Bird, and L. Duong. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of the 2015 Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers (EACL’17). 894--904.

[21]

Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. 2016. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 2741--2749.

Digital Library

[22]

B. Krishnamurthi. 2003. The Dravidian Languages. Cambridge University Press.

[23]

A. Kunchukuttan, A. Mishra, R. Chatterjee, R. M. Shah, and P. Bhattacharyya. 2014. Shata-Anuvadak: Tackling multiway translation of Indian languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1781--1787.

[24]

G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. 2018. Phrase-based & neural unsupervised machine translation. CoRR. abs/1804.07755, (2018).

[25]

A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu. 2015. Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 250--256

[26]

T. Luong, H. Pham, and C. D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.

[27]

T. Luong, H. Pham, and C. D. Manning. 2015b. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@ HLT-NAACL’15). 151--159.

[28]

T. Luong, R. Socher, and C. D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 104--113.

[29]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.

[30]

T. Mikolov, Q. V. Le, and I. Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv:1309.4168.

[31]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013c. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS’13). 3111--3119.

[32]

T. Mikolov, W. Yih, and G. Zweig. 2013d. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13). 746--751.

[33]

S. Qiu, Q. Cui, J. Bian, B. Gao, and T.-Y. Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 141--150.

[34]

A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Volume 1 (IJCNLP’08). 513--520.

[35]

S. Reddy and S. Sharoff. 2011. Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. In Proceedings of the 5th International Workshop on Cross Lingual Information Access. 11--19

[36]

R. Soricut and F. J. Och. 2015. Unsupervised morphology induction using word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1627--1637.

[37]

I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2 (NIPS’14). 3104--3112.

[38]

C.-T. Tsai and D. Roth. 2016. Cross-lingual Wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 589--598

[39]

S. Upadhyay, M. Faruqui, C. Dyer, and D. Roth, 2016. Cross-lingual models of word embeddings: An empirical comparison. arXiv:1604.00425.

[40]

I. Vulic, N. Mrksic, and A. Korhonen. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2546--2558.

[41]

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, and M. Krikun et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.

[42]

C. Xing, D. Wang, C. Liu, and Y. Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1006--1011.

[43]

W. Y. Zou, R. Socher, D. M. Cer, and C. D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1393--1398.

Cited By

Jha PKumar RSahula V(2023)Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358049522:4(1-15)Online publication date: 12-Apr-2023
https://dl.acm.org/doi/10.1145/3580495
Raja ESoni BBorgohain S(2023)Fake news detection in Dravidian languages using transfer learning with adaptive finetuningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106877126:PAOnline publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.engappai.2023.106877
Shi SWu XSu RHuang H(2022)Low-resource Neural Machine Translation: Methods and TrendsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/352430021:5(1-22)Online publication date: 15-Nov-2022
https://dl.acm.org/doi/10.1145/3524300
Show More Cited By

Index Terms

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
      2. Phonology / morphology

Recommendations

Cross-lingual word analogies using linear transformations between semantic spaces
Highlights
- We generalize the word analogy task to evaluate cross-lingual semantic spaces.
- ...
Abstract
The ability to represent the meaning of words is one of the core parts of natural language understanding (NLU), with applications ranging across machine translation, summarization, question answering, information retrieval, etc. The ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Morphological Word Embedding for Arabic
Abstract
Word embedding has opened new and exciting avenues for understanding and processing languages. The simple yet effective word embedding models rapidly became a dominant building block for Natural Language Processing (NLP) applications as they ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19, Issue 5

September 2020

278 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3403646

Editor:
Imed Zitouni
Microsoft, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2020

Online AM: 07 May 2020

Accepted: 01 March 2020

Revised: 01 January 2020

Received: 01 February 2018

Published in TALLIP Volume 19, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
232
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jha PKumar RSahula V(2023)Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358049522:4(1-15)Online publication date: 12-Apr-2023
https://dl.acm.org/doi/10.1145/3580495
Raja ESoni BBorgohain S(2023)Fake news detection in Dravidian languages using transfer learning with adaptive finetuningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106877126:PAOnline publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.engappai.2023.106877
Shi SWu XSu RHuang H(2022)Low-resource Neural Machine Translation: Methods and TrendsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/352430021:5(1-22)Online publication date: 15-Nov-2022
https://dl.acm.org/doi/10.1145/3524300
Kumar AAlbuquerque V(2021)Sentiment Analysis Using XLM-R Transformer and Zero-shot Transfer Learning on Resource-poor Indian LanguageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/346176420:5(1-13)Online publication date: 30-Jun-2021
https://dl.acm.org/doi/10.1145/3461764
Sangwan SBhatia M(2021)Denigrate Comment Detection in Low-Resource Hindi Language Using Attention-Based Residual NetworksACM Transactions on Asian and Low-Resource Language Information Processing10.1145/343172921:1(1-14)Online publication date: 29-Nov-2021
https://dl.acm.org/doi/10.1145/3431729
Chimalamarri SSitaram D(2021)Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languagesInternational Journal of Speech Technology10.1007/s10772-021-09865-524:4(1047-1053)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s10772-021-09865-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents