Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Published: 21 June 2020 Publication History

Abstract

Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.

References

[1]
S. S. Akhtar, A. Gupta, A. Vajpayee, A. Srivastava, and M. Shrivastava. 2017. Word similarity datasets for Indian languages: Annotation and baseline systems. In Proceedings of the 11th Linguistic Annotation Workshop (AW@ACL’17). 91--94.
[2]
M. Artetxe, G. Labaka, and E. Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2289--2294.
[3]
M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Robots (ACLR’18).
[4]
D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
[5]
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 1137—1155.
[6]
P. Bhatia, R. Guthrie, and J. Eisenstein. 2016. Morphological priors for probabilistic neural word embeddings. arXiv:1608.01056.
[7]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. arXiv:1607.04606.
[8]
J. A. Botha and P. Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1899--1907.
[9]
K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation (SSST-8). 103--111.
[10]
T. Cohn, S. Bird, G. Neubig, O. Adams, and A. J. Makarucha. 2017. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers. 937--947.
[11]
R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 160--167.
[12]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537.
[13]
R. Cotterell and H. Schütze. 2015. Morphological word-embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1287--1292.
[14]
M. Creutz and K. Lagus. 2005. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Helsinki University of Technology.
[15]
L. Duong, H. Kanayama, T. Ma, S. Bird, and T. Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 1285--1295.
[16]
M. Fadaee, A. Bisazza, and C. Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv:1705.00440.
[17]
M. Faruqui and C. Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL’14). 462--471.
[18]
S. Gouws and A. Søgaard. 2015. Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1386--1390.
[19]
J. Gu, H. Hassan, J. Devlin, and V. O. K. Li. 2018. Universal neural machine translation for extremely low resource languages. arXiv:1802.05368.
[20]
H. Kanayama, T. Cohn, T. Ma, S. Bird, and L. Duong. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of the 2015 Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers (EACL’17). 894--904.
[21]
Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. 2016. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 2741--2749.
[22]
B. Krishnamurthi. 2003. The Dravidian Languages. Cambridge University Press.
[23]
A. Kunchukuttan, A. Mishra, R. Chatterjee, R. M. Shah, and P. Bhattacharyya. 2014. Shata-Anuvadak: Tackling multiway translation of Indian languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1781--1787.
[24]
G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. 2018. Phrase-based & neural unsupervised machine translation. CoRR. abs/1804.07755, (2018).
[25]
A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu. 2015. Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 250--256
[26]
T. Luong, H. Pham, and C. D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.
[27]
T. Luong, H. Pham, and C. D. Manning. 2015b. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@ HLT-NAACL’15). 151--159.
[28]
T. Luong, R. Socher, and C. D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 104--113.
[29]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.
[30]
T. Mikolov, Q. V. Le, and I. Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv:1309.4168.
[31]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013c. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS’13). 3111--3119.
[32]
T. Mikolov, W. Yih, and G. Zweig. 2013d. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13). 746--751.
[33]
S. Qiu, Q. Cui, J. Bian, B. Gao, and T.-Y. Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 141--150.
[34]
A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Volume 1 (IJCNLP’08). 513--520.
[35]
S. Reddy and S. Sharoff. 2011. Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. In Proceedings of the 5th International Workshop on Cross Lingual Information Access. 11--19
[36]
R. Soricut and F. J. Och. 2015. Unsupervised morphology induction using word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1627--1637.
[37]
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2 (NIPS’14). 3104--3112.
[38]
C.-T. Tsai and D. Roth. 2016. Cross-lingual Wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 589--598
[39]
S. Upadhyay, M. Faruqui, C. Dyer, and D. Roth, 2016. Cross-lingual models of word embeddings: An empirical comparison. arXiv:1604.00425.
[40]
I. Vulic, N. Mrksic, and A. Korhonen. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2546--2558.
[41]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, and M. Krikun et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.
[42]
C. Xing, D. Wang, C. Liu, and Y. Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1006--1011.
[43]
W. Y. Zou, R. Socher, D. M. Cer, and C. D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1393--1398.

Cited By

View all
  • (2023)Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358049522:4(1-15)Online publication date: 12-Apr-2023
  • (2023)Fake news detection in Dravidian languages using transfer learning with adaptive finetuningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106877126:PAOnline publication date: 1-Nov-2023
  • (2022)Low-resource Neural Machine Translation: Methods and TrendsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/352430021:5(1-22)Online publication date: 15-Nov-2022
  • Show More Cited By

Index Terms

  1. Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 5
      September 2020
      278 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3403646
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 June 2020
      Online AM: 07 May 2020
      Accepted: 01 March 2020
      Revised: 01 January 2020
      Received: 01 February 2018
      Published in TALLIP Volume 19, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Word embeddings
      2. bilingual embeddings
      3. crosslingual embeddings
      4. linear transformation
      5. machine translation
      6. morphologically rich languages
      7. morphology
      8. supervised learning
      9. word2vec

      Qualifiers

      • Short-paper
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)50
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 30 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358049522:4(1-15)Online publication date: 12-Apr-2023
      • (2023)Fake news detection in Dravidian languages using transfer learning with adaptive finetuningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106877126:PAOnline publication date: 1-Nov-2023
      • (2022)Low-resource Neural Machine Translation: Methods and TrendsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/352430021:5(1-22)Online publication date: 15-Nov-2022
      • (2021)Sentiment Analysis Using XLM-R Transformer and Zero-shot Transfer Learning on Resource-poor Indian LanguageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/346176420:5(1-13)Online publication date: 30-Jun-2021
      • (2021)Denigrate Comment Detection in Low-Resource Hindi Language Using Attention-Based Residual NetworksACM Transactions on Asian and Low-Resource Language Information Processing10.1145/343172921:1(1-14)Online publication date: 29-Nov-2021
      • (2021)Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languagesInternational Journal of Speech Technology10.1007/s10772-021-09865-524:4(1047-1053)Online publication date: 1-Dec-2021

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media