Fast and Accurate Bilingual Lexicon Induction via Matching Optimization

Zewen Chi^13,14,15,
Heyan Huang¹³,
Shenjian Zhao¹⁶,
Heng-Da Xu¹³ &
…
Xian-Ling Mao¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11838))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2360 Accesses

Abstract

Most recent state-of-the-art approaches are proposed to utilize the pre-trained word embeddings for bilingual lexicon induction. However, the word embeddings introduce noises for both frequent and rare words. Especially in the case of rare words, embeddings of which are always not well learned due to their low occurrence in the training data. In order to alleviate the above problem, we propose BLIMO, a simple yet effective approach for automatic lexicon induction. It does not introduce word embeddings but converts the lexicon induction problem into a maximum weighted matching problem, which could be efficiently solved by the matching optimization with greedy search. Empirical experiments further demonstrate that our proposed method outperforms state-of-the-arts baselines greatly on two standard benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Article 13 May 2024

Improving Bilingual Lexicon Induction on Distant Language Pairs

Notes

References

AP, S.C., et al.: An autoencoder approach to learning bilingual word representations. In: Advances in Neural Information Processing Systems, pp. 1853–1861 (2014)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 451–462 (2017)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. ACL (2018)
Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Burkard, R.E., Cela, E.: Linear assignment problems and extensions. In: Du, D.Z., Pardalos, P.M. (eds.) Handbook of Combinatorial Optimization, pp. 75–149. Springer, Boston (1999). https://doi.org/10.1007/978-1-4757-3023-4_2
Chapter Google Scholar
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Dinu, G., Lazaridou, A., Baroni, M.: Improving zero-shot learning by mitigating the hubness problem. Comput. Sci. 9284, 135–151 (2014)
Google Scholar
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471 (2014)
Google Scholar
Gliozzo, A., Strapparava, C.: Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 553–560. Association for Computational Linguistics (2006)
Google Scholar
Ker, S.J., Chang, J.S.: A class-based approach to word alignment. Comput. Linguist. 23(2), 313–343 (1997)
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)
Google Scholar
Lu, A., Wang, W., Bansal, M., Gimpel, K., Livescu, K.: Deep multilingual correlation for improved word embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 250–256 (2015)
Google Scholar
Melamed, I.D.: Models of translational equivalence among words. Comput. Linguist. 26(2), 221–249 (2000)
Article Google Scholar
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Nakazawa, T., et al.: ASPEC: Asian scientific paper excerpt corpus. In: LREC (2016)
Google Scholar
Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. MIT Press, Cambridge (2003)
Book Google Scholar
Riley, P., Gildea, D.: Orthographic features for bilingual lexicon induction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 390–394 (2018)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: a statistical approach. Computat. Linguist. 22(1), 1–38 (1996)
Google Scholar
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: ICLR (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 5998–6008 (2017)
Google Scholar
Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1006–1011 (2015)
Google Scholar
Zhang, M., Liu, Y., Luan, H.B., Sun, M., Izuha, T., Hao, J.: Building earth mover’s distance on bilingual word embeddings for machine translation. In: AAAI, pp. 2870–2876 (2016)
Google Scholar
Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1959–1970 (2017)
Google Scholar

Download references

Acknowledgement

The work is supported by SFSMBRP (2018YFB1005100), BIGKE (No. 20160754021), NSFC (No. 61772076 and 61751201), NSFB (No. Z181100008918002), Major Project of Zhijiang Lab (No. 2019DH0ZX01), and CETC (No. w-2018018).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Zewen Chi, Heyan Huang, Heng-Da Xu & Xian-Ling Mao
CETC Big Data Research Institute Co., Ltd., Guiyang, 550022, China
Zewen Chi
Big Data Application on lmproving Government Governance Capabilities National Engineering Laboratory, Guiyang, 550022, China
Zewen Chi
ByteDance Inc., Beijing, China
Shenjian Zhao

Authors

Zewen Chi
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shenjian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Heng-Da Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Ling Mao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heyan Huang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jie Tang
National University of Singapore, Singapore, Singapore
Min-Yen Kan
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Sujian Li
Zhengzhou University, Zhengzhou, China
Hongying Zan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chi, Z., Huang, H., Zhao, S., Xu, HD., Mao, XL. (2019). Fast and Accurate Bilingual Lexicon Induction via Matching Optimization. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11838. Springer, Cham. https://doi.org/10.1007/978-3-030-32233-5_57

Download citation

DOI: https://doi.org/10.1007/978-3-030-32233-5_57
Published: 30 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32232-8
Online ISBN: 978-3-030-32233-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Fast and Accurate Bilingual Lexicon Induction via Matching Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Improving Bilingual Lexicon Induction on Distant Language Pairs

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Fast and Accurate Bilingual Lexicon Induction via Matching Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Improving Bilingual Lexicon Induction on Distant Language Pairs

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation