research-article

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

Authors:

Arbi Haza Nasution,

Yohei Murakami,

Toru IshidaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 17, Issue 2

Article No.: 9, Pages 1 - 29

https://doi.org/10.1145/3138815

Published: 13 November 2017 Publication History

Abstract

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.

References

[1]

Carlos Ansótegui, María Luisa Bonet, and Jordi Levy. 2009. Solving (weighted) partial MaxSAT through satisfiability testing. In Theory and Applications of Satisfiability Testing-SAT 2009. Springer, 427--440.

[2]

Armin Biere, Marijn Heule, and Hans van Maaren. 2009. Handbook of Satisfiability. Vol. 185. IOS Press.

[3]

Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2 (1990), 79--85.

Digital Library

[4]

Lyle Campbell. 2013. Historical Linguistics. Edinburgh University Press.

[5]

Lyle Campbell and William J. Poser. 2008. Language classification. History and Method. Cambridge University Press, Cambridge (2008).

[6]

Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1 (COLING’02). Association for Computational Linguistics, Stroudsburg, PA, 1--7.

Digital Library

[7]

Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the 3rd Workshop on Very Large Corpora. 173--183.

[8]

Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17.

[9]

Charlotte Gooskens. 2006. Linguistic and extra-linguistic predictors of inter-scandinavian intelligibility. Ling. Netherlands 23, 1 (2006), 101--113.

[10]

Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In Proceedings of the International Semantic Web Conference. Springer, 130--143.

Digital Library

[11]

Eric W. Holman, Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, and others. 2011. Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52, 6 (2011), 841--875.

[12]

Toru Ishida. 2011. The Language Grid: Service-Oriented Collective Intelligence for Language Resource Interoperability. Springer.

[13]

Winfred P. Lehmann. 2013. Historical Linguistics: An Introduction. Routledge.

[14]

M. Paul Lewis, Gary F. Simons, and Charles D. Fennig (Eds.). 2015. Ethnologue: Languages of the World (18th ed.). SIL International, Dallas, TX.

[15]

Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. Association for Computational Linguistics, 1--8.

Digital Library

[16]

Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22.

[17]

I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing N-best translation lexicons. CoRR cmp-lg/9505044 (1995). Retrieved from http://arxiv.org/abs/cmp-lg/9505044.

[18]

Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44 (2012), 179--222.

Digital Library

[19]

Preslav Nakov and Jörg Tiedemann. 2012. Combining word-level and character-level models for machine translation between closely related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Volume 2. Association for Computational Linguistics, 301--305.

[20]

Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). Paris, France, 3291--3298.

[21]

Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 320--322.

Digital Library

[22]

John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the Pacific Asia Conference on Language, Information and Computation (PACLIC’15).

[23]

C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA.

[24]

Xabier Saralegi, Iker Manterola, and Inaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856.

Digital Library

[25]

Kevin P. Scannell. 2006. Machine translation for closely related language pairs. In Proceedings of the Workshop Strategies for Developing Machine Translation for Minority Languages. Citeseer, 103--109.

[26]

Lloyd S. Shapley. 1953. A value for n-person games. Contrib. Theor. Games 2, 28 (1953), 307--317.

[27]

Gary F. Simons and Charles D. Fennig (eds.). 2017. Ethnologue: Languages of the World, 20th ed. (2017).

[28]

Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 623--632.

Digital Library

[29]

Stephen Soderland, Oren Etzioni, Daniel S Weld, Michael Skinner, Jeff Bilmes, and others. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1. Association for Computational Linguistics, 262--270.

Digital Library

[30]

Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. Int. J. Am. Ling. 21, 2 (1955), 121--137.

[31]

Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 297--303.

Digital Library

[32]

Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09), Vol. 2009. 1555--1561.

[33]

Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the 13th Conference of the European Association for Machine Translation (EAMT’09). 12--19.

[34]

Renee Van Bezooijen and Charlotte Gooskens. 2005. How easy is it for speakers of dutch to understand frisian and afrikaans, and why? Ling. Netherlands 22, 1 (2005), 13--24.

[35]

Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21, 3 (01 Sep 2007), 165--181.

Digital Library

[36]

Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2014. Pivot-Based Bilingual Dictionary Extraction from Multiple Dictionary Resources. Springer International, Cham, 221--234.

[37]

Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2015. A constraint approach to pivot-based bilingual dictionary induction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 1, Article 4 (Nov. 2015), 26 pages.

Digital Library

Cited By

Nasution AOnan A(2024)ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP TasksIEEE Access10.1109/ACCESS.2024.340280912(71876-71900)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3402809
Murakami Y(2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
https://doi.org/10.1007/978-981-97-0779-9_3
Nasution AMonika WMasnur T(2022)Speech Recognition Mobile Application for Learning Iqra’ Using PocketSphinxProceedings of 2nd International Conference on Smart Computing and Cyber Security10.1007/978-981-16-9480-6_23(243-252)Online publication date: 27-May-2022
https://doi.org/10.1007/978-981-16-9480-6_23
Show More Cited By

Index Terms

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Lexical semantics

Recommendations

Multilingual Offensive Language Identification for Low-resource Languages
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, ...
Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families
Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely related ones, it has been shown that the constraint-based approach is useful for inducing bilingual lexicons from two bilingual ...
A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

High-quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 17, Issue 2

June 2018

134 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3160862

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2017

Accepted: 01 September 2017

Revised: 01 August 2017

Received: 01 February 2017

Published in TALLIP Volume 17, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Grant-in-Aid for Scientific Research (A)
Indonesia Endownment Fund for Education (LPDP)
Grant-in-Aid for Young Scientists (A)
Japan Society for the Promotion of Science (JSPS)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nasution AOnan A(2024)ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP TasksIEEE Access10.1109/ACCESS.2024.340280912(71876-71900)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3402809
Murakami Y(2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
https://doi.org/10.1007/978-981-97-0779-9_3
Nasution AMonika WMasnur T(2022)Speech Recognition Mobile Application for Learning Iqra’ Using PocketSphinxProceedings of 2nd International Conference on Smart Computing and Cyber Security10.1007/978-981-16-9480-6_23(243-252)Online publication date: 27-May-2022
https://doi.org/10.1007/978-981-16-9480-6_23
Nasution AMurakami YIshida T(2021)Plan Optimization to Bilingual Dictionary Induction for Low-resource Language FamiliesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344821520:2(1-28)Online publication date: 15-Mar-2021
https://dl.acm.org/doi/10.1145/3448215
Lin DMurakami YIshida T(2020)Towards Language Service Creation and Customization for Low-Resource LanguagesInformation10.3390/info1102006711:2(67)Online publication date: 27-Jan-2020
https://doi.org/10.3390/info11020067
Indriani DNasution AMonika WNasution S(2020)Towards a Sentiment Analyser for Low-resource LanguagesProceedings of International Conference on Smart Computing and Cyber Security10.1007/978-981-15-7990-5_10(109-118)Online publication date: 28-Nov-2020
https://doi.org/10.1007/978-981-15-7990-5_10
Nasution AKadir EMurakami YIshida T(2020)Toward Formalization of Comprehensive Bilingual Dictionaries Creation Planning as Constraint Optimization ProblemOptimization Based Model Using Fuzzy and Other Statistical Techniques Towards Environmental Sustainability10.1007/978-981-15-2655-8_3(41-54)Online publication date: 28-Feb-2020
https://doi.org/10.1007/978-981-15-2655-8_3
Bakhshaei SSafabakhsh RKhadivi S(2019)Matching Graph, a Method for Extracting Parallel Information from Comparable CorporaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332971319:1(1-29)Online publication date: 25-Jul-2019
https://dl.acm.org/doi/10.1145/3329713
Liu DYang KQu QLv J(2019)Ancient–Modern Chinese Translation with a New Large Training DatasetACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332588719:1(1-13)Online publication date: 31-May-2019
https://dl.acm.org/doi/10.1145/3325887
Murakami Y(2019)Indonesia Language Sphere: an ecosystem for dictionary development for low-resource languagesJournal of Physics: Conference Series10.1088/1742-6596/1192/1/0120011192(012001)Online publication date: 17-May-2019
https://doi.org/10.1088/1742-6596/1192/1/012001
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents