Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

Published: 13 November 2017 Publication History

Abstract

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.

References

[1]
Carlos Ansótegui, María Luisa Bonet, and Jordi Levy. 2009. Solving (weighted) partial MaxSAT through satisfiability testing. In Theory and Applications of Satisfiability Testing-SAT 2009. Springer, 427--440.
[2]
Armin Biere, Marijn Heule, and Hans van Maaren. 2009. Handbook of Satisfiability. Vol. 185. IOS Press.
[3]
Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2 (1990), 79--85.
[4]
Lyle Campbell. 2013. Historical Linguistics. Edinburgh University Press.
[5]
Lyle Campbell and William J. Poser. 2008. Language classification. History and Method. Cambridge University Press, Cambridge (2008).
[6]
Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1 (COLING’02). Association for Computational Linguistics, Stroudsburg, PA, 1--7.
[7]
Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the 3rd Workshop on Very Large Corpora. 173--183.
[8]
Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17.
[9]
Charlotte Gooskens. 2006. Linguistic and extra-linguistic predictors of inter-scandinavian intelligibility. Ling. Netherlands 23, 1 (2006), 101--113.
[10]
Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In Proceedings of the International Semantic Web Conference. Springer, 130--143.
[11]
Eric W. Holman, Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, and others. 2011. Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52, 6 (2011), 841--875.
[12]
Toru Ishida. 2011. The Language Grid: Service-Oriented Collective Intelligence for Language Resource Interoperability. Springer.
[13]
Winfred P. Lehmann. 2013. Historical Linguistics: An Introduction. Routledge.
[14]
M. Paul Lewis, Gary F. Simons, and Charles D. Fennig (Eds.). 2015. Ethnologue: Languages of the World (18th ed.). SIL International, Dallas, TX.
[15]
Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. Association for Computational Linguistics, 1--8.
[16]
Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22.
[17]
I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing N-best translation lexicons. CoRR cmp-lg/9505044 (1995). Retrieved from http://arxiv.org/abs/cmp-lg/9505044.
[18]
Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44 (2012), 179--222.
[19]
Preslav Nakov and Jörg Tiedemann. 2012. Combining word-level and character-level models for machine translation between closely related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Volume 2. Association for Computational Linguistics, 301--305.
[20]
Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). Paris, France, 3291--3298.
[21]
Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 320--322.
[22]
John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the Pacific Asia Conference on Language, Information and Computation (PACLIC’15).
[23]
C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA.
[24]
Xabier Saralegi, Iker Manterola, and Inaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856.
[25]
Kevin P. Scannell. 2006. Machine translation for closely related language pairs. In Proceedings of the Workshop Strategies for Developing Machine Translation for Minority Languages. Citeseer, 103--109.
[26]
Lloyd S. Shapley. 1953. A value for n-person games. Contrib. Theor. Games 2, 28 (1953), 307--317.
[27]
Gary F. Simons and Charles D. Fennig (eds.). 2017. Ethnologue: Languages of the World, 20th ed. (2017).
[28]
Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 623--632.
[29]
Stephen Soderland, Oren Etzioni, Daniel S Weld, Michael Skinner, Jeff Bilmes, and others. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1. Association for Computational Linguistics, 262--270.
[30]
Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. Int. J. Am. Ling. 21, 2 (1955), 121--137.
[31]
Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 297--303.
[32]
Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09), Vol. 2009. 1555--1561.
[33]
Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the 13th Conference of the European Association for Machine Translation (EAMT’09). 12--19.
[34]
Renee Van Bezooijen and Charlotte Gooskens. 2005. How easy is it for speakers of dutch to understand frisian and afrikaans, and why? Ling. Netherlands 22, 1 (2005), 13--24.
[35]
Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21, 3 (01 Sep 2007), 165--181.
[36]
Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2014. Pivot-Based Bilingual Dictionary Extraction from Multiple Dictionary Resources. Springer International, Cham, 221--234.
[37]
Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2015. A constraint approach to pivot-based bilingual dictionary induction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 1, Article 4 (Nov. 2015), 26 pages.

Cited By

View all
  • (2024)ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP TasksIEEE Access10.1109/ACCESS.2024.340280912(71876-71900)Online publication date: 2024
  • (2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
  • (2022)Speech Recognition Mobile Application for Learning Iqra’ Using PocketSphinxProceedings of 2nd International Conference on Smart Computing and Cyber Security10.1007/978-981-16-9480-6_23(243-252)Online publication date: 27-May-2022
  • Show More Cited By

Index Terms

  1. A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 2
      June 2018
      134 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3160862
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 November 2017
      Accepted: 01 September 2017
      Revised: 01 August 2017
      Received: 01 February 2017
      Published in TALLIP Volume 17, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Constraint satisfaction problem
      2. closely-related languages
      3. cognate recognition
      4. low-resource languages
      5. pivot-based bilingual lexicon induction

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Grant-in-Aid for Scientific Research (A)
      • Indonesia Endownment Fund for Education (LPDP)
      • Grant-in-Aid for Young Scientists (A)
      • Japan Society for the Promotion of Science (JSPS)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP TasksIEEE Access10.1109/ACCESS.2024.340280912(71876-71900)Online publication date: 2024
      • (2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
      • (2022)Speech Recognition Mobile Application for Learning Iqra’ Using PocketSphinxProceedings of 2nd International Conference on Smart Computing and Cyber Security10.1007/978-981-16-9480-6_23(243-252)Online publication date: 27-May-2022
      • (2021)Plan Optimization to Bilingual Dictionary Induction for Low-resource Language FamiliesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344821520:2(1-28)Online publication date: 15-Mar-2021
      • (2020)Towards Language Service Creation and Customization for Low-Resource LanguagesInformation10.3390/info1102006711:2(67)Online publication date: 27-Jan-2020
      • (2020)Towards a Sentiment Analyser for Low-resource LanguagesProceedings of International Conference on Smart Computing and Cyber Security10.1007/978-981-15-7990-5_10(109-118)Online publication date: 28-Nov-2020
      • (2020)Toward Formalization of Comprehensive Bilingual Dictionaries Creation Planning as Constraint Optimization ProblemOptimization Based Model Using Fuzzy and Other Statistical Techniques Towards Environmental Sustainability10.1007/978-981-15-2655-8_3(41-54)Online publication date: 28-Feb-2020
      • (2019)Matching Graph, a Method for Extracting Parallel Information from Comparable CorporaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332971319:1(1-29)Online publication date: 25-Jul-2019
      • (2019)Ancient–Modern Chinese Translation with a New Large Training DatasetACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332588719:1(1-13)Online publication date: 31-May-2019
      • (2019)Indonesia Language Sphere: an ecosystem for dictionary development for low-resource languagesJournal of Physics: Conference Series10.1088/1742-6596/1192/1/0120011192(012001)Online publication date: 17-May-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media