Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

Published: 21 November 2015 Publication History

Abstract

High-quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution and usually requires only two input bilingual dictionaries A-B and B-C to automatically induce the new one, A-C. This approach, however, has never been demonstrated to utilize the complete structures of the input bilingual dictionaries, and this is a key failing because the dropped meanings negatively influence the result. This article proposes a constraint approach to pivot-based dictionary induction where language A and C are closely related. We create constraints from language similarity and model the structures of the input dictionaries as a Boolean optimization problem, which is then formulated within the Weighted Partial Max-SAT framework, an extension of Boolean Satisfiability (SAT). All of the encoded CNF (Conjunctive Normal Form), the predominant input language of modern SAT/MAX-SAT solvers, formulas are evaluated by a solver to produce the target (output) bilingual dictionary. Moreover, we discuss alternative formalizations as a comparison study. We designed a tool that uses the Sat4j library as the default solver to implement our method and conducted an experiment in which the output bilingual dictionary achieved better quality than the baseline method.

References

[1]
Kisuh Ahn and Matthew Frampton. 2006. Automatic generation of translation dictionaries using intermediary languages. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 41--44.
[2]
Fadi A. Aloul, Arathi Ramani, Igor L. Markov, and Karem A. Sakallah. 2002. Generic ILP versus specialized 0-1 ILP: An update. In Proceedings of the 2002 IEEE/ACM International Conference on Computer-Aided Design. ACM, 450--457.
[3]
Hitham Abo Bakr, Khaled Shaalan, and Ibrahim Ziedan. 2008. A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In Proceedings of the he 6th International Conference on Informatics and Systems (INFOS’08). Cairo University.
[4]
Shane Bergsma and Benjamin Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web images. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1764.
[5]
Armin Biere, Marijn J. H. Heule, Hans van Maaren, and Toby Walsh (Eds.). 2009. Handbook of Satisfiability. Frontiers in Artificial Intelligence and Applications, Vol. 185. IOS Press.
[6]
Francis Bond and Kentaro Ogura. 2008. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation 42, 2 (2008), 127--136.
[7]
Francis Bond, Takefumi Yamazaki, Ruhaida Binti Sulong, and Kentaro Okura. 2001. Design and construction of a machine--tractable Japanese-Malay lexicon. In Annual Meeting of the Association for Natural Language Processing, Vol. 7. 1.
[8]
Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16, 2 (1990), 79--85.
[9]
Y. Cheng, Victor Wu, Robert Collins, A. Hanson, and E. Riseman. 1996. Maximum-weight bipartite matching technique and its application in image feature matching. In SPIE Conference on Visual Communication and Image Processing. 1358--1379.
[10]
Stephen A. Cook. 1971. The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on Theory of Computing. ACM, 151--158.
[11]
Inderjit Dhillon, Yuqiang Guan, and Brian Kulis. 2005. A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 629--634.
[12]
Qing Dou and Kevin Knight. 2012. Large scale decipherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 266--275.
[13]
Zhaohui Fu and Sharad Malik. 2006. On solving the partial MAX-SAT problem. In Theory and Applications of Satisfiability Testing (SAT’06). Springer, 252--265.
[14]
Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora.
[15]
Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17.
[16]
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: HLT (2008), 771--779.
[17]
Jirka Hana, Anna Feldman, Chris Brew, and Luiz Amaral. 2006. Tagging Portuguese with a Spanish tagger using cognates. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 33--40.
[18]
Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In The Semantic Web-ISWC 2006. Springer, 130--143.
[19]
John Hopcroft and Robert Tarjan. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Communications of the ACM 16, 6 (1973), 372--378.
[20]
Toru Ishida. 2011. The Language Grid. Springer.
[21]
Azniah Ismail and Suresh Manandhar. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 481--489.
[22]
Varga István and Yokoyama Shoichi. 2009. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 862--870.
[23]
Hiroyuki Kaji and Toshiko Aizono. 1996. Extracting word correspondences from bilingual corpora based on word co-occurrences information. In Proceedings of the 16th Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 23--28.
[24]
Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. 2008. Automatic construction of a Japanese-Chinese dictionary via English. In LREC, Vol. 2008. 699--706.
[25]
Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition-Volume 9. Association for Computational Linguistics, 9--16.
[26]
Ruiming Li, Dian Zhou, and Donglei Du. 2004. Satisfiability and integer programming as complementary tools. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference. IEEE Press, 879--882.
[27]
Wushouer Mairidan, Lin Donghui, and Toru Ishida. 2013. A heuristic framework for pivot-based bilingual dictionary induction. In Proceedings of 3rd International Conference on Culture and Computing.
[28]
Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence-Volume Volume 3. AAAI Press, 1846--1851.
[29]
I. Dan Melamed. 1997. A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 490--497.
[30]
I. Dan Melamed. 2000. Models of translational equivalence among words. Computational Linguistics 26, 2 (2000), 221--249.
[31]
Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research 44, 1 (2012), 179--222.
[32]
Luka Nerima and Eric Wehrli. 2008. Generating bilingual dictionaries by transitivity. In LREC.
[33]
Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 74--81.
[34]
Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 156--164.
[35]
Pablo Gamallo Otero and José Ramom Pichel Campos. 2010. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In Computational Linguistics and Intelligent Text Processing. Springer, 473--483.
[36]
Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 519--526.
[37]
Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 812--819.
[38]
Magnus Sahlgren and Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11, 03 (2005), 327--341.
[39]
Wael Salloum and Nizar Habash. 2011. Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the 1st Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. Association for Computational Linguistics, 10--21.
[40]
Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2012. Building a Basque-Chinese dictionary by using english as pivot. In LREC. 1443--1447.
[41]
Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856.
[42]
Hassan Sawaf. 2010. Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA’10).
[43]
Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of the 6th Conference on Natural Language Learning-Volume 20. Association for Computational Linguistics, 1--7.
[44]
Alexander Schrijver. 1998. Theory of Linear and Integer {rogramming. John Wiley & Sons.
[45]
Stefan Schulz, Kornél Markó, Eduardo Sbrissia, Percy Nohama, and Udo Hahn. 2004. Cognate mapping: A heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 813.
[46]
Daphna Shezaf and Ari Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 98--107.
[47]
Carsten Sinz. 2005. Towards an optimal CNF encoding of boolean cardinality constraints. In Principles and Practice of Constraint Programming (CP’05). Springer, 827--831.
[48]
Jonas Sjobergh. 2005. Creating a free digital Japanese-Swedish lexicon. In Proceedings of PACLING. Citeseer, 296--300.
[49]
Hana Skoumalova. 2001. Bridge dictionaries as bridges between languages. International Journal of Corpus Linguistics, 6, Special Issue 95, 105 (2001), 11.
[50]
Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics.
[51]
Kumiko Tanaka and Hideya Iwasaki. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 580--585.
[52]
Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1 (COLING’94). Association for Computational Linguistics, Stroudsburg, PA, 297--303.
[53]
Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In IJCAI. 1555--1561.
[54]
Jerzy Tomaszczyk. 1986. The bilingual dictionary under review. In Zurilex’86 Proceedings: Papers Read at the Euralex International Congress, University of Zurich. 289--297.
[55]
Laurence A. Wolsey. 1998. Integer Programming. Vol. 42. Wiley, New York.
[56]
Dekai Wu and Xuanyin Xia. 1994. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the 1st Conference of the Association for Machine Translation in the Americas. Citeseer, 206--213.
[57]
Kun Yu and Junichi Tsujii. 2009. Bilingual dictionary extraction from wikipedia. Proceedings of Machine Translation Summit XII (2009), 379--386.
[58]
Xiaoheng Zhang. 1998. Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 1460--1464.

Cited By

View all
  • (2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
  • (2023)Sinhala-English Parallel Word Dictionary Dataset2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS)10.1109/ICIIS58898.2023.10253560(61-66)Online publication date: 25-Aug-2023
  • (2021)Plan Optimization to Bilingual Dictionary Induction for Low-resource Language FamiliesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344821520:2(1-28)Online publication date: 15-Mar-2021
  • Show More Cited By

Index Terms

  1. A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 1
    January 2016
    89 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/2847552
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 November 2015
    Accepted: 01 January 2015
    Revised: 01 October 2014
    Received: 01 August 2013
    Published in TALLIP Volume 15, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bilingual dictionary induction
    2. Weighted Partial Max-SAT
    3. constraint satisfaction problem
    4. low-resource languages
    5. pivot language

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • JST RISTEX, and a Grant-in-Aid for Scientific Research (S)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
    • (2023)Sinhala-English Parallel Word Dictionary Dataset2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS)10.1109/ICIIS58898.2023.10253560(61-66)Online publication date: 25-Aug-2023
    • (2021)Plan Optimization to Bilingual Dictionary Induction for Low-resource Language FamiliesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344821520:2(1-28)Online publication date: 15-Mar-2021
    • (2020)Towards Language Service Creation and Customization for Low-Resource LanguagesInformation10.3390/info1102006711:2(67)Online publication date: 27-Jan-2020
    • (2020)Toward Formalization of Comprehensive Bilingual Dictionaries Creation Planning as Constraint Optimization ProblemOptimization Based Model Using Fuzzy and Other Statistical Techniques Towards Environmental Sustainability10.1007/978-981-15-2655-8_3(41-54)Online publication date: 28-Feb-2020
    • (2019)Indonesia Language Sphere: an ecosystem for dictionary development for low-resource languagesJournal of Physics: Conference Series10.1088/1742-6596/1192/1/0120011192(012001)Online publication date: 17-May-2019
    • (2018)A Constraint Approach to Lexicon Induction for Low-Resource LanguagesServices Computing for Language Resources10.1007/978-981-10-7793-7_7(109-123)Online publication date: 24-Feb-2018
    • (2017)A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language FamiliesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/313881517:2(1-29)Online publication date: 13-Nov-2017
    • (2017)Plan Optimization for Creating Bilingual Dictionaries of Low-Resource Languages2017 International Conference on Culture and Computing (Culture and Computing)10.1109/Culture.and.Computing.2017.21(35-41)Online publication date: Sep-2017
    • (2016)Intercultural Collaboration and Support Systems: A Brief HistoryPRIMA 2016: Princiles and Practice of Multi-Agent Systems10.1007/978-3-319-44832-9_1(3-19)Online publication date: 10-Aug-2016

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media