article

Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Authors:

Hiroshi Echizen-ya,

Yoshio MomouchiAuthors Info & Claims

Information Processing and Management: an International Journal, Volume 42, Issue 5

Pages 1294 - 1315

https://doi.org/10.1016/j.ipm.2005.11.004

Published: 01 September 2006 Publication History

Abstract

In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL.

References

[1]

Ahrenberg, L., Andersson, M., & Merkel, M. (1998). A simple hybrid aligner for generating lexical correspondences in parallel texts. In Proceedings of the 36th annual meeting of the association for computational linguisties and 17th international conference on computational linguistics (COLING-ACL'98) (pp. 29-35).]]

Digital Library

[2]

Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., & Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the 13th international joint conference on artificial intelligence (IJCAI'93) (pp. 1172-1178).]]

[3]

Bar-Yossef, Z., & Rajagopalan, S. (2002). Template detection via data mining and its application. In Proceedings of the 11th international world wide web conference (WWW'02) (pp. 580-591).]]

Digital Library

[4]

Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263-311.]]

Digital Library

[5]

Chen, A., & Gey, F. C. (2004). Multilingual information retrieval using machine translation, relevance feedback and decompounding. Information Retrieval, 7(1-2), 149-182.]]

Digital Library

[6]

Chen, F.S. (1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st annual meeting of the association for computational linguistics (ACL'93) (pp. 9-16).]]

Digital Library

[7]

Chikushi, F. (2001). Express: French, Hakusui-sha (in Japanese).]]

[8]

Collier, N., Ono, K., & Hirakawa, H. (1998). An experiment in hybrid dictionary and statistical sentence alignment. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (COLING-ACL'98) (pp. 268-274).]]

Digital Library

[9]

Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proceedings of the 27th international conference on very large data bases (pp. 109-118).]]

Digital Library

[10]

Dagan, I., Church, K. W., & Gale, W.A. (1993). Robust bilingual word alignment for machine aided translation. In Proceedings of the workshop on very large corpora: academic and industrial perspectives (pp. 1-8).]]

[11]

Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74.]]

Digital Library

[12]

Echizen-ya, H., Araki, K., Momouchi, Y., Tochinai, K. (2002). Study of practical effectiveness for machine translation using Recursive Chain-link-type Learning. In Proceedings of the 19th international conference on computational linguistics (COLING'02) (pp. 246-252).]]

Digital Library

[13]

Echizen-ya, H., Araki, K., & Momouchi, Y. (2005a). Automatic extraction of low frequency bilingual word pairs from parallel corpora with various languages. In Proceedings of the 9th pacific-asia conference on knowledge discovery and data mining (PAKDD'05). Lecture notes in artificial intelligence (Vol. 3518, pp. 32-37). Springer Publishing.]]

Digital Library

[14]

Echizen-ya, H., Araki, K., & Momouchi, Y. (2005b). Automatic acquisition of adjacent information and its effectiveness in extraction of bilingual word pairs from parallel corpora. In Proceedings of the 1Oth international conference on applications of natural language to information systems (NLDB'05). Lecture notes in computer science (Vol. 3513, pp. 349-352). Springer Publishing.]]

Digital Library

[15]

Emoto, H., Han, G. (2004). Express: Shanghai, Hakusui-sha (in Japanese).]]

[16]

Fujii, A., & Ishikawa, T. (2001). Japanese/English cross-language information retrieval: exploration of query translation and transliteration. Computers and the Humanities, 35(4), 389-420.]]

[17]

Fung, P., & Church, K. (1994). K-vec: a new approach for alignment parallel texts. In Proceedings of the 15th international conference on computational linguistics (COLING'94) (pp. 1096-1102).]]

Digital Library

[18]

Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus, Workshop on Very Large Corpora (pp. 173-183).]]

[19]

Fung, P. (1998). A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. Lecture notes in artificial intelligence (Vol. 1529, pp. 1-17). Springer Publishing.]]

Digital Library

[20]

Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75-102.]]

Digital Library

[21]

Güvenir, H. A., & Cicekli, I. (1998). Learning translation templates from examples. Information Systems, 23(6), 353-363.]]

Digital Library

[22]

Harukawa, Y., & Snelling, J. (1998). Express: English, Hakusui-sha (in Japanese).]]

[23]

Hiemstra, D., de Jong, F., & Kraaij, W. (1997). A domain specific lexicon acquisition tool for cross-language information retrieval. In Proceedings of the RIAO'97 conference on computer-assisted information searching on internet (pp. 255-269).]]

[24]

Hirokawa, S., Itoh, E., & Miyahara, T. (2003). Semi-automatic construction of metadata from a series of web documents. In Proceedings of the 16th Australian conference on artificial intelligence (AI'03). Lecture notes in computer science (Vol. 2903, pp. 942-953). Springer Publishing.]]

[25]

Hisamitsu, T., & Niwa, Y. (2001). Topic-word selection based on combinatorial probability. In Proceedings of the 6th natural language processing pacific rim symposium (NLPRS'O1) (pp. 289-296).]]

[26]

Hsu, J. Y., & Yih, W. (1997). Template-based information mining from HTML documents. In Proceedings of the 14th national conference on artificial intelligence and 9th conference on innovative applications of artificial intelligence (AAAI-IAAI'97) (pp. 256-262).]]

[27]

Kaji, H., & Aizono, T. (1996). Extracting word correspondences from bilingual corpora based on word co-occurrence information. In Proceedings of the 16th international conference on computational linguistics (COLING'96) (pp. 23-28).]]

Digital Library

[28]

Kay, M., & Rööscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1), 121-142.]]

Digital Library

[29]

Kishida, K., Chen, K., Lee, S., Chen, H., Kando, N., Kuriyama, K., Myaeng, S., & Eguchi, K. (2004). Cross-lingual information retrieval (CLIR) task at the NTCIR workshop 3. SIGIR Forum, 38(1), 17-20.]]

Digital Library

[30]

Kumano, A., & Hirakawa, H. (1994). Building an MT dictionary from parallel texts based on linguistic and statistical information. In Proceedings of the 15th international conference on computational linguistics (COLING'94) (pp. 76-81).]]

Digital Library

[31]

Kushmerick, N., Weld, D.S., & Doorenbos, R. (1997). Wrapper induction for information extraction. In Proceedings of the 15th international joint conference on artificial intelligence (IJCAI'97) (pp. 729-735).]]

[32]

Lee, J., & Bui, T. (2000). A template-based methodology for disaster management information systems. In Proceedings of the 33rd annual Hawaii international conference on system sciences (HICSS-33).]]

Digital Library

[33]

Macklovitch, E., & Hannan, M. L. (1996). Line 'em up: advances in alignment technology and their impact on translation support tools. In Proceedings of the second conference of the association for machine translation in the Americas (AMTA'96) (pp. 145-156).]]

[34]

Manning, C. D., & Schüütze, H. (1999). Foundations of statistical natural language processing. MIT Press.]]

Digital Library

[35]

Matsumoto, Y., Ishimoto, H., & Utsuro, T. (1993). Structural matching of parallel texts. In Proceedings of the 31st annual meeting of the association for computational linguistics (ACL'93) (pp. 23-30).]]

Digital Library

[36]

Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., & Asahara, M. (2000). Japanese Morphological Analysis System ChaSen version 2.2.1 manual. Nara Institute of Science and Technology.]]

[37]

McTait, K. (2001). Linguistic knowledge and complexity in an EBMT system based on translation patterns. In Proceedings of the workshop on EBMT, MT Summit VIII.]]

[38]

Melamed, I. D. (2001). Empirical methods for exploiting parallel texts. MIT Press.]]

[39]

Nakagawa, H., & Nakamoto, M. (2004). Express. Ainu, Hakusui-sha (in Japanese).]]

[40]

Nießen, S., & Ney, H. (2004). Statistical machine translation with scarce resources using morpho-syntactic information. Computational Linguistics, 30(2), 181-204.]]

Digital Library

[41]

Och, F.J. (2000). Giza++: Training of statistical translation models. Available from http://www-i6.informatik.rwth-aachen.de/ Colleagues/och/software/GIZA ++.html.]]

[42]

Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19-51.]]

Digital Library

[43]

Oshio, T. (2004). Express: German, Hakusui-sha (in Japanese).]]

[44]

Pedersen, T., Varma, N. (2003). K-vec++: Approach for finding word correspondences, Available from http://www.d.umn.edu/~tpederse/ Code/Readme.K-vec++.v02.txt.]]

[45]

Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd annual meeting of the association for computational linguistics (ACL'95) (pp. 320-322).]]

Digital Library

[46]

Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th annual meeting of the association for computational linguistics (ACL'99) (pp. 519-526).]]

Digital Library

[47]

Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistic-based and linguistics-based approach. In Proceedings of the 6th international workshop on information retrieval with Asian languages (pp. 57-64).]]

Digital Library

[48]

Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: a statistical approach. Computational Linguistics, 22(1), 1-38.]]

Digital Library

[49]

Tanaka, K., & Iwasaki, H. (1996). Extraction of lexical translation from non-aligned corpora. In Proceedings of the 16th international conference on computational linguistics (COLING'96) (pp. 580-585).]]

Digital Library

[50]

Utsuro, T., Hino, K., & Kida, M. (2004). Integrating cross-lingually relevant news articles and monolingual Web documents in bilingual lexicon acquisition. In Proceedings of the 20th international conference on computational linguistics (COLING'04) (pp. 1036-1042).]]

Digital Library

[51]

Veronis, J. (2000). Parallel text processing: alignment and use of translation corpora. Kluwer Academic Publishers.]]

[52]

Vogel, S., Ney, H., & Tillmann, C. (1996). HMM-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING'96) (pp. 836-841).]]

Digital Library

[53]

Xu, J., & Weischedel, R. (2003). Cross-lingual retrieval for Hindi. ACM transactions on Asian language information processing, 2(2), 164-168.]]

Digital Library

[54]

Yamada, K., & Knight, K. (2001). A syntax-based statistical translation model. In Proceedings of the 39th annual meeting of the association for computational linguistics (ACL'01) (pp. 523-530).]]

Digital Library

[55]

Yamada, Y., Ikeda, D., & Hirokawa, S. (2002). Automatic wrapper generation for multilingual web resources. In Proceedings of the 5th international conference on discovery science (DS'02). Lecture notes in computer science (Vol. 2534, pp. 332-339). Springer Publishing.]]

Digital Library

[56]

Zhao, B., Zechner, K., Vogel, S., & Waibel, A. (2003). Efficient optimization for bilingual sentence alignment based on linear regression. In Proceedings of HLT-NAACL 2003 workshop: building and using parallel texts data driven machine translation and beyond (pp. 81-87).]]

Digital Library

Index Terms

Automatic extraction of bilingual word pairs using inductive chain learning in various languages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval models and ranking

Recommendations

Automatic extraction of bilingual word pairs from parallel corpora with various languages using learning for adjacent information

This paper presents a learning method using adjacent information as the method to extract bilingual word pairs efficiently from parallel corpora with various languages for which language resources are insufficient. In our method, information about ...
Automatic extraction of low frequency bilingual word pairs from parallel corpora with various languages
PAKDD'05: Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

In this paper, we propose a new learning method for extraction of low-frequency bilingual word pairs from parallel corpora with various languages. It is important to extract low-frequency bilingual word pairs because the frequencies of many bilingual ...
Word Sense Based Hindi-Tamil Statistical Machine Translation

Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 42, Issue 5

September 2006

266 pages

ISSN:0306-4573

Issue’s Table of Contents

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2006

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents