Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Published: 01 September 2006 Publication History

Abstract

In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL.

References

[1]
Ahrenberg, L., Andersson, M., & Merkel, M. (1998). A simple hybrid aligner for generating lexical correspondences in parallel texts. In Proceedings of the 36th annual meeting of the association for computational linguisties and 17th international conference on computational linguistics (COLING-ACL'98) (pp. 29-35).]]
[2]
Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., & Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the 13th international joint conference on artificial intelligence (IJCAI'93) (pp. 1172-1178).]]
[3]
Bar-Yossef, Z., & Rajagopalan, S. (2002). Template detection via data mining and its application. In Proceedings of the 11th international world wide web conference (WWW'02) (pp. 580-591).]]
[4]
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263-311.]]
[5]
Chen, A., & Gey, F. C. (2004). Multilingual information retrieval using machine translation, relevance feedback and decompounding. Information Retrieval, 7(1-2), 149-182.]]
[6]
Chen, F.S. (1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st annual meeting of the association for computational linguistics (ACL'93) (pp. 9-16).]]
[7]
Chikushi, F. (2001). Express: French, Hakusui-sha (in Japanese).]]
[8]
Collier, N., Ono, K., & Hirakawa, H. (1998). An experiment in hybrid dictionary and statistical sentence alignment. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (COLING-ACL'98) (pp. 268-274).]]
[9]
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proceedings of the 27th international conference on very large data bases (pp. 109-118).]]
[10]
Dagan, I., Church, K. W., & Gale, W.A. (1993). Robust bilingual word alignment for machine aided translation. In Proceedings of the workshop on very large corpora: academic and industrial perspectives (pp. 1-8).]]
[11]
Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74.]]
[12]
Echizen-ya, H., Araki, K., Momouchi, Y., Tochinai, K. (2002). Study of practical effectiveness for machine translation using Recursive Chain-link-type Learning. In Proceedings of the 19th international conference on computational linguistics (COLING'02) (pp. 246-252).]]
[13]
Echizen-ya, H., Araki, K., & Momouchi, Y. (2005a). Automatic extraction of low frequency bilingual word pairs from parallel corpora with various languages. In Proceedings of the 9th pacific-asia conference on knowledge discovery and data mining (PAKDD'05). Lecture notes in artificial intelligence (Vol. 3518, pp. 32-37). Springer Publishing.]]
[14]
Echizen-ya, H., Araki, K., & Momouchi, Y. (2005b). Automatic acquisition of adjacent information and its effectiveness in extraction of bilingual word pairs from parallel corpora. In Proceedings of the 1Oth international conference on applications of natural language to information systems (NLDB'05). Lecture notes in computer science (Vol. 3513, pp. 349-352). Springer Publishing.]]
[15]
Emoto, H., Han, G. (2004). Express: Shanghai, Hakusui-sha (in Japanese).]]
[16]
Fujii, A., & Ishikawa, T. (2001). Japanese/English cross-language information retrieval: exploration of query translation and transliteration. Computers and the Humanities, 35(4), 389-420.]]
[17]
Fung, P., & Church, K. (1994). K-vec: a new approach for alignment parallel texts. In Proceedings of the 15th international conference on computational linguistics (COLING'94) (pp. 1096-1102).]]
[18]
Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus, Workshop on Very Large Corpora (pp. 173-183).]]
[19]
Fung, P. (1998). A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. Lecture notes in artificial intelligence (Vol. 1529, pp. 1-17). Springer Publishing.]]
[20]
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75-102.]]
[21]
Güvenir, H. A., & Cicekli, I. (1998). Learning translation templates from examples. Information Systems, 23(6), 353-363.]]
[22]
Harukawa, Y., & Snelling, J. (1998). Express: English, Hakusui-sha (in Japanese).]]
[23]
Hiemstra, D., de Jong, F., & Kraaij, W. (1997). A domain specific lexicon acquisition tool for cross-language information retrieval. In Proceedings of the RIAO'97 conference on computer-assisted information searching on internet (pp. 255-269).]]
[24]
Hirokawa, S., Itoh, E., & Miyahara, T. (2003). Semi-automatic construction of metadata from a series of web documents. In Proceedings of the 16th Australian conference on artificial intelligence (AI'03). Lecture notes in computer science (Vol. 2903, pp. 942-953). Springer Publishing.]]
[25]
Hisamitsu, T., & Niwa, Y. (2001). Topic-word selection based on combinatorial probability. In Proceedings of the 6th natural language processing pacific rim symposium (NLPRS'O1) (pp. 289-296).]]
[26]
Hsu, J. Y., & Yih, W. (1997). Template-based information mining from HTML documents. In Proceedings of the 14th national conference on artificial intelligence and 9th conference on innovative applications of artificial intelligence (AAAI-IAAI'97) (pp. 256-262).]]
[27]
Kaji, H., & Aizono, T. (1996). Extracting word correspondences from bilingual corpora based on word co-occurrence information. In Proceedings of the 16th international conference on computational linguistics (COLING'96) (pp. 23-28).]]
[28]
Kay, M., & Rööscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1), 121-142.]]
[29]
Kishida, K., Chen, K., Lee, S., Chen, H., Kando, N., Kuriyama, K., Myaeng, S., & Eguchi, K. (2004). Cross-lingual information retrieval (CLIR) task at the NTCIR workshop 3. SIGIR Forum, 38(1), 17-20.]]
[30]
Kumano, A., & Hirakawa, H. (1994). Building an MT dictionary from parallel texts based on linguistic and statistical information. In Proceedings of the 15th international conference on computational linguistics (COLING'94) (pp. 76-81).]]
[31]
Kushmerick, N., Weld, D.S., & Doorenbos, R. (1997). Wrapper induction for information extraction. In Proceedings of the 15th international joint conference on artificial intelligence (IJCAI'97) (pp. 729-735).]]
[32]
Lee, J., & Bui, T. (2000). A template-based methodology for disaster management information systems. In Proceedings of the 33rd annual Hawaii international conference on system sciences (HICSS-33).]]
[33]
Macklovitch, E., & Hannan, M. L. (1996). Line 'em up: advances in alignment technology and their impact on translation support tools. In Proceedings of the second conference of the association for machine translation in the Americas (AMTA'96) (pp. 145-156).]]
[34]
Manning, C. D., & Schüütze, H. (1999). Foundations of statistical natural language processing. MIT Press.]]
[35]
Matsumoto, Y., Ishimoto, H., & Utsuro, T. (1993). Structural matching of parallel texts. In Proceedings of the 31st annual meeting of the association for computational linguistics (ACL'93) (pp. 23-30).]]
[36]
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., & Asahara, M. (2000). Japanese Morphological Analysis System ChaSen version 2.2.1 manual. Nara Institute of Science and Technology.]]
[37]
McTait, K. (2001). Linguistic knowledge and complexity in an EBMT system based on translation patterns. In Proceedings of the workshop on EBMT, MT Summit VIII.]]
[38]
Melamed, I. D. (2001). Empirical methods for exploiting parallel texts. MIT Press.]]
[39]
Nakagawa, H., & Nakamoto, M. (2004). Express. Ainu, Hakusui-sha (in Japanese).]]
[40]
Nießen, S., & Ney, H. (2004). Statistical machine translation with scarce resources using morpho-syntactic information. Computational Linguistics, 30(2), 181-204.]]
[41]
Och, F.J. (2000). Giza++: Training of statistical translation models. Available from http://www-i6.informatik.rwth-aachen.de/ Colleagues/och/software/GIZA ++.html.]]
[42]
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19-51.]]
[43]
Oshio, T. (2004). Express: German, Hakusui-sha (in Japanese).]]
[44]
Pedersen, T., Varma, N. (2003). K-vec++: Approach for finding word correspondences, Available from http://www.d.umn.edu/~tpederse/ Code/Readme.K-vec++.v02.txt.]]
[45]
Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd annual meeting of the association for computational linguistics (ACL'95) (pp. 320-322).]]
[46]
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th annual meeting of the association for computational linguistics (ACL'99) (pp. 519-526).]]
[47]
Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistic-based and linguistics-based approach. In Proceedings of the 6th international workshop on information retrieval with Asian languages (pp. 57-64).]]
[48]
Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: a statistical approach. Computational Linguistics, 22(1), 1-38.]]
[49]
Tanaka, K., & Iwasaki, H. (1996). Extraction of lexical translation from non-aligned corpora. In Proceedings of the 16th international conference on computational linguistics (COLING'96) (pp. 580-585).]]
[50]
Utsuro, T., Hino, K., & Kida, M. (2004). Integrating cross-lingually relevant news articles and monolingual Web documents in bilingual lexicon acquisition. In Proceedings of the 20th international conference on computational linguistics (COLING'04) (pp. 1036-1042).]]
[51]
Veronis, J. (2000). Parallel text processing: alignment and use of translation corpora. Kluwer Academic Publishers.]]
[52]
Vogel, S., Ney, H., & Tillmann, C. (1996). HMM-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING'96) (pp. 836-841).]]
[53]
Xu, J., & Weischedel, R. (2003). Cross-lingual retrieval for Hindi. ACM transactions on Asian language information processing, 2(2), 164-168.]]
[54]
Yamada, K., & Knight, K. (2001). A syntax-based statistical translation model. In Proceedings of the 39th annual meeting of the association for computational linguistics (ACL'01) (pp. 523-530).]]
[55]
Yamada, Y., Ikeda, D., & Hirokawa, S. (2002). Automatic wrapper generation for multilingual web resources. In Proceedings of the 5th international conference on discovery science (DS'02). Lecture notes in computer science (Vol. 2534, pp. 332-339). Springer Publishing.]]
[56]
Zhao, B., Zechner, K., Vogel, S., & Waibel, A. (2003). Efficient optimization for bilingual sentence alignment based on linear regression. In Proceedings of HLT-NAACL 2003 workshop: building and using parallel texts data driven machine translation and beyond (pp. 81-87).]]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 42, Issue 5
September 2006
266 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2006

Author Tags

  1. bilingual word pairs
  2. learning method
  3. parallel corpora
  4. sparse data problem
  5. statistical approach
  6. various languages

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media