Abstract
Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results show that the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Arabic Gigaword Fourth Edition (LDC2009T30).
- 2.
English Gigaword Fourth Edition (LDC2009T13).
References
Bourdaillet, J., Huet, S., Langlais, P., Lapalme, G.: TransSearch: from a bilingual concordancer to a translation finder. Mach. Transl. 24(3–4), 241–271 (2010)
Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 57–63, Barcelona, Spain (2004)
Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pp. 414–420, Montreal, Canada (1998)
Kikui, G., Sumita, E., Takezawa, T., Yamamoto, S.: Creating corpora for speech-to-speech translation. In: Proceedings of EUROSPEECH 2003, pp. 381–384, Geneva (2003)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (2007)
Kumano, T., Tanaka, H., Tokunaga, T.: Extracting phrasal alignments from comparable corpora by using joint probability smt model. In: Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden (2007)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)
Quirk, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to parallel fragment extraction. In: Proceedings of the Machine Translation Summit XI, pp. 377–384, Copenhagen, Denmark (2007)
Rapp, R.: Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320–322, Cambridge, Massachusetts (1995)
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)
Resnik, P., Smith, N.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Tillmann, C., Hewavitharana, S.: A unified alignment algorithm for bilingual data. In: Proceedings of Interspeech 2011, Florence, Italy (2011)
Tillmann, C., Xu, J.-M.: A simple sentence-level extraction algorithm for comparable data. In: Companion Volume of NAACL HLT 09, Boulder, CA (2009)
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese–English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 72–79, Sapporo, Japan (2003)
Vogel, S.: PESA: phrase pair extraction as sentence splitting. In: Proceedings of the Machine Translation Summit X, Phuket, Thailand (2005)
Zhao, B., Vogel, S.: Adaptive parallel sentence mining from web bilingual news collection. In: Proceedings of the IEEE International Conference on Data Mining, pp. 745–748, Maebashi City, Japan (2002)
Zhao, B., Vogel, S.: Full-text story alignment models for Chinese–English bilingual news corpora. In: Proceedings of the ICSLP ’02 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hewavitharana, S., Vogel, S. (2013). Extracting Parallel Phrases from Comparable Data. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)