Abstract
This paper is concerned with exploring efficient domain adaptation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed unsupervised phrase-based data selection model. Compared with the traditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points. Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in- and general-domain baseline system, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP, pp. 355–362 (2011)
Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (1991)
Eck, M., Vogel, S., Waibel, A.: Language model adaptation for statistical machine translation based on information retrieval. In: Proceedings of LREC, pp. 327–330 (2004)
Foster, G., Kuhn, R.: Mixture-model adaptation for SMT. In: Proceedings of ACL, pp. 128–135 (2007)
Hildebrand, A.S.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT, pp. 133–142 (2005)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of NAACL, pp. 48–54 (2003)
Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of WMT (2007)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of ICML, pp. 296–304 (1998)
Liu, P., Zhou, Y., Zong, C.: Approach to selecting best development set for phrase-base statistical machine translation. In: Proceedings of PACLIC, pp. 325–334 (2009)
Liu, P., Zhou, Y., Zong, C.: Data selection for statistical machine translation. In: Proceedings of NLP-KE, pp. 232–236 (2010)
Lu, S., Fu, X., Wei, W., Peng, X., Xu, B.: Joint and coupled bilingual topic model based sentence representations for language model adpataiton. In: Proceedings of IJCAI, pp. 2141–2147 (2013)
Lu, S., Wei, W., Fu, X., Xu, B.: Translation model based cross-lingual language model adaptation: from word models to phrase models. In: Proceedings of EMNLP-CoNLL, pp. 512–522 (2012)
Lv, Y., Huang, J., Liu, Q.: Improving statistical machine translation peformance by training data selection and optimization. In: Proceedings of EMNLP, pp. 343–350 (2007)
Moore, R., Lewis, W.: Intelligent selection for language model training data. In: Proceedings of ACL, pp. 220–224 (2010)
Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of WMT (2008)
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of ACL, pp. 440–447 (2000)
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of ICSLP, pp. 901–904 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lu, S., Peng, X., Chen, Z., Xu, B. (2013). Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-41644-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41643-9
Online ISBN: 978-3-642-41644-6
eBook Packages: Computer ScienceComputer Science (R0)