Induction of latent domains in heterogeneous corpora: a case study of word alignment

Hoang Cuong¹ &
Khalil Sima’an¹

189 Accesses
Explore all metrics

Abstract

This paper focuses on the insensitivity of existing word alignment models to domain differences, which often yields suboptimal results on large heterogeneous data. A novel latent domain word alignment model is proposed, which induces domain-focused lexical and alignment statistics. We propose to train the model on a heterogeneous corpus under partial supervision, using a small number of seed samples from different domains. The seed samples allow estimating sharper, domain-focused word alignment statistics for sentence pairs. Our experiments show that the derived domain-focused statistics, once combined together, produce significant improvements both in word alignment accuracy and in translation accuracy of their resulting SMT systems. Going beyond the findings, we surmise that virtually any large corpus (e.g., Europarl, Hansards, Common Crawl) harbors an arbitrary diversity of hidden domains, unknown in advance. We address the novel challenge of unsupervised induction of hidden domains in parallel corpora, applied within a domain-focused word-alignment modeling framework. On the technical side, we contrast flat estimation for the unsupervised induction of domains to a simple form of hierarchical estimation, consisting of two steps aiming at avoiding bad local maxima. Extensive experiments, conducted over seven different language pairs with fully unsupervised induction of domains for word alignment, demonstrate significant improvements in alignment accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic Explicit Topic Modeling Using Wikipedia

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Although our work focuses on the HMM-based alignment model, the approach can be also straightforwardly applied to fertility-based alignment models (Brown et al. 1993).
We model explicitly distances in the range $\pm \,5$ in this work.
$P(z|\ \mathbf f ,\ \mathbf e )$ can be also heuristically computed a symmetrized strategy $P(z|\ \mathbf f ,\ \mathbf e ) \propto P(z)({{P(\mathbf f |\ \mathbf e ,\ z)P(\mathbf e | z)}} + {{P(\mathbf e |\ \mathbf f ,\ z)P(\mathbf f |z)}}).$ However, we found that this strategy does not provide any significant contribution to the final performance of alignment accuracy.
During the initialization, we assume that the pool of the rest sentence pairs in the heterogeneous data is the exemplifying sample of the out-domain.
Note that adding word-level features from both translation sides does not help much, as observed by Och et al. (2004) and Huck et al. (2012). We thus add only an one from a translation side.
Naturally, the data, as any complex and large dataset, contains a wide variety of hidden sub-domains, yet they are not specified in advance. This motivates us to induce these domains automatically. In principle, we could induce domains without reference to the alignment problem and then use the latent domain variable within alignment models. However, we believe that this would not be an optimal choice as such domains are induced to capture phenomena potentially irrelevant to the word alignment problem (e.g., monolingual co-occurrence information).
The corpus consists of 1.1M sentence pairs, which is available at http://www.isi.edu/natural-language/download/hansard/index.html. We kept only 808.39K sentence pairs as the training data after removing duplicate sentences.
The corpus is available at http://www.statmt.org/europarl.
Similarly, the original corpus (which contains duplicate sentences) consists of 1.0M sentence pairs, which is available at http://optima.jrc.it/Acquis/JRC-Acquis.3.0/alignments/index.html.
We train the interpolated 3-grams latent domain LMs with expected Kneser–Ney smoothing in our experiments.
Other choices of the hyperparameter have also been tried, yet we did not observe significant differences in the model performance.

References

Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: EMNLP
Beal MJ (2003) Variational algorithms for approximate Bayesian inference. PhD Thesis, Gatsby Computational Neuroscience Unit, University College, London
Bojar O, Buck C, Callison-Burch C, Federmann C, Haddow B, Koehn P, Monz C, Post M, Soricut R, Specia L (2013) Findings of the 2013 workshop on statistical machine translation. In: WMT
Bojar O, Chatterjee R, Federmann C, Haddow B, Huck M, Hokamp C, Koehn P, Logacheva V, Monz C, Negri M, Post M, Scarton C, Specia L, Turchi M (2015) Findings of the 2015 workshop on statistical machine translation. In: WMT
Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19:263–311
Google Scholar
Carpuat M, Goutte C, Foster G (2014) Linear mixture models for robust machine translation. In: WMT
Chang YW, Rush AM, DeNero J, Collins M (2014) A constrained Viterbi relaxation for bidirectional word alignment. In: ACL
Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: NAACL HLT
Clark JH, Dyer C, Lavie A, Smith NA (2011) Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: ACL HLT (short papers)
Cuong H, Sima’an K (2014a) Latent domain phrase-based models for adaptation. In: EMNLP
Cuong H, Sima’an K (2014b) Latent domain translation models in mix-of-domains haystack. In: COLING
Cuong H, Sima’an K, Titov I (2016) Adapting to all domains at once: rewarding domain invariance in SMT. TACL. https://transacl.org/ojs/index.php/tacl/article/view/768/176
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38
MathSciNet MATH Google Scholar
Denkowski M, Lavie A (2011) METEOR 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: WMT
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: ACL
Duh K, Sudoh K, Tsukada H (2010) Analysis of translation model adaptation in statistical machine translation. In: IWSLT
Farajian MA, Bertoldi N, Federico M (2014) Online word alignment for online adaptive machine translation. In: Proceedings of the EACL 2014 workshop on humans and computer-assisted translation
Fraser A, Marcu D (2006) Semi-supervised training for statistical word alignment. In: COLING-ACL
Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: EMNLP
Gao Q, Bach N, Vogel S (2010) A semi-supervised word alignment algorithm with partial manual alignments. In: WMT
Gao Q, Lewis W, Quirk C, Hwang MY (2011) Incremental training and intentional over-fitting of word alignment. In: MT Summit
Gao Q, Vogel S (2010) Consensus versus expertise: a case study of word alignment with mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk
Graça JV, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Linguist 36(3):481–504. https://doi.org/10.1162/coli_a_00007
Article MathSciNet Google Scholar
Graca J, Pardal JP, Coheur L, Caseiro D (2008) Building a golden collection of parallel multi-language word alignment. In: LREC
Holmqvist M, Ahrenberg L (2011) A gold standard for English–Swedish word alignment. In: Proceedings of the 18th Nordic conference of computational linguistics NODALIDA 2011, vol 11
Hua W, Haifeng W, Zhanyi L (2005) Alignment model adaptation for domain-specific word alignment. In: ACL
Huck M, Peitz S, Freitag M, Nuhn M, Ney H (2012) The RWTH Aachen machine translation system for WMT 2012. In: WMT
Kirchhoff K, Bilmes J (2014) Submodularity for data selection in machine translation. In: EMNLP
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) MOSES: open source toolkit for statistical machine translation. In: ACL on interactive poster and demonstration sessions
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: NAACL HLT
Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: HLT-NAACL
Liu C, Liu Y, Sun M, Luan H, Yu H (2015) Generalized agreement for bidirectional word alignment. In: Proceedings of the EMNLP
Mansour Y, Mohri M, Rostamizadeh A (2009a) Domain adaptation with multiple sources. In: Proceedings of NIPS
Mansour Y, Mohri M, Rostamizadeh A (2009b) Multiple source adaptation and the RÉnyi divergence. In: Proceedings of UAI
Mihalcea R, Pedersen T (2003) An evaluation exercise for word alignment. In: Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: data driven machine translation and beyond, vol 3
Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2004) A smorgasbord of features for statistical machine translation. In: HLT-NAACL
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51. https://doi.org/10.1162/089120103321337421
Article MATH Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL
Riley D, Gildea D (2012) Improving the IBM alignment models using variational Bayes. In: Proceedings of ACL (short paper)
Shah K, Barrault L, Schwenk H (2010) Translation model adaptation by resampling. In: WMT
Shen S, Liu Y, Sun M, Luan H (2015) Consistency-aware search for word alignment. In: Proceedings of the EMNLP
Simion A, Collins M, Stein C (2013) A convex alternative to IBM model 2. In: EMNLP
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA
Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: LREC
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: LREC
Tam YC, Lane I, Schultz T (2007) Bilingual LSA-based adaptation for statistical machine translation. Mach Transl 21(4):187–207. https://doi.org/10.1007/s10590-008-9045-2
Article Google Scholar
Tamura A, Watanabe T, Sumita E (2014) Recurrent neural networks for word alignment model. In: ACL
Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING, p 836–841. http://dblp.uni-trier.de/db/conf/coling/coling1996.html#VogelNT96
Wang X, Utiyama M, Finch A, Watanabe T, Sumita E (2015) Leave-one-out word alignment without garbage collector effects. In: Proceedings of the EMNLP
Zhang H, Chiang D (2014) Kneser–Ney smoothing on expected counts. In: Proceedings of ACL
Zhao B, Xing EP (2008) HM-BiTAM: bilingual topic exploration, word alignment, and translation. In: NIPS

Download references

Acknowledgements

We thanks anonymous reviewers and Ivan Titov for their inputs. The second author is supported by VICI Grant Nr 277-89-002 from the Netherlands Organization for Scientific Research (NWO).

Author information

Authors and Affiliations

ILLC, University of Amsterdam, Amsterdam, The Netherlands
Hoang Cuong & Khalil Sima’an

Authors

Hoang Cuong
View author publications
You can also search for this author in PubMed Google Scholar
Khalil Sima’an
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoang Cuong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cuong, H., Sima’an, K. Induction of latent domains in heterogeneous corpora: a case study of word alignment. Machine Translation 31, 225–249 (2017). https://doi.org/10.1007/s10590-018-9215-9

Download citation

Received: 19 June 2017
Accepted: 19 February 2018
Published: 17 March 2018
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10590-018-9215-9

Induction of latent domains in heterogeneous corpora: a case study of word alignment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Probabilistic Explicit Topic Modeling Using Wikipedia

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Induction of latent domains in heterogeneous corpora: a case study of word alignment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Probabilistic Explicit Topic Modeling Using Wikipedia

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation