Significance tests of automatic machine translation evaluation metrics

Ying Zhang¹ &
Stephan Vogel¹

358 Accesses
Explore all metrics

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance test-driven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Amigó E, Gonzalo J, Peñas A, Verdejo F (2005) QARLA: a framework for the evaluation of text summarization systems. In: ACL ’05: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 280–289
Amigó E, Giménez J, Gonzalo J, Màrquez L (2006) MT evaluation: human-like vs. human acceptable. In: Proceedings of the COLING/ACL on main conference poster sessions. Association for Computational Linguistics, Morristown, NJ, USA, pp 17–24
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments’. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72
Bisani M, Ney H (2004) Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004). Montreal, Quebec, Canada
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluation the role of bleu in machine translation research. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics: EACL 2006. Trento, Italy, pp 249–256
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: StatMT ’07: Proceedings of the second workshop on statistical machine translation. Association for Computational Linguistics, Morristown, NJ, USA, pp 136–158
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman & Hall, Boca Raton
MATH Google Scholar
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of EMNLP 2004. Barcelona, Spain
Leusch G, Ueffing N, Ney H (2003) A novel string-to-string distance measurewith applications to machine translation evaluation. In: Proceedings of MT Summit IX. New Orleans, LA
Lin C-Y, Och FJ (2004) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: COLING ’04: Proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, p 501
Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: ACL 2005 workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization
Nießen S, Vogel S, Ney H, Tillmann C (1998) A DP based search algorithm for statistical machine translation. In: Proceedings of the 17th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 960–967
NIST (2003) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. Technical report, NIST, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf
Owczarzak K, Genabith J, Way A (2007) Evaluating machine translation with LFG dependencies. Mach Transl 21(2): 95–119
Article Google Scholar
Pado S, Galley M, Jurafsky D, Manning CD (2009) Robust machine translation evaluation with entailment features. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, pp 297–305
Papineni K, Roukos S, Ward T, Zhu W (2001) Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176(W0109-022), IBM Research Division, Thomas J. Watson Research Center
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings AMTA, pp 223–231
Zhang Y (2008) Structured language model for statistical machine translation. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA
Zhang Y, Vogel S, Waibel A (2004) Interpreting Bleu/NIST scores: how much improvement do we need to have a better system? In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Ying Zhang & Stephan Vogel

Authors

Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Vogel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Vogel, S. Significance tests of automatic machine translation evaluation metrics. Machine Translation 24, 51–65 (2010). https://doi.org/10.1007/s10590-010-9073-6

Download citation

Received: 16 May 2009
Accepted: 31 March 2010
Published: 20 April 2010
Issue Date: March 2010
DOI: https://doi.org/10.1007/s10590-010-9073-6

Significance tests of automatic machine translation evaluation metrics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Identification of Relevant and Redundant Automatic Metrics for MT Evaluation

A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Significance tests of automatic machine translation evaluation metrics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Identification of Relevant and Redundant Automatic Metrics for MT Evaluation

A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now