Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/1218955.1218978dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free access

Statistical machine translation with word- and sentence-aligned parallel corpora

Published: 21 July 2004 Publication History

Abstract

The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including word-aligned data during training. Incorporating word-level alignments into the parameter estimation of the IBM models reduces alignment error rate and increases the Bleu score when compared to training the same models only on sentence-aligned data. On the Verbmobil data set, we attain a 38% reduction in the alignment error rate and a higher Bleu score with half as many training examples. We discuss how varying the ratio of word-aligned to sentence-aligned data affects the expected performance gain.

References

[1]
Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer. 1993. The mathematics of machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, June.
[2]
Adrian Corduneanu. 2002. Stable mixing of complete and incomplete information. Master's thesis, Massachusetts Institute of Technology, February.
[3]
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1): 1--38, Nov.
[4]
Ulrich Germann. 2001. Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In ACL 2001 Workshop on Data-Driven Machine Translation, Toulouse, France, July 7.
[5]
Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the HLT/NAACL.
[6]
I. Dan Melamed. 1998. Manual annotation of translational equivalence: The blinker project. Cognitive Science Technical Report 98/07, University of Pennsylvania.
[7]
Rada Mihalcea and Ted Pedersen. 2003. An evaluation exercise for word alignment. In Rada Mihalcea and Ted Pedersen, editors, HLT-NAACL 2003 Workshop: Building and Using Parallel Texts.
[8]
Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, and Tom M. Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3): 103--134.
[9]
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19--51, March.
[10]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. Bleu: a method for automatic evaluation of machine translation. IBM Research Report RC22176(W0109-022), IBM.
[11]
Philip Resnik and Noah Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349--380, September.

Cited By

View all
  • (2015)Toward a semantic stability index (SSI) via a preliminary exploration of translation loopingProceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community10.5555/2857070.2857143(1-4)Online publication date: 6-Nov-2015
  • (2015)Hybrid Approach for Inductive Semi Supervised Learning Using Label Propagation and Support Vector MachineProceedings of the 11th International Conference on Machine Learning and Data Mining in Pattern Recognition - Volume 916610.1007/978-3-319-21024-7_14(199-213)Online publication date: 20-Jul-2015
  • (2012)Refining lexical translation training scheme for improving the quality of statistical phrase-based translationProceedings of the 3rd Symposium on Information and Communication Technology10.1145/2350716.2350727(55-62)Online publication date: 23-Aug-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ACL '04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
July 2004
729 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 21 July 2004

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2015)Toward a semantic stability index (SSI) via a preliminary exploration of translation loopingProceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community10.5555/2857070.2857143(1-4)Online publication date: 6-Nov-2015
  • (2015)Hybrid Approach for Inductive Semi Supervised Learning Using Label Propagation and Support Vector MachineProceedings of the 11th International Conference on Machine Learning and Data Mining in Pattern Recognition - Volume 916610.1007/978-3-319-21024-7_14(199-213)Online publication date: 20-Jul-2015
  • (2012)Refining lexical translation training scheme for improving the quality of statistical phrase-based translationProceedings of the 3rd Symposium on Information and Communication Technology10.1145/2350716.2350727(55-62)Online publication date: 23-Aug-2012
  • (2010)EMDCProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873821(349-357)Online publication date: 23-Aug-2010
  • (2010)Improving word alignment by semi-supervised ensembleProceedings of the Fourteenth Conference on Computational Natural Language Learning10.5555/1870568.1870585(135-143)Online publication date: 15-Jul-2010
  • (2010)A semi-supervised word alignment algorithm with partial manual alignmentsProceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR10.5555/1868850.1868851(1-10)Online publication date: 15-Jul-2010
  • (2010)Consensus versus expertiseProceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk10.5555/1866696.1866700(30-34)Online publication date: 6-Jun-2010
  • (2010)Active semi-supervised learning for improving word alignmentProceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing10.5555/1860625.1860627(10-17)Online publication date: 6-Jun-2010
  • (2010)Active learning-based elicitation for semi-supervised word alignmentProceedings of the ACL 2010 Conference Short Papers10.5555/1858842.1858909(365-370)Online publication date: 11-Jul-2010
  • (2009)Improving alignment for SMT by reordering and augmenting the training corpusProceedings of the Fourth Workshop on Statistical Machine Translation10.5555/1626431.1626456(120-124)Online publication date: 30-Mar-2009
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media