Detecting Machine-Translated Paragraphs by Matching Similar Words

Hoang-Quoc Nguyen-Son⁸,
Tran Phuong Thao⁸,
Seira Hidano⁸ &
…
Shinsaku Kiyomoto⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

436 Accesses

Abstract

Machine-translated text plays an important role in modern life by smoothing communication from various communities using different languages. However, unnatural translation may lead to misunderstanding, a detector is thus needed to avoid the unfortunate mistakes. While a previous method measured the naturalness of continuous words using a N-gram language model, another method matched noncontinuous words across sentences but this method ignores such words in an individual sentence. We have developed a method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text. Experiment evaluates on 2000 English human-generated and 2000 English machine-translated paragraphs from German showing that the coherence-based method achieves high performance (accuracy = 87.0%; equal error rate = 13.0%). It is efficiently better than previous methods (best accuracy = 72.4%; equal error rate = 29.7%). Similar experiments on Dutch and Japanese obtain 89.2% and 97.9% accuracy, respectively. The results demonstrate the persistence of the proposed method in various languages with different resource levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

Collaborative Matching for Sentence Alignment

Identifying Similar Sentences by Using N-Grams of Characters

Notes

References

Aharoni, R., Koppel, M., Goldberg, Y.: Automatic detection of machine translated text and translation quality estimation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 289–295. Association for Computational Linguistics (2014)
Google Scholar
Arase, Y., Zhou, M.: Machine translation detection from monolingual web-text. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1597–1607. Association for Computational Linguistics (2013)
Google Scholar
Chae, J., Nenkova, A.: Predicting the fluency of text with shallow structural features: case studies of machine translation and human-written text. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 139–147. Association for Computational Linguistics (2009)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics 94(1), 379–396 (2013)
Article Google Scholar
Li, Y., Wang, R., Zhao, H.: A machine learning method to distinguish machine translation from human translation. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC), pp. 354–360 (2015)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings 52nd Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations, pp. 55–60. Association for Computational Linguistics (2014)
Google Scholar
Nguyen-Son, H.-Q., Echizen, I.: Detecting computer-generated text using fluency and noise features. In: Hasida, K., Pa, W.P. (eds.) PACLING 2017. CCIS, vol. 781, pp. 288–300. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8438-6_23
Chapter Google Scholar
Nguyen-Son, H.Q., Tieu, N.D.T., Nguyen, H.H., Yamagishi, J., Echizen, I.: Identifying computer-generated text using statistical analysis. In: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1504–1511. IEEE (2017)
Google Scholar
Nguyen-Son, H.Q., Tieu, N.D.T., Nguyen, H.H., Yamagishi, J., Echizen, I.: Identifying computer-translated paragraphs using coherence features. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (PACLIC) (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Digit. Scholarsh. Humanit. 30(1), 98–118 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

KDDI Research Inc., Saitama, Japan
Hoang-Quoc Nguyen-Son, Tran Phuong Thao, Seira Hidano & Shinsaku Kiyomoto

Authors

Hoang-Quoc Nguyen-Son
View author publications
You can also search for this author in PubMed Google Scholar
Tran Phuong Thao
View author publications
You can also search for this author in PubMed Google Scholar
Seira Hidano
View author publications
You can also search for this author in PubMed Google Scholar
Shinsaku Kiyomoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoang-Quoc Nguyen-Son .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen-Son, HQ., Thao, T.P., Hidano, S., Kiyomoto, S. (2023). Detecting Machine-Translated Paragraphs by Matching Similar Words. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_36
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Detecting Machine-Translated Paragraphs by Matching Similar Words

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

Collaborative Matching for Sentence Alignment

Identifying Similar Sentences by Using N-Grams of Characters

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Detecting Machine-Translated Paragraphs by Matching Similar Words

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

Collaborative Matching for Sentence Alignment

Identifying Similar Sentences by Using N-Grams of Characters

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation