Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

Shiva Taslimipoor¹⁴,
Ruslan Mitkov¹⁴,
Gloria Corpas Pastor¹⁵ &
…
Afsaneh Fazly¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1266 Accesses

Abstract

Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

New Areas of Application of Comparable Corpora

Research on Cross-Language Retrieval Using Bilingual Word Vectors in Different Languages

Notes

1.
We use the lexicon built by applying GIZA++ on the Spanish–English portion of the Europarl.
2.
The software is available in the websites of the authors of [13].
3.
We set the frequency threshold to 10 in our experiments.
4.
http://www.abc.es and http://www.abc.net.au.
5.
http://es.noticias.yahoo.com and http://uk.news.yahoo.com.
6.
http://cnnespanol.cnn.com and http://cnn.com.
7.
http://www.sport.es/es and http://www.sport-english.com/en.
8.
http://es.euronews.com and http://euronews.net.
9.
http://www.accurat-project.eu.
10.
The comparable corpora that we prepared is available on https://github.com/shivaat/EnEsCC.
11.
Note that we add noise in both Spanish–English and English–Spanish directions.

References

Aker, A., Kanoulas, E., Gaizauskas, R.: A light way to collect comparable corpora from the web. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)
Google Scholar
Bannard, C.: A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In: Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, pp. 1–8. Association for Computational Linguistics (2007)
Google Scholar
Bouamor, D., Semmar, N., Zweigenbaum, P.: Identifying bilingual multi-word expressions for statistical machine translation. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association (ELRA) (2012)
Google Scholar
Bouamor, D., Semmar, N., Zweigenbaum, P.: Context vector disambiguation for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, Short Papers, vol. 2, pp. 759–764. Association for Computational Linguistics (2013)
Google Scholar
Pastor, G.C.: Collocations in e-bilingual dictionaries: from underlying theoretical assumptions to practical lexicography and translation issues. In: Torner, S., Bernal, E. (eds.) Collocations and Other Lexical Combinations in Spanish: Theoretical and Applied Approaches, pp. 173–199. Routledge, Abingdon (2017)
Google Scholar
Evert, S.: The statistics of word cooccurrences : word pairs and collocations. Ph.D. thesis, Universität Stuttgart, Holzgartenstr. 16, 70174 Stuttgart (2005)
Google Scholar
Fazly, A.: Automatic acquisition of lexical knowledge about multiword predicates. Ph.D. thesis, Department of Computer Science, University of Toronto (2007)
Google Scholar
Fung, P.: A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 1–17. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_1
Chapter Google Scholar
Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202 (1997)
Google Scholar
Ion, R.: PEXACC: a parallel sentence mining algorithm from comparable corpora. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)
Google Scholar
Ismail, A., Manandhar, S.: Bilingual lexicon extraction from comparable corpora using in-domain terms. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 481–489. Association for Computational Linguistics (2010)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: The Tenth Machine Translation Summit, Phuket, Thailand, pp. 79–86 (2005)
Google Scholar
Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, Short Papers, vol. 2, pp. 302–308. Association for Computational Linguistics (2014)
Google Scholar
McEnery, A., Xiao, R.: Parallel and comparable corpora: what is happening. In: Incorporating Corpora: The Linguist and the Translator, pp. 18–31 (2007)
Google Scholar
Mendoza Rivera, O., Mitkov, R., Corpas Pastor, G.: A flexible framework for collocation retrieval and translation from parallel and comparable corpora. In: Workshop on Multi-word Units in Machine Translation and Translation Technology (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Article Google Scholar
Pal, S., Pakray, P., Naskar, S.K.: Automatic building and using parallel resources for SMT from comparable corpora. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) @ EACL, pp. 48–57 (2014)
Google Scholar
Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL Student Research Workshop, ACLstudent 2005, Stroudsburg, PA, USA, pp. 13–18. Association for Computational Linguistics (2005)
Google Scholar
Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)
Article Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 519–526. Association for Computational Linguistics (1999)
Google Scholar
Rapp, R., Sharoff, S.: Extracting multiword translations from aligned comparable documents. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) @ EACL 2014, Gothenburg, Sweden, pp. 83–91 (2014)
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Chapter Google Scholar
Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19, 143–177 (1993)
Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2010), pp. 403–411 (2010)
Google Scholar
Su, F., Babych, B.: Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-)parallel translation equivalents. In: Proceedings of the Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL 2012, Stroudsburg, PA, USA, pp. 10–19. Association for Computational Linguistics (2012)
Google Scholar
Tiedemann, J.: Extraction of translation equivalents from parallel corpora. In: Proceedings of the 11th Nordic Conference on Computational Linguistics, pp. 120–128 (1998)
Google Scholar

Download references

Acknowledgments

This work has been partially supported by the LATEST (Ref: 327197-FP7-PEOPLE-2012-IEF) project. The authors would like to express their gratitude to Anna de Santis and Lorena Gomez for their annotation work.

Author information

Authors and Affiliations

Research Group in Computational Linguistics, University of Wolverhampton, Wolverhampton, UK
Shiva Taslimipoor & Ruslan Mitkov
Univeristy of Malaga, Malaga, Spain
Gloria Corpas Pastor
VerticalScope Inc., Toronto, Canada
Afsaneh Fazly

Authors

Shiva Taslimipoor
View author publications
You can also search for this author in PubMed Google Scholar
Ruslan Mitkov
View author publications
You can also search for this author in PubMed Google Scholar
Gloria Corpas Pastor
View author publications
You can also search for this author in PubMed Google Scholar
Afsaneh Fazly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiva Taslimipoor .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taslimipoor, S., Mitkov, R., Corpas Pastor, G., Fazly, A. (2018). Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_10
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics