Appendices

Ahmet Aker¹⁰,
Radu Ion¹¹,
Nikos Mastropavlos¹²,
Monica Paramita¹⁰,
Mārcis Pinnis¹³,
Dan Ştefănescu¹¹,
Fangzhong Su¹⁴,
Gregor Thurmair¹⁵,
Elena Irimia¹¹,
Nikola Ljubešić¹⁶,
Evangelos Kanoulas¹⁰,
Judita Preiss¹⁰,
Rob Gaizauskas¹⁰,
Paul Clough¹⁰,
Emma Barker¹⁰,
Nikos Glaros¹²,
Tiberiu Boroș¹¹,
Inguna Skadiņa¹³ &
…
Andrejs Vasiļjevs¹³

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

373 Accesses

Abstract

The tools that were developed through the ACCURAT project and are presented in this book are packed into the ACCURAT toolkit (Pinnis et al. 2012a)—a collection of tools that are capable of collecting comparable corpora, analysing and extracting parallel data. The ACCURAT toolkit produces

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parallel Corpora

Aranea: Yet Another Family of (Comparable) Web Corpora

The Responsa Project: Some Promising Future Directions

Notes

1.
http://www.accurat-project.eu/
2.
Whilst they may not be directly applicable, it is straightforward to adopt and apply our methods for building comparable corpora from the web to digital archives or other off-line textual data collections that are very large.
3.
The EU’s multilingual thesaurus, http://eurovoc.europa.eu/
4.
http://htmlparser.sourceforge.net/
5.
http://www.w3.org/DOM/
6.
Open NLP—http://incubator.apache.org/opennlp/
7.
http://www.racai.ro/en/tools/text/
8.
The first version of Sisyphus was created by the Belgian METAL team in 1987, in pre-Windows times, to speed up system development. This kind of tool is still needed.
9.
Full requirements are defined in the documentation of each tool ACCURAT D2.6 (2012).
10.
http://www.accurat-project.eu/index.php?p=toolkit

References

ACCURAT D2.6. (2012). Toolkit for multi-level alignment and information extraction from comparable corpora. http://www.accurat-project.eu
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.
Google Scholar
Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004 (pp. 1313–1316).
Google Scholar
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.
Google Scholar
Evert, S. (2005). The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
Google Scholar
Ion, R., Ceauşu, A., & Irimia, E. (2011). An expectation maximization algorithm for textual unit alignment. Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) held at the 49th Annual Meeting of the Association for Computational Linguistics (pp. 128—135), Portland, OR, June 24th, 2011. (C) 2011 Association for Computational Linguistics. ISBN: 978-1-937284-01-5.
Ion, R. (2012). PEXACC: A parallel data mining algorithm from comparable corpora. Proceedings of LREC 2012, May 21–27, Istanbul, Turkey.
Google Scholar
Pecina, P. (2009). Lexical association measures: Collocation extraction. Studies in computational and theoretical linguistics. Prague, Czech Republic: Institute of Formal and Applied Linguistics.
Google Scholar
Petrović, S., Šnajder, J., & Bašić, B. D. (2010). Extending lexical association measures for collocation extraction. Computer Speech and Language, 24(2), 383–394.
Article Google Scholar
Pinnis, M. (2012). Latvian and Lithuanian named entity recognition with TildeNER. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
Google Scholar
Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., et al. (2012a). Toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, July 8–14, 2012.
Google Scholar
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T. (2012b). Term extraction, tagging, and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), June 20–21, Madrid, Spain.
Google Scholar
Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa, M., et al. (2010). Collection of comparable corpora for under-resourced languages. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications (Vol. 219, pp. 161–168). IOS Press.
Google Scholar
Ştefănescu, D. (2012). Mining for term translations in comparable corpora. Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012) to be held at the 8th edition of Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey, May 23–25, 2012.
Google Scholar
Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy.
Google Scholar
Su, F., & Babych, B. (2012a). Development and application of a cross-language document comparability metric. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
Google Scholar
Su, F., & Babych, B. (2012b). Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-) parallel translation equivalents. Proceedings of EACL’12 Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), Avignon, France.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Sheffield, Sheffield, UK
Ahmet Aker, Monica Paramita, Evangelos Kanoulas, Judita Preiss, Rob Gaizauskas, Paul Clough & Emma Barker
Romanian Academy, Research Institute for Artificial Intelligence, Bucharest, Romania
Radu Ion, Dan Ştefănescu, Elena Irimia & Tiberiu Boroș
Institute for Language and Speech Processing, Athens, Greece
Nikos Mastropavlos & Nikos Glaros
Tilde, Riga, Latvia
Mārcis Pinnis, Inguna Skadiņa & Andrejs Vasiļjevs
University of Leeds, Leeds, UK
Fangzhong Su
Linguatec, Munich, Germany
Gregor Thurmair
Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić

Authors

Ahmet Aker
View author publications
You can also search for this author in PubMed Google Scholar
Radu Ion
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Mastropavlos
View author publications
You can also search for this author in PubMed Google Scholar
Monica Paramita
View author publications
You can also search for this author in PubMed Google Scholar
Mārcis Pinnis
View author publications
You can also search for this author in PubMed Google Scholar
Dan Ştefănescu
View author publications
You can also search for this author in PubMed Google Scholar
Fangzhong Su
View author publications
You can also search for this author in PubMed Google Scholar
Gregor Thurmair
View author publications
You can also search for this author in PubMed Google Scholar
Elena Irimia
View author publications
You can also search for this author in PubMed Google Scholar
Nikola Ljubešić
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Kanoulas
View author publications
You can also search for this author in PubMed Google Scholar
Judita Preiss
View author publications
You can also search for this author in PubMed Google Scholar
Rob Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar
Emma Barker
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Glaros
View author publications
You can also search for this author in PubMed Google Scholar
Tiberiu Boroș
View author publications
You can also search for this author in PubMed Google Scholar
Inguna Skadiņa
View author publications
You can also search for this author in PubMed Google Scholar
Andrejs Vasiļjevs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Inguna Skadiņa .

Editor information

Editors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs

Additional information

Chapter editor: Inguna Skadiņa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aker, A. et al. (2019). Appendices. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-99004-0_8
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Appendices

Abstract

Access this chapter

Subscribe and save

Buy Now