Improved Alignment Based Algorithm for Multilingual Text Compression

Ehud S. Conley^18,19 &
Shmuel Tomi Klein¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6638))

Included in the following conference series:

International Conference on Language and Automata Theory and Applications

684 Accesses

Abstract

Multilingual text compression exploits the existence of the same text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual text alignment, a mapping of words and phrases in one text to their semantic equivalents in the translation. A new multilingual text compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language text; the incurred compression loss due to this overhead is smaller than the savings in the compressed target language texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EUR-Lex website of the European Union. These results show the superiority of the new algorithm as a function of the number languages.

This work has been done while the first author was a PhD student at Bar Ilan University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

Automatic Parallel Data Mining After Bilingual Document Alignment

Compact and Fast Indexes for Translation Related Tasks

References

Adiego, J., Brisaboa, N.R., Martínez-Prieto, M.A., Sánchez-Martínez, F.: A two-level structure for compressing aligned bitexts. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 114–121. Springer, Heidelberg (2009)
Chapter Google Scholar
Ahrenberg, L., Andersson, M., Merkel, M.: A knowledge-lite approach to word alignment. In: Véronis, J. (ed.) Parallel Text Processing, pp. 97–116. Kluwer Academic Publishers, Dordrecht (2000)
Chapter Google Scholar
Brown, P.F., Della Pietra, S., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Conley, E.S., Klein, S.T.: Using alignment for multilingual text compression. Int. J. Found. Comput. Sci. 19(1), 89–101 (2008)
Article MathSciNet MATH Google Scholar
Conley, E.S., Klein, S.T.: Compression of multilingual aligned texts. In: DCC, p. 442. IEEE Computer Society, Los Alamitos (2006)
Google Scholar
Dagan, I., Church, K.W., Gale, W.A.: Robust bilingual word alignment for machine-aided translation. In: Proc. of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8 (1993)
Google Scholar
EUR-Lex, http://eur-lex.europa.eu/
Fung, P., McKeown, K.: Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In: Proceedings of the First Conference of the Association for Machine Translation in the Americas, pp. 81–88 (1994)
Google Scholar
Gaussier, É., Hull, D., Aït-Mokhtar, S.: Term alignment in use: Machine-aided human translation. In: Véronis, J. (ed.) Parallel Text Processing, pp. 253–274. Kluwer Academic Publishers, Dordrecht (2000)
Chapter Google Scholar
Heaps, J.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., New York (1978)
MATH Google Scholar
Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Storer, J.A., Marcellin, M.W. (eds.) DCC, p. 459. IEEE Computer Society, Los Alamitos (2009)
Google Scholar
Nevill, C., Bell, T.: Compression of parallel texts. Information Processing & Management 28, 781–793 (1992)
Article Google Scholar
Schmid, H.: TreeTagger – a language-independent part-of-speech tagger. Web address, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Download references

Author information

Authors and Affiliations

Jerusalem College of Technology, Jerusalem, Israel
Ehud S. Conley
Bar-Ilan University, Ramat Gan, Israel
Ehud S. Conley & Shmuel Tomi Klein

Authors

Ehud S. Conley
View author publications
You can also search for this author in PubMed Google Scholar
Shmuel Tomi Klein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Universitat Rovira i Virgili, Avinguda Catalunya, 35, 43002, Tarragona, Spain
Adrian-Horia Dediu & Carlos Martín-Vide &
Department of Informatics, Kyushu University, 744 Motooka, 819–0395, Fukuoka, Japan
Shunsuke Inenaga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Conley, E.S., Klein, S.T. (2011). Improved Alignment Based Algorithm for Multilingual Text Compression. In: Dediu, AH., Inenaga, S., Martín-Vide, C. (eds) Language and Automata Theory and Applications. LATA 2011. Lecture Notes in Computer Science, vol 6638. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21254-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-21254-3_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21253-6
Online ISBN: 978-3-642-21254-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improved Alignment Based Algorithm for Multilingual Text Compression

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

Automatic Parallel Data Mining After Bilingual Document Alignment

Compact and Fast Indexes for Translation Related Tasks

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improved Alignment Based Algorithm for Multilingual Text Compression

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

Automatic Parallel Data Mining After Bilingual Document Alignment

Compact and Fast Indexes for Translation Related Tasks

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation