Abstract
Multilingual text compression exploits the existence of the same text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual text alignment, a mapping of words and phrases in one text to their semantic equivalents in the translation. A new multilingual text compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language text; the incurred compression loss due to this overhead is smaller than the savings in the compressed target language texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EUR-Lex website of the European Union. These results show the superiority of the new algorithm as a function of the number languages.
This work has been done while the first author was a PhD student at Bar Ilan University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adiego, J., Brisaboa, N.R., Martínez-Prieto, M.A., Sánchez-Martínez, F.: A two-level structure for compressing aligned bitexts. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 114–121. Springer, Heidelberg (2009)
Ahrenberg, L., Andersson, M., Merkel, M.: A knowledge-lite approach to word alignment. In: Véronis, J. (ed.) Parallel Text Processing, pp. 97–116. Kluwer Academic Publishers, Dordrecht (2000)
Brown, P.F., Della Pietra, S., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Conley, E.S., Klein, S.T.: Using alignment for multilingual text compression. Int. J. Found. Comput. Sci. 19(1), 89–101 (2008)
Conley, E.S., Klein, S.T.: Compression of multilingual aligned texts. In: DCC, p. 442. IEEE Computer Society, Los Alamitos (2006)
Dagan, I., Church, K.W., Gale, W.A.: Robust bilingual word alignment for machine-aided translation. In: Proc. of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8 (1993)
EUR-Lex, http://eur-lex.europa.eu/
Fung, P., McKeown, K.: Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In: Proceedings of the First Conference of the Association for Machine Translation in the Americas, pp. 81–88 (1994)
Gaussier, É., Hull, D., Aït-Mokhtar, S.: Term alignment in use: Machine-aided human translation. In: Véronis, J. (ed.) Parallel Text Processing, pp. 253–274. Kluwer Academic Publishers, Dordrecht (2000)
Heaps, J.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., New York (1978)
Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Storer, J.A., Marcellin, M.W. (eds.) DCC, p. 459. IEEE Computer Society, Los Alamitos (2009)
Nevill, C., Bell, T.: Compression of parallel texts. Information Processing & Management 28, 781–793 (1992)
Schmid, H.: TreeTagger – a language-independent part-of-speech tagger. Web address, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Conley, E.S., Klein, S.T. (2011). Improved Alignment Based Algorithm for Multilingual Text Compression. In: Dediu, AH., Inenaga, S., Martín-Vide, C. (eds) Language and Automata Theory and Applications. LATA 2011. Lecture Notes in Computer Science, vol 6638. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21254-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-21254-3_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21253-6
Online ISBN: 978-3-642-21254-3
eBook Packages: Computer ScienceComputer Science (R0)