Abstract
Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A long-standing problem is to compute LZ77 using small working space. Considering that \(\mathcal {O}(z)\) words of space can be significantly (up to exponentially) smaller than the size n of the input text, even succinct and entropy-compressed solutions are often unduly memory demanding. In this work we focus on an important measure of text repetitiveness: the number \(r\) of equal-letter runs in the Burrows–Wheeler transform of the reversed input text. As z, the measure \(r\) is closely related to the number of repetitions in the text and can be exponentially smaller than n. We describe two algorithms computing LZ77 in \(\mathcal {O}(r\log n)\) bits of working space and \(\mathcal {O}(n\log r)\) time. Roughly speaking, our algorithms store a constant number of memory words per BWT run to keep track of first-last run-positions and a suitable indexing mechanism to sample the runs of the BWT (instead of its positions). Important consequences of our results include (i) the possibility to convert from RLBWT- to LZ77-based compressed formats without first decompressing the text, and (ii) the existence of asymptotically-optimal construction algorithms for repetition-aware self-indexes based on these compression techniques. We finally describe an implementation of our solutions and present extensive experiments on highly repetitive datasets. Our algorithms use a working space as small as 1% of the dataset size and are two to three orders of magnitude more space-efficient (albeit slower) than existing solutions based, respectively, on entropy compression and suffix arrays.
Similar content being viewed by others
Notes
e.g. by inserting the characters \(c_1^{\lambda _1}c_2^{\lambda _2}\dots c_r^{\lambda _r}\) in the dynamic string structure described in the previous section.
References
Bannai, H., Gawrychowski, P., Inenaga, S., Takeda, M.: Converting SLP to LZ78 in almost Linear Time. In: Proceedings of 24th Annual Symposium on Combinatorial Pattern Matching, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013, vol. 7922, pp. 38–49. Springer (2013)
Bannai, H., Inenaga, S., Takeda, M.: Efficient LZ78 factorization of grammar compressed text. In: Proceedings of 19th International Symposium on String Processing and Information Retrieval, SPIRE 2012, Cartagena de Indias, Colombia, October 21–25, 2012, vol. 7608, p. 86. Springer (2012)
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Proceedings of CPM, pp. 26–39 (2015)
Belazzougui, D., Gagie, T., Gawrychowski, P., Kärkkäinen, J., Ordónez, A., Puglisi, S.J., Tabei, Y.: Queries on LZ-Bounded Encodings. arXiv:1412.0967 (2014)
Belazzougui, D., Puglisi, S.J.: Range Predecessor and Lempel–Ziv Parsing. arXiv:1507.07080 (2015)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inf. 111(3), 313–337 (2011)
Crochemore, M., Ilie, L.: Computing longest previous factor in linear time and applications. Inf. Process. Lett. 106(2), 75–80 (2008)
DYNAMIC: dynamic succinct/compressed data structures library. https://github.com/nicolaprezza/DYNAMIC. Accessed 20 May 2016
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science, 2000. Proceedings, pp. 390–398. IEEE (2000)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: Proceedings of 11th International Conference on String Processing and Information Retrieval, SPIRE 2004, Padova, Italy, October 5–8, 2004, vol. 3246, p. 150. Springer (2004)
Fici, G.: Factorizations of the fibonacci infinite word. J. Integer Seq. 18(2), 3 (2015)
Fischer, J., Gagie, T., Gawrychowski, P.L., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: ESA 2015 LNCS 9294, p. 533
Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)
get-git-revisions: Get all revisions of a git repository. https://github.com/nicolaprezza/get-git-revisions. Accessed 20 May 2016
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: 13th International Symposium on Experimental Algorithms, (SEA 2014), pp. 326–337 (2014)
LZ77 factorization algorithms. https://www.cs.helsinki.fi/group/pads/lz77.html. Accessed 20 May 2016
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel–Ziv parsing. In: Experimental Algorithms, pp. 139–150. Springer (2013)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: simple, fast, small. In: Proceedings of 24th Annual Symposium on Combinatorial Pattern Matching, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013, vol. 7922, p. 189. Springer (2013)
Kempa, D., Puglisi, S.J.: Lempel–Ziv factorization: simple, fast, practical. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 103–112. Society for Industrial and Applied Mathematics (2013)
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Mantaci, S., Restivo, A., Sciortino, M.: Burrows–Wheeler transform and sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)
Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. SIAM J. Comput. 43(5), 1781–1806 (2014)
Nishimoto, T., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Dynamic Index and LZ Factorization in Compressed Space. In Jan, H., Jan, Ž. (eds.) Proceedings of the Prague Stringology Conference 2016, pp. 158–170. Czech Technical University in Prague, Czech Republic (2016)
Ohlebusch, E., Gog, S.: Lempel–Ziv factorization revisited. In: Proceedings of 22nd Annual Symposium on Combinatorial Pattern Matching, CPM 2011, Palermo, Italy, June 27–29, 2011, vol. 6661, p. 15. Springer (2011)
pizza&chili repetitive corpus. http://pizzachili.dcc.uchile.cl/repcorpus/real/. Accessed 20 May 2016
Policriti, A., Gigante, N., Prezza, N.: Average linear time and compressed space construction of the Burrows–Wheeler transform. In: Language and Automata Theory and Applications. Lecture Notes in Computer Science, vol. 8977, pp. 587–598. Springer International Publishing (2015)
Policriti, A., Prezza, N.: Fast Online Lempel–Ziv Factorization in Compressed Space. In: String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 9309, pp. 13–20. Springer International Publishing (2015). doi:10.1007/978-3-319-23826-5_2
Rytter, W.: Application of Lempel–Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1), 211–222 (2003)
Sirén, J., et al.: Compressed full-text indexes for highly repetitive collections. PhD thesis, Helsingin yliopisto (2012)
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Proceedings of 15th International Symposium on String Processing and Information Retrieval, SPIRE 2008, Melbourne, Australia, November 10–12, 2008, vol. 5280, p. 164. Springer (2008)
Takabatake, Y., Tabei, Y., Sakamoto, H.: Online self-indexed grammar compression. In Proceedings of SPIRE. Lecture Notes in Computer Science, vol. 9309, pp. 258–269. Springer International Publishing (2015)
Tamakoshi, Y., Tomohiro I., Inenaga, S., Bannai, H., Takeda, M.: From run length encoding to LZ78 and back again. In: Data Compression Conference (DCC), 2013, pp. 143–152. IEEE (2013)
wiki-get: Download all versions of a Wikipedia page. https://github.com/nicolaprezza/wiki_get. Accessed 20 May 2016
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Policriti, A., Prezza, N. LZ77 Computation Based on the Run-Length Encoded BWT. Algorithmica 80, 1986–2011 (2018). https://doi.org/10.1007/s00453-017-0327-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-017-0327-z