Abstract
The Lempel–Ziv 78 (LZ78) factorization is a well-studied technique for data compression. It and its derivates are used in compression formats such as compress or gif. While most research focuses on the factorization of plain data, not much research has been conducted on indexing the data for fast LZ78 factorization. Here, we study the LZ78 factorization in the substring compression model, where we are allowed to index the data and have to return the factorization of a substring specified at query time. In that model, we propose an algorithm that works in CDAWG-compressed space, computing the factorization with a logarithmic slowdown compared to the optimal time complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arimura, H., Inenaga, S., Kobayashi, Y., Nakashima, Y., Sue, M.: Optimally computing compressed indexing arrays based on the compact directed acyclic word graph. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds.) SPIRE 2023. LNCS, vol. 14240, pp. 28–34. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-43980-3_3
Babenko, M.A., Gawrychowski, P., Kociumaka, T., Starikovskaya, T.: Wavelet trees meet suffix trees. In: Proceedings of SODA, pp. 572–591 (2015)
Bannai, H., Gawrychowski, P., Inenaga, S., Takeda, M.: Converting SLP to LZ78 in almost Linear Time. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 38–49. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38905-4_6
Belazzougui, D., Boldi, P., Vigna, S.: Predecessor search with distance-sensitive query time. ArXiv CoRR arxiv:1209.5441 (2012)
Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of CPM, vol. 78 of LIPIcs, pp. 7:1–7:13 (2017)
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3
Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted ancestors in suffix trees revisited. In: Proceedings of CPM, vol. 191 of LIPIcs, pp. 8:1–8:15 (2021)
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)
Cormode, G., Muthukrishnan,: Substring compression problems. In: Proceedings of SODA, pp. 321–330 (2005)
Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63220-4_55
Ehrhardt, M., Mulzer, W.: Delta-fast tries: local searches in bounded universes with linear space. In: WADS 2017. LNCS, vol. 10389, pp. 361–372. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62127-2_31
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algor. 13, 1.12:1 – 1.12:31 (2008)
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proceedings of PODS, pp. 181–190 (2008)
Fischer, J., Tomohiro, I., Köppl, D., Sadakane, K.: Lempel–Ziv factorization powered by space efficient suffix trees. Algorithmica 80(7), 2048–2081 (2018)
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of SODA, pp. 1459–1477 (2018)
Gawrychowski, P., Lewenstein, M., Nicholson, P.K.: Weighted ancestors in suffix trees. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 455–466. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44777-2_38
Keller, O., Kopelowitz, T., Feibish, S.L., Lewenstein, M.: Generalized substring compression. Theor. Comput. Sci. 525, 42–54 (2014)
Kempa, D., Kociumaka, T.: Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In: Proceedings of FOCS, pp. 1877–1886 (2023)
Kociumaka, T.: Efficient Data Structures for Internal Queries in Texts. PhD thesis, University of Warsaw (2018)
Köppl, D.: Non-overlapping LZ77 factorization and LZ78 substring compression queries with suffix trees. Algorithms 14(2)(44), 1–21 (2021)
Köppl, D.: Computing LZ78-derivates with suffix trees. In: Proceedings of DCC, pp. 133–142 (2024)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Nakashima, Y., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Constructing LZ78 tries and position heaps in linear time for large alphabets. Inf. Process. Lett. 115(9), 655–659 (2015)
Navarro, G.: Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv. 54(2), 29:1–29:31 (2021)
Nekrich, Y.: A dynamic stabbing-max data structure with sub-logarithmic query time. In: Asano, T., Nakano, S., Okamoto, Y., Watanabe, O. (eds.) ISAAC 2011. LNCS, vol. 7074, pp. 170–179. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25591-5_19
Raffinot, M.: On maximal repeats in strings. Inf. Process. Lett. 80(3), 165–169 (2001)
Rytter, W.: The structure of subword graphs and suffix trees of Fibonacci words. Theor. Comput. Sci. 363(2), 211–223 (2006)
Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM 32(3), 652–686 (1985)
Storer, J.A., Szymanski, T.G.: The macro model for data compression (extended abstract). In: Proceedings of STOC, pp. 30–39 (1978)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of SWAT, pp. 1–11 (1973)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number JP23H04378.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shibata, H., Köppl, D. (2025). LZ78 Substring Compression with CDAWGs. In: Lipták, Z., Moura, E., Figueroa, K., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2024. Lecture Notes in Computer Science, vol 14899. Springer, Cham. https://doi.org/10.1007/978-3-031-72200-4_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-72200-4_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72199-1
Online ISBN: 978-3-031-72200-4
eBook Packages: Computer ScienceComputer Science (R0)