Abstract
This paper considers enumeration of substring equivalence classes introduced by Blumer et al. (J ACM 34(3):578–595, 1987). These equivalence classes were originally proposed to define a text indexing structure called compact directed acyclic word graphs (CDAWGs). These equivalence classes are also useful for text analysis, since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate these equivalence classes using only suffix arrays and two auxiliary arrays (rank arrays and lcp arrays), in O(n) time for a given string of length n over the integer alphabet. The proposed method overcomes all the existing algorithms which require \(O(n \log \sigma )\) time, where \(\sigma \) is the alphabet size. Our experimental results show that the proposed method is also practically faster and more memory efficient than the existing ones. Furthermore, we propose an O(n)-time algorithm which constructs the CDAWG of an input string over the integer alphabet. This algorithm is based on the above-mentioned algorithm to enumerate equivalence classes. Our experiments show that the proposed method runs faster than the existing algorithm on large alphabets.
Similar content being viewed by others
References
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)
Blumer, A., Blumer, J., Haussler, D., Mcconnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs.In: Proceedings of CPM 1997, pp. 116–129 (1997)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)
Inenaga, S., Hoshinoa, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. Discret. Appl. Math. 146(2), 156–179 (2005)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53, 918–936 (2006)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of CPM’01, pp. 181–192 (2001)
Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discret. Algorithms 3, 126–142 (2005)
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discret. Algorithms 3, 143–156 (2005)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Proceedings of DS’07, pp. 159–172 (2007)
Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. In: Proceedings of CPM’07, pp. 340–351 (2007)
Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)
Okanohara, D., Tsujii, J.: Text categorization with all substring features. In: Proceedings of SDM’09, pp. 838–846 (2009)
Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of PSC’08, pp. 161–169 (2008)
Revuz, D.: Minimisation of acyclic deterministic automata in linear time. Theor. Comput. Sci. 92(1), 181–189 (1992)
sais. https://sites.google.com/site/yuta256/ Accessed 30 Oct 2013
Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I.: Discovering characteristic expressions in literary works. Theor. Comput. Sci. 292(2), 525–546 (2003)
The canterbury corpus. http://corpus.canterbury.ac.nz/ Accessed 30 Oct 2013
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of 14th IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Acknowledgments
This work was partially supported by KAKENHI Grant Number 25240003.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Narisawa, K., Hiratsuka, H., Inenaga, S. et al. Efficient Computation of Substring Equivalence Classes with Suffix Arrays. Algorithmica 79, 291–318 (2017). https://doi.org/10.1007/s00453-016-0178-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-016-0178-z