Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Conference paper

pp 340–351
Cite this conference paper

Combinatorial Pattern Matching (CPM 2007)

Kazuyuki Narisawa¹,
Shunsuke Inenaga²,
Hideo Bannai¹ &
…
Masayuki Takeda^1,3

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4580))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

775 Accesses
7 Citations
5 Altmetric

Abstract

This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Similar content being viewed by others

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Article 17 August 2016

An efficient pruning strategy for approximate string matching over suffix tree

Article 06 November 2015

Sequential Representation of Suffix Trie: An Empirical Evaluation

Chapter © 2021

References

Blumer, A., Blumer, J., Haussler, D., Mcconnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Article MathSciNet Google Scholar
Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. Technical report, Department of Informatics, Kyushu University (2007)
Google Scholar
Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I.: Discovering characteristic expressions in literary works. Theoretical Computer Science 292(2), 525–546 (2003)
Article MATH MathSciNet Google Scholar
Inenaga, S., Hoshinoa, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. Discrete Applied Mathematics 146(2), 156–179 (2005)
Article MATH MathSciNet Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MATH MathSciNet Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Google Scholar
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Article MATH MathSciNet Google Scholar
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 31–55 (1985)
Article MATH MathSciNet Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)
Chapter Google Scholar
Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)
Chapter Google Scholar
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)
Chapter Google Scholar
Arnold, R., Bell, T.: A corpus for the evaluation of lossless compression algorithms. In: Proc. DCC ’97, pp. 201–210 (1997), http://corpus.canterbury.ac.nz/
Nevill-Manning, C., Witten, I.: Protein is incompressible. In: Proc. DCC 1999, pp. 257–266 (1999), http://www.data-compression.info/Corpora/ProteinCorpus/index.htm
Larsson, N.J., Sadakane, K.: Faster suffix sorting. Technical Report LU-CS-TR:99-214, LUNDFD6/(NFCS-3140)/1–20/(1999) Department of Computer Science, Lund University, Sweden (1999), http://www.larsson.dogma.net/qsufsort.c

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Fukuoka 819-0395, Japan
Kazuyuki Narisawa, Hideo Bannai & Masayuki Takeda
Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka 819-0395, Japan
Shunsuke Inenaga
SORST, Japan Science and Technology Agency (JST),
Masayuki Takeda

Authors

Kazuyuki Narisawa
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Inenaga
View author publications
You can also search for this author in PubMed Google Scholar
Hideo Bannai
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Bin Ma Kaizhong Zhang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Narisawa, K., Inenaga, S., Bannai, H., Takeda, M. (2007). Efficient Computation of Substring Equivalence Classes with Suffix Arrays. In: Ma, B., Zhang, K. (eds) Combinatorial Pattern Matching. CPM 2007. Lecture Notes in Computer Science, vol 4580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73437-6_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-73437-6_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73436-9
Online ISBN: 978-3-540-73437-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics