Abstract
We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to insertions, deletions and substitutions, in O(mkn + R) time. The existence problem needs O(mkn) time. We also show that the algorithm can be adapted to run in O(k 2 n + min(mkn, m 2(mσ)k + R) average time, where σ is the alphabet size. The experimental results show a speedup over the basic approach for moderate m and small k.
Work developed during postdoctoral stay at the University of Helsinki, partially supported by the Academy of Finland (grant (grant 44449) and Fundación Andes. Also supported by Fondecyt grant 1-990627.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. DCC’92, pages 279–288, 1992.
A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. J. of and Sys. Sciences, 52(2):299–307, 1996. Earlier version in Proc. SODA’ 94.
A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press, Oxford, UK, 1997.
R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127–158, 1999.
T. Bell, J. Cleary, and I. Witten. Text Compression. Prentice Hall, 1990.
W. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. CPM’92, LNCS 644, pages 172–181, 1992.
W. Chang and T. Marr. Approximate string matching and local similarity. In Proc. CPM’94, LNCS 807, pages 259–273, 1994.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, Oxford, UK, 1994.
M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20:388–404, 1998. Previous version in STOC’95.
Z. Galil and K. Park. An improved algorithm for approximate string matching. SI AM J. on Computing, 19(6):989–999, 1990.
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. DCC’98, pages 103–112, 1998.
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99, LNCS 1645, pages 1–13, 1999.
G. Myers. A fast bit-vector algorithm for approximate pattern matching based on dynamic progamming. In Proc. CPM’98, LNCS 1448, pages 1–13, 1998.
G. Navarro. A guided tour to approximate string matching. Technical Report TR/DCC-99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
G. Navarro and R. Baeza-Yates. Very fast and simple approximate string matching. Information Processing Letters, 72:65–70, 1999.
G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99, LNCS 1645, pages 14–36, 1999.
G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. CPM’2000, LNCS 1848, 2000, pp. 166–180. In this same volume.
S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. of Molecular Biology, 48:444–453, 1970.
P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359–373, 1980.
E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.
T. A. Welch. A technique for high performance data compression. IEEE Computer Magazine, 17(6):8–19, June 1984.
S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, 1992.
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23:337–343, 1977.
J. Ziv and A. Lempel. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory, 24:530–536, 1978.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kärkkäinen, J., Navarro, G., Ukkonen, E. (2000). Approximate String Matching over Ziv—Lempel Compressed Text. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_18
Download citation
DOI: https://doi.org/10.1007/3-540-45123-4_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive