Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1007352.1007374acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
Article

Dictionary matching and indexing with errors and don't cares

Published: 13 June 2004 Publication History

Abstract

This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution.The performance bounds all have a similar character. For example, for the indexing problem with n=|t| and m=|p|, the query time for k substitutions is O(m + (c1 log n)k⁄k! + # matches), with a data structure of size O(n (c2 log n)k⁄k!) and a preprocessing time of O(n (c2 log n)k⁄k!), where c1,c2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.

References

[1]
K. Abrahamson. Generalized string matching. SIAM J. Computing, 16(6):1039--1051, 1987.]]
[2]
A. V. Aho and M. J. Corasick. Efficient string matching. Comm. ACM, 18(6):333--340, 1975.]]
[3]
T. Akutsu. Approximate string matching with don't care characters. Information Processing Letters, 55:235--239, 1995.]]
[4]
T. Akutsu. Approximate string matching with variable length don't care characters. IEICE Trans. Information and Systems, E79-D:1353--1354, 1996.]]
[5]
A. Amir and M. Farach. Adaptive dictionary matching. Proc. of the Symposium on Foundations of Computer Science, 1991, 760--766.]]
[6]
A. Amir, M. Farach, R. Giancarlo, Z. Galil and K. Park. Dynamic dictionary matching. J. of Computer and System Sciences, 49(2):208--222, 1994.]]
[7]
A. Amir, M. Farach, R. M. Idury, J. A. La Poutre, and A. A. Schaffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258--282, 1995.]]
[8]
A. Amir, D. Keselman, G. M. Landau, N. Lewenstein, M. Lewenstein, and M. Rodeh. Text Indexing and dictionary matching with one error. J. of Algorithms, 37(2): 309--325, 2000.]]
[9]
A. Amir, G. Landau, M. Lewenstein and D. Sokol. Dynamic pattern, static text matching. Submitted for journal publication; preliminary version, Proc. of Workshop on Algorithms and Data Structures, 2003, 340--352.]]
[10]
A. Amir, M. Lewenstein and E. Porat. Faster algorithms for string matching with k mismatches. Proc. of the Symposium on Discrete Algorithms, 2000, 794--803.]]
[11]
A. Amir, M. Lewenstein and E. Porat. Approximate subset matching with don't cares. Proc. of the Symposium on Discrete Algorithms, 2001, 279--288.]]
[12]
G. S. Brodal and L. Gasieniec. Approximate dictionary queries. Proc. of the Symposium on Combinatorial Pattern Matching, 1996, 65--74. (LNCS 1075).]]
[13]
G. S. Brodal and S. Venkatesh. Improved bounds for dictionary lookup with one error. Information Processing Letters, 75(1-2):57--59 (2000).]]
[14]
A. L. Buchsbaum, M. T. Goodrich, J. Westbrook. Range searching over tree cross products. Proc. of the European Symposium on Algorithms, 2000, 120--131.]]
[15]
R. Cole and R. Hariharan. Approximate string matching: A faster simpler algorithm. Proc. of the Symposium on Discrete Algorithms, 1998, 463--472.]]
[16]
R. Cole and R. Hariharan. Verifying candidate matches in sparse and wildcard matching. Proc. of the Symposium on Theory of Computing, 2002, 592--601.]]
[17]
M. Farach. Optimal suffix tree construction with large alphabets. Proc. of the Symposium on Foundations of Computer Science, 1997, 137--143.]]
[18]
P. Ferragina, S. Muthukrishnan, and M. deBerg. Multi-method dispatching: a geometric approach with applications to string matching. Proc. of the Symposium on the Theory of Computing, 1999, 483--491.]]
[19]
M. J. Fischer and M. S. Paterson. String matching and other products. Complexity of Computation, R. M. Karp (editor), SIAM-AMS Proceedings, 7:113--125, 1974.]]
[20]
Z. Galil and R. Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4), 52--54, 1986.]]
[21]
D. Greene, M. Parnas and F. Yao. Multi-index hashing for information retrieval. Proc. of the Symposium on the Foundations of Computer Science, 1994, 722--731.]]
[22]
R. Grossi, J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. Proc. of the Symposium on Theory of Computing, 2000, 397--406.]]
[23]
T. Hagerup, P. B. Miltersen, R. Pagh. Deterministic dictionaries. J. of Algorithms, 41(1):69--85, 2001.]]
[24]
D. Harel, R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338--55, 1984.]]
[25]
R. M. Idury and A. A Sch"affer. Dynamic dictionary matching with failure functions. Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching, 1992, 273--284.]]
[26]
P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. of the Symposium on Theory of Computing, 1998, 604--613.]]
[27]
E. Kushilevitz, R. Ostrovsky, Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):457--474, 2000.]]
[28]
G. M. Landau and U. Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43, 239-249, 1986.]]
[29]
G. M. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. of Algorithms, 10(2):157--169, 1989.]]
[30]
U. Manber and S. Wu. An algorithm for approximate membership checking with applications to password security. Information Processing Letters, 50:92--197, 1994.]]
[31]
E. M. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23:262--272, 1976.]]
[32]
K. Mehlhorn. Data Structures and Efficient Algorithms, Vol. 1: Sorting and Searching, pp. 177--178. Springer Verlag, EATCS Monographs, 1984.]]
[33]
M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass., 1969.]]
[34]
S. C. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. Proc. of the Symposium on Foundations of Computer Science, 1996, 320--328.]]
[35]
B. Schieber and U. Vishkin. On finding lowest common ancestors: simplifications and parallelization. SIAM Journal on Computing, 17(6):1253--62, 1988.]]
[36]
S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu, C. T. Lu. Spatial databases - accomplishments and research needs. TKDE, 11(1):45--55 (1999).]]
[37]
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249--260, 1995.]]
[38]
P. Weiner. Linear pattern matching algorithm. Proc. of the Symposium on Switching and Automata Theory, 1973, 1--11.]]
[39]
D. E. Willard. Log-logarithmic worst-case range queries are possible in space θ(n). Information Processing Letters, 17(2):81--84, 1983.]]
[40]
A. C. -C. Yao and F. F. Yao. Dictionary lookup with one error. J. of Algorithms, 25(1):194--202, 1997.]]

Cited By

View all
  • (2024)An integrated data-driven framework for vehicle quality analysis based on maintenance record mining and Bayesian networkInternational Journal of Quality & Reliability Management10.1108/IJQRM-03-2023-0114Online publication date: 21-May-2024
  • (2024)Pattern Masking for Dictionary Matching: Theory and PracticeAlgorithmica10.1007/s00453-024-01213-886:6(1948-1978)Online publication date: 1-Jun-2024
  • (2024)Elastic-Degenerate String Matching with 1 Error or MismatchTheory of Computing Systems10.1007/s00224-024-10194-868:5(1442-1467)Online publication date: 1-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC '04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
June 2004
660 pages
ISBN:1581138520
DOI:10.1145/1007352
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate pattern matching
  2. dictionary matching
  3. dictionary query
  4. suffix trees
  5. text indexing
  6. wildcards

Qualifiers

  • Article

Conference

STOC04
Sponsor:
STOC04: Symposium of Theory of Computing 2004
June 13 - 16, 2004
IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)18
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An integrated data-driven framework for vehicle quality analysis based on maintenance record mining and Bayesian networkInternational Journal of Quality & Reliability Management10.1108/IJQRM-03-2023-0114Online publication date: 21-May-2024
  • (2024)Pattern Masking for Dictionary Matching: Theory and PracticeAlgorithmica10.1007/s00453-024-01213-886:6(1948-1978)Online publication date: 1-Jun-2024
  • (2024)Elastic-Degenerate String Matching with 1 Error or MismatchTheory of Computing Systems10.1007/s00224-024-10194-868:5(1442-1467)Online publication date: 1-Oct-2024
  • (2024)Bounded-Ratio Gapped String IndexingString Processing and Information Retrieval10.1007/978-3-031-72200-4_9(118-126)Online publication date: 23-Sep-2024
  • (2023)Bidirectional String Anchors for Improved Text Indexing and Top-$K$ Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323178035:11(11093-11111)Online publication date: 1-Nov-2023
  • (2023)Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00066(1105-1130)Online publication date: 6-Nov-2023
  • (2022)SIMD-Matcher: A SIMD-based Arbitrary Matching FrameworkACM Transactions on Architecture and Code Optimization10.1145/351424619:3(1-20)Online publication date: 4-May-2022
  • (2022)On the Average-case Complexity of Pattern Matching with WildcardsTheoretical Computer Science10.1016/j.tcs.2022.04.009Online publication date: Apr-2022
  • (2022)Efficient Computation of Sequence MappabilityAlgorithmica10.1007/s00453-022-00934-y84:5(1418-1440)Online publication date: 2-Feb-2022
  • (2022)Elastic-Degenerate String Matching with 1 ErrorLATIN 2022: Theoretical Informatics10.1007/978-3-031-20624-5_2(20-37)Online publication date: 29-Oct-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media