research-article

Approximate string matching by position restricted alignment

Authors:

Sharma V. Thankachan,

Seung-Jong Park,

David FoltzAuthors Info & Claims

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

Pages 384 - 391

https://doi.org/10.1145/2457317.2457388

Published: 18 March 2013 Publication History

Abstract

Given a collection of strings, goal of the approximate string matching is to efficiently find the strings in the collection that are similar to a query string. In this paper, we focus on edit distance as measure to quantify the similarity between two strings. Existing q-gram based methods to address this problem use inverted indexes to index the q-grams of given string collection. These methods begin by generating the q-grams of query string (disjoint or overlapping) and then merge the inverted lists of these q-grams. Several filtering techniques have been proposed so as to segment inverted lists to relatively shorter lists thus reducing the merging cost. We use a filtering technique which we call as "position restricted alignment" that combines well known length filtering and position filtering to provide more aggressive pruning. We then provide an indexing scheme that integrates the inverted lists storage with the proposed filter thus enabling us to auto-filter the inverted lists. We evaluate the effectiveness of the proposed approach by thorough experimentation.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[2]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313--324, 2003.

Digital Library

[3]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

Digital Library

[4]

R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don't cares. In STOC, pages 91--100, 2004.

Digital Library

[5]

D. Deng, G. Li, and J. Feng. Top-k string similarity search with edit-distance constraints. In ICDE, 2013.

Digital Library

[6]

J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012.

Digital Library

[7]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.

Digital Library

[8]

R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-Compressed Text Indexes. In Proceedings of Symposium on Discrete Algorithms, pages 841--850, 2003.

Digital Library

[9]

T. Kahveci and A. K. Singh. Efficient index structures for string databases. In VLDB, pages 351--360, 2001.

Digital Library

[10]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006.

Digital Library

[11]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[12]

C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.

Digital Library

[13]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[14]

S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proceedings of Symposium on Discrete Algorithms, pages 657--666, 2002.

Digital Library

[15]

G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1), 2001.

Digital Library

[16]

E. Ohlebusch, J. Fischer, and S. Gog. Cst++. In SPIRE, pages 322--333, 2010.

Digital Library

[17]

R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets. In Proceedings of Symposium on Discrete Algorithms, pages 233--242, 2002.

Digital Library

[18]

K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000.

Digital Library

[19]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[20]

P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of Symposium on Switching and Automata Theory, pages 1--11, 1973.

Digital Library

[21]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.

Digital Library

[22]

Z. Yang, J. Yu, and M. Kitsuregawa. Fast algorithms for top-k approximate string matching. In AAAI, 2010.

[23]

Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010.

Digital Library

Cited By

Dutta S(2015)MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical SignificanceAdvances in Information Retrieval10.1007/978-3-319-16354-3_31(284-290)Online publication date: 2015
https://doi.org/10.1007/978-3-319-16354-3_31
Patil MShah RDyreson CLi FÖzsu M(2014)Similarity joins for uncertain stringsProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2612178(1471-1482)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2588555.2612178

Index Terms

Approximate string matching by position restricted alignment
1. Information systems
  1. Information systems applications
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Restricted transposition invariant approximate string matching under edit distance
SPIRE'05: Proceedings of the 12th international conference on String Processing and Information Retrieval

Let A and B be strings with lengths m and n, respectively, over a finite integer alphabet. Two classic string mathing problems are computing the edit distance between A and B, and searching for approximate occurrences of A inside B. We consider the ...
Compressed Indexes for Approximate String Matching

We revisit the problem of indexing a string S[1..n] to support finding all substrings in S that match a given pattern P[1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(mk) time for searching. ...
The Max-Shift Algorithm for Approximate String Matching
WAE '01: Proceedings of the 5th International Workshop on Algorithm Engineering

The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

March 2013

423 pages

ISBN:9781450315999

DOI:10.1145/2457317

General Chair:
Giovanna Guerrini
Università di Genova, Italy

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EDBT/ICDT '13

EDBT/ICDT '13: Joint 2013 EDBT/ICDT Conferences

March 18 - 22, 2013

Genoa, Italy

Acceptance Rates

EDBT '13 Paper Acceptance Rate 7 of 10 submissions, 70%;

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
142
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dutta S(2015)MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical SignificanceAdvances in Information Retrieval10.1007/978-3-319-16354-3_31(284-290)Online publication date: 2015
https://doi.org/10.1007/978-3-319-16354-3_31
Patil MShah RDyreson CLi FÖzsu M(2014)Similarity joins for uncertain stringsProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2612178(1471-1482)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2588555.2612178

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents