Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-642-03973-7_8guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Efficient Set Similarity Joins Using Min-prefixes

Published: 21 August 2009 Publication History

Abstract

Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.

References

[1]
Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: Proc. VLDB, pp. 918-929 (2006).
[2]
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. WWW, pp. 131-140 (2007).
[3]
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proc. Compression and Complexity of Sequences, p. 21 (1997).
[4]
Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An Efficient Filter for Approximate Membership Checking. In: Proc. SIGMOD, pp. 805-818 (2008).
[5]
Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE, p. 5 (2006).
[6]
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB, pp. 491-500 (2001).
[7]
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast Indexes and Algorithms for Set Selection Queries. In: Proc. ICDE, pp. 267-276 (2008).
[8]
Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Approximate String Searches. In: Proc. ICDE, pp. 257-266 (2008).
[9]
Sarawagi, S., Kirpal, A.: Efficient Set Joins on Similarity Predicates. In: Proc. SIGMOD, pp. 743-754 (2004).
[10]
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In: Proc. KDD, pp. 678-684 (2005).
[11]
Xiao, C., Wang, W., Lin, X.: Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. In: PVLDB, vol. 1(1), pp. 933-944 (2008).
[12]
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k Set Similarity Joins. In: Proc. ICDE, pp. 916-927 (2009).
[13]
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proc. WWW, pp. 131-140 (2008).

Cited By

View all
  • (2023)FINEX: A Fast Index for Exact & Flexible Density-Based ClusteringProceedings of the ACM on Management of Data10.1145/35889251:1(1-25)Online publication date: 30-May-2023
  • (2016)Set containment join revisitedKnowledge and Information Systems10.1007/s10115-015-0895-749:1(375-402)Online publication date: 1-Oct-2016
  • (2015)Performance prediction for set similarity joinsProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2695694(967-972)Online publication date: 13-Apr-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ADBIS '09: Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
August 2009
377 pages
ISBN:9783642039720
  • Editors:
  • Janis Grundspenkis,
  • Tadeusz Morzy,
  • Gottfried Vossen

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 August 2009

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)FINEX: A Fast Index for Exact & Flexible Density-Based ClusteringProceedings of the ACM on Management of Data10.1145/35889251:1(1-25)Online publication date: 30-May-2023
  • (2016)Set containment join revisitedKnowledge and Information Systems10.1007/s10115-015-0895-749:1(375-402)Online publication date: 1-Oct-2016
  • (2015)Performance prediction for set similarity joinsProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2695694(967-972)Online publication date: 13-Apr-2015
  • (2015)An effective candidate generation method for improving performance of edit similarity query processingInformation Systems10.1016/j.is.2014.07.00547:C(116-128)Online publication date: 1-Jan-2015
  • (2013)Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraintsProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2487575.2487637(176-184)Online publication date: 11-Aug-2013
  • (2011)Generalizing prefix filtering to improve set similarity joinsInformation Systems10.1016/j.is.2010.07.00336:1(62-78)Online publication date: 1-Mar-2011
  • (2009)A cluster-based approach to XML similarity joinsProceedings of the 2009 International Database Engineering & Applications Symposium10.1145/1620432.1620451(182-193)Online publication date: 16-Sep-2009

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media