Article

Efficient Set Similarity Joins Using Min-prefixes

Authors:

Leonardo A. Ribeiro,

Theo HärderAuthors Info & Claims

ADBIS '09: Proceedings of the 13th East European Conference on Advances in Databases and Information Systems

Pages 88 - 102

https://doi.org/10.1007/978-3-642-03973-7_8

Published: 21 August 2009 Publication History

Abstract

Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.

References

[1]

Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: Proc. VLDB, pp. 918-929 (2006).

Crossref

Google Scholar

[2]

Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. WWW, pp. 131-140 (2007).

Crossref

Google Scholar

[3]

Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proc. Compression and Complexity of Sequences, p. 21 (1997).

Crossref

Google Scholar

[4]

Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An Efficient Filter for Approximate Membership Checking. In: Proc. SIGMOD, pp. 805-818 (2008).

Crossref

Google Scholar

[5]

Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE, p. 5 (2006).

Crossref

Google Scholar

[6]

Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB, pp. 491-500 (2001).

Crossref

Google Scholar

[7]

Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast Indexes and Algorithms for Set Selection Queries. In: Proc. ICDE, pp. 267-276 (2008).

Crossref

Google Scholar

[8]

Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Approximate String Searches. In: Proc. ICDE, pp. 257-266 (2008).

Crossref

Google Scholar

[9]

Sarawagi, S., Kirpal, A.: Efficient Set Joins on Similarity Predicates. In: Proc. SIGMOD, pp. 743-754 (2004).

Crossref

Google Scholar

[10]

Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In: Proc. KDD, pp. 678-684 (2005).

Crossref

Google Scholar

[11]

Xiao, C., Wang, W., Lin, X.: Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. In: PVLDB, vol. 1(1), pp. 933-944 (2008).

Crossref

Google Scholar

[12]

Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k Set Similarity Joins. In: Proc. ICDE, pp. 916-927 (2009).

Crossref

Google Scholar

[13]

Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proc. WWW, pp. 131-140 (2008).

Crossref

Google Scholar

Cited By

View all

Thiel KKocher DAugsten NHütter TMann WSchmitt D(2023)FINEX: A Fast Index for Exact & Flexible Density-Based ClusteringProceedings of the ACM on Management of Data10.1145/35889251:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588925
Bouros PMamoulis NGe STerrovitis M(2016)Set containment join revisitedKnowledge and Information Systems10.1007/s10115-015-0895-749:1(375-402)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-015-0895-7
Sidney CMendes DRibeiro LHärder TWainwright RCorchado JBechini AHong J(2015)Performance prediction for set similarity joinsProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2695694(967-972)Online publication date: 13-Apr-2015
https://dl.acm.org/doi/10.1145/2695664.2695694
Show More Cited By

Recommendations

Top-k Set Similarity Joins
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this ...
Generalizing prefix filtering to improve set similarity joins

Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they ...
Similarity Joins

Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ADBIS '09: Proceedings of the 13th East European Conference on Advances in Databases and Information Systems

August 2009

377 pages

ISBN:9783642039720

Editors:
Janis Grundspenkis
Institute of Applied Computer Systems, Riga Technical University, Riga, Latvia LV 1658
,
Tadeusz Morzy
Institute of Computing Science, University of Technology, Poznán, Poland 60-965
,
Gottfried Vossen
European Research Center for Information Systems, University of Münster, Münster, Germany 48149

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 August 2009

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Thiel KKocher DAugsten NHütter TMann WSchmitt D(2023)FINEX: A Fast Index for Exact & Flexible Density-Based ClusteringProceedings of the ACM on Management of Data10.1145/35889251:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588925
Bouros PMamoulis NGe STerrovitis M(2016)Set containment join revisitedKnowledge and Information Systems10.1007/s10115-015-0895-749:1(375-402)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-015-0895-7
Sidney CMendes DRibeiro LHärder TWainwright RCorchado JBechini AHong J(2015)Performance prediction for set similarity joinsProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2695694(967-972)Online publication date: 13-Apr-2015
https://dl.acm.org/doi/10.1145/2695664.2695694
Kim J(2015)An effective candidate generation method for improving performance of edit similarity query processingInformation Systems10.1016/j.is.2014.07.00547:C(116-128)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.is.2014.07.005
Tabei YKishimoto AKotera MYamanishi YGrossman RUthurusamy RDhillon IKoren Y(2013)Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraintsProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2487575.2487637(176-184)Online publication date: 11-Aug-2013
https://dl.acm.org/doi/10.1145/2487575.2487637
Ribeiro LHärder T(2011)Generalizing prefix filtering to improve set similarity joinsInformation Systems10.1016/j.is.2010.07.00336:1(62-78)Online publication date: 1-Mar-2011
https://dl.acm.org/doi/10.1016/j.is.2010.07.003
Ribeiro LHärder TPimenta FDesai BSacca DGreco S(2009)A cluster-based approach to XML similarity joinsProceedings of the 2009 International Database Engineering & Applications Symposium10.1145/1620432.1620451(182-193)Online publication date: 16-Sep-2009
https://dl.acm.org/doi/10.1145/1620432.1620451

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Recommendations

Top-k Set Similarity Joins

Generalizing prefix filtering to improve set similarity joins

Similarity Joins

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations