Article

Efficient set joins on similarity predicates

Authors:

Sunita Sarawagi,

Alok KirpalAuthors Info & Claims

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Pages 743 - 754

https://doi.org/10.1145/1007568.1007652

Published: 13 June 2004 Publication History

Abstract

In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.

References

[1]

R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

Digital Library

[2]

R. Ananthakrishna, S. chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002.

Digital Library

[3]

Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.

[4]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. Sixth Int'l. World Wide Web Conference, pages 391--404. WWW Consortium, 1997.

Digital Library

[5]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.

Digital Library

[6]

Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Symposium on Principles of Database Systems, pages 216--225, 2000.

Digital Library

[7]

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. Knowledge and Data Engineering, 13(1):64--78, 2001.

Digital Library

[8]

E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. J. Alg. special issue of selected papers from SODA '97., 30:211--252, 1999.

Digital Library

[9]

W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18:288--321, 2000.

Digital Library

[10]

D. DeWitt, J. Naughton, and D. Schneider. An evaluation of non-equijoin algorithms. In Proceedings of the 17th Conference on Very Large Databases, Morgan Kaufman pubs. (Los Altos CA), Barcelona, 1991.

Digital Library

[11]

L. Gravano, P. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001.

Digital Library

[12]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an RDBMS. In ICDE, pages 729--731, 2003.

[13]

J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 2003.

Digital Library

[14]

S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society of Information Science and Technology, 54(8):713--729, 2003.

Digital Library

[15]

S. Helmer and G. Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In The VLDB Journal, pages 386--395, 1997.

Digital Library

[16]

M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.

Digital Library

[17]

N. Mamoulis. Efficient processing of joins on set-valued attributes. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 157--168. ACM Press, 2003.

Digital Library

[18]

S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. Technical report, Stanford University, 2001.

[19]

A. Moffat, J. Zobel, and N. Sharman. Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, 9(2):302--313, March-April 1997.

Digital Library

[20]

A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.

[21]

K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, 2000.

Digital Library

[22]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Canada, July 2002.

Digital Library

[23]

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, 1999.

Digital Library

Cited By

Yang LYe HLiu XMao YZhang J(2023)Authenticating q-Gram-Based Similarity Search Results for Outsourced String DatabasesMathematics10.3390/math1109212811:9(2128)Online publication date: 1-May-2023
https://doi.org/10.3390/math11092128
Li WCheng ZDeng LYang ZLi A(2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213117:COnline publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.is.2023.102213
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574263
Show More Cited By

Efficient set joins on similarity predicates
1. Information systems
  1. Information retrieval

Recommendations

Top-k Set Similarity Joins
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this ...
Similarity Joins

Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...
Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09: Proceedings of the 13th East European Conference on Advances in Databases and Information Systems

Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

June 2004

988 pages

ISBN:1581138598

DOI:10.1145/1007568

Conference Chairs:
Arnd Christian König
Microsoft Research
,
Stefan Dessloch
University of Kaiserslautern, Germany
,
General Chair:
Patrick Valduriez
INRIA, France
,
Program Chair:
Gerhard Weikum
University of the Saarland

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS04

Sponsor:

SIGMOD

SIGMOD/PODS04: International Conference on Management of Data and Symposium on Principles Database and Systems

June 13 - 18, 2004

Paris, France

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

282
Total Citations
View Citations
1,629
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang LYe HLiu XMao YZhang J(2023)Authenticating q-Gram-Based Similarity Search Results for Outsourced String DatabasesMathematics10.3390/math1109212811:9(2128)Online publication date: 1-May-2023
https://doi.org/10.3390/math11092128
Li WCheng ZDeng LYang ZLi A(2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213117:COnline publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.is.2023.102213
Zeakis ASkoutas DSacharidis DPapapetrou OKoubarakis M(2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574263
樊沁(2022)Entity Resolution Algorithm Based on Locality Sensitive Hash and Fuzzy JoinHans Journal of Data Mining10.12677/HJDM.2022.12302812:03(280-296)Online publication date: 2022
https://doi.org/10.12677/HJDM.2022.123028
Yang ZZheng BWang XLi GZhou X(2022)minIL: A Simple and Small Index for String Similarity Search with Edit Distance2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00047(565-577)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00047
Wang ZWang SLi JYuan CGu RHuang Y(2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2021.07.009
Siddiqui TChaudhuri SNarasayya V(2021)COMPAREProceedings of the VLDB Endowment10.14778/3476249.347629114:11(2419-2431)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476291
Huang JLi XWang JYu XZhu LZhan YGao YHuang C(2021)Cross-Dataset Multiple Organ Segmentation From CT Imagery Using FBP-Derived Domain AdaptationIEEE Access10.1109/ACCESS.2021.30558039(25025-25035)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3055803
Feng ZJin CKim HCui X(2021)Time-aware approximate collective keyword search in traffic networksKnowledge-Based Systems10.1016/j.knosys.2021.107367229:COnline publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1016/j.knosys.2021.107367
Pacífico LRibeiro L(2021)Streaming Set Similarity JoinsEnterprise Information Systems10.1007/978-3-030-75418-1_2(24-42)Online publication date: 1-May-2021
https://doi.org/10.1007/978-3-030-75418-1_2
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents