Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1007568.1007652acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Efficient set joins on similarity predicates

Published: 13 June 2004 Publication History

Abstract

In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.

References

[1]
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.
[2]
R. Ananthakrishna, S. chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002.
[3]
Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.
[4]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. Sixth Int'l. World Wide Web Conference, pages 391--404. WWW Consortium, 1997.
[5]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.
[6]
Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Symposium on Principles of Database Systems, pages 216--225, 2000.
[7]
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. Knowledge and Data Engineering, 13(1):64--78, 2001.
[8]
E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. J. Alg. special issue of selected papers from SODA '97., 30:211--252, 1999.
[9]
W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18:288--321, 2000.
[10]
D. DeWitt, J. Naughton, and D. Schneider. An evaluation of non-equijoin algorithms. In Proceedings of the 17th Conference on Very Large Databases, Morgan Kaufman pubs. (Los Altos CA), Barcelona, 1991.
[11]
L. Gravano, P. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001.
[12]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an RDBMS. In ICDE, pages 729--731, 2003.
[13]
J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 2003.
[14]
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society of Information Science and Technology, 54(8):713--729, 2003.
[15]
S. Helmer and G. Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In The VLDB Journal, pages 386--395, 1997.
[16]
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.
[17]
N. Mamoulis. Efficient processing of joins on set-valued attributes. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 157--168. ACM Press, 2003.
[18]
S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. Technical report, Stanford University, 2001.
[19]
A. Moffat, J. Zobel, and N. Sharman. Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, 9(2):302--313, March-April 1997.
[20]
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
[21]
K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, 2000.
[22]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Canada, July 2002.
[23]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, 1999.

Cited By

View all
  1. Efficient set joins on similarity predicates

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
    June 2004
    988 pages
    ISBN:1581138598
    DOI:10.1145/1007568
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Authenticating q-Gram-Based Similarity Search Results for Outsourced String DatabasesMathematics10.3390/math1109212811:9(2128)Online publication date: 1-May-2023
    • (2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213117:COnline publication date: 1-Jul-2023
    • (2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
    • (2022)Entity Resolution Algorithm Based on Locality Sensitive Hash and Fuzzy JoinHans Journal of Data Mining10.12677/HJDM.2022.12302812:03(280-296)Online publication date: 2022
    • (2022)minIL: A Simple and Small Index for String Similarity Search with Edit Distance2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00047(565-577)Online publication date: May-2022
    • (2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
    • (2021)COMPAREProceedings of the VLDB Endowment10.14778/3476249.347629114:11(2419-2431)Online publication date: 27-Oct-2021
    • (2021)Cross-Dataset Multiple Organ Segmentation From CT Imagery Using FBP-Derived Domain AdaptationIEEE Access10.1109/ACCESS.2021.30558039(25025-25035)Online publication date: 2021
    • (2021)Time-aware approximate collective keyword search in traffic networksKnowledge-Based Systems10.1016/j.knosys.2021.107367229:COnline publication date: 11-Oct-2021
    • (2021)Streaming Set Similarity JoinsEnterprise Information Systems10.1007/978-3-030-75418-1_2(24-42)Online publication date: 1-May-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media