Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Auto-join: joining tables by leveraging transformations

Published: 01 June 2017 Publication History

Abstract

Traditional equi-join relies solely on string equality comparisons to perform joins. However, in scenarios such as ad-hoc data analysis in spreadsheets, users increasingly need to join tables whose join-columns are from the same semantic domain but use different textual representations, for which transformations are needed before equi-join can be performed. We developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able. We developed an optimal sampling strategy that allows Auto-Join to scale to large datasets efficiently, while ensuring joins succeed with high probability. Our evaluation using real test cases collected from both public web tables and proprietary enterprise tables shows that the proposed system performs the desired transformation joins efficiently and with high quality.

References

[1]
DBLP. http://dblp.uni-trier.de/.
[2]
Google Web Tables. http://research.google.com/tables.
[3]
Informatica Rev. https://www.informatica.com/products/data-quality/rev.html.
[4]
Microsoft Excel Power Query. http://office.microsoft.com/powerbi.
[5]
Power query: Merge queries. https://support.office.com/en-us/article/Merge-queries-Power-Query-fd157620-5470-4c0f-b132-7ca2616d17f9.
[6]
C. Biemann. Structure Discovery in Natural Language. Theory and Applications of Natural Language Processing. Springer, 2012.
[7]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Computer Networks and ISDN Systems, 1997.
[8]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[9]
H. Chernoff. A note on an inequality involving the normal distribution. Annals of Probability, 1981.
[10]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, pages 240--251, 2002.
[11]
R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: Discovering Complex Semantic Matches Between Database Schemas. In SIGMOD, 2004.
[12]
N. Ganguly, A. Deutsch, and A. Mukherjee. Dynamics on and of complex networks: Applications to biology, computer science, and the social sciences. 2009.
[13]
H. G. Gauch. Scientific Method in Practice. Cambridge University Press, 2003.
[14]
J. Hare, C. Adams, A. Woodward, and H. Swinehart. Forecast snapshot: Self-service data preparation, worldwide, 2016. Gartner, Inc., February 2016.
[15]
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In SIGPLAN, 2011.
[16]
O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. PVLDB, 6(6):444--456, 2013.
[17]
Y. He, K. Ganjam, and X. Chu. Sema-join: Joining semantically-related tables using big table corpora. In Proceedings of VLDB, 2015.
[18]
Z. Jin, M. R. Anderson, M. Cafarella, and H. V. Jagadish. Foofah: Transforming data by example. In SIGMOD, 2017.
[19]
H. Lieberman, editor. Your Wish is My Command: Programming by Example. Morgan Kaufmann, 2001.
[20]
U. Manber and E. W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935--948, 1993.
[21]
J. Rissanen. Modeling by shortest data description. Automatica, 1978.
[22]
R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. In Proc. VLDB, 2016.
[23]
R. H. Warren and F. W. Tompa. Multi-column substring matching for database schema translation. In PVLDB, 2006.
[24]
E. Zhu, Y. He, and S. Chaudhuri. AutoJoin: Joining Tables by Leveraging Transformations (Full Version). https://www.microsoft.com/en-us/research/publication/auto-join-joining-tables-leveraging-transformations/.

Cited By

View all
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
  • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
  • Show More Cited By
  1. Auto-join: joining tables by leveraging transformations

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 10
    June 2017
    180 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 June 2017
    Published in PVLDB Volume 10, Issue 10

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
    • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
    • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
    • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
    • (2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
    • (2024)Preliminary Guidelines for Combining Data Integration and Visual Data AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.333451330:10(6678-6690)Online publication date: 1-Oct-2024
    • (2023)Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using ExamplesProceedings of the VLDB Endowment10.14778/3611479.361153416:11(3391-3403)Online publication date: 24-Aug-2023
    • (2023)Cross Modal Data Discovery over Structured and Unstructured Data LakesProceedings of the VLDB Endowment10.14778/3611479.361153316:11(3377-3390)Online publication date: 24-Aug-2023
    • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
    • (2023)DeepJoin: Joinable Table Discovery with Pre-Trained Language ModelsProceedings of the VLDB Endowment10.14778/3603581.360358716:10(2458-2470)Online publication date: 8-Aug-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media