research-article

Finding Related Tables in Data Lakes for Interactive Data Science

Authors:

Zachary G. IvesAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1951 - 1966

https://doi.org/10.1145/3318464.3389726

Published: 31 May 2020 Publication History

Abstract

Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

Supplementary Material

MP4 File (3318464.3389726.mp4)

Presentation Video

Download
123.74 MB

References

[1]

Khalid Belhajjame, Norman W Paton, Alvaro AA Fernandes, Cornelia Hedeler, and Suzanne M Embury. 2011. User Feedback as a First Class Citizen in Information Integration Systems. In CIDR. 175--183.

[2]

William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer.2000. Feasibility of a Serverless Distributed File System Deployed onan Existing Set of Desktop PCs. In Proc. Measurement and Modeling of Computer Systems, 2000. 34--43.

[3]

Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365--1375.

Digital Library

[4]

Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, CongYu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of Webtables. Proceedings of the VLDB Endowment 11, 12 (2018), 2140--2149.

Digital Library

[5]

Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu,and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. PVLDB 1, 1 (2008), 538--549.

Digital Library

[6]

Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies - a survey of approaches. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 147--165.

Digital Library

[7]

Lucas AMC Carvalho, Regina Wang, Yolanda Gil, and Daniel Garijo.2017. NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance. In Proceedings of Workshops and Tutorials of the 9th International Conference on Knowledge Capture (K-CAP2017).

[8]

James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trendsin Databases1, 4 (2009), 379--474.

[9]

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.

[10]

Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. J. Comput. System Sci.66(4) (June2003), 614--656.

[11]

Ju Fan, Meiyu Lu, Beng Chin Ooi, Wang-Chiew Tan, and Meihui Zhang. 2014. A hybrid machine-crowdsourcing system for matching web tables. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 976--987.

[12]

Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Dis-covering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering 23, 5 (2011), 683--698.

Digital Library

[13]

Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan,Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.

[14]

Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden.2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1190--1201.

[15]

Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec.34, 4 (2005), 27--33.

Digital Library

[16]

Avigdor Gal. 2011. Uncertain Schema Matching. Morgan and Claypool.

[17]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford.2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).

[18]

Jeremy Goecks, Anton Nekrutenko, and James Taylor. 2010. Galaxy:a comprehensive approach for supporting accessible, reproducible,and transparent computational research in the life sciences.Genomebiology11, 8 (2010), R86.

[19]

Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelligent Data Lake System. In SIGMOD. ACM, New York, NY, USA,2097--2100. https://doi.org/10.1145/2882903.2899389

Digital Library

[20]

Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806.

Digital Library

[21]

Alon Y Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5--14.

[22]

Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen.1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.

[23]

Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. 2003. Supporting Top-k Join Queries in Relational Databases. In VLDB. 754--765.

[24]

Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 647--658.

Digital Library

[25]

Ihab F. Ilyas and Mohamed Soliman. 2011. Probabilistic Ranking Techniques in Relational Databases. Morgan and Claypool.

[26]

Jaewook Kim, Yun Peng, Nenad Ivezik, Junho Shin, et al.2010.Semantic-based Optimal XML Schema Matching: A Mathematical Programming Approach. In The Proceedings of International Conference on E-business, Management and Economics.

[27]

Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard,Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. 2016. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment 9, 13 (2016), 1581--1584.

Digital Library

[28]

David Koop and Jay Patel. 2017. Dataflow notebooks: encoding and tracking dependencies of cells. In 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 17). USENIX Association.

[29]

Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join? Thinking Twice about Joins beforeFeature Selection. In Proceedings of the 2016 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 19--34. https://doi.org/10.1145/2882903.2882952

[30]

Chengkai Li, Kevin Chen-Chuan Chang, Ihab F. Ilyas, and Sumin Song. 2005. RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In SIGMOD. 131--142.

[31]

Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. 2006.Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience(2006), 1039--1065.

[32]

Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proceedings of the VLDB Endowment 12, 12 (2019).

Digital Library

[33]

Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018.Table union search on open data.Proceedings of the VLDB Endowment11, 7 (2018), 813--825.

[34]

T. Oinn, M. Greenwood, M. Addis, N. Alpdemir, J. Ferris, K. Glover,C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. 2006. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 18, 10 (2006), 1067--1100.

Digital Library

[35]

Christos H Papadimitriou. 1981. On the complexity of integer programming. Journal of the ACM (JACM)28, 4 (1981), 765--768.

Digital Library

[36]

Fernando Perez and Brian E Granger. 2015. Project Jupyter: Computational narratives as the engine of collaborative data science. Retrieved September 11 (2015), 207.

[37]

Tomas Petricek, James Geddes, and Charles Sutton. 2018. Wrattler: Reproducible, live and polyglot notebooks. In 10th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2018). USENIX Association.

[38]

Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. PVLDB 5, 10 (2012), 908--919.

Digital Library

[39]

Erhard Rahm and Philip A. Bernstein. 2001. A Survey of Approaches to Automatic Schema Matching. VLDB J.10, 4 (2001), 334--350.

Digital Library

[40]

Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, and Sudipto Guha. 2008. Learning to create data-integrating queries. PVLDB 1, 1 (2008),785--796.

Digital Library

[41]

Petros Venetis, Alon Y Halevy, Jayant Madhavan, Marius Pasca, WarrenShen, Fei Wu, and Gengxin Miao. 2011. Recovering semantics of tables on the web. (2011).

[42]

Daisy Zhe Wang, Xin Luna Dong, Anish Das Sarma, Michael J Franklin, and Alon Y Halevy. 2009. Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.

[43]

Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. 2009. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25--36.

Digital Library

[44]

Yi Zhang and Zachary G. Ives. 2019. Juneau: Data Lake Management for Jupyter. Proceedings of the VLDB Endowment 12, 7 (2019).

[45]

Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019.JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data. ACM, 847--864.

Digital Library

[46]

Erkang Zhu, Fatemeh Nargesian, Ken Q Pu, and Renée J Miller. 2016. LSH ensemble: internet-scale domain search. Proceedings of the VLDB Endowment 9, 12 (2016), 1185--1196.

Digital Library

[47]

Moshé M. Zloof. 1975. Query-by-example: the invocation and definition of tables and forms. InVLDB '75: Proceedings of the 1st International Conference on Very Large Data Bases. 1--24.

Digital Library

[48]

Moshé M. Zloof. 1977. Query By Example: A Data Base Language. IBM Systems Journal 16(4) (1977), 324--343.

Cited By

Zhang YChen PIves Z(2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682005
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Zecchini LBleifuß TSimonini GBergamaschi SNaumann F(2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639303
Show More Cited By

Index Terms

Finding Related Tables in Data Lakes for Interactive Data Science
1. Information systems
  1. Data management systems
    1. Information integration
      1. Federated databases

Recommendations

A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes
Proceedings of the Second International Workshop on Data Management and Analytics for Medicine and Healthcare - Volume 10186

Medical research use cases are population centric, unlike the clinical use cases which are patient or individual centric. Hence the research use cases require accessing medical archives and data source repositories of heterogeneous nature. Traditionally,...
Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake
Advances in Databases and Information Systems
Abstract
In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, ...
Modeling Data Lakes with Data Vault: Practical Experiences, Assessment, and Lessons Learned
Conceptual Modeling
Abstract
Data lakes have become popular to enable organization-wide analytics on heterogeneous data from multiple sources. Data lakes store data in their raw format and are often characterized as schema-free. Nevertheless, it turned out that data still ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
1,629
Total Downloads

Downloads (Last 12 months)280
Downloads (Last 6 weeks)32

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YChen PIves Z(2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682005
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Zecchini LBleifuß TSimonini GBergamaschi SNaumann F(2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639303
Leventidis AChristensen MLissandrini MDi Rocco LHose KMiller RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657877
Varma SShivam SRay BBiswas S(2024)Reimagining Enterprise Data Management using Generative Artificial Intelligence2024 11th IEEE Swiss Conference on Data Science (SDS)10.1109/SDS60720.2024.00023(107-114)Online publication date: 30-May-2024
https://doi.org/10.1109/SDS60720.2024.00023
Taha ILissandrini MSimitsis AIoannidis Y(2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
https://doi.org/10.1109/ICSC59802.2024.00046
Xing JJagadish H(2024)ARTS: A System for Aggregate Related Table Search2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00428(5461-5464)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00428
Zhao XChen ZHuang KZhang RZheng BZhou X(2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00303
Fan GShraga RMiller R(2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00272
Ionescu AVasilev KBuse FHai RKatsifodimos A(2024)AutoFeat: Transitive Feature Discovery over Join Paths2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00150(1861-1873)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00150
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents