Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3318464.3389726acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Finding Related Tables in Data Lakes for Interactive Data Science

Published: 31 May 2020 Publication History

Abstract

Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

Supplementary Material

MP4 File (3318464.3389726.mp4)
Presentation Video

References

[1]
Khalid Belhajjame, Norman W Paton, Alvaro AA Fernandes, Cornelia Hedeler, and Suzanne M Embury. 2011. User Feedback as a First Class Citizen in Information Integration Systems. In CIDR. 175--183.
[2]
William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer.2000. Feasibility of a Serverless Distributed File System Deployed onan Existing Set of Desktop PCs. In Proc. Measurement and Modeling of Computer Systems, 2000. 34--43.
[3]
Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365--1375.
[4]
Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, CongYu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of Webtables. Proceedings of the VLDB Endowment 11, 12 (2018), 2140--2149.
[5]
Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu,and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. PVLDB 1, 1 (2008), 538--549.
[6]
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies - a survey of approaches. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 147--165.
[7]
Lucas AMC Carvalho, Regina Wang, Yolanda Gil, and Daniel Garijo.2017. NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance. In Proceedings of Workshops and Tutorials of the 9th International Conference on Knowledge Capture (K-CAP2017).
[8]
James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trendsin Databases1, 4 (2009), 379--474.
[9]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.
[10]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. J. Comput. System Sci.66(4) (June2003), 614--656.
[11]
Ju Fan, Meiyu Lu, Beng Chin Ooi, Wang-Chiew Tan, and Meihui Zhang. 2014. A hybrid machine-crowdsourcing system for matching web tables. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 976--987.
[12]
Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Dis-covering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering 23, 5 (2011), 683--698.
[13]
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan,Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.
[14]
Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden.2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1190--1201.
[15]
Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec.34, 4 (2005), 27--33.
[16]
Avigdor Gal. 2011. Uncertain Schema Matching. Morgan and Claypool.
[17]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford.2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).
[18]
Jeremy Goecks, Anton Nekrutenko, and James Taylor. 2010. Galaxy:a comprehensive approach for supporting accessible, reproducible,and transparent computational research in the life sciences.Genomebiology11, 8 (2010), R86.
[19]
Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelligent Data Lake System. In SIGMOD. ACM, New York, NY, USA,2097--2100. https://doi.org/10.1145/2882903.2899389
[20]
Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806.
[21]
Alon Y Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5--14.
[22]
Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen.1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.
[23]
Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. 2003. Supporting Top-k Join Queries in Relational Databases. In VLDB. 754--765.
[24]
Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 647--658.
[25]
Ihab F. Ilyas and Mohamed Soliman. 2011. Probabilistic Ranking Techniques in Relational Databases. Morgan and Claypool.
[26]
Jaewook Kim, Yun Peng, Nenad Ivezik, Junho Shin, et al.2010.Semantic-based Optimal XML Schema Matching: A Mathematical Programming Approach. In The Proceedings of International Conference on E-business, Management and Economics.
[27]
Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard,Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. 2016. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment 9, 13 (2016), 1581--1584.
[28]
David Koop and Jay Patel. 2017. Dataflow notebooks: encoding and tracking dependencies of cells. In 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 17). USENIX Association.
[29]
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join? Thinking Twice about Joins beforeFeature Selection. In Proceedings of the 2016 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 19--34. https://doi.org/10.1145/2882903.2882952
[30]
Chengkai Li, Kevin Chen-Chuan Chang, Ihab F. Ilyas, and Sumin Song. 2005. RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In SIGMOD. 131--142.
[31]
Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. 2006.Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience(2006), 1039--1065.
[32]
Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proceedings of the VLDB Endowment 12, 12 (2019).
[33]
Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018.Table union search on open data.Proceedings of the VLDB Endowment11, 7 (2018), 813--825.
[34]
T. Oinn, M. Greenwood, M. Addis, N. Alpdemir, J. Ferris, K. Glover,C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. 2006. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 18, 10 (2006), 1067--1100.
[35]
Christos H Papadimitriou. 1981. On the complexity of integer programming. Journal of the ACM (JACM)28, 4 (1981), 765--768.
[36]
Fernando Perez and Brian E Granger. 2015. Project Jupyter: Computational narratives as the engine of collaborative data science. Retrieved September 11 (2015), 207.
[37]
Tomas Petricek, James Geddes, and Charles Sutton. 2018. Wrattler: Reproducible, live and polyglot notebooks. In 10th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2018). USENIX Association.
[38]
Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. PVLDB 5, 10 (2012), 908--919.
[39]
Erhard Rahm and Philip A. Bernstein. 2001. A Survey of Approaches to Automatic Schema Matching. VLDB J.10, 4 (2001), 334--350.
[40]
Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, and Sudipto Guha. 2008. Learning to create data-integrating queries. PVLDB 1, 1 (2008),785--796.
[41]
Petros Venetis, Alon Y Halevy, Jayant Madhavan, Marius Pasca, WarrenShen, Fei Wu, and Gengxin Miao. 2011. Recovering semantics of tables on the web. (2011).
[42]
Daisy Zhe Wang, Xin Luna Dong, Anish Das Sarma, Michael J Franklin, and Alon Y Halevy. 2009. Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.
[43]
Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. 2009. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25--36.
[44]
Yi Zhang and Zachary G. Ives. 2019. Juneau: Data Lake Management for Jupyter. Proceedings of the VLDB Endowment 12, 7 (2019).
[45]
Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019.JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data. ACM, 847--864.
[46]
Erkang Zhu, Fatemeh Nargesian, Ken Q Pu, and Renée J Miller. 2016. LSH ensemble: internet-scale domain search. Proceedings of the VLDB Endowment 9, 12 (2016), 1185--1196.
[47]
Moshé M. Zloof. 1975. Query-by-example: the invocation and definition of tables and forms. InVLDB '75: Proceedings of the 1st International Conference on Very Large Data Bases. 1--24.
[48]
Moshé M. Zloof. 1977. Query By Example: A Data Base Language. IBM Systems Journal 16(4) (1977), 324--343.

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
  • Show More Cited By

Index Terms

  1. Finding Related Tables in Data Lakes for Interactive Data Science

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data lakes
    2. interactive data science
    3. notebooks
    4. table search

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)280
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
    • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
    • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
    • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
    • (2024)Reimagining Enterprise Data Management using Generative Artificial Intelligence2024 11th IEEE Swiss Conference on Data Science (SDS)10.1109/SDS60720.2024.00023(107-114)Online publication date: 30-May-2024
    • (2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
    • (2024)ARTS: A System for Aggregate Related Table Search2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00428(5461-5464)Online publication date: 13-May-2024
    • (2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
    • (2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
    • (2024)AutoFeat: Transitive Feature Discovery over Join Paths2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00150(1861-1873)Online publication date: 13-May-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media