research-article

Draining the Data Swamp: A Similarity-based Approach

Authors:

Will Brackenbury,

Mainack Mondal,

Aaron J. Elmore,

Michael J. FranklinAuthors Info & Claims

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

Article No.: 13, Pages 1 - 7

https://doi.org/10.1145/3209900.3209911

Published: 10 June 2018 Publication History

Abstract

While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called "data lakes" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, "data lakes" become "data swamps."

Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.

References

[1]

Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 459--468.

Digital Library

[2]

Taiwo Ayodele, Galyna Akmayeva, and Charles A. Shoniregun. 2012. Machine learning approach towards email management. In World Congress on Internet Security (WorldCIS-2012). 106--109.

[3]

Sreeram Balakrishnan, Alon Y Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in Practice. In CIDR.

[4]

Deborah K. Barreau. 1995. Context As a Factor in Personal Information Management Systems. J. Am. Soc. Inf. Sci. 46, 5 (1995), 327--339.

Digital Library

[5]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB JournalâĂŤThe International Journal on Very Large Data Bases 18, 1 (2009), 255--276.

Digital Library

[6]

Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1199--1214.

Digital Library

[7]

Anant Bhardwaj, Amol Deshpande, Aaron J Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, and Rebecca Zhang. 2015. Collaborative data analytics with DataHub. Proceedings of the VLDB Endowment 8, 12 (2015), 1916--1919.

Digital Library

[8]

Richard Boardman and M Angela Sasse. 2004. Stuff goes into the computer and doesn't come out: a cross-tool study of personal information management. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 583--590.

Digital Library

[9]

Andrei Z Broder. 2000. Identifying and filtering near-duplicate documents. In Annual Symposium on Combinatorial Pattern Matching. Springer, 1--10.

Digital Library

[10]

Harry Bruce. 2005. Personal, Anticipated Information Need. Information Research: An International Electronic Journal 10, 3 (2005), n3.

[11]

Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549.

Digital Library

[12]

Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He, and WA Redmond. 2016. Data services leveraging Bing's data assets. IEEE Data Eng. Bull. 39, 3 (2016), 15--28.

[13]

Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. 2015. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory. 503--522.

[14]

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.

[15]

Susan Dumais, Edward Cutrell, Jonathan J Cadiz, Gavin Jancke, Raman Sarin, and Daniel C Robbins. 2016. Stuff I've seen: a system for personal information retrieval and re-use. In ACM SIGIR Forum, Vol. 49. ACM, 28--35.

Digital Library

[16]

Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2097--2100.

Digital Library

[17]

Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806.

Digital Library

[18]

Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et al. 2017. Ground: A Data Context Service. In CIDR.

[19]

Shawn R Jeffery, Michael J Franklin, and Alon Y Halevy. 2008. Pay-as-you-go user feedback for dataspace systems. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 847--860.

Digital Library

[20]

Shawn R Jeffery, Liwen Sun, Matt DeLand, Nick Pendar, Rick Barber, and Andrew Galdi. 2013. Arnold: Declarative Crowd-Machine Data Integration. In CIDR.

[21]

William Jones. 2007. Personal information management. Annual review of information science and technology 41, 1 (2007), 453--504.

Digital Library

[22]

William Jones. 2010. Keeping found things found: The study and practice of personal information management. Morgan Kaufmann.

Digital Library

[23]

Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372.

Digital Library

[24]

Mohammad Taha Khan, Maria Hyun, Chris Kanich, and Blase Ur. 2018. Forgotten But Not Gone: Identifying the Need for Longitudinal Data Management in Cloud Storage. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM.

Digital Library

[25]

Peter Klemperer, Yuan Liang, Michelle Mazurek, Manya Sleeper, Blase Ur, Lujo Bauer, Lorrie Faith Cranor, Nitin Gupta, and Michael Reiter. 2012. Tag, You Can See It!: Using Tags for Access Control in Photo Sharing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 377--386.

Digital Library

[26]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.

Digital Library

[27]

Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4--22.

Digital Library

[28]

Guoliang Li. 2017. Human-in-the-loop data integration. Proceedings of the VLDB Endowment 10, 12 (2017), 2006--2017.

Digital Library

[29]

Jayant Madhavan, Shawn R Jeffery, Shirley Cohen, Xin Dong, David Ko, Cong Yu, and Alon Halevy. 2007. Web-scale data integration: You can only afford to pay as you go. CIDR.

[30]

Michelle L Mazurek, Yuan Liang, William Melicher, Manya Sleeper, Lujo Bauer, Gregory R Ganger, Nitin Gupta, and Michael K Reiter. 2014. Toward strong, usable access control for shared distributed data. In Proceedings of the 12th USENIX conference on File and Storage Technologies. USENIX Association, 89--103.

Digital Library

[31]

Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201.

Digital Library

[32]

Leo Sauermann, Gunnar Aastrand Grimnes, Malte Kiesel, Christiaan Fluit, Heiko Maus, Dominik Heim, Danish Nadeem, Benjamin Horak, and Andreas Dengel. 2006. Semantic Desktop 2.0: The Gnowsis Experience. In Proceedings of the 5th International Conference on The Semantic Web. 887--900.

Digital Library

[33]

Burr Settles. {n. d.}. Active Learning Literature Survey. 2010. Computer Sciences Technical Report 1648 ({n. d.}).

[34]

Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483--1494.

Digital Library

[35]

Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment 6, 6 (2013), 349--360.

Digital Library

[36]

Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. 2011. Guided data repair. Proceedings of the VLDB Endowment 4, 5 (2011), 279--289.

Digital Library

[37]

Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. ACM, 97--108.

Digital Library

Cited By

Oreščanin DHlupić TVrdoljak B(2024)Managing Personal Identifiable Information in Data LakesIEEE Access10.1109/ACCESS.2024.336504212(32164-32180)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3365042
Stohr AOllig PKeller RRieger A(2024)Generative mechanisms of AI implementationInformation and Organization10.1016/j.infoandorg.2024.10050334:2Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.infoandorg.2024.100503
Psallidas FAgrawal ASugunan CIbrahim KKaranasos KCamacho-Rodríguez JFloratou ACurino CRamakrishnan R(2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611555
Show More Cited By

Recommendations

A Review on Data Cleansing Methods for Big Data
Abstract
Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

June 2018

87 pages

ISBN:9781450358279

DOI:10.1145/3209900

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10, 2018

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 28 of 56 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
487
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)8

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Oreščanin DHlupić TVrdoljak B(2024)Managing Personal Identifiable Information in Data LakesIEEE Access10.1109/ACCESS.2024.336504212(32164-32180)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3365042
Stohr AOllig PKeller RRieger A(2024)Generative mechanisms of AI implementationInformation and Organization10.1016/j.infoandorg.2024.10050334:2Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.infoandorg.2024.100503
Psallidas FAgrawal ASugunan CIbrahim KKaranasos KCamacho-Rodríguez JFloratou ACurino CRamakrishnan R(2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611555
Hai RKoutras CQuix CJarke M(2023)Data Lakes: A Survey of Functions and SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327010135:12(12571-12590)Online publication date: 1-Dec-2023
https://doi.org/10.1109/TKDE.2023.3270101
Salcher FFinck SHellwig M(2023)Automated Process Capability Analysis for Product Quality Improvements2023 IEEE International Conference on Engineering, Technology and Innovation (ICE/ITMC)10.1109/ICE/ITMC58018.2023.10332307(1-9)Online publication date: 19-Jun-2023
https://doi.org/10.1109/ICE/ITMC58018.2023.10332307
Narechania AChakraborty SAgarwal SSinha ARossi RDu FHoffswell JGuo SKoh EEndert ANavathe S(2023)DataCockpit: A Toolkit for Data Lake Navigation and Monitoring Utilizing Quality and Usage Information2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386133(5305-5310)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386133
Anderson CLawlor B(2022)An introduction to data lakes for academic librariansInformation Services and Use10.3233/ISU-22017642:3-4(397-407)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/ISU-220176
Couto JRuiz D(2022)An overview about data integration in data lakes2022 17th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI54924.2022.9820576(1-7)Online publication date: 22-Jun-2022
https://doi.org/10.23919/CISTI54924.2022.9820576
Orescanin DHlupic T(2021)Data Lakehouse - a Novel Step in Analytics Architecture2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO52101.2021.9597091(1242-1246)Online publication date: 27-Sep-2021
https://doi.org/10.23919/MIPRO52101.2021.9597091
Oppermann MKincaid RMunzner T(2021)VizCommender: Computing Text-Based Similarity in Visualization Repositories for Content-Based RecommendationsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.303038727:2(495-505)Online publication date: Feb-2021
https://doi.org/10.1109/TVCG.2020.3030387
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents