Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3209900.3209911acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Draining the Data Swamp: A Similarity-based Approach

Published: 10 June 2018 Publication History

Abstract

While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called "data lakes" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, "data lakes" become "data swamps."
Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.

References

[1]
Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 459--468.
[2]
Taiwo Ayodele, Galyna Akmayeva, and Charles A. Shoniregun. 2012. Machine learning approach towards email management. In World Congress on Internet Security (WorldCIS-2012). 106--109.
[3]
Sreeram Balakrishnan, Alon Y Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in Practice. In CIDR.
[4]
Deborah K. Barreau. 1995. Context As a Factor in Personal Information Management Systems. J. Am. Soc. Inf. Sci. 46, 5 (1995), 327--339.
[5]
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB JournalâĂŤThe International Journal on Very Large Data Bases 18, 1 (2009), 255--276.
[6]
Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1199--1214.
[7]
Anant Bhardwaj, Amol Deshpande, Aaron J Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, and Rebecca Zhang. 2015. Collaborative data analytics with DataHub. Proceedings of the VLDB Endowment 8, 12 (2015), 1916--1919.
[8]
Richard Boardman and M Angela Sasse. 2004. Stuff goes into the computer and doesn't come out: a cross-tool study of personal information management. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 583--590.
[9]
Andrei Z Broder. 2000. Identifying and filtering near-duplicate documents. In Annual Symposium on Combinatorial Pattern Matching. Springer, 1--10.
[10]
Harry Bruce. 2005. Personal, Anticipated Information Need. Information Research: An International Electronic Journal 10, 3 (2005), n3.
[11]
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549.
[12]
Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He, and WA Redmond. 2016. Data services leveraging Bing's data assets. IEEE Data Eng. Bull. 39, 3 (2016), 15--28.
[13]
Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. 2015. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory. 503--522.
[14]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.
[15]
Susan Dumais, Edward Cutrell, Jonathan J Cadiz, Gavin Jancke, Raman Sarin, and Daniel C Robbins. 2016. Stuff I've seen: a system for personal information retrieval and re-use. In ACM SIGIR Forum, Vol. 49. ACM, 28--35.
[16]
Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2097--2100.
[17]
Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806.
[18]
Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et al. 2017. Ground: A Data Context Service. In CIDR.
[19]
Shawn R Jeffery, Michael J Franklin, and Alon Y Halevy. 2008. Pay-as-you-go user feedback for dataspace systems. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 847--860.
[20]
Shawn R Jeffery, Liwen Sun, Matt DeLand, Nick Pendar, Rick Barber, and Andrew Galdi. 2013. Arnold: Declarative Crowd-Machine Data Integration. In CIDR.
[21]
William Jones. 2007. Personal information management. Annual review of information science and technology 41, 1 (2007), 453--504.
[22]
William Jones. 2010. Keeping found things found: The study and practice of personal information management. Morgan Kaufmann.
[23]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372.
[24]
Mohammad Taha Khan, Maria Hyun, Chris Kanich, and Blase Ur. 2018. Forgotten But Not Gone: Identifying the Need for Longitudinal Data Management in Cloud Storage. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM.
[25]
Peter Klemperer, Yuan Liang, Michelle Mazurek, Manya Sleeper, Blase Ur, Lujo Bauer, Lorrie Faith Cranor, Nitin Gupta, and Michael Reiter. 2012. Tag, You Can See It!: Using Tags for Access Control in Photo Sharing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 377--386.
[26]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.
[27]
Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4--22.
[28]
Guoliang Li. 2017. Human-in-the-loop data integration. Proceedings of the VLDB Endowment 10, 12 (2017), 2006--2017.
[29]
Jayant Madhavan, Shawn R Jeffery, Shirley Cohen, Xin Dong, David Ko, Cong Yu, and Alon Halevy. 2007. Web-scale data integration: You can only afford to pay as you go. CIDR.
[30]
Michelle L Mazurek, Yuan Liang, William Melicher, Manya Sleeper, Lujo Bauer, Gregory R Ganger, Nitin Gupta, and Michael K Reiter. 2014. Toward strong, usable access control for shared distributed data. In Proceedings of the 12th USENIX conference on File and Storage Technologies. USENIX Association, 89--103.
[31]
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201.
[32]
Leo Sauermann, Gunnar Aastrand Grimnes, Malte Kiesel, Christiaan Fluit, Heiko Maus, Dominik Heim, Danish Nadeem, Benjamin Horak, and Andreas Dengel. 2006. Semantic Desktop 2.0: The Gnowsis Experience. In Proceedings of the 5th International Conference on The Semantic Web. 887--900.
[33]
Burr Settles. {n. d.}. Active Learning Literature Survey. 2010. Computer Sciences Technical Report 1648 ({n. d.}).
[34]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483--1494.
[35]
Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment 6, 6 (2013), 349--360.
[36]
Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. 2011. Guided data repair. Proceedings of the VLDB Endowment 4, 5 (2011), 279--289.
[37]
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. ACM, 97--108.

Cited By

View all
  • (2024)Managing Personal Identifiable Information in Data LakesIEEE Access10.1109/ACCESS.2024.336504212(32164-32180)Online publication date: 2024
  • (2024)Generative mechanisms of AI implementationInformation and Organization10.1016/j.infoandorg.2024.10050334:2Online publication date: 1-Jun-2024
  • (2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics
June 2018
87 pages
ISBN:9781450358279
DOI:10.1145/3209900
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 28 of 56 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)8
Reflects downloads up to 19 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Managing Personal Identifiable Information in Data LakesIEEE Access10.1109/ACCESS.2024.336504212(32164-32180)Online publication date: 2024
  • (2024)Generative mechanisms of AI implementationInformation and Organization10.1016/j.infoandorg.2024.10050334:2Online publication date: 1-Jun-2024
  • (2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
  • (2023)Data Lakes: A Survey of Functions and SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327010135:12(12571-12590)Online publication date: 1-Dec-2023
  • (2023)Automated Process Capability Analysis for Product Quality Improvements2023 IEEE International Conference on Engineering, Technology and Innovation (ICE/ITMC)10.1109/ICE/ITMC58018.2023.10332307(1-9)Online publication date: 19-Jun-2023
  • (2023)DataCockpit: A Toolkit for Data Lake Navigation and Monitoring Utilizing Quality and Usage Information2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386133(5305-5310)Online publication date: 15-Dec-2023
  • (2022)An introduction to data lakes for academic librariansInformation Services and Use10.3233/ISU-22017642:3-4(397-407)Online publication date: 1-Jan-2022
  • (2022)An overview about data integration in data lakes2022 17th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI54924.2022.9820576(1-7)Online publication date: 22-Jun-2022
  • (2021)Data Lakehouse - a Novel Step in Analytics Architecture2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO)10.23919/MIPRO52101.2021.9597091(1242-1246)Online publication date: 27-Sep-2021
  • (2021)VizCommender: Computing Text-Based Similarity in Visualization Repositories for Content-Based RecommendationsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.303038727:2(495-505)Online publication date: Feb-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media