Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Wisteria: nurturing scalable data cleaning infrastructure

Published: 01 August 2015 Publication History

Abstract

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

References

[1]
Apache falcon. http://falcon.apache.org.
[2]
Informatica. https://www.informatica.com.
[3]
Talend. https://www.talend.com/solutions/etl-analytics.
[4]
Trifacta. http://www.trifacta.com.
[5]
Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135. ACM, 2014.
[6]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013.
[7]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.
[8]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.
[9]
S. Kandel, A. Paepcke, J. Hellerstein, and H. Jeffrey. Enterprise data analysis and visualization: An interview study. VAST, 2012.
[10]
S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. Proc. VLDB, 8(12), 2015.
[11]
C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.
[12]
H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. In SIGMOD, 2014.
[13]
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.
[14]
S. Venkataraman, A. Panda, G. Ananthanarayanan, M. J. Franklin, and I. Stoica. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation, pages 301--316. USENIX Association, 2014.
[15]
R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013.
[16]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014.

Cited By

View all
  • (2021)Case Studies on the Motivation and Performance of Contributors Who Verify and Maintain In-Flux Tabular DatasetsProceedings of the ACM on Human-Computer Interaction10.1145/34795925:CSCW2(1-25)Online publication date: 18-Oct-2021
  • (2021)AutoML to Date and Beyond: Challenges and OpportunitiesACM Computing Surveys10.1145/347091854:8(1-36)Online publication date: 4-Oct-2021
  • (2021)Explanations for Data Repair Through Shapley ValuesProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482341(362-371)Online publication date: 26-Oct-2021
  • Show More Cited By
  1. Wisteria: nurturing scalable data cleaning infrastructure

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 8, Issue 12
    Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
    August 2015
    728 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2015
    Published in PVLDB Volume 8, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Case Studies on the Motivation and Performance of Contributors Who Verify and Maintain In-Flux Tabular DatasetsProceedings of the ACM on Human-Computer Interaction10.1145/34795925:CSCW2(1-25)Online publication date: 18-Oct-2021
    • (2021)AutoML to Date and Beyond: Challenges and OpportunitiesACM Computing Surveys10.1145/347091854:8(1-36)Online publication date: 4-Oct-2021
    • (2021)Explanations for Data Repair Through Shapley ValuesProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482341(362-371)Online publication date: 26-Oct-2021
    • (2019)SmokeProceedings of the VLDB Endowment10.14778/3199517.319952211:6(719-732)Online publication date: 17-Jan-2019
    • (2018)SmokeProceedings of the VLDB Endowment10.5555/3199517.319952211:6(719-732)Online publication date: 1-Feb-2018
    • (2018)SmokeProceedings of the VLDB Endowment10.14778/3184470.318447511:6(719-732)Online publication date: 5-Oct-2018
    • (2017)CleanMProceedings of the VLDB Endowment10.14778/3137628.313765410:11(1466-1477)Online publication date: 1-Aug-2017
    • (2017)Human-in-the-Loop Challenges for Entity MatchingProceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics10.1145/3077257.3077268(1-6)Online publication date: 14-May-2017
    • (2017)Discovering Conditional Matching RulesACM Transactions on Knowledge Discovery from Data10.1145/307064711:4(1-38)Online publication date: 29-Jun-2017
    • (2017)FalconProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3035960(1431-1446)Online publication date: 9-May-2017
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media