research-article

Wisteria: nurturing scalable data cleaning infrastructure

Editors: Chen Li, Volker Markl Authors:

Sanjay Krishnan,

Michael J. Franklin,

Eugene WuAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 12

Pages 2004 - 2007

https://doi.org/10.14778/2824032.2824122

Published: 01 August 2015 Publication History

Abstract

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

References

[1]

Apache falcon. http://falcon.apache.org.

[2]

Informatica. https://www.informatica.com.

[3]

Talend. https://www.talend.com/solutions/etl-analytics.

[4]

Trifacta. http://www.trifacta.com.

[5]

Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135. ACM, 2014.

[6]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013.

[7]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.

[8]

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.

[9]

S. Kandel, A. Paepcke, J. Hellerstein, and H. Jeffrey. Enterprise data analysis and visualization: An interview study. VAST, 2012.

[10]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. Proc. VLDB, 8(12), 2015.

[11]

C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.

[12]

H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. In SIGMOD, 2014.

[13]

M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.

[14]

S. Venkataraman, A. Panda, G. Ananthanarayanan, M. J. Franklin, and I. Stoica. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation, pages 301--316. USENIX Association, 2014.

[15]

R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013.

[16]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014.

Cited By

Weber LLeiva L(2025)AutoML for shape-writing biometricsNeural Computing and Applications10.1007/s00521-025-10983-3Online publication date: 1-Feb-2025
https://doi.org/10.1007/s00521-025-10983-3
Onsongo GFritsche SNguyen TBelemlih AThompson JSilverstein K(2022)ITALLICComputers and Electronics in Agriculture10.1016/j.compag.2022.106947197:COnline publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1016/j.compag.2022.106947
vanKeulen M(2022)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-2(1-8)Online publication date: 15-Jun-2022
https://doi.org/10.1007/978-3-319-63962-8_18-2
Show More Cited By

Wisteria: nurturing scalable data cleaning infrastructure
1. Information systems
  1. Data management systems

Recommendations

Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
The Complete Raw Workflow Guide: How to get the most from your raw images in Adobe Camera Raw, Lightroom, Photoshop, and Elements
ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse
ICAIP '17: Proceedings of the International Conference on Advances in Image Processing

Data cleansing can be considered to be an activity that is performed on the data sets of the data warehouse. The cleansing is done in order to enhance and collectively maintain data consistency and quality. The quality of data has a strong impact on a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 12

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

August 2015

728 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015

Published in PVLDB Volume 8, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
258
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Weber LLeiva L(2025)AutoML for shape-writing biometricsNeural Computing and Applications10.1007/s00521-025-10983-3Online publication date: 1-Feb-2025
https://doi.org/10.1007/s00521-025-10983-3
Onsongo GFritsche SNguyen TBelemlih AThompson JSilverstein K(2022)ITALLICComputers and Electronics in Agriculture10.1016/j.compag.2022.106947197:COnline publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1016/j.compag.2022.106947
vanKeulen M(2022)Probabilistic Data IntegrationEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_18-2(1-8)Online publication date: 15-Jun-2022
https://doi.org/10.1007/978-3-319-63962-8_18-2
Wallace SPapoutsaki ATan NGuo HHuang J(2021)Case Studies on the Motivation and Performance of Contributors Who Verify and Maintain In-Flux Tabular DatasetsProceedings of the ACM on Human-Computer Interaction10.1145/34795925:CSCW2(1-25)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479592
Karmaker (“Santu”) SHassan MSmith MXu LZhai CVeeramachaneni K(2021)AutoML to Date and Beyond: Challenges and OpportunitiesACM Computing Surveys10.1145/347091854:8(1-36)Online publication date: 4-Oct-2021
https://dl.acm.org/doi/10.1145/3470918
Deutch DFrost NGilad ASheffer ODemartini GZuccon GCulpepper JHuang ZTong H(2021)Explanations for Data Repair Through Shapley ValuesProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482341(362-371)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482341
Nguyen K(2021)Feature Engineering and Health Indicator Construction for Fault Detection and DiagnosticControl Charts and Machine Learning for Anomaly Detection in Manufacturing10.1007/978-3-030-83819-5_10(243-269)Online publication date: 30-Aug-2021
https://doi.org/10.1007/978-3-030-83819-5_10
Abbott PWeinger M(2020)Health information technology:Fallacies and Sober realities – Redux A homage to Bentzi Karsh and Robert WearsApplied Ergonomics10.1016/j.apergo.2019.10297382(102973)Online publication date: Jan-2020
https://doi.org/10.1016/j.apergo.2019.102973
Li CJiang HGe Q(2019)Power Data Cleaning in Micro Grid2019 Chinese Control Conference (CCC)10.23919/ChiCC.2019.8865726(3776-3781)Online publication date: Jul-2019
https://doi.org/10.23919/ChiCC.2019.8865726
Schmidt CRohlig MGrundel BDaumke PRitter MStahl ARosenthal PSchumann H(2019)Combining Visual Cleansing and Exploration for Clinical Data2019 IEEE Workshop on Visual Analytics in Healthcare (VAHC)10.1109/VAHC47919.2019.8945034(25-32)Online publication date: Oct-2019
https://doi.org/10.1109/VAHC47919.2019.8945034
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents