research-article

ActiveClean: interactive data cleaning for statistical modeling

Authors:

Sanjay Krishnan,

Michael J. Franklin,

Ken GoldbergAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 9, Issue 12

Pages 948 - 959

https://doi.org/10.14778/2994509.2994514

Published: 01 August 2016 Publication History

Abstract

Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.

References

[1]

Apache spark survey. https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html.

[2]

Dollars for docs. https://projects.propublica.org/docdollars/.

[3]

Keystone ml. http://keystone-ml.org/.

[4]

Tensor flow. https://www.tensorflow.org/.

[5]

Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. In VLDB, 2014.

Digital Library

[6]

M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In SIGMOD, 2015.

Digital Library

[7]

D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. In CoRR, 2015.

[8]

L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. 2012.

[9]

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., 2003.

[10]

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. In JMLR, 2012.

Digital Library

[11]

A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.

Digital Library

[12]

P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. In JMLR, 2012.

Digital Library

[13]

W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012.

Digital Library

[14]

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.

[15]

J. Feng, H. Xu, S. Mannor, and S. Yan. Robust logistic regression and classification. In NIPS, 2014.

Digital Library

[16]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.

Digital Library

[17]

A. Guillory, E. Chastain, and J. Bilmes. Active learning as non-convex optimization. In AISTATS, 2009.

[18]

A. Heise, G. Kasneci, and F. Naumann. Estimating the number and sizes of fuzzy-duplicate clusters. In CIKM, 2014.

Digital Library

[19]

S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. Declarative support for sensor data cleaning. In Pervasive Computing, 2006.

Digital Library

[20]

S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In TVCG, 2012.

Digital Library

[21]

S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. In SIGMOD Demo, 2016.

Digital Library

[22]

S. Krishnan, D. Haas, M. J. Franklin, and E. Wu. Towards reliable interactive data cleaning: A user survey and recommendations. In HILDA, 2016.

Digital Library

[23]

S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014.

Digital Library

[24]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv, 2015.

[25]

B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. In VLDB, 2014.

Digital Library

[26]

B. Nelson, B. I. P. Rubinstein, L. Huang, A. D. Joseph, S. J. Lee, S. Rao, and J. D. Tygar. Query strategies for evading convex-inducing classifiers. In JMLR, 2012.

Digital Library

[27]

S. J. Pan and Q. Yang. A survey on transfer learning. In TKDE. IEEE, 2010.

Digital Library

[28]

T. Papenbrock, A. Heise, and F. Naumann. Progressive duplicate detection. In TKDE, 2015.

Digital Library

[29]

J. Pearl. Causality: models, reasoning and inference. Economet. Theor, 19:675--685, 2003.

[30]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. In IEEE Data Eng. Bull., 2000.

[31]

B. Settles. Active learning literature survey. In University of Wisconsin, Madison, 2010.

Digital Library

[32]

E. H. Simpson. The interpretation of interaction in contingency tables. In Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 1951.

[33]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.

Digital Library

[34]

H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.

Digital Library

[35]

M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.

Digital Library

[36]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. In VLDB, 2011.

Digital Library

[37]

P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML, 2015.

Cited By

Dong SWang QSahri SPalpanas TSrivastava D(2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681984
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675051
Mecca GPapotti PSantoro DVeltri E(2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 25-May-2024
https://dl.acm.org/doi/10.1145/3665930
Show More Cited By

ActiveClean: interactive data cleaning for statistical modeling
1. Computing methodologies

Recommendations

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, ...
Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
A Review on Data Cleansing Methods for Big Data
Abstract
Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 9, Issue 12

August 2016

345 pages

ISSN:2150-8097

Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2016

Published in PVLDB Volume 9, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
1,518
Total Downloads

Downloads (Last 12 months)288
Downloads (Last 6 weeks)27

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dong SWang QSahri SPalpanas TSrivastava D(2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681984
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675051
Mecca GPapotti PSantoro DVeltri E(2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 25-May-2024
https://dl.acm.org/doi/10.1145/3665930
Zhen CAryal NTermehchy AChabada A(2024)Certain and Approximately Certain Models for Statistical LearningProceedings of the ACM on Management of Data10.1145/36549292:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654929
Pereira JFonseca MLopes AGalhardas H(2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1145/3648476
Sadeghianasl Ster Hofstede AWynn MTürkay S(2024)Humans-in-the-loopEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.107875132:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.engappai.2024.107875
Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s10515-024-00453-w
Wang HZhang ASong SWang J(2024)Streaming data cleaning based on speed changeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00796-y33:1(1-24)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s00778-023-00796-y
Baghoussi YSoares CMendes-Moreira J(2024)Corrector LSTM: built-in training data correction for improved time-series forecastingNeural Computing and Applications10.1007/s00521-024-09962-x36:26(16213-16231)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00521-024-09962-x
Xia HLiu JLou JQin ZRen KCao YXiong L(2023)Equitable Data Valuation Meets the Right to Be Forgotten in Model MarketsProceedings of the VLDB Endowment10.14778/3611479.361153116:11(3349-3362)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611531
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents