Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ActiveClean: interactive data cleaning for statistical modeling

Published: 01 August 2016 Publication History

Abstract

Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.

References

[1]
Apache spark survey. https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html.
[2]
Dollars for docs. https://projects.propublica.org/docdollars/.
[3]
Keystone ml. http://keystone-ml.org/.
[4]
Tensor flow. https://www.tensorflow.org/.
[5]
Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. In VLDB, 2014.
[6]
M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In SIGMOD, 2015.
[7]
D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. In CoRR, 2015.
[8]
L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. 2012.
[9]
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., 2003.
[10]
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. In JMLR, 2012.
[11]
A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.
[12]
P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. In JMLR, 2012.
[13]
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012.
[14]
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.
[15]
J. Feng, H. Xu, S. Mannor, and S. Yan. Robust logistic regression and classification. In NIPS, 2014.
[16]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.
[17]
A. Guillory, E. Chastain, and J. Bilmes. Active learning as non-convex optimization. In AISTATS, 2009.
[18]
A. Heise, G. Kasneci, and F. Naumann. Estimating the number and sizes of fuzzy-duplicate clusters. In CIKM, 2014.
[19]
S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. Declarative support for sensor data cleaning. In Pervasive Computing, 2006.
[20]
S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In TVCG, 2012.
[21]
S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. In SIGMOD Demo, 2016.
[22]
S. Krishnan, D. Haas, M. J. Franklin, and E. Wu. Towards reliable interactive data cleaning: A user survey and recommendations. In HILDA, 2016.
[23]
S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014.
[24]
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv, 2015.
[25]
B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. In VLDB, 2014.
[26]
B. Nelson, B. I. P. Rubinstein, L. Huang, A. D. Joseph, S. J. Lee, S. Rao, and J. D. Tygar. Query strategies for evading convex-inducing classifiers. In JMLR, 2012.
[27]
S. J. Pan and Q. Yang. A survey on transfer learning. In TKDE. IEEE, 2010.
[28]
T. Papenbrock, A. Heise, and F. Naumann. Progressive duplicate detection. In TKDE, 2015.
[29]
J. Pearl. Causality: models, reasoning and inference. Economet. Theor, 19:675--685, 2003.
[30]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. In IEEE Data Eng. Bull., 2000.
[31]
B. Settles. Active learning literature survey. In University of Wisconsin, Madison, 2010.
[32]
E. H. Simpson. The interpretation of interaction in contingency tables. In Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 1951.
[33]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.
[34]
H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.
[35]
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.
[36]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. In VLDB, 2011.
[37]
P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML, 2015.

Cited By

View all
  • (2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
  • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
  • (2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 25-May-2024
  • Show More Cited By
  1. ActiveClean: interactive data cleaning for statistical modeling

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 12
    August 2016
    345 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2016
    Published in PVLDB Volume 9, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)288
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficiently Mitigating the Impact of Data Drift on Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3681954.368198417:11(3072-3081)Online publication date: 1-Jul-2024
    • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
    • (2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 25-May-2024
    • (2024)Certain and Approximately Certain Models for Statistical LearningProceedings of the ACM on Management of Data10.1145/36549292:3(1-25)Online publication date: 30-May-2024
    • (2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 19-Mar-2024
    • (2024)Humans-in-the-loopEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.107875132:COnline publication date: 1-Jun-2024
    • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 1-Nov-2024
    • (2024)Streaming data cleaning based on speed changeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00796-y33:1(1-24)Online publication date: 1-Jan-2024
    • (2024)Corrector LSTM: built-in training data correction for improved time-series forecastingNeural Computing and Applications10.1007/s00521-024-09962-x36:26(16213-16231)Online publication date: 1-Sep-2024
    • (2023)Equitable Data Valuation Meets the Right to Be Forgotten in Model MarketsProceedings of the VLDB Endowment10.14778/3611479.361153116:11(3349-3362)Online publication date: 24-Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media