research-article

A sample-and-clean framework for fast and accurate query processing on dirty data

Authors:

Sanjay Krishnan,

Michael J. Franklin,

Tova MiloAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 469 - 480

https://doi.org/10.1145/2588555.2610505

Published: 18 June 2014 Publication History

Abstract

In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing sampling error. In this paper, we explore an intriguing opportunity. That is, we explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers. We derive confidence intervals as a function of sample size and show how our approach addresses error bias. We evaluate the Sample-and-Clean framework using data from three sources: the TPC-H benchmark with synthetic noise, a subset of the Microsoft academic citation index and a sensor data set. Our results are consistent with the theoretical confidence intervals and suggest that the Sample-and-Clean framework can produce significant improvements in accuracy compared to query processing without data cleaning and speed compared to data cleaning without sampling.

References

[1]

S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD Conference, pages 487--498, 2000.

Digital Library

[2]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD Conference, pages 574--576, 1999.

Digital Library

[3]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013.

Digital Library

[4]

A. Aldroubi. Non-uniform weighted average sampling and reconstruction in shift-invariant and wavelet spaces. Applied and Computational Harmonic Analysis, 13(2):151--161, 2002.

[5]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD Conference, pages 539--550, 2003.

Digital Library

[6]

Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1--10. Springer, 2002.

Digital Library

[7]

V. Barnett. Sample survey. Principles and method, 3, 1991.

[8]

K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD Conference, pages 199--210, 2007.

Digital Library

[9]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003.

Digital Library

[10]

M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, pages 268--279, 2000.

Digital Library

[11]

S. Chaudhuri, G. Das, and V. R. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2):9, 2007.

Digital Library

[12]

P. Christen. Febrl: a freely available record linkage system with a graphical user interface. In HDKM, pages 17--25, 2008.

Digital Library

[13]

P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537--1555, 2012.

Digital Library

[14]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In SIGMOD Conference, pages 1115--1118, 2010.

Digital Library

[15]

J. Considine, F. Li, G. Kollios, and J. W. Byers. Approximate aggregation techniques for sensor databases. In ICDE, pages 449--460, 2004.

Digital Library

[16]

G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012.

Digital Library

[17]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013.

Digital Library

[18]

T. Dasu and T. Johnson. Exploratory data mining and data cleaning. Wiley, 2003.

Digital Library

[19]

DataWrangler. http://vis.stanford.edu/wrangler.

[20]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012.

Digital Library

[21]

X. L. Dong and D. Srivastava. Big data integration. PVLDB, 6(11):1188--1189, 2013.

Digital Library

[22]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.

Digital Library

[23]

W. Fan and F. Geerts. Foundations of data quality management. Synthesis Lectures on Data Management, 2012.

Digital Library

[24]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010.

Digital Library

[25]

Full version. http://goo.gl/SPNP1R.

[26]

M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.

Digital Library

[27]

P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In VLDB, pages 311--322, 1995.

Digital Library

[28]

G. J. Hahn and W. Q. Meeker. Statistical intervals: a guide for practitioners, volume 328. Wiley. com, 2011.

[29]

M. H. Hansen. Some history and reminiscences on survey sampling. Statistical Science, 2(2):180--190, 1987.

[30]

D. Harnik, O. Margalit, D. Naor, D. Sotnikov, and G. Vernik. Estimation of deduplication ratios in large data sets. In MSST, pages 1--11, 2012.

[31]

J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[32]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD Conference, pages 171--182, 1997.

Digital Library

[33]

D. V. Hinkley. Bootstrap methods. Journal of the Royal Statistical Society. Series B, pages 321--337, 1988.

[34]

J. Jacod and A. N. Shiryaev. Limit theorems for stochastic processes, volume 288. Springer-Verlag Berlin, 1987.

[35]

S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. Declarative support for sensor data cleaning. In Pervasive, pages 83--100, 2006.

Digital Library

[36]

S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD Conference, pages 847--860, 2008.

Digital Library

[37]

S. Jukna. Analysis of boolean functions. In Boolean Function Complexity, pages 55--77. Springer, 2012.

[38]

G. Kalton. Introduction to survey sampling, volume 7. SAGE Publications, Incorporated, 1983.

[39]

H. Kopcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.

Digital Library

[40]

S. Kullback. Information theory and statistics. Courier Dover Publications, 1997.

[41]

J. Lee. U-statistics: Theory and Practice. CRC Press, 1990.

[42]

J. S. Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6(2):113--119, 1996.

[43]

S. L. Lohr. Sampling: design and analysis. Cengage Learning, 2010.

[44]

C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148--188, 1989.

[45]

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21:1087, 1953.

[46]

OpenRefine. http://openrefine.org.

[47]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11):1135--1145, 2011.

Digital Library

[48]

H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. Technical report, Stanford University.

[49]

C.-E. Sarndal, B. Swensson, and J. H. Wretman. Model assisted survey sampling. Springer, 2003.

[50]

L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: scientific data management with bounds on runtime and quality. In CIDR, pages 296--301, 2011.

[51]

N. Swartz. Gartner warns firms of 'dirty data'. Information Management Journal, 41(3), 2007.

[52]

B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013.

Digital Library

[53]

R. Valliant, A. H. Dorfman, and R. M. Royall. Finite population sampling and inference: a prediction approach. Wiley New York, 2000.

[54]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.

Digital Library

[55]

J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD Conference, pages 229--240, 2013.

Digital Library

[56]

H. Weisberg. The Total Survey Error Approach: A Guide to the New Science of Survey Research. University of Chicago Press, 2009.

[57]

S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed online aggregation. PVLDB, 2(1):443--454, 2009.

Digital Library

[58]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.

Digital Library

Cited By

Marchesin SSilvello G(2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665865
Xiong KXu XFu SWeng DWang YWu Y(2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
https://doi.org/10.1109/TVCG.2024.3388556
Schreiber RAmsterdamer Y(2024)CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00418(5421-5424)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00418
Show More Cited By

Index Terms

A sample-and-clean framework for fast and accurate query processing on dirty data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

A Taxonomy of Dirty Data

Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining,...
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
A Comparative Study of Data Cleaning Tools

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computing and Communication Foundations
Defense Advanced Research Projects Agency
European Research Council
Israel Ministry of Science
LBNL

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

113
Total Citations
View Citations
1,119
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)8

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Marchesin SSilvello G(2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665865
Xiong KXu XFu SWeng DWang YWu Y(2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
https://doi.org/10.1109/TVCG.2024.3388556
Schreiber RAmsterdamer Y(2024)CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00418(5421-5424)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00418
Drien OFreiman MAmarilli AAmsterdamer Y(2023)Query-Guided Resolution in Uncertain DatabasesProceedings of the ACM on Management of Data10.1145/35893251:2(1-27)Online publication date: 20-Jun-2023
https://doi.org/10.1145/3589325
Obraczka DSaeedi AChristen VRahm E(2023)Big Data Integration for Industry 4.0Digital Transformation10.1007/978-3-662-65004-2_10(247-268)Online publication date: 3-Feb-2023
https://doi.org/10.1007/978-3-662-65004-2_10
Chai CWang JLuo YNiu ZLi G(2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3148237
Sisodiya SPrabhu S(2022)Analysis of Data cleaning Framework: Technological advancements2022 International Interdisciplinary Humanitarian Conference for Sustainability (IIHC)10.1109/IIHC55949.2022.10059820(36-40)Online publication date: 18-Nov-2022
https://doi.org/10.1109/IIHC55949.2022.10059820
Riondato M(2022)Sampling-Based Data Mining Algorithms: Modern Techniques and Case StudiesMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_48(516-519)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44845-8_48
Papadakis GIoannou EThanos EPalpanas T(2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
https://doi.org/10.2200/S01067ED1V01Y202012DTM064
Rezig EOuzzani MAref WElmagarmid AMahmood AStonebraker M(2021)HorizonProceedings of the VLDB Endowment10.14778/3476249.347630114:11(2546-2554)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476249.3476301
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents