Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2588555.2610505acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A sample-and-clean framework for fast and accurate query processing on dirty data

Published: 18 June 2014 Publication History

Abstract

In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing sampling error. In this paper, we explore an intriguing opportunity. That is, we explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers. We derive confidence intervals as a function of sample size and show how our approach addresses error bias. We evaluate the Sample-and-Clean framework using data from three sources: the TPC-H benchmark with synthetic noise, a subset of the Microsoft academic citation index and a sensor data set. Our results are consistent with the theoretical confidence intervals and suggest that the Sample-and-Clean framework can produce significant improvements in accuracy compared to query processing without data cleaning and speed compared to data cleaning without sampling.

References

[1]
S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD Conference, pages 487--498, 2000.
[2]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD Conference, pages 574--576, 1999.
[3]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013.
[4]
A. Aldroubi. Non-uniform weighted average sampling and reconstruction in shift-invariant and wavelet spaces. Applied and Computational Harmonic Analysis, 13(2):151--161, 2002.
[5]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD Conference, pages 539--550, 2003.
[6]
Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1--10. Springer, 2002.
[7]
V. Barnett. Sample survey. Principles and method, 3, 1991.
[8]
K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD Conference, pages 199--210, 2007.
[9]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003.
[10]
M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, pages 268--279, 2000.
[11]
S. Chaudhuri, G. Das, and V. R. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2):9, 2007.
[12]
P. Christen. Febrl: a freely available record linkage system with a graphical user interface. In HDKM, pages 17--25, 2008.
[13]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537--1555, 2012.
[14]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In SIGMOD Conference, pages 1115--1118, 2010.
[15]
J. Considine, F. Li, G. Kollios, and J. W. Byers. Approximate aggregation techniques for sensor databases. In ICDE, pages 449--460, 2004.
[16]
G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012.
[17]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013.
[18]
T. Dasu and T. Johnson. Exploratory data mining and data cleaning. Wiley, 2003.
[19]
DataWrangler. http://vis.stanford.edu/wrangler.
[20]
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012.
[21]
X. L. Dong and D. Srivastava. Big data integration. PVLDB, 6(11):1188--1189, 2013.
[22]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
[23]
W. Fan and F. Geerts. Foundations of data quality management. Synthesis Lectures on Data Management, 2012.
[24]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010.
[25]
Full version. http://goo.gl/SPNP1R.
[26]
M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.
[27]
P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In VLDB, pages 311--322, 1995.
[28]
G. J. Hahn and W. Q. Meeker. Statistical intervals: a guide for practitioners, volume 328. Wiley. com, 2011.
[29]
M. H. Hansen. Some history and reminiscences on survey sampling. Statistical Science, 2(2):180--190, 1987.
[30]
D. Harnik, O. Margalit, D. Naor, D. Sotnikov, and G. Vernik. Estimation of deduplication ratios in large data sets. In MSST, pages 1--11, 2012.
[31]
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
[32]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD Conference, pages 171--182, 1997.
[33]
D. V. Hinkley. Bootstrap methods. Journal of the Royal Statistical Society. Series B, pages 321--337, 1988.
[34]
J. Jacod and A. N. Shiryaev. Limit theorems for stochastic processes, volume 288. Springer-Verlag Berlin, 1987.
[35]
S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. Declarative support for sensor data cleaning. In Pervasive, pages 83--100, 2006.
[36]
S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD Conference, pages 847--860, 2008.
[37]
S. Jukna. Analysis of boolean functions. In Boolean Function Complexity, pages 55--77. Springer, 2012.
[38]
G. Kalton. Introduction to survey sampling, volume 7. SAGE Publications, Incorporated, 1983.
[39]
H. Kopcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.
[40]
S. Kullback. Information theory and statistics. Courier Dover Publications, 1997.
[41]
J. Lee. U-statistics: Theory and Practice. CRC Press, 1990.
[42]
J. S. Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6(2):113--119, 1996.
[43]
S. L. Lohr. Sampling: design and analysis. Cengage Learning, 2010.
[44]
C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148--188, 1989.
[45]
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21:1087, 1953.
[46]
OpenRefine. http://openrefine.org.
[47]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11):1135--1145, 2011.
[48]
H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. Technical report, Stanford University.
[49]
C.-E. Sarndal, B. Swensson, and J. H. Wretman. Model assisted survey sampling. Springer, 2003.
[50]
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: scientific data management with bounds on runtime and quality. In CIDR, pages 296--301, 2011.
[51]
N. Swartz. Gartner warns firms of 'dirty data'. Information Management Journal, 41(3), 2007.
[52]
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013.
[53]
R. Valliant, A. H. Dorfman, and R. M. Royall. Finite population sampling and inference: a prediction approach. Wiley New York, 2000.
[54]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012.
[55]
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD Conference, pages 229--240, 2013.
[56]
H. Weisberg. The Total Survey Error Approach: A Guide to the New Science of Survey Research. University of Chicago Press, 2009.
[57]
S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed online aggregation. PVLDB, 2(1):443--454, 2009.
[58]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.

Cited By

View all
  • (2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
  • (2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
  • (2024)CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00418(5421-5424)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. aggregate query
  2. data cleaning
  3. dirty data
  4. sampling

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'14
Sponsor:

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)77
  • Downloads (Last 6 weeks)8
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
  • (2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
  • (2024)CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00418(5421-5424)Online publication date: 13-May-2024
  • (2023)Query-Guided Resolution in Uncertain DatabasesProceedings of the ACM on Management of Data10.1145/35893251:2(1-27)Online publication date: 20-Jun-2023
  • (2023)Big Data Integration for Industry 4.0Digital Transformation10.1007/978-3-662-65004-2_10(247-268)Online publication date: 3-Feb-2023
  • (2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
  • (2022)Analysis of Data cleaning Framework: Technological advancements2022 International Interdisciplinary Humanitarian Conference for Sustainability (IIHC)10.1109/IIHC55949.2022.10059820(36-40)Online publication date: 18-Nov-2022
  • (2022)Sampling-Based Data Mining Algorithms: Modern Techniques and Case StudiesMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_48(516-519)Online publication date: 10-Mar-2022
  • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
  • (2021)HorizonProceedings of the VLDB Endowment10.14778/3476249.347630114:11(2546-2554)Online publication date: 1-Jul-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media