Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2882903.2882909acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Published: 14 June 2016 Publication History

Abstract

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?
In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

References

[1]
P. D. Allison. Handling missing data by maximum likelihood. In SAS global forum, pages 1--21, 2012.
[2]
S. Amer-Yahia, A. Doan, J. Kleinberg, N. Koudas, and M. Franklin. Crowds, clouds, and algorithms: Exploring the human side of "big data" applications. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, 2010.
[3]
J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association, 88(421), 1993.
[4]
K. P. Burnham and W. S. Overton. Estimation of the Size of a Closed Population when Capture Probabilities vary Among Animals. Biometrika, 65(3), 1978.
[5]
A. Chao. Nonparametric Estimation of the Number of Classes in a Population. SJS, 11(4), 1984.
[6]
A. Chao. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd Edition, pages 7907--7916. Wiley, New York, 2005.
[7]
A. Chao and S. Lee. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association, 87(417):210--217, 1992.
[8]
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 268--279. ACM, 2000.
[9]
R. B. D'Agostino Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association, 95(451):749--759, 2000.
[10]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.
[11]
A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011.
[12]
D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB '97, 1997.
[13]
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 2011.
[14]
I. J. Good. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 1953.
[15]
Google. Freebase. https://www.freebase.com, 2015. Accessed: 2015-07-08.
[16]
D. Haas, M. Greenstein, K. Kamalov, A. Marcus, M. Olszewski, and M. Piette. Reducing error in context-sensitive crowdsourced tasks. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.
[17]
P. J. Haas. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM, 1996.
[18]
P. J. Haas et al. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of VLDB, 1995.
[19]
A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.
[20]
L. Kish. Survey sampling. John Wiley and Sons, 1965.
[21]
W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1275--1286. ACM, 2014.
[22]
U. Leser and F. Naumann. Query planning with information quality bounds. In H. Larsen, T. Andreasen, H. Christiansen, J. Kacprzyk, and S. Zadrozny, editors, Flexible Query Answering Systems, volume 7 of Advances in Soft Computing, pages 85--94. Physica-Verlag HD, 2001.
[23]
M. Lexa. Useful facts about the kullback-leibler discrimination distance. Houston, Texas, 2004.
[24]
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108.
[25]
J. Liang. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf, 2008.
[26]
J. Lu and D. Li. Estimating deep web data source size by capture--recapture method. Inf. Retr., 13(1):70--95, Feb. 2010.
[27]
R. Lynch and B. Kim. Sample size, the margin of error and the coefficient of variation. InterStat, 2010.
[28]
M. Magnani and D. Montesi. A survey on uncertainty management in data integration. J. Data and Information Quality, 2(1):5:1--5:33, July 2010.
[29]
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 1315--1318, 2011.
[30]
A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings, pages 211--214, 2011.
[31]
D. A. McAllester and R. E. Schapire. On the convergence rate of good-turing estimators. In COLT, pages 1--6. Citeseer, 2000.
[32]
J. McClave and T. Sincich. Statistics. Pearson, 2013.
[33]
W. Meng, K.-L. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 146--153, Mar 1999.
[34]
F. Naumann, J.-C. Freytag, and U. Leser. Completeness of integrated information sources. Inf. Syst., 29(7):583--615, Sept. 2004.
[35]
M. T. Neiling and H.-J. Lenz. Data integration by means of object identification in information systems. In In Proceedings of European Conference on Information Systems, 2000.
[36]
F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, volume 86, pages 25--28, 1986.
[37]
J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2012.
[38]
A. Parameswaran and N. Polyzotis. Answering Queries using Humans, Algorithms and Databases. In Proc. of CIDR, 2011.
[39]
Pew Research Center. How u.s. tech-sector jobs have grown, changed in 15 years. http://pewrsr.ch/PtqZDA, 2014. Accessed: 2015-07-08.
[40]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.
[41]
S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. Identifying the extent of completeness of query answers over partially complete databases.
[42]
J. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.
[43]
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, 1976.
[44]
B. Saha and D. Srivastava. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 1294--1297, 2014.
[45]
R. Sapsford. Survey Research. SAGE Publications, 1999.
[46]
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013.
[47]
K. I. Ugland, J. S. Gray, and K. E. Ellingsen. The species--accumulation curve and estimation of species richness. Journal of Animal Ecology, 72(5):888--897, 2003.
[48]
G. Valiant and P. Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685--694. ACM, 2011.
[49]
Wikipedia. 68-95-99.7 rule. https://en.wikipedia.org/wiki/68--95--99.7_rule, 2015. Accessed: 2015-07-08.
[50]
Wikipedia. List of u.s. states by gdp. https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP, 2015. Accessed: 2015-07-08.
[51]
T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys '10, pages 77--90, New York, NY, USA, 2010. ACM.
[52]
Y. C. Yuan. Multiple imputation for missing data: Concepts and new development (version 9.0). SAS Institute Inc, Rockville, MD, 2010.

Cited By

View all
  • (2021)ReStore - Neural Data Completion for Relational DatabasesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457264(710-722)Online publication date: 9-Jun-2021
  • (2020)Thrifty Query Execution via IncrementabilityProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389756(1241-1256)Online publication date: 11-Jun-2020
  • (2019)Distribution-Aware Crowdsourced Entity CollectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.261150931:7(1312-1326)Online publication date: 1-Jul-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. aggregate query
  2. data integration
  3. open-world assumption
  4. species estimation
  5. unknown unknowns

Qualifiers

  • Research-article

Funding Sources

  • NSF IIS
  • Air Force

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)8
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)ReStore - Neural Data Completion for Relational DatabasesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457264(710-722)Online publication date: 9-Jun-2021
  • (2020)Thrifty Query Execution via IncrementabilityProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389756(1241-1256)Online publication date: 11-Jun-2020
  • (2019)Distribution-Aware Crowdsourced Entity CollectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.261150931:7(1312-1326)Online publication date: 1-Jul-2019
  • (2018)NorthstarProceedings of the VLDB Endowment10.14778/3229863.324049311:12(2150-2164)Online publication date: 1-Aug-2018
  • (2018)CYADBProceedings of the VLDB Endowment10.14778/3229863.323625411:12(2038-2041)Online publication date: 1-Aug-2018
  • (2018)Estimating the Impact of Unknown Unknowns on Aggregate Query ResultsACM Transactions on Database Systems10.1145/316797043:1(1-37)Online publication date: 6-Mar-2018
  • (2018)Incentive-Based Entity Collection Using Crowdsourcing2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00039(341-352)Online publication date: Apr-2018
  • (2018)Crowdsourced OperatorsCrowdsourced Data Management10.1007/978-981-10-7847-7_7(97-154)Online publication date: 13-Oct-2018
  • (2017)A data quality metric (DQM)Proceedings of the VLDB Endowment10.14778/3115404.311541410:10(1094-1105)Online publication date: 1-Jun-2017
  • (2017)What you see is not what you get!Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics10.1145/3077257.3077266(1-5)Online publication date: 14-May-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media