research-article

Public Access

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Authors:

Michael Lind Mortensen,

Carsten Binnig,

Tim KraskaAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 861 - 876

https://doi.org/10.1145/2882903.2882909

Published: 14 June 2016 Publication History

Abstract

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?

In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

References

[1]

P. D. Allison. Handling missing data by maximum likelihood. In SAS global forum, pages 1--21, 2012.

[2]

S. Amer-Yahia, A. Doan, J. Kleinberg, N. Koudas, and M. Franklin. Crowds, clouds, and algorithms: Exploring the human side of "big data" applications. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, 2010.

Digital Library

[3]

J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association, 88(421), 1993.

[4]

K. P. Burnham and W. S. Overton. Estimation of the Size of a Closed Population when Capture Probabilities vary Among Animals. Biometrika, 65(3), 1978.

[5]

A. Chao. Nonparametric Estimation of the Number of Classes in a Population. SJS, 11(4), 1984.

[6]

A. Chao. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd Edition, pages 7907--7916. Wiley, New York, 2005.

[7]

A. Chao and S. Lee. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association, 87(417):210--217, 1992.

[8]

M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 268--279. ACM, 2000.

Digital Library

[9]

R. B. D'Agostino Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association, 95(451):749--759, 2000.

[10]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.

[11]

A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011.

Digital Library

[12]

D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB '97, 1997.

Digital Library

[13]

M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 2011.

Digital Library

[14]

I. J. Good. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 1953.

[15]

Google. Freebase. https://www.freebase.com, 2015. Accessed: 2015-07-08.

[16]

D. Haas, M. Greenstein, K. Kamalov, A. Marcus, M. Olszewski, and M. Piette. Reducing error in context-sensitive crowdsourced tasks. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.

[17]

P. J. Haas. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM, 1996.

[18]

P. J. Haas et al. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of VLDB, 1995.

Digital Library

[19]

A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.

[20]

L. Kish. Survey sampling. John Wiley and Sons, 1965.

[21]

W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1275--1286. ACM, 2014.

Digital Library

[22]

U. Leser and F. Naumann. Query planning with information quality bounds. In H. Larsen, T. Andreasen, H. Christiansen, J. Kacprzyk, and S. Zadrozny, editors, Flexible Query Answering Systems, volume 7 of Advances in Soft Computing, pages 85--94. Physica-Verlag HD, 2001.

[23]

M. Lexa. Useful facts about the kullback-leibler discrimination distance. Houston, Texas, 2004.

[24]

X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108.

Digital Library

[25]

J. Liang. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf, 2008.

[26]

J. Lu and D. Li. Estimating deep web data source size by capture--recapture method. Inf. Retr., 13(1):70--95, Feb. 2010.

Digital Library

[27]

R. Lynch and B. Kim. Sample size, the margin of error and the coefficient of variation. InterStat, 2010.

[28]

M. Magnani and D. Montesi. A survey on uncertainty management in data integration. J. Data and Information Quality, 2(1):5:1--5:33, July 2010.

Digital Library

[29]

A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 1315--1318, 2011.

Digital Library

[30]

A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings, pages 211--214, 2011.

[31]

D. A. McAllester and R. E. Schapire. On the convergence rate of good-turing estimators. In COLT, pages 1--6. Citeseer, 2000.

Digital Library

[32]

J. McClave and T. Sincich. Statistics. Pearson, 2013.

[33]

W. Meng, K.-L. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 146--153, Mar 1999.

Digital Library

[34]

F. Naumann, J.-C. Freytag, and U. Leser. Completeness of integrated information sources. Inf. Syst., 29(7):583--615, Sept. 2004.

Digital Library

[35]

M. T. Neiling and H.-J. Lenz. Data integration by means of object identification in information systems. In In Proceedings of European Conference on Information Systems, 2000.

[36]

F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, volume 86, pages 25--28, 1986.

Digital Library

[37]

J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2012.

[38]

A. Parameswaran and N. Polyzotis. Answering Queries using Humans, Algorithms and Databases. In Proc. of CIDR, 2011.

[39]

Pew Research Center. How u.s. tech-sector jobs have grown, changed in 15 years. http://pewrsr.ch/PtqZDA, 2014. Accessed: 2015-07-08.

[40]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.

[41]

S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. Identifying the extent of completeness of query answers over partially complete databases.

[42]

J. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.

[43]

D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, 1976.

[44]

B. Saha and D. Srivastava. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 1294--1297, 2014.

[45]

R. Sapsford. Survey Research. SAGE Publications, 1999.

[46]

B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013.

Digital Library

[47]

K. I. Ugland, J. S. Gray, and K. E. Ellingsen. The species--accumulation curve and estimation of species richness. Journal of Animal Ecology, 72(5):888--897, 2003.

[48]

G. Valiant and P. Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685--694. ACM, 2011.

Digital Library

[49]

Wikipedia. 68-95-99.7 rule. https://en.wikipedia.org/wiki/68--95--99.7_rule, 2015. Accessed: 2015-07-08.

[50]

Wikipedia. List of u.s. states by gdp. https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP, 2015. Accessed: 2015-07-08.

[51]

T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys '10, pages 77--90, New York, NY, USA, 2010. ACM.

Digital Library

[52]

Y. C. Yuan. Multiple imputation for missing data: Concepts and new development (version 9.0). SAS Institute Inc, Rockville, MD, 2010.

Cited By

Hilprecht BBinnig CLi GLi ZIdreos SSrivastava D(2021)ReStore - Neural Data Completion for Relational DatabasesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457264(710-722)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457264
Tang DShang ZElmore AKrishnan SFranklin MMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Thrifty Query Execution via IncrementabilityProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389756(1241-1256)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389756
Fan JWei ZZhang DYang JDu X(2019)Distribution-Aware Crowdsourced Entity CollectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.261150931:7(1312-1326)Online publication date: 1-Jul-2019
https://doi.org/10.1109/TKDE.2016.2611509
Show More Cited By

Index Terms

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results
Best of SIGMOD 2016 Papers and Regular Papers

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete?...
An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example
Advances in Computer Graphics
Abstract
“Unknown unknowns” are instances predicted models assign incorrect labels with high confidence, greatly reducing the generalization ability of models. In practical applications, unknown unknowns may lead to significant decision-making mistakes and ...
What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition
WWW '22: Proceedings of the ACM Web Conference 2022

Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF IIS
Air Force

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
581
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)8

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hilprecht BBinnig CLi GLi ZIdreos SSrivastava D(2021)ReStore - Neural Data Completion for Relational DatabasesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457264(710-722)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457264
Tang DShang ZElmore AKrishnan SFranklin MMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Thrifty Query Execution via IncrementabilityProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389756(1241-1256)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389756
Fan JWei ZZhang DYang JDu X(2019)Distribution-Aware Crowdsourced Entity CollectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.261150931:7(1312-1326)Online publication date: 1-Jul-2019
https://doi.org/10.1109/TKDE.2016.2611509
Kraska T(2018)NorthstarProceedings of the VLDB Endowment10.14778/3229863.324049311:12(2150-2164)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3240493
Shang ZBrackenbury WElmore AFranklin M(2018)CYADBProceedings of the VLDB Endowment10.14778/3229863.323625411:12(2038-2041)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3236254
Chung YMortensen MBinnig CKraska T(2018)Estimating the Impact of Unknown Unknowns on Aggregate Query ResultsACM Transactions on Database Systems10.1145/316797043:1(1-37)Online publication date: 6-Mar-2018
https://dl.acm.org/doi/10.1145/3167970
Chai CFan JLi G(2018)Incentive-Based Entity Collection Using Crowdsourcing2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00039(341-352)Online publication date: Apr-2018
https://doi.org/10.1109/ICDE.2018.00039
Li GWang JZheng YFan JFranklin MLi GWang JZheng YFan JFranklin M(2018)Crowdsourced OperatorsCrowdsourced Data Management10.1007/978-981-10-7847-7_7(97-154)Online publication date: 13-Oct-2018
https://doi.org/10.1007/978-981-10-7847-7_7
Chung YKrishnan SKraska T(2017)A data quality metric (DQM)Proceedings of the VLDB Endowment10.14778/3115404.311541410:10(1094-1105)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.14778/3115404.3115414
Guo YBinnig CKraska T(2017)What you see is not what you get!Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics10.1145/3077257.3077266(1-5)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.1145/3077257.3077266

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents