Abstract
Data imputation is a well-known technique for repairing missing data values but can incur a prohibitive cost when applied to large data sets. Query-driven imputation offers a better alternative as it allows for fixing only the data that is relevant for a query. We adopt a rule-based query rewriting technique for imputing the answers of analytic queries that are missing or suffer from incorrectness due to data incompleteness. We present a novel query rewriting mechanism that is guided by partition patterns which are compact representations of complete and missing data partitions. Our solution strives to infer the largest possible set of missing answers while improving the precision of incorrect ones.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We omit attribute names when they’re not necessary for understanding.
References
Buck, S.F.: A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J. R. Stat. Soc. Ser. B (Methodol) 22, 302–306 (1960)
Cambronero, J., Feser, J.K., Smith, M.J., Madden, S.: Query optimization for dynamic imputation. Proc. VLDB Endowment 10(11), 1310–1321 (2017)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, pp. 2201–2206. ACM, New York (2016)
Chung, Y., Mortensen, M.L., Binnig, C., Kraska, T.: Estimating the impact of unknown unknowns on aggregate query results. ACM Trans. Database Syst. (TODS) 43(1), 3 (2018)
Dallachiesa, M., et al.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 541–552. ACM (2013)
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the 2008 ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 159–170. ACM (2008)
Fan, W., Geerts, F.: Relative information completeness. ACM Trans. Database Syst. (TODS) 35(4), 27 (2010)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. Proc. VLDB Endowment 3(1–2), 173–184 (2010)
Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recognit. 41(12), 3692–3705 (2008)
Garofalakis, M.N., Gibbons, P.B.: Approximate query processing: taming the terabytes. In: Proceedings of 27th International Conference on Very Large Databases (VLDB), pp. 343–352 (2001)
Hannou, F.Z., Amann, B., Baazizi, A.M.: Exploring and comparing table fragments with fragment summaries. In: The Eleventh International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA). IARIA (2019)
Liao, Z., Lu, X., Yang, T., Wang, H.: Missing data imputation: a fuzzy k-means clustering algorithm over sliding window. In: 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 3, pp. 133–137. IEEE (2009)
Mansinghka, V., Tibbetts, R., Baxter, J., Shafto, P., Eaves, B.: BayesDB: A probabilistic programming system for querying the probable implications of data. arXiv preprint arXiv:1512.05006 (2015)
Razniewski, S., Korn, F., Nutt, W., Srivastava, D.: Identifying the extent of completeness of query answers over partially complete databases. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 561–576, 31 May–4 June 2015
Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la Vega, M.D.: Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw. 24(1), 121–129 (2011)
Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 469–480. ACM (2014)
Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 457–468 (2014)
Zhu, B., He, C., Liatsis, P.: A robust missing value imputation method for noisy data. Appl. Intell. 36(1), 61–74 (2012)
Acknowledgement
This work has partially been supported by the EBITA collaborative research project between the Fraunhofer Institute and Sorbonne Université.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hannou, FZ., Amann, B., Baazizi, MA. (2019). Query-Oriented Answer Imputation for Aggregate Queries. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-28730-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28729-0
Online ISBN: 978-3-030-28730-6
eBook Packages: Computer ScienceComputer Science (R0)