research-article

Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions

Authors:

Nezihe Merve Gürel,

Ce ZhangAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 3

Pages 255 - 267

https://doi.org/10.14778/3430915.3430917

Published: 01 November 2020 Publication History

Abstract

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

References

[1]

Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases: The Logical Level (1st ed.). Addison-Wesley Longman Publishing Co., Inc., USA.

Digital Library

[2]

Pankaj K. Agarwal, Boris Aronov, Sariel Har-Peled, Jeff M. Phillips, Ke Yi, and Wuzhou Zhang. 2016. Nearest-Neighbor Searching Under Uncertainty II. ACM Trans. Algorithms 13, 1, Article Article 3 (Oct. 2016), 25 pages.

Digital Library

[3]

Pankaj K. Agarwal, Alon Efrat, Swaminathan Sankararaman, and Wuzhou Zhang. 2012. Nearest-Neighbor Searching under Uncertainty. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '12). Association for Computing Machinery, New York, NY, USA, 225--236.

Digital Library

[4]

Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proc. 18th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. 68--79.

Digital Library

[5]

Leopoldo E Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.

Digital Library

[6]

Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1--6.

[7]

Lane F Burgette and Jerome P Reiter. 2010. Multiple imputation for missing data via sequential regression trees. American journal of epidemiology 172, 9 (2010), 1070--1076.

[8]

Yuxin Chen, S. Hamed Hassani, Amin Karbasi, and Andreas Krause. 2015. Sequential Information Maximization: When is Greedy Near-optimal?. In Conference on Learning Theory.

[9]

Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proc. ACM SIGMOD Int. Conf. on Management of Data. 1247--1261.

Digital Library

[10]

Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.

[11]

Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1--38.

[12]

Pedro J. García-Laencina, José-Luis Sancho-Gómez, and Aníbal R. Figueiras-Vidal. 2010. Pattern Classification with Missing Data: A Review. Neural Comput. Appl. 19, 2 (March 2010), 263--282.

Digital Library

[13]

Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM.

Digital Library

[14]

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow. 12, 11 (July 2019), 1610--1623.

Digital Library

[15]

Bojan Karlaš, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. arXiv:2005.05117

[16]

Pasha Khosravi, YooJung Choi, Yitao Liang, Antonio Vergari, and Guy Van den Broeck. 2019. On Tractable Computation of Expected Predictions. In Advances in Neural Information Processing Systems 32 (NeurIPS). http://starai.cs.ucla.edu/papers/KhosraviNeurIPS19.pdf

Digital Library

[17]

Pasha Khosravi, Yitao Liang, YooJung Choi, and Guy Van den Broeck. 2019. What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features. CoRR abs/1903.01620 (2019). arXiv:1903.01620 http://arxiv.org/abs/1903.01620

[18]

Pasha Khosravi, Yitao Liang, YooJung Choi, and Guy Van Den Broeck. 2019. What to expect of classifiers? reasoning about logistic regression with missing features. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 2716--2724.

Digital Library

[19]

Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6--11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 1885--1894. http://proceedings.mlr.press/v70/koh17a.html

Digital Library

[20]

Hans-Peter Kriegel, Peter Kunath, and Matthias Renz. 2007. Probabilistic Nearest-Neighbor Query on Uncertain Objects. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA'07). Springer-Verlag, Berlin, Heidelberg, 337--348.

Digital Library

[21]

Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boost-clean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).

[22]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. Proc. VLDB Endowment 9, 12 (2016), 948--959.

Digital Library

[23]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.

Digital Library

[24]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. arXiv preprint arXiv:1904.09483 (2019).

[25]

Andrei Lopatenko and Leopoldo E Bertossi. 2007. Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics. In Proc. 11th Int. Conf. on Database Theory. 179--193.

Digital Library

[26]

Harris Papadopoulos, Vladimir Vovk, and Alex Gammerman. 2011. Regression Conformal Prediction with Nearest Neighbours. J. Artif. Int. Res. 40, 1 (Jan. 2011), 815--840.

Digital Library

[27]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[28]

Carl E Rasmussen, Radford M Neal, Geoffrey E Hinton, Drew van Camp, Michael Revow, Zoubin Ghahramani, R Kustra, and Robert Tibshirani. 1996. The DELVE manual. URL http://www.cs.toronto.edu/~delve (1996).

[29]

Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holo-Clean: holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201.

Digital Library

[30]

Donald B Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581--592.

[31]

Donald B. Rubin. 1996. Multiple Imputation After 18+ Years. J. Amer. Statist. Assoc. 91, 434(1996), 473--489. http://www.jstor.org/stable/2291635

[32]

Boris Sharchilev, Yury Ustinovskiy, Pavel Serdyukov, and Maarten de Rijke. 2018. Finding Influential Training Samples for Gradient Boosted Decision Trees. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10--15, 2018 (Proceedings of Machine Learning Research), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. PMLR, 4584--4592. http://proceedings.mlr.press/v80/sharchilev18a.html

[33]

Jeffrey S Simonoff. 2013. Analyzing categorical data. Springer Science & Business Media.

[34]

C. Sitawarin and D. Wagner. 2019. On the Robustness of Deep K-Nearest Neighbors. In 2019 IEEE Security and Privacy Workshops (SPW). 1--7.

[35]

Daniel J Stekhoven and Peter Bühlmann. 2012. MissForest---non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2012), 112--118.

Digital Library

[36]

Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic Databases. Synthesis Lectures on Data Management 3, 2 (2011), 1--180. arXiv:https://doi.org/10.2200/S00362ED1V01Y201105DTM016

Digital Library

[37]

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520--525.

[38]

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49--60.

Digital Library

[39]

Chun Wa Ko, Jon Lee, and Maurice Queyranne. 1995. An Exact Algorithm for Maximum Entropy Sampling. Oper. Res. 43, 4 (1995), 684--691.

Digital Library

[40]

Jiannan Wang, Sanjay Krishnan, Michael J Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. 2014. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. on Management of Data. 469--480.

Digital Library

[41]

Maurice Weber, Xiaojun Xu, Bojan Karlas, Ce Zhang, and Bo Li. 2020. RAB: Provable Robustness Against Backdoor Attacks. arXiv e-prints, Article arXiv:2003.08904 (March 2020), arXiv:2003.08904 pages. arXiv:cs.LG/2003.08904

[42]

Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems (2020), 307--325.

Cited By

Reis EAbdelaal MBinnig C(2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704983
Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Gao HCai SDinh THuang ZOoi B(2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698831
Show More Cited By

Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions
1. Computing methodologies

Recommendations

Prototype Selection for Composite Nearest Neighbor Classifiers
Hierarchical distance learning by stacking nearest neighbor classifiers

A hierarchical decision fusion and distance learning method, called FSG, is proposed.FSG is employed to bridge the gap between Bayes and N-sample classification error.A measure is proposed to learn expertise of base-layer classifiers of the FSG.FSG ...
Prototype selection for composite nearest neighbor classifiers

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 3

November 2020

217 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2020

Published in PVLDB Volume 14, Issue 3

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
72
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Reis EAbdelaal MBinnig C(2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704983
Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Gao HCai SDinh THuang ZOoi B(2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698831
Pereira JFonseca MLopes AGalhardas H(2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1145/3648476
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Chai CLiu JTang NFan JMiao DWang JLuo YLi G(2023)GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete DataProceedings of the ACM on Management of Data10.1145/35893021:2(1-27)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589302
Fan ZKoutris POuyang XWijsen J(2023)LinCQA: Faster Consistent Query Answering with Linear Time GuaranteesProceedings of the ACM on Management of Data10.1145/35887181:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588718
Whang SRoh YSong HLee J(2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00775-932:4(791-813)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1007/s00778-022-00775-9
Jiang JBurkhalter LFu FDing BDu BHithnawi ALi BZhang CKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)VF-PSProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600422(2088-2101)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3600422
Eyuboglu SKarlaš BRé CZhang CZou JBoehm MVarma PXin D(2022)dcbenchProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533310(1-4)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533028.3533310

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents