Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions

Published: 01 November 2020 Publication History

Abstract

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

References

[1]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases: The Logical Level (1st ed.). Addison-Wesley Longman Publishing Co., Inc., USA.
[2]
Pankaj K. Agarwal, Boris Aronov, Sariel Har-Peled, Jeff M. Phillips, Ke Yi, and Wuzhou Zhang. 2016. Nearest-Neighbor Searching Under Uncertainty II. ACM Trans. Algorithms 13, 1, Article Article 3 (Oct. 2016), 25 pages.
[3]
Pankaj K. Agarwal, Alon Efrat, Swaminathan Sankararaman, and Wuzhou Zhang. 2012. Nearest-Neighbor Searching under Uncertainty. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '12). Association for Computing Machinery, New York, NY, USA, 225--236.
[4]
Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proc. 18th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. 68--79.
[5]
Leopoldo E Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.
[6]
Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1--6.
[7]
Lane F Burgette and Jerome P Reiter. 2010. Multiple imputation for missing data via sequential regression trees. American journal of epidemiology 172, 9 (2010), 1070--1076.
[8]
Yuxin Chen, S. Hamed Hassani, Amin Karbasi, and Andreas Krause. 2015. Sequential Information Maximization: When is Greedy Near-optimal?. In Conference on Learning Theory.
[9]
Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proc. ACM SIGMOD Int. Conf. on Management of Data. 1247--1261.
[10]
Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
[11]
Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1--38.
[12]
Pedro J. García-Laencina, José-Luis Sancho-Gómez, and Aníbal R. Figueiras-Vidal. 2010. Pattern Classification with Missing Data: A Review. Neural Comput. Appl. 19, 2 (March 2010), 263--282.
[13]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM.
[14]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow. 12, 11 (July 2019), 1610--1623.
[15]
Bojan Karlaš, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. arXiv:2005.05117
[16]
Pasha Khosravi, YooJung Choi, Yitao Liang, Antonio Vergari, and Guy Van den Broeck. 2019. On Tractable Computation of Expected Predictions. In Advances in Neural Information Processing Systems 32 (NeurIPS). http://starai.cs.ucla.edu/papers/KhosraviNeurIPS19.pdf
[17]
Pasha Khosravi, Yitao Liang, YooJung Choi, and Guy Van den Broeck. 2019. What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features. CoRR abs/1903.01620 (2019). arXiv:1903.01620 http://arxiv.org/abs/1903.01620
[18]
Pasha Khosravi, Yitao Liang, YooJung Choi, and Guy Van Den Broeck. 2019. What to expect of classifiers? reasoning about logistic regression with missing features. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 2716--2724.
[19]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6--11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 1885--1894. http://proceedings.mlr.press/v70/koh17a.html
[20]
Hans-Peter Kriegel, Peter Kunath, and Matthias Renz. 2007. Probabilistic Nearest-Neighbor Query on Uncertain Objects. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA'07). Springer-Verlag, Berlin, Heidelberg, 337--348.
[21]
Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boost-clean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).
[22]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. Proc. VLDB Endowment 9, 12 (2016), 948--959.
[23]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.
[24]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. arXiv preprint arXiv:1904.09483 (2019).
[25]
Andrei Lopatenko and Leopoldo E Bertossi. 2007. Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics. In Proc. 11th Int. Conf. on Database Theory. 179--193.
[26]
Harris Papadopoulos, Vladimir Vovk, and Alex Gammerman. 2011. Regression Conformal Prediction with Nearest Neighbours. J. Artif. Int. Res. 40, 1 (Jan. 2011), 815--840.
[27]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[28]
Carl E Rasmussen, Radford M Neal, Geoffrey E Hinton, Drew van Camp, Michael Revow, Zoubin Ghahramani, R Kustra, and Robert Tibshirani. 1996. The DELVE manual. URL http://www.cs.toronto.edu/~delve (1996).
[29]
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holo-Clean: holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201.
[30]
Donald B Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581--592.
[31]
Donald B. Rubin. 1996. Multiple Imputation After 18+ Years. J. Amer. Statist. Assoc. 91, 434(1996), 473--489. http://www.jstor.org/stable/2291635
[32]
Boris Sharchilev, Yury Ustinovskiy, Pavel Serdyukov, and Maarten de Rijke. 2018. Finding Influential Training Samples for Gradient Boosted Decision Trees. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10--15, 2018 (Proceedings of Machine Learning Research), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. PMLR, 4584--4592. http://proceedings.mlr.press/v80/sharchilev18a.html
[33]
Jeffrey S Simonoff. 2013. Analyzing categorical data. Springer Science & Business Media.
[34]
C. Sitawarin and D. Wagner. 2019. On the Robustness of Deep K-Nearest Neighbors. In 2019 IEEE Security and Privacy Workshops (SPW). 1--7.
[35]
Daniel J Stekhoven and Peter Bühlmann. 2012. MissForest---non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2012), 112--118.
[36]
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic Databases. Synthesis Lectures on Data Management 3, 2 (2011), 1--180. arXiv:https://doi.org/10.2200/S00362ED1V01Y201105DTM016
[37]
Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520--525.
[38]
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49--60.
[39]
Chun Wa Ko, Jon Lee, and Maurice Queyranne. 1995. An Exact Algorithm for Maximum Entropy Sampling. Oper. Res. 43, 4 (1995), 684--691.
[40]
Jiannan Wang, Sanjay Krishnan, Michael J Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. 2014. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. on Management of Data. 469--480.
[41]
Maurice Weber, Xiaojun Xu, Bojan Karlas, Ce Zhang, and Bo Li. 2020. RAB: Provable Robustness Against Backdoor Attacks. arXiv e-prints, Article arXiv:2003.08904 (March 2020), arXiv:2003.08904 pages. arXiv:cs.LG/2003.08904
[42]
Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems (2020), 307--325.

Cited By

View all
  • (2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
  • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
  • (2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
  • Show More Cited By
  1. Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 14, Issue 3
    November 2020
    217 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 November 2020
    Published in PVLDB Volume 14, Issue 3

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
    • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
    • (2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
    • (2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 19-Mar-2024
    • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
    • (2023)GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete DataProceedings of the ACM on Management of Data10.1145/35893021:2(1-27)Online publication date: 20-Jun-2023
    • (2023)LinCQA: Faster Consistent Query Answering with Linear Time GuaranteesProceedings of the ACM on Management of Data10.1145/35887181:1(1-25)Online publication date: 30-May-2023
    • (2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00775-932:4(791-813)Online publication date: 1-Jul-2023
    • (2022)VF-PSProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600422(2088-2101)Online publication date: 28-Nov-2022
    • (2022)dcbenchProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533310(1-4)Online publication date: 12-Jun-2022

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media