Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3514221.3517867acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

The Price of Tailoring the Index to Your Data: Poisoning Attacks on Learned Index Structures

Published: 11 June 2022 Publication History

Abstract

The concept of learned index structures relies on the idea that the input-output functionality of a database index can be viewed as a prediction task and, thus, implemented using a machine learning model instead of traditional algorithmic techniques. This novel angle for a decades-old problem has inspired exciting results at the intersection of machine learning and data structures. However, the advantage of learned index structures, i.e., the ability to adjust to the data at hand via the underlying ML-model, can become a disadvantage from a security perspective as it could be exploited.
In this work, we present the first study of data poisoning attacks on learned index structures. Our poisoning approach is different from all previous works since the model under attack is trained on a cumulative distribution function (CDF) and, thus, every injection on the training set has a cascading impact on multiple data values. We formulate the first poisoning attacks on linear regression models trained on a CDF, which is a basic building block of the proposed learned index structures. We generalize our poisoning techniques to attack the advanced two-stage design of learned index structures called recursive model index (RMI), which has been shown to outperform traditional B-Trees. We evaluate our attacks under a variety of parameterizations of the model and show that the error of the RMI increases up to 300x and the error of its second-stage models increases up to 3000x.

References

[1]
Evan Ackerman. 2019. Three Small Stickers in Intersection Can Cause Tesla Autopilot to Swerve Into Wrong Lane. In IEEE Spectrum. https://spectrum.ieee.org/three-small-stickers-on-road-can-steer-tesla-autopilot-into-oncoming-lane [Accessed: 11-Oct-2021].
[2]
Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: A Hands-Free Adaptive Store. In Proc. of ACM SIGMOD. 1103--1114.
[3]
Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning Attacks Against Support Vector Machines. arxiv: 1206.6389
[4]
Battista Biggio, Ignazio Pillai, Samuel Rota Bulò, Davide Ariu, Marcello Pelillo, and Fabio Roli. 2013. Is Data Clustering in Adversarial Settings Secure? Proc. of AISec.
[5]
Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2021. A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries. In ALENEX .
[6]
Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. 2018. PAC-learning in the presence of adversaries. In Proc. of NeurIPS. 230--241.
[7]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. 2020 a. ALEX: An Updatable Adaptive Learned Index. In Proc. of the ACM SIGMOD. 969--984.
[8]
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020 b. Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow., Vol. 14, 2 (Oct. 2020), 74--86.
[9]
Yihe Dong, Piotr Indyk, Ilya P. Razenshteyn, and Tal Wagner. 2019. Learning Sublinear-Time Indexing for Nearest Neighbor Search. arxiv: 1901.08544
[10]
Martin Eppert, Philipp Fent, and Thomas Neumann. 2021. A Tailored Regression for Learned Indexes: Logarithmic Error Regression. Proc. of the International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM), 9--15.
[11]
Minghong Fang, Minghao Sun, Qi Li, Neil Zhenqiang Gong, Jin Tian, and Jia Liu. 2021. Data Poisoning Attacks and Defenses to Crowdsourcing Systems. In Proceedings of the Web Conference (WWW). 969--980.
[12]
Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2020. Why Are Learned Indexes So Effective?. In Proc. of the 37th ICML, Vol. 119.
[13]
Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2021. On the Performance of Learned Data Structures. Theoretical Computer Science (2021), 107--120.
[14]
Paolo Ferragina and Giorgio Vinciguerra. 2020 a. Learned Data Structures. In Recent Trends in Learning From Data. 5--41.
[15]
Paolo Ferragina and Giorgio Vinciguerra. 2020 b. The PGM-Index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds. Proc. VLDB Endow., Vol. 13, 8 (April 2020), 1162--1175.
[16]
Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures. In Proc. of the 22nd ACM CCS. 1322--1333.
[17]
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure. In Proc. of ACM SIGMOD. 1189--1206.
[18]
Ali Hadian and Thomas Heinis. 2019 a. Considerations for Handling Updates in Learned Index Structures. Proc. of the International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM) .
[19]
Ali Hadian and Thomas Heinis. 2019 b. Interpolation-friendly B-trees: Bridging the Gap Between Algorithmic and Learned Indexes. In Proc. of EDBT. 710--713.
[20]
Ali Hadian and Thomas Heinis. 2021. Shift-Table: A Low-latency Learned Index for Range Queries using Model Correction. arxiv: 2101.10457
[21]
Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-Based Frequency Estimation Algorithms. In Proc. of ICLR.
[22]
Hai Huang, Jiaming Mu, Neil Zhenqiang Gong, Qi Li, Bin Liu, and Mingwei Xu. 2021. Data Poisoning Attacks to Deep Learning Based Recommender Systems. Proc. of NDSS.
[23]
Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, and Zichen Zhu. 2019. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. In Proc. of CIDR.
[24]
Stratos Idreos and Tim Kraska. 2019. From Auto-Tuning One Size Fits All to Self-Designed and Learned Data-Intensive Systems. In Proc. of ACM SIGMOD. 2054--2059.
[25]
Piotr Indyk, Ali Vakilian, and Yang Yuan. 2019. Learning-Based Low-Rank Approximations. In Proc. of NeurIPS. 7400--7410.
[26]
Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. 2018. Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning. Proc. of IEEE S&P.
[27]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. arxiv: 1911.13014
[28]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: A Single-Pass Learned Index. Proc. of the International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM) .
[29]
Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In Proc. of CIDR.
[30]
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proc. of ACM SIGMOD. 489--504.
[31]
Yaliang Li, Daoyuan Chen, Bolin Ding, Kai Zeng, and Jingren Zhou. 2021. A Pluggable Learned Index Method via Sampling and Gap Insertion. arxiv: 2101.00808
[32]
Xuanqing Liu, Si Si, Xiaojin Zhu, Yang Li, and Cho-Jui Hsieh. 2019. A Unified Framework for Data Poisoning Attack to Graph-based Semi-supervised Learning. arxiv: 1910.14147
[33]
Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based Workload Forecasting for Self-Driving Database Management Systems. In Proc. of ACM SIGMOD. 631--645.
[34]
Shiqing Ma and Yingqi Liu. 2019. NIC: Detecting Adversarial Samples with Neural Network Invariant Checking. In Proc. of the NDSS.
[35]
Stephen Macke, Alex Beutel, Tim Kraska, Maheswaran Sathiamoorthy, Derek Zhiyuan Cheng, and Ed H. Chi. 2018. Lifting the Curse of Multidimensional Data with Learned Existence Indexes. In Workshop on ML for Systems at NeurIPS .
[36]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proc. of ACM SIGCOMM. 270--288.
[37]
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. arxiv: 2006.12804
[38]
Florida Miami-Dade County. 2019. Employee Pay Information. https://gis-mdc.opendata.arcgis.com/datasets/employee-pay-information/data.
[39]
David J. Miller and Hasan S. Uyar. 1996. A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data. In Proc. of NeurIPS. 571--577.
[40]
Kenneth S. Miller. 1966. An Introduction to the Calculus of Finite Differences and Difference Equations, by Kenneth S. Miller .Dover Publications, New York.
[41]
Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In Proc. of NeurIPS. 464--473.
[42]
Douglas C. Montgomery. 2006. Design and Analysis of Experiments .John Wiley & Sons, Inc., Hoboken, NJ, USA.
[43]
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In Proc. of ACM SIGMOD. 985--1000.
[44]
OpenStreetMap contributors. 2017. Planet dump retrieved from https://planet.osm.org. https://www.openstreetmap.org .
[45]
Jack W. Rae, Sergey Bartunov, and Timothy P. Lillicrap. 2019. Meta-Learning Neural Bloom Filters. In Proc. of ICML, Vol. 97. 5271--5280.
[46]
Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes. 2019. ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. In Proc. of NDSS .
[47]
Naufal Fikri Setiawan, Benjamin I. P. Rubinstein, and Renata Borovica-Gajic. 2020. Function Interpolation for Learned Index Structures. In ADC - Databases Theory and Applications. Springer, 68--80.
[48]
Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, and Tim Kraska. 2021. Bounding the Last Mile: Efficient Learned String Indexing. arxiv: 2111.14905
[49]
Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. Towards Practical Learned Indexing. arxiv: 2108.05117
[50]
Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumitras. 2018. When Does Machine Learning FAIL? Generalized Transferability for Evasion and Poisoning Attacks. In Proc. of USENIX Security. 1299--1316.
[51]
Mingjie Sun, Jian Tang, Huichen Li, Bo Li, Chaowei Xiao, Yao Chen, and Dawn Song. 2018. Data Poisoning Attack against Unsupervised Node Embedding Methods. arxiv: 1810.12881
[52]
Chuzhe Tang, Zhiyuan Dong, Minjie Wang, Zhaoguo Wang, and Haibo Chen. 2019. Learned Indexes for Dynamic Workloads. arxiv: 1902.00655
[53]
Kapil Vaidya, Eric Knorr, Tim Kraska, and Michael Mitzenmacher. 2020. Partitioned Learned Bloom Filter. arxiv: 2006.03176
[54]
Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-Scale Machine Learning. In Proc. of ACM SIGMOD. 1009--1024.
[55]
Peter Van Sandt, Yannis Chronis, and Jignesh M. Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proc. of ACM SIGMOD. 36--53.
[56]
Giorgio Vinciguerra, Paolo Ferragina, and Michele Miccinesi. 2019. Superseding Traditional Indexes by Orchestrating Learning and Geometry. arxiv: 1903.00507
[57]
B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proc. of the 40th IEEE S&P. 707--723.
[58]
Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. In Proc. of ACM SIGMOD. 1223--1240.
[59]
Wenkun Xiang, Hao Zhang, Rui Cui, Xing Chu, Keqin Li, and Wei Zhou. 2019. Pavo: A RNN-Based Learned Inverted Index, Supervised or Unsupervised? IEEE Access, Vol. 7 (2019), 293--303.
[60]
Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. 2015. Is Feature Selection Secure against Training Data Poisoning?. In Proc. of the 32nd ICML, Vol. 37. 1689--1698.
[61]
Chaofei Yang, Qing Wu, Hai Li, and Yiran Chen. 2017. Generative Poisoning Attack Method Against Neural Networks. arxiv: 1703.01340

Cited By

View all
  • (2024)Algorithmic Complexity Attacks on Dynamic Learned IndexesProceedings of the VLDB Endowment10.14778/3636218.363623217:4(780-793)Online publication date: 5-Mar-2024
  • (2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Author Tags

  1. attacks
  2. data poisoning
  3. indexing
  4. learned systems

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)181
  • Downloads (Last 6 weeks)30
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Algorithmic Complexity Attacks on Dynamic Learned IndexesProceedings of the VLDB Endowment10.14778/3636218.363623217:4(780-793)Online publication date: 5-Mar-2024
  • (2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
  • (2024)TRAP: Tailored Robustness Assessment for Index Advisors via Adversarial Perturbation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00011(42-55)Online publication date: 13-May-2024
  • (2023)Sieve: A Learned Data-Skipping Index for Data AnalyticsProceedings of the VLDB Endowment10.14778/3611479.361152016:11(3214-3226)Online publication date: 24-Aug-2023
  • (2023)Learned Index: A Comprehensive Experimental EvaluationProceedings of the VLDB Endowment10.14778/3594512.359452816:8(1992-2004)Online publication date: 1-Apr-2023
  • (2023)On Nonlinear Learned String IndexingIEEE Access10.1109/ACCESS.2023.329543411(74021-74034)Online publication date: 2023
  • (2023)Morphtree: a polymorphic main-memory learned index for dynamic workloadsThe VLDB Journal10.1007/s00778-023-00823-y33:4(1065-1084)Online publication date: 1-Dec-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media