research-article

Open access

The Price of Tailoring the Index to Your Data: Poisoning Attacks on Learned Index Structures

Authors:

Evgenios M. Kornaropoulos,

Roberto TamassiaAuthors Info & Claims

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 1331 - 1344

https://doi.org/10.1145/3514221.3517867

Published: 11 June 2022 Publication History

Abstract

The concept of learned index structures relies on the idea that the input-output functionality of a database index can be viewed as a prediction task and, thus, implemented using a machine learning model instead of traditional algorithmic techniques. This novel angle for a decades-old problem has inspired exciting results at the intersection of machine learning and data structures. However, the advantage of learned index structures, i.e., the ability to adjust to the data at hand via the underlying ML-model, can become a disadvantage from a security perspective as it could be exploited.

In this work, we present the first study of data poisoning attacks on learned index structures. Our poisoning approach is different from all previous works since the model under attack is trained on a cumulative distribution function (CDF) and, thus, every injection on the training set has a cascading impact on multiple data values. We formulate the first poisoning attacks on linear regression models trained on a CDF, which is a basic building block of the proposed learned index structures. We generalize our poisoning techniques to attack the advanced two-stage design of learned index structures called recursive model index (RMI), which has been shown to outperform traditional B-Trees. We evaluate our attacks under a variety of parameterizations of the model and show that the error of the RMI increases up to 300x and the error of its second-stage models increases up to 3000x.

References

[1]

Evan Ackerman. 2019. Three Small Stickers in Intersection Can Cause Tesla Autopilot to Swerve Into Wrong Lane. In IEEE Spectrum. https://spectrum.ieee.org/three-small-stickers-on-road-can-steer-tesla-autopilot-into-oncoming-lane [Accessed: 11-Oct-2021].

[2]

Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: A Hands-Free Adaptive Store. In Proc. of ACM SIGMOD. 1103--1114.

Digital Library

[3]

Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning Attacks Against Support Vector Machines. arxiv: 1206.6389

[4]

Battista Biggio, Ignazio Pillai, Samuel Rota Bulò, Davide Ariu, Marcello Pelillo, and Fabio Roli. 2013. Is Data Clustering in Adversarial Settings Secure? Proc. of AISec.

Digital Library

[5]

Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2021. A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries. In ALENEX .

[6]

Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. 2018. PAC-learning in the presence of adversaries. In Proc. of NeurIPS. 230--241.

[7]

Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. 2020 a. ALEX: An Updatable Adaptive Learned Index. In Proc. of the ACM SIGMOD. 969--984.

Digital Library

[8]

Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020 b. Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow., Vol. 14, 2 (Oct. 2020), 74--86.

[9]

Yihe Dong, Piotr Indyk, Ilya P. Razenshteyn, and Tal Wagner. 2019. Learning Sublinear-Time Indexing for Nearest Neighbor Search. arxiv: 1901.08544

[10]

Martin Eppert, Philipp Fent, and Thomas Neumann. 2021. A Tailored Regression for Learned Indexes: Logarithmic Error Regression. Proc. of the International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM), 9--15.

Digital Library

[11]

Minghong Fang, Minghao Sun, Qi Li, Neil Zhenqiang Gong, Jin Tian, and Jia Liu. 2021. Data Poisoning Attacks and Defenses to Crowdsourcing Systems. In Proceedings of the Web Conference (WWW). 969--980.

Digital Library

[12]

Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2020. Why Are Learned Indexes So Effective?. In Proc. of the 37th ICML, Vol. 119.

[13]

Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2021. On the Performance of Learned Data Structures. Theoretical Computer Science (2021), 107--120.

[14]

Paolo Ferragina and Giorgio Vinciguerra. 2020 a. Learned Data Structures. In Recent Trends in Learning From Data. 5--41.

[15]

Paolo Ferragina and Giorgio Vinciguerra. 2020 b. The PGM-Index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds. Proc. VLDB Endow., Vol. 13, 8 (April 2020), 1162--1175.

Digital Library

[16]

Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures. In Proc. of the 22nd ACM CCS. 1322--1333.

Digital Library

[17]

Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure. In Proc. of ACM SIGMOD. 1189--1206.

Digital Library

[18]

Ali Hadian and Thomas Heinis. 2019 a. Considerations for Handling Updates in Learned Index Structures. Proc. of the International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM) .

Digital Library

[19]

Ali Hadian and Thomas Heinis. 2019 b. Interpolation-friendly B-trees: Bridging the Gap Between Algorithmic and Learned Indexes. In Proc. of EDBT. 710--713.

[20]

Ali Hadian and Thomas Heinis. 2021. Shift-Table: A Low-latency Learned Index for Range Queries using Model Correction. arxiv: 2101.10457

[21]

Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-Based Frequency Estimation Algorithms. In Proc. of ICLR.

[22]

Hai Huang, Jiaming Mu, Neil Zhenqiang Gong, Qi Li, Bin Liu, and Mingwei Xu. 2021. Data Poisoning Attacks to Deep Learning Based Recommender Systems. Proc. of NDSS.

[23]

Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, and Zichen Zhu. 2019. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. In Proc. of CIDR.

[24]

Stratos Idreos and Tim Kraska. 2019. From Auto-Tuning One Size Fits All to Self-Designed and Learned Data-Intensive Systems. In Proc. of ACM SIGMOD. 2054--2059.

Digital Library

[25]

Piotr Indyk, Ali Vakilian, and Yang Yuan. 2019. Learning-Based Low-Rank Approximations. In Proc. of NeurIPS. 7400--7410.

[26]

Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. 2018. Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning. Proc. of IEEE S&P.

[27]

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. arxiv: 1911.13014

[28]

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: A Single-Pass Learned Index. Proc. of the International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM) .

Digital Library

[29]

Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In Proc. of CIDR.

[30]

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proc. of ACM SIGMOD. 489--504.

Digital Library

[31]

Yaliang Li, Daoyuan Chen, Bolin Ding, Kai Zeng, and Jingren Zhou. 2021. A Pluggable Learned Index Method via Sampling and Gap Insertion. arxiv: 2101.00808

[32]

Xuanqing Liu, Si Si, Xiaojin Zhu, Yang Li, and Cho-Jui Hsieh. 2019. A Unified Framework for Data Poisoning Attack to Graph-based Semi-supervised Learning. arxiv: 1910.14147

[33]

Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based Workload Forecasting for Self-Driving Database Management Systems. In Proc. of ACM SIGMOD. 631--645.

[34]

Shiqing Ma and Yingqi Liu. 2019. NIC: Detecting Adversarial Samples with Neural Network Invariant Checking. In Proc. of the NDSS.

[35]

Stephen Macke, Alex Beutel, Tim Kraska, Maheswaran Sathiamoorthy, Derek Zhiyuan Cheng, and Ed H. Chi. 2018. Lifting the Curse of Multidimensional Data with Learned Existence Indexes. In Workshop on ML for Systems at NeurIPS .

[36]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proc. of ACM SIGCOMM. 270--288.

Digital Library

[37]

Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. arxiv: 2006.12804

[38]

Florida Miami-Dade County. 2019. Employee Pay Information. https://gis-mdc.opendata.arcgis.com/datasets/employee-pay-information/data.

[39]

David J. Miller and Hasan S. Uyar. 1996. A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data. In Proc. of NeurIPS. 571--577.

[40]

Kenneth S. Miller. 1966. An Introduction to the Calculus of Finite Differences and Difference Equations, by Kenneth S. Miller .Dover Publications, New York.

[41]

Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In Proc. of NeurIPS. 464--473.

[42]

Douglas C. Montgomery. 2006. Design and Analysis of Experiments .John Wiley & Sons, Inc., Hoboken, NJ, USA.

Digital Library

[43]

Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In Proc. of ACM SIGMOD. 985--1000.

Digital Library

[44]

OpenStreetMap contributors. 2017. Planet dump retrieved from https://planet.osm.org. https://www.openstreetmap.org .

[45]

Jack W. Rae, Sergey Bartunov, and Timothy P. Lillicrap. 2019. Meta-Learning Neural Bloom Filters. In Proc. of ICML, Vol. 97. 5271--5280.

[46]

Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes. 2019. ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. In Proc. of NDSS .

[47]

Naufal Fikri Setiawan, Benjamin I. P. Rubinstein, and Renata Borovica-Gajic. 2020. Function Interpolation for Learned Index Structures. In ADC - Databases Theory and Applications. Springer, 68--80.

[48]

Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, and Tim Kraska. 2021. Bounding the Last Mile: Efficient Learned String Indexing. arxiv: 2111.14905

[49]

Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. Towards Practical Learned Indexing. arxiv: 2108.05117

[50]

Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumitras. 2018. When Does Machine Learning FAIL? Generalized Transferability for Evasion and Poisoning Attacks. In Proc. of USENIX Security. 1299--1316.

[51]

Mingjie Sun, Jian Tang, Huichen Li, Bo Li, Chaowei Xiao, Yao Chen, and Dawn Song. 2018. Data Poisoning Attack against Unsupervised Node Embedding Methods. arxiv: 1810.12881

[52]

Chuzhe Tang, Zhiyuan Dong, Minjie Wang, Zhaoguo Wang, and Haibo Chen. 2019. Learned Indexes for Dynamic Workloads. arxiv: 1902.00655

[53]

Kapil Vaidya, Eric Knorr, Tim Kraska, and Michael Mitzenmacher. 2020. Partitioned Learned Bloom Filter. arxiv: 2006.03176

[54]

Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-Scale Machine Learning. In Proc. of ACM SIGMOD. 1009--1024.

Digital Library

[55]

Peter Van Sandt, Yannis Chronis, and Jignesh M. Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proc. of ACM SIGMOD. 36--53.

[56]

Giorgio Vinciguerra, Paolo Ferragina, and Michele Miccinesi. 2019. Superseding Traditional Indexes by Orchestrating Learning and Geometry. arxiv: 1903.00507

[57]

B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proc. of the 40th IEEE S&P. 707--723.

[58]

Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. In Proc. of ACM SIGMOD. 1223--1240.

Digital Library

[59]

Wenkun Xiang, Hao Zhang, Rui Cui, Xing Chu, Keqin Li, and Wei Zhou. 2019. Pavo: A RNN-Based Learned Inverted Index, Supervised or Unsupervised? IEEE Access, Vol. 7 (2019), 293--303.

[60]

Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. 2015. Is Feature Selection Secure against Training Data Poisoning?. In Proc. of the 32nd ICML, Vol. 37. 1689--1698.

[61]

Chaofei Yang, Qing Wu, Hai Li, and Yiran Chen. 2017. Generative Poisoning Attack Method Against Neural Networks. arxiv: 1703.01340

Cited By

Yang RKornaropoulos ECheng Y(2024)Algorithmic Complexity Attacks on Dynamic Learned IndexesProceedings of the VLDB Endowment10.14778/3636218.363623217:4(780-793)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636232
Zhang JZhang CLi GChai C(2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639292
Costa MFerragina PVinciguerra G(2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639258
Show More Cited By

Index Terms

The Price of Tailoring the Index to Your Data: Poisoning Attacks on Learned Index Structures

Recommendations

Defending Against Adversarial Denial-of-Service Data Poisoning Attacks
DYNAMICS '20: Proceedings of the 2020 Workshop on DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security

Data poisoning is one of the most relevant security threats against machine learning and data-driven technologies. Since many applications rely on untrusted training data, an attacker can easily craft malicious samples and inject them into the training ...
Data poisoning attacks against machine learning algorithms
Abstract
For the past decade, machine learning technology has increasingly become popular and it has been contributing to many areas that have the potential to influence the society considerably. Generally, machine learning is used by various ...
Highlights
- A new approach to analyze robustness of machine learning.
- Machine learning ...
Stronger data poisoning attacks break data sanitization defenses
Abstract
Machine learning models trained on data from the outside world can be corrupted by data poisoning attacks that inject malicious points into the models’ training sets. A common defense against these attacks is data sanitization: first filter out ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

June 2022

2597 pages

ISBN:9781450392495

DOI:10.1145/3514221

General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12 - 17, 2022

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
715
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)30

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang RKornaropoulos ECheng Y(2024)Algorithmic Complexity Attacks on Dynamic Learned IndexesProceedings of the VLDB Endowment10.14778/3636218.363623217:4(780-793)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636232
Zhang JZhang CLi GChai C(2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639292
Costa MFerragina PVinciguerra G(2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639258
Zhou WLin CZhou XLi GWang T(2024)TRAP: Tailored Robustness Assessment for Index Advisors via Adversarial Perturbation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00011(42-55)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00011
Tong YLiu JWang HZhou KHe RZhang QWang C(2023)Sieve: A Learned Data-Skipping Index for Data AnalyticsProceedings of the VLDB Endowment10.14778/3611479.361152016:11(3214-3226)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611520
Sun ZZhou XLi G(2023)Learned Index: A Comprehensive Experimental EvaluationProceedings of the VLDB Endowment10.14778/3594512.359452816:8(1992-2004)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.14778/3594512.3594528
Ferragina PFrasca MMarinò GVinciguerra G(2023)On Nonlinear Learned String IndexingIEEE Access10.1109/ACCESS.2023.329543411(74021-74034)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295434
Luo YJin PChu ZWang XYuan YZhang ZLuo YWu XZou P(2023)Morphtree: a polymorphic main-memory learned index for dynamic workloadsThe VLDB Journal10.1007/s00778-023-00823-y33:4(1065-1084)Online publication date: 1-Dec-2023
https://doi.org/10.1007/s00778-023-00823-y

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents