research-article

Open access

MEGEX: Data-Free Model Extraction Attack Against Gradient-Based Explainable AI

Authors:

Takayuki Miura,

Toshiki Shibahara,

Naoto YanaiAuthors Info & Claims

SecTL '24: Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems

Pages 56 - 66

https://doi.org/10.1145/3665451.3665533

Published: 23 July 2024 Publication History

Abstract

Explainable AI encourages machine learning applications in the real world, whereas data-free model extraction attacks (DFME), in which an adversary steals a trained machine learning model by creating input queries with generative models instead of collecting training data, have attracted attention as a serious threat. In this paper, we propose MEGEX, a data-free model extraction attack against explainable AI that provides gradient-based explanations for inference results, and investigate whether the gradient-based explanations increase the vulnerability to the data-free model extraction attacks. In MEGEX, an adversary leverages explanations by Vanilla Gradient as derivative values for training a generative model. We prove that MEGEX is identical to white-box data-free knowledge distillation, whereby the adversary can train the generative model with the exact gradients. Our experiments show that the adversary in MEGEX can steal highly accurate models - 0.98×, 0.91×, and 0.96× the victim model accuracy on SVHN, Fashion-MNIST, and CIFAR-10 datasets given 1.5M, 5M, 20M queries, respectively. In addition, we also apply sophisticated gradient-based explanations, i.e., SmoothGrad and Integrated Gradients, to MEGEX. The experimental results indicate that these explanations are potential countermeasures to MEGEX. We also found that the accuracy of the model stolen by the adversary depends on the diversity of query inputs by the generative model.

References

[1]

Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138--52160.

[2]

Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs. 2020. Model extraction from counterfactual explanations. arXiv preprint arXiv:2009.01884 (2020).

[3]

Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. 2017. Towards better understanding of gradient-based attribution methods for deep neural networks. arXiv preprint arXiv:1711.06104 (2017).

[4]

Varun Chandrasekaran, Kamalika Chaudhuri, Irene Giacomelli, Somesh Jha, and Songbai Yan. 2020. Exploring connections between active learning and model extraction. In 29th USENIX Security Symposium (USENIX Security 20). 1309--1326.

[5]

Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji, Xingjun Ma, Bo Li, and Dawn Song. 2022. Copy, Right? A Testing Framework for Copyright Protection of Deep Learning Models. In Proc. of IEEE S&P 2022. IEEE, 824--841.

[6]

Yanjiao Chen, Rui Guan, Xueluan Gong, Jianshuo Dong, and Meng Xue. 2023. D-DAE: Defense-Penetrating Model Extraction Attacks. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 432--449.

[7]

Yanjiao Chen, Rui Guan, Xueluan Gong, Jianshuo Dong, and Meng Xue. 2023. D-DAE: Defense-Penetrating Model Extraction Attacks. In 2023 IEEE Symposium on Security and Privacy (SP). 382--399. https://doi.org/10.1109/SP46215.2023.10179406

[8]

Gongfan Fang, Jie Song, Chengchao Shen, Xinchao Wang, Da Chen, and Mingli Song. 2019. Data-free adversarial distillation. arXiv preprint arXiv:1912.11006 (2019).

[9]

Xueluan Gong, Yanjiao Chen, Wenbin Yang, Guanghao Mei, and Qian Wang. 2021. InverseNet: Augmenting Model Extraction Attacks with Training Data Inversion. In IJCAI. 2439--2447.

[10]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).

[11]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[12]

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. 2020. High Accuracy and High Fidelity Extraction of Neural Networks. In 29th USENIX Security Symposium (USENIX Security 20).

[13]

Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. 2019. PRADA: protecting against DNN model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 512--527.

[14]

Sanjay Kariyappa, Atul Prakash, and Moinuddin K Qureshi. 2021. MAZE: Data-Free Model Stealing Attack Using Zeroth-Order Gradient Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13814--13823.

[15]

Manish Kesarwani, Bhaskar Mukhoty, Vijay Arya, and Sameep Mehta. 2018. Model extraction warning in mlaas paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference. 371--380.

Digital Library

[16]

Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).

[17]

Jeonghyun Lee, Sungmin Han, and Sangkyun Lee. 2022. Model stealing defense against exploiting information leak through the interpretation of deep neural nets. In Int. Joint Conf. on Artificial Intelligence (IJCAI), Vienna, Austria.

[18]

Pan Li, Peizhuo Lv, Kai Chen, Yuling Cai, Fan Xiang, and Shengzhi Zhang. 2023. Model Stealing Attack against Multi-Exit Networks. CoRR abs/2305.13584 (2023). https://doi.org/10.48550/arXiv.2305.13584

[19]

Zeyu Li, Chenghui Shi, Yuwen Pu, Xuhong Zhang, Yu Li, Jinbao Li, and Shouling Ji. 2023. MEAOD: Model Extraction Attack against Object Detectors. CoRR abs/2312.14677 (2023). https://doi.org/10.48550/arXiv.2312.14677

[20]

Xinjing Liu, Zhuo Ma, Yang Liu, Zhan Qin, Junwei Zhang, and Zhuzhu Wang. 2022. SeInspect: Defending Model Stealing via Heterogeneous Semantic Inspection. In Computer Security-ESORICS 2022: 27th European Symposium on Research in Computer Security, Copenhagen, Denmark, September 26-30, 2022, Proceedings, Part I. Springer, 610--630.

[21]

Zhuo Ma, Xinjing Liu, Yang Liu, Ximeng Liu, Zhan Qin, and Kui Ren. 2023. DivTheft: An Ensemble Model Stealing Attack by Divide-and-Conquer. IEEE Transactions on Dependable and Secure Computing 20, 6 (2023), 4810--4822.

Digital Library

[22]

Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Moritz Hardt. 2019. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 1--9.

Digital Library

[23]

Jaron Mink, Hadger Benkraouda, Limin Yang, Arridhana Ciptadi, Ali Ahmadzadeh, Daniel Votipka, and Gang Wang. 2023. Everybody' s Got ML, Tell Me What Else You Have: Practitioners' Perception of ML-Based Security Tools and Explanations. In Proc. of IEEE S&P 2023. IEEE Computer Society, 2068--2085.

[24]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning. (2011).

[25]

Daryna Oliynyk, Rudolf Mayer, and Andreas Rauber. 2023. I know what you trained last summer: A survey on stealing machine learning models and defences. Comput. Surveys (2023).

[26]

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. 2019. Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks. In International Conference on Learning Representations.

[27]

David Rolnick and Konrad Kording. 2020. Reverse-engineering deep ReLU networks. In International Conference on Machine Learning. PMLR, 8178--8187.

[28]

Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes. 2018. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246 (2018).

[29]

Sunandini Sanyal, Sravanti Addepalli, and R Venkatesh Babu. 2022. Towards Data-Free Model Stealing in a Hard Label Setting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15284--15293.

[30]

Zeyang Sha, Xinlei He, Ning Yu, Michael Backes, and Yang Zhang. 2023. Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders. In Proc. CVPR 2023. 16373--16383.

[31]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP). IEEE, 3--18.

[32]

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).

[33]

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017).

[34]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365 (2017).

Digital Library

[35]

Sebastian Szyller, Buse Gul Atli, Samuel Marchal, and N Asokan. 2021. Dawn: Dynamic adversarial watermarking of neural networks. In Proceedings of the 29th ACM International Conference on Multimedia. 4417--4425.

Digital Library

[36]

Jingxuan Tan, Nan Zhong, Zhenxing Qian, Xinpeng Zhang, and Sheng Li. 2023. Deep Neural Network Watermarking against Model Extraction Attack. In Proceedings of the 31st ACM International Conference on Multimedia (, Ottawa ON, Canada,) (MM '23). Association for Computing Machinery, New York, NY, USA, 1588--1597. https://doi.org/10.1145/3581783.3612515

Digital Library

[37]

Ruixiang Tang, Hongye Jin, Mengnan Du, Curtis Wigington, Rajiv Jain, and Xia Hu. 2023. Exposing Model Theft: A Robust and Transferable Watermark for Thwarting Model Extraction Attacks. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (, Birmingham, United Kingdom,) (CIKM '23). Association for Computing Machinery, New York, NY, USA, 4315--4319. https://doi.org/10.1145/3583780.3615211

Digital Library

[38]

Masataka Tasumi, Kazuki Iwahana, Naoto Yanai, Katsunari Shishido, Toshiya Shimizu, Yuji Higuchi, Ikuya Morikawa, and Jun Yajima. 2021. First to Possess His Statistics: Data-Free Model Extraction Attack on Tabular Data. arXiv preprint arXiv:2109.14857 (2021).

[39]

Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. 2016. Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16). 601--618.

Digital Library

[40]

Jean-Baptiste Truong, Pratyush Maini, Robert J. Walls, and Nicolas Papernot. 2021. Data-Free Model Extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]

Bas HM Van der Velden, Hugo J Kuijf, Kenneth GA Gilhuijs, and Max A Viergever. 2022. Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Medical Image Analysis (2022), 102470.

[42]

Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech. 31 (2017), 841.

[43]

Yongjie Wang, Hangwei Qian, and Chunyan Miao. 2022. DualCF: Efficient Model Extraction Attack from Counterfactual Explanations. arXiv preprint arXiv:2205.06504 (2022).

[44]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).

[45]

Yi Xie, Mengdie Huang, Xiaoyu Zhang, Changyu Dong, Willy Susilo, and Xiaofeng Chen. 2022. GAME: Generative-Based Adaptive Model Extraction Attack. In Computer Security-ESORICS 2022: 27th European Symposium on Research in Computer Security, Copenhagen, Denmark, September 26-30, 2022, Proceedings, Part I. Springer, 570--588.

[46]

Anli Yan, Ruitao Hou, Hongyang Yan, and Xiaozhang Liu. 2023. Explanation-based data-free model extraction attacks. World Wide Web 26, 5 (2023), 3081--3092.

Digital Library

[47]

Anli Yan, Teng Huang, Lishan Ke, Xiaozhang Liu, Qi Chen, and Changyu Dong. 2023. Explanation leaks: Explanation-guided model extraction attacks. Information Sciences 632 (2023), 269--284.

Digital Library

[48]

Zhenrui Yue, Zhankui He, Huimin Zeng, and Julian J. McAuley. 2021. BlackBox Attacks on Sequential Recommenders via Data-Free Model Extraction. In Proceedings of Conference on Recommender Systems (RecSys). ACM, 44--54.

[49]

Zhanyuan Zhang, Yizheng Chen, and David Wagner. 2021. SEAT: Similarity Encoder by Adversarial Training for Detecting Model Extraction Attack Queries. In Proc. of AISec 2021. ACM, 37--48.

Digital Library

[50]

Huadi Zheng, Qingqing Ye, Haibo Hu, Chengfang Fang, and Jie Shi. 2019. BDPL: A Boundary Differentially Private Layer Against Machine Learning Model Extraction Attacks. In Proc. of ESORICS 2019 (LNCS, Vol. 11735). Springer, 66--83.

Digital Library

[51]

Huadi Zheng, Qingqing Ye, Haibo Hu, Chengfang Fang, and Jie Shi. 2022. Protecting Decision Boundary of Machine Learning Model With Differentially Private Perturbation. IEEE Transactions on Dependable and Secure Computing 19, 3 (2022), 2007--2022.

Cited By

Yang EWang ZShen LYin NLiu TGuo GWang XTao D(2024)Continual Learning From a Stream of APIsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346087146:12(11432-11445)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3460871

Index Terms

MEGEX: Data-Free Model Extraction Attack Against Gradient-Based Explainable AI
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Systems security
    1. Vulnerability management

Recommendations

Data-Free Model Extraction Attacks in the Context of Object Detection
Computer Vision Systems
Abstract
A significant number of machine learning models are vulnerable to model extraction attacks, which focus on stealing the models by using specially curated queries against the target model. This task is well accomplished by using part of the ...
Training A Secure Model Against Data-Free Model Extraction
Computer Vision – ECCV 2024
Abstract
The objective of data-free model extraction (DFME) is to acquire a pre-trained black-box model solely through query access, without any knowledge of the training data used for the victim model. Defending against DFME is challenging because the ...
Average Gradient-Based Adversarial Attack
Deep neural networks (DNNs) are vulnerable to adversarial attacks which can fool the classifiers by adding small perturbations to the original example. The added perturbations in most existing attacks are mainly determined by the gradient of the loss ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SecTL '24: Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems

July 2024

69 pages

ISBN:9798400706912

DOI:10.1145/3665451

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ASIA CCS '24

Sponsor:

SIGSAC

ASIA CCS '24: ACM Asia Conference on Computer and Communications Security

July 2 - 20, 2024

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
452
Total Downloads

Downloads (Last 12 months)452
Downloads (Last 6 weeks)74

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang EWang ZShen LYin NLiu TGuo GWang XTao D(2024)Continual Learning From a Stream of APIsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346087146:12(11432-11445)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3460871

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten