research-article

Open access

Constructing Surrogate Models in Machine Learning Using Combinatorial Testing and Active Learning

Authors:

Krishna Khadka,

Raghu N. Kacker,

D. Richard KuhnAuthors Info & Claims

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Pages 1645 - 1654

https://doi.org/10.1145/3691620.3695532

Published: 27 October 2024 Publication History

Abstract

Machine learning (ML)-based models are often black box, making it challenging to understand and interpret their decision-making processes. Surrogate models are constructed to approximate the behavior of a target model and are an essential tool for analyzing black-box models. The construction of a surrogate model typically includes querying the target model with carefully selected data points and using the responses from the target model to infer information about its structure and parameters.

In this paper, we propose an approach to surrogate model construction using combinatorial testing and active learning, aiming to efficiently capture the essential interactions between features that drive the target model's predictions. Our approach first leverages t-way testing to generate data points that capture all the t-way feature interactions. We then use an iterative process to isolate the essential feature interactions, i.e., those that can determine a model prediction. In the iterative process, we remove nonessential feature interactions, generate additional data points to contain the remaining interactions, and employ active learning techniques to select a subset of the data points to update the surrogate model. This process is continued until we construct a surrogate model that closely mirrors the target model's behavior. We evaluate our approach on 4 public datasets and 12 ML models and compare the results with the state-of-the-art (SOTA) approaches. Our experimental results show that our approach can perform in most cases better than the SOTA approaches in terms of accuracy and efficiency.

References

[1]

1987. Mushroom. UCI Machine Learning Repository.

[2]

Claudio Angione, Eric Silverman, and Elisabeth Yaneske. 2022. Using machine learning as a surrogate model for agent-based simulations. Plos one 17, 2 (2022), e0263150.

[3]

Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository.

[4]

James Beetham, Navid Kardan, Ajmal Mian, and Mubarak Shah. 2023. Dual student networks for data-free model stealing. arXiv preprint arXiv:2309.10058 (2023).

[5]

Varun Chandrasekaran, Kamalika Chaudhuri, Irene Giacomelli, Somesh Jha, and Songbai Yan. 2020. Exploring connections between active learning and model extraction. In 29th USENIX Security Symposium (USENIX Security 20). 1309--1326.

[6]

Molnar Christoph. 2020. Interpretable machine learning: A guide for making black box models explainable. Leanpub.

[7]

James Dougherty, Ron Kohavi, and Mehran Sahami. 1995. Supervised and unsupervised discretization of continuous features. In Machine learning proceedings 1995. Elsevier, 194--202.

Digital Library

[8]

Python Software Foundation. [n. d.]. Python Documentation. https://docs.python.org/3/library/time.html. Accessed on: June 2024.

[9]

Yotam Gil, Yoav Chai, Or Gorodissky, and Jonathan Berant. 2019. White-to-black: Efficient distillation of black-box adversarial attacks. arXiv preprint arXiv:1904.02405 (2019).

[10]

Micah Goldblum, Liam Fowl, Soheil Feizi, and Tom Goldstein. 2020. Adversarially robust distillation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 3996--4003.

[11]

Xueluan Gong, Yanjiao Chen, Wenbin Yang, Guanghao Mei, and Qian Wang. 2021. InverseNet: Augmenting Model Extraction Attacks with Training Data Inversion. In IJCAI. 2439--2447.

[12]

Ian Hardy, Jayanth Yetukuri, and Yang Liu. 2023. Adaptive Adversarial Training Does Not Increase Recourse Costs. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 432--442.

Digital Library

[13]

Tianxu He, Shukui Zhang, Jie Xin, Pengpeng Zhao, Jian Wu, Xuefeng Xian, Chunhua Li, and Zhiming Cui. 2014. An active learning approach with uncertainty, representativeness, and diversity. The Scientific World Journal 2014, 1 (2014), 827586.

[14]

Hans Hofmann. 1994. Statlog (German Credit Data). UCI Machine Learning Repository.

[15]

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. 2020. High accuracy and high fidelity extraction of neural networks. In 29th USENIX security symposium (USENIX Security 20). 1345--1362.

[16]

Weisen Jiang, James Kwok, and Yu Zhang. 2022. Subspace learning for effective meta-learning. In International Conference on Machine Learning. PMLR, 10177--10194.

[17]

Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. 2019. PRADA: protecting against DNN model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 512--527.

[18]

Sanjay Kariyappa, Atul Prakash, and Moinuddin K Qureshi. 2021. Maze: Data-free model stealing attack using zeroth-order gradient estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13814--13823.

[19]

Krishna Khadka, Sunny Shree, Yu Lei, Raghu N. Kacker, and D. Richard Kuhn. 2024. Assessing the Degree of Feature Interactions that Determine a Model Prediction. Journal of Machine Learning Research 180 (2024).

[20]

D Richard Kuhn, Renee Bryce, Feng Duan, Laleh Sh Ghandehari, Yu Lei, and Raghu N Kacker. 2015. Combinatorial testing: Theory and practice. Advances in computers 99 (2015), 1--66.

[21]

D Richard Kuhn, Dolores R Wallace, and Albert M Gallo. 2004. Software fault interactions and implications for software testing. IEEE transactions on software engineering 30, 6 (2004), 418--421.

Digital Library

[22]

Yu Lei, Raghu Kacker, D Richard Kuhn, Vadim Okun, and James Lawrence. 2007. IPOG: A general strategy for t-way software testing. In 14th Annual IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS'07). IEEE, 549--556.

Digital Library

[23]

Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. 2017. Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535 (2017).

[24]

Daniel Lowd and Christopher Meek. 2005. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 641--647.

Digital Library

[25]

Divyat Mahajan, Chenhao Tan, and Amit Sharma. 2019. Preserving causal constraints in counterfactual explanations for machine learning classifiers. arXiv preprint arXiv:1912.03277 (2019).

[26]

Takayuki Miura, Satoshi Hasegawa, and Toshiki Shibahara. 2021. MEGEX: Data-free model extraction attack against gradient-based explainable AI. arXiv preprint arXiv:2107.08909 (2021).

[27]

Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 607--617.

Digital Library

[28]

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. 2019. Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4954--4963.

[29]

Soham Pal, Yash Gupta, Aditya Shukla, Aditya Kanade, Shirish Shevade, and Vinod Ganapathy. 2020. Activethief: Model extraction using active learning and unannotated public data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 865--872.

[30]

Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. 506--519.

Digital Library

[31]

Robert Nikolai Reith, Thomas Schneider, and Oleksandr Tkachenko. 2019. Efficiently stealing your machine learning models. In Proceedings of the 18th ACM Workshop on Privacy in the Electronic Society. 198--210.

Digital Library

[32]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1--40.

[33]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135--1144.

Digital Library

[34]

Jonathan Rosenthal, Eric Enouen, Hung Viet Pham, and Lin Tan. 2023. DisGUIDE: disagreement-guided data-free model extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 9614--9622.

Digital Library

[35]

Andrew Ross and Finale Doshi-Velez. 2018. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.

[36]

Sunandini Sanyal, Sravanti Addepalli, and R Venkatesh Babu. 2022. Towards data-free model stealing in a hard label setting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15284--15293.

[37]

Yi Shi, Yalin Sagduyu, and Alexander Grushin. 2017. How to steal a machine learning classifier with deep learning. In 2017 IEEE International symposium on technologies for homeland security (HST). IEEE, 1--5.

[38]

sunny Shree. 2024. SurrogateModel. https://github.com/sunnyshreexai/surrogateModel.

[39]

Masataka Tasumi, Kazuki Iwahana, Naoto Yanai, Katsunari Shishido, Toshiya Shimizu, Yuji Higuchi, Ikuya Morikawa, and Jun Yajima. 2021. First to possess his statistics: Data-free model extraction attack on tabular data. arXiv preprint arXiv:2109.14857 (2021).

[40]

Alaa Tharwat and Wolfram Schenck. 2023. A survey on active learning: state-of-the-art, practical challenges and research directions. Mathematics 11, 4 (2023), 820.

[41]

Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. 2016. Stealing machine learning models via prediction {APIs}. In 25th USENIX security symposium (USENIX Security 16). 601--618.

[42]

Jean-Baptiste Truong, Pratyush Maini, Robert J Walls, and Nicolas Papernot. 2021. Data-free model extraction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4771--4780.

[43]

Rachel Tzoref-Brill. 2019. Advances in combinatorial testing. Advances in Computers 112 (2019), 79--134.

[44]

Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, and Marco Visentini-Scarzanella. 2019. Unifying heterogeneous classifiers with distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3175--3184.

[45]

Ji Wang, Weidong Bao, Lichao Sun, Xiaomin Zhu, Bokai Cao, and S Yu Philip. 2019. Private model compression via knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1190--1197.

Digital Library

[46]

Yixu Wang, Jie Li, Hong Liu, Yan Wang, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2022. Black-box dissector: Towards erasing-based hard-label model stealing attack. In European conference on computer vision. Springer, 192--208.

Digital Library

[47]

Yongjie Wang, Hangwei Qian, and Chunyan Miao. 2022. Dualcf: Efficient model extraction attack from counterfactual explanations. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1318--1329.

Digital Library

[48]

WIlliam Wolberg. 1992. Breast Cancer Wisconsin (Original). UCI Machine Learning Repository.

[49]

Zhengxuan Wu, Karel D'Oosterlinck, Atticus Geiger, Amir Zur, and Christopher Potts. 2023. Causal proxy models for concept-based model explanations. In International conference on machine learning. PMLR, 37313--37334.

[50]

Haonan Yan, Xiaoguang Li, Hui Li, Jiamin Li, Wenhai Sun, and Fenghua Li. 2021. Monitoring-based differential privacy mechanism against query flooding-based model extraction attack. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2680--2694.

[51]

Changchang Yin, Buyue Qian, Shilei Cao, Xiaoyu Li, Jishang Wei, Qinghua Zheng, and Ian Davidson. 2017. Deep similarity-based batch mode active learning with exploration-exploitation. In 2017 IEEE international conference on data mining (ICDM). IEEE, 575--584.

[52]

Honggang Yu, Kaichen Yang, Teng Zhang, Yun-Yun Tsai, Tsung-Yi Ho, and Yier Jin. 2020. CloudLeak: Large-Scale Deep Learning Models Stealing Through Adversarial Examples. In NDSS.

[53]

Vitaliy Zakaznikov. 2023. Testflows Combinatorics. https://pypi.org/project/testflows.combinatorics/. Released: Sep 21, 2023.

[54]

Jingbo Zhu, Huizhen Wang, Benjamin K Tsou, and Matthew Ma. 2009. Active learning with sampling by uncertainty and density for data annotations. IEEE Transactions on audio, speech, and language processing 18, 6 (2009), 1323--1331.

Digital Library

Index Terms

Constructing Surrogate Models in Machine Learning Using Combinatorial Testing and Active Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Blackbox optimization and surrogate models for machining free-form surfaces
Abstract
This paper introduces an optimization model for machining free-form surfaces. It involves one categorical decision variable and continuous decision variables. Its objective function is partially separable. It is composed of two blackboxes: a ...
Highlights
- A new partitioning-and-machining optimization model is introduced.
- The optimization problem is solved using the blackbox optimization software NOMAD.
- To enhance the optimization process, four static surrogates are proposed.
- ...
Combinatorial Interaction Testing with Multi-perspective Feature Models
ICSTW '13: Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation Workshops

Testing product lines and similar software involves the important task of testing feature interactions. The challenge is to test all those feature interactions that result in testing of all variations across all dimensions of variation. In this context, ...
Improving Search-Based Android Test Generation Using Surrogate Models
Search-Based Software Engineering
Abstract
The increasing popularity of mobile apps implies a need for automated test generation techniques for Android apps. Unlike other domains where automated test generation has been applied successfully, such as unit test generation, test execution for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

October 2024

2587 pages

ISBN:9798400712487

DOI:10.1145/3691620

General Chair:
Vladimir Filkov,
Program Co-chairs:
Baishakhi Ray
Columbia University, USA; AWS AI Lab
,
Minghui Zhou
Peking University, China

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Institute of Standards and Technology

Conference

ASE '24

Sponsor:

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

CA, Sacramento, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
171
Total Downloads

Downloads (Last 12 months)171
Downloads (Last 6 weeks)59

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten