Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3611643.3616297acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

Published: 30 November 2023 Publication History

Abstract

Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the "black-box" nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility.

Supplementary Material

Video (fse23main-p471-p-video.mp4)
"Code datasets are of immense value for training neural-network-based code completion models; companies or organizations tend to make substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the ``black-box'' nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. Imperceptibility is vital to prevent adversaries who have an ulterior motive to remove watermarks. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility."

References

[1]
2021. GitHub Copilot research recitation. https://github.blog/2021-06-30-github-copilot-research-recitation/
[2]
2022. aiXcoder. https://www.aixcoder.com/
[3]
2022. Code faster with AI completions | TabNine. https://www.tabnine.com/
[4]
2022. GitHub Copilot · Your AI pair programmer. https://copilot.github.com/
[5]
2022. How is the data in Copilot for Individuals used and shared? https://github.com/features/copilot/##how-is-the-data-in-copilot-for-individuals-used-and-shared
[6]
2022. ML-powered coding companion – Amazon CodeWhisperer – Amazon Web Services. https://aws.amazon.com/codewhisperer/
[7]
2022. Tree-sitter - Introduction. https://tree-sitter.github.io/tree-sitter
[8]
2022. Where did AWS obtain the training data to build this service? https://aws.amazon.com/codewhisperer/faqs/?nc1=h_ls
[9]
2023. CodeMark. https://sites.google.com/view/codemark
[10]
2023. Stack Overflow Will Charge AI Giants for Training Data. https://www.wired.com/story/stack-overflow-will-charge-ai-giants-for-training-data/
[11]
Yossi Adi, Carsten Baum, Moustapha Cissé, Benny Pinkas, and Joseph Keshet. 2018. Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring. In USENIX Security Symposium.
[12]
Geneviève Arboit. 2002. A Method for Watermarking Java Programs via Opaque Predicates. Electronic Commerce Research.
[13]
Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2021. Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[14]
Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Ben Edwards, Taesung Lee, Ian Molloy, and B. Srivastava. 2019. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. ArXiv, abs/1811.03728 (2019).
[15]
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526.
[16]
Sebastian Danicic and James Alexander George Hamilton. 2010. An Evaluation of Static Java Bytecode Watermarking.
[17]
Robert I Davidson and Nathan Myhrvold. 1996. Method and system for generating and auditing a signature for a computer program. US Patent 5,559,884
[18]
Tianyu Gu, Brendan Dolan-Gavitt, and S. Garg. 2017. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. ArXiv, abs/1708.06733 (2017).
[19]
James Alexander George Hamilton and Sebastian Danicic. 2011. A survey of static software watermarking. 2011 World Congress on Internet Security (WorldCIS-2011), 100–107.
[20]
Xuanli He, Qiongkai Xu, L. Lyu, Fangzhao Wu, and Chenguang Wang. 2021. Protecting Intellectual Property of Language Generation APIs with Lexical Watermark. ArXiv, abs/2112.02701 (2021).
[21]
Hamel Husain, Hongqi Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. ArXiv, abs/1909.09436 (2019).
[22]
Wan Soo Kim and Kyogu Lee. 2020. Digital Watermarking For Protecting Audio Classification Datasets. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2842–2846.
[23]
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022. The Stack: 3 TB of permissively licensed source code. Preprint.
[24]
Peter J. Landin. 1964. The Mechanical Evaluation of Expressions. Comput. J., 6 (1964), 308–320.
[25]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nourhan Fahmy, Urvashi Bhattacharyya, W. Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jana Ebert, Tri Dao, Mayank Mishra, Alexander Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean M. Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you!. ArXiv, abs/2305.06161 (2023), https://api.semanticscholar.org/CorpusID:258588247
[26]
Yiming Li, Zi-Mou Zhang, Jiawang Bai, Baoyuan Wu, Yong Jiang, and Shutao Xia. 2020. Open-sourced Dataset Protection via Backdoor Watermarking. ArXiv, abs/2010.05821 (2020).
[27]
Haoyu Ma, Chunfu Jia, Shijia Li, Wantong Zheng, and Dinghao Wu. 2019. Xmark: Dynamic Software Watermarking Using Collatz Conjecture. IEEE Transactions on Information Forensics and Security, 14 (2019), 2859–2874.
[28]
Vadim Markovtsev and Waren Long. 2018. Public Git Archive: A Big Code Dataset for All. 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), 34–37.
[29]
Akito Monden, Hajimu Iida, Ken ichi Matsumoto, Koji Torii, and Katsuro Inoue. 2000. A practical method for watermarking Java programs. Proceedings 24th Annual International Computer Software and Applications Conference. COMPSAC2000, 191–197.
[30]
Kishore Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
[31]
Md. Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Inf. Softw. Technol., 135 (2021), 106552.
[32]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
[33]
Goutham Ramakrishnan and Aws Albarghouthi. 2020. Backdoors in Neural Models of Source Code. ArXiv, abs/2006.06841 (2020).
[34]
R. Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. 2020. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. ArXiv, abs/2007.02220 (2020).
[35]
A. Shafahi, W. R. Huang, Mahyar Najibi, O. Suciu, Christoph Studer, T. Dumitras, and T. Goldstein. 2018. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In NeurIPS.
[36]
B. K. Sharma, R. P. Agarwal, and Raghuraj Singh. 2011. An Efficient Software Watermark by Equation Reordering and FDOS. In SocProS.
[37]
Jacob M. Springer, Bryn Reinstadler, and Una-May O’Reilly. 2020. STRATA: Simple, Gradient-Free Attacks for Models of Code.
[38]
Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. 2021. CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning. Proceedings of the ACM Web Conference 2022.
[39]
Buse Gul Atli Tekgul and N. Asokan. 2022. On the Effectiveness of Dataset Watermarking in Adversarial Settings. ArXiv, abs/2202.12506 (2022).
[40]
Smita Thaker. 2004. Software watermarking via assembly code transformations. San Jose State University.
[41]
Brandon Tran, Jerry Li, and A. Madry. 2018. Spectral Signatures in Backdoor Attacks. In NeurIPS.
[42]
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. ArXiv, abs/1706.03762 (2017).
[43]
Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models. In NAACL.
[44]
Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. ArXiv, abs/2109.00859 (2021).
[45]
B. L. Welch. 1947. The generalisation of student’s problems when several different population variances are involved. Biometrika, 34 1-2 (1947), 28–35.
[46]
Changming Xu, Jun Wang, Yuqing Tang, Francisco Guzmán, Benjamin I. P. Rubinstein, and Trevor Cohn. 2021. A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning. Proceedings of the Web Conference 2021.
[47]
Mohammad Mehdi Yadollahi, Farzaneh Shoeleh, Sajjad Dadkhah, and Ali A. Ghorbani. 2021. Robust Black-box Watermarking for Deep Neural Network using Inverse Document Frequency. 2021 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 574–581.
[48]
Yanming Yang, Xin Xia, David Lo, and John C. Grundy. 2020. A Survey on Deep Learning for Software Engineering. CoRR, abs/2011.14597 (2020).
[49]
Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre-trained Models of Code. ArXiv, abs/2201.08698 (2022).
[50]
Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code. Proceedings of the ACM on Programming Languages, 4 (2020), 1 – 30.
[51]
Huangzhao Zhang, Zhuo Li, Ge Li, L. Ma, Yang Liu, and Zhi Jin. 2020. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. In AAAI.
[52]
Weiwei Zhang, Shengjian Guo, Hongyu Zhang, Yulei Sui, Yinxing Xue, and Yun Xu. 2021. Challenging Machine Learning-based Clone Detectors via Semantic-preserving Code Transformations. ArXiv, abs/2111.10793 (2021).
[53]
Shihao Zhao, Xingjun Ma, X. Zheng, J. Bailey, Jingjing Chen, and Yugang Jiang. 2020. Clean-Label Backdoor Attacks on Video Recognition Models. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14431–14440.

Cited By

View all
  • (2024)CodeWMBench: An Automated Benchmark for Code Watermarking EvaluationProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674447(120-125)Online publication date: 5-Jul-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Code dataset
  2. Neural code completion models
  3. Watermarking

Qualifiers

  • Research-article

Funding Sources

Conference

ESEC/FSE '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)231
  • Downloads (Last 6 weeks)26
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CodeWMBench: An Automated Benchmark for Code Watermarking EvaluationProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674447(120-125)Online publication date: 5-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media