poster

Public Access

Comprehensive comparisons of embedding approaches for cryptographic API completion: poster

Authors:

Ya Xiao,

Salman Ahmed,

Xinyang Ge,

Bimal Viswanath,

Na Meng,

Danfeng (Daphne) YaoAuthors Info & Claims

ICSE '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings

Pages 360 - 361

https://doi.org/10.1145/3510454.3528645

Published: 19 October 2022 Publication History

PDF eReader

Abstract

In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.

References

[1]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1--29.

Digital Library

Google Scholar

[2]

Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2019. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472--489.

Crossref

Google Scholar

[3]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).

Google Scholar

[4]

Jordan Henkel, Shuvendu K Lahiri, Ben Liblit, and Thomas Reps. 2018. Code vectors: Understanding programs through embedded abstracted symbolic traces. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 163--174.

Digital Library

Google Scholar

[5]

Na Meng, Stefan Nagy, Danfeng Yao, Wenjie Zhuang, and Gustavo Arango-Argoty. 2018. Secure coding practices in Java: Challenges and vulnerabilities. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 372--383.

Digital Library

Google Scholar

[6]

Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N Nguyen. 2017. Exploring API embedding for API usages and applications. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 438--449.

Digital Library

Google Scholar

[7]

Sazzadur Rahaman, Ya Xiao, Sharmin Afrose, Fahad Shaon, Ke Tian, Miles Frantz, Murat Kantarcioglu, and Danfeng Yao. 2019. Cryptoguard: High precision detection of cryptographic vulnerabilities in massive-sized java projects. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2455--2472.

Digital Library

Google Scholar

[8]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

Google Scholar

[9]

Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs.

Google Scholar

Index Terms

Comprehensive comparisons of embedding approaches for cryptographic API completion: poster

Index terms have been assigned to the content through auto-classification.

Recommendations

Measurement of Embedding Choices on Cryptographic API Completion Tasks
In this article, we conduct a measurement study to comprehensively compare the accuracy impacts of multiple embedding options in cryptographic API completion tasks. Embedding is the process of automatically learning vector representations of program ...
Lossless data embedding: new paradigm in digital watermarking

One common drawback of virtually all current data embedding methods is the fact that the original image is inevitably distorted due to data embedding itself. This distortion typically cannot be removed completely due to quantization, bit-replacement, or ...
Reversible Data Embedding for Tamper-Proof Watermarks
ICICIC '06: Proceedings of the First International Conference on Innovative Computing, Information and Control - Volume 3

In this paper, a novel reversible data embedding for tamper-proof watermarks is proposed. A reversible watermark is embedded into robust watermark in the discrete wavelet transform (DWT) domain using a feature map and a location map. Generally, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ICSE '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings

May 2022

394 pages

ISBN:9781450392235

DOI:10.1145/3510454

General Chair:
Matthew B Dwyer
University of Virginia

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2022

Check for updates

Qualifiers

Poster

Funding Sources

NSF (National Science Foundation)

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
125
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)17

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Measurement of Embedding Choices on Cryptographic API Completion Tasks

Lossless data embedding: new paradigm in digital watermarking

Reversible Data Embedding for Tamper-Proof Watermarks