Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3533767.3534383acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Improving cross-platform binary analysis using representation learning via graph alignment

Published: 18 July 2022 Publication History

Abstract

Cross-platform binary analysis requires a common representation of binaries across platforms, on which a specific analysis can be performed. Recent work proposed to learn low-dimensional, numeric vector representations (i.e., embeddings) of disassembled binary code, and perform binary analysis in the embedding space. Unfortunately, however, existing techniques fall short in that they are either (i) specific to a single platform producing embeddings not aligned across platforms, or (ii) not designed to capture the rich contextual information available in a disassembled binary.
We present a novel deep learning-based method, XBA, which addresses the aforementioned problems. To this end, we first abstract binaries as typed graphs, dubbed binary disassembly graphs (BDGs), which encode control-flow and other rich contextual information of different entities found in a disassembled binary, including basic blocks, external functions called, and string literals referenced. We then formulate binary code representation learning as a graph alignment problem, i.e., finding the node correspondences between BDGs extracted from two binaries compiled for different platforms. XBA uses graph convolutional networks to learn the semantics of each node, (i) using its rich contextual information encoded in the BDG, and (ii) aligning its embeddings across platforms. Our formulation allows XBA to learn semantic alignments between two BDGs in a semi-supervised manner, requiring only a limited number of node pairs be aligned across platforms for training. Our evaluation shows that XBA can learn semantically-rich embeddings of binaries aligned across platforms without apriori platform-specific knowledge. By training our model only with 50% of the oracle alignments, XBA was able to predict, on average, 75% of the rest. Our case studies further show that the learned embeddings encode knowledge useful for cross-platform binary analysis.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[2]
Emily Alsentzer, Samuel G Finlayson, Michelle M Li, and Marinka Zitnik. 2020. Subgraph Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS).
[3]
Arm Limited. [n. d.]. Arm A64 Instruction Set Architecture. https://developer.arm.com/documentation/ddi0596/2021-12
[4]
Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. 2014. BYTEWEIGHT: Learning to recognize functions in binary code. In Proceedings of the USENIX Security Symposium.
[5]
Mohsen Bayati, David F Gleich, Amin Saberi, and Ying Wang. 2013. Message-passing algorithms for sparse network alignment. ACM Transactions on Knowledge Discovery from Data (TKDD), 7, 1 (2013), 1–31.
[6]
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203.
[7]
Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30, 9 (2018), 1616–1637.
[8]
Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. BinGo: Cross-architecture cross-OS binary search. In Proceedings of the ACM SIGSOFT Symposium on Foundations of Software Engineering (FSE).
[9]
Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. 2016. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. arXiv preprint arXiv:1611.03954.
[10]
Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).
[11]
Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. 2017. Neural nets can learn function type signatures from binaries. In Proceedings of the USENIX Security Symposium.
[12]
Andrei Costin, Jonas Zaddach, Aurélien Francillon, and Davide Balzarotti. 2014. A large-scale analysis of the security of embedded firmwares. In Proceedings of the USENIX Security Symposium.
[13]
Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative Embeddings of Latent Variable Models for Structured Data. In Proceedings of the International Conference on Machine Learning (ICML).
[14]
Yaniv David, Uri Alon, and Eran Yahav. 2020. Neural Reverse Engineering of Stripped Binaries Using Augmented Control Flow Graphs. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), Article 225, Nov., 28 pages.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805.
[16]
Giuseppe Antonio Di Luna, Davide Italiano, Luca Massarelli, Sebastian Österlund, Cristiano Giuffrida, and Leonardo Querzoni. 2021. Who’s debugging the debuggers? Exposing debug information bugs in optimized binaries. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[17]
Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2019. Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P).
[18]
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. DeepBinDiff: Learning program-wide code representations for binary diffing. In Proceedings of the Network and Distributed System Security Symposium (NDSS).
[19]
Thomas Dullien and Sebastian Porst. 2009. REIL: A platform-independent intermediate representation of disassembled code for static code analysis. CanSecWest.
[20]
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In Proceedings of the Network and Distributed System Security Symposium (NDSS).
[21]
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the ACM Conference on Computer and Communications Security (CCS).
[22]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD).
[23]
Zellig S Harris. 1954. Distributional structure. Word, 10, 2-3 (1954), 146–162.
[24]
Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. 2018. Debin: Predicting debug information in stripped binaries. In Proceedings of the ACM Conference on Computer and Communications Security (CCS).
[25]
John Hennessy. 1982. Symbolic debugging of optimized code. ACM Transactions on Programming Languages and Systems (TOPLAS), 4, 3 (1982), 323–344.
[26]
hex-rays. [n. d.]. IDA Pro: A powerful disassembler and a versatile debugger. https://hex-rays.com/ida-pro
[27]
Intel. [n. d.]. Intel Advanced Encryption Standard (AES) New Instructions Set. https://www.intel.com/content/dam/develop/external/us/en/documents/aes-wp-2012-09-22-v01-165683.pdf
[28]
Geunwoo Kim, Sanghyun Hong, Michael Franz, and Dokyung Song. 2022. XBA. Zenodo. https://doi.org/10.5281/zenodo.6579248
[29]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
[30]
Gunnar W Klau. 2009. A new graph-based method for pairwise global network alignment. BMC bioinformatics, 10, 1 (2009), 1–9.
[31]
Yonghwi Kwon, Weihang Wang, Yunhui Zheng, Xiangyu Zhang, and Dongyan Xu. 2017. CPR: Cross platform binary code reuse via platform independent trace program. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA).
[32]
Jeremy Lacomis, Pengcheng Yin, Edward Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2019. Dire: A neural approach to decompiled identifier naming. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE).
[33]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning (ICML).
[34]
Chengjiang Li, Yixin Cao, Lei Hou, Jiaxin Shi, Juanzi Li, and Tat-Seng Chua. 2019. Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
[35]
Xuezixiang Li, Qu Yu, and Heng Yin. 2021. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In Proceedings of the ACM Conference on Computer and Communications Security (CCS).
[36]
Yuanbo Li, Shuo Ding, Qirun Zhang, and Davide Italiano. 2020. Debug information validation for optimized code. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[37]
Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the International Conference on Machine Learning (ICML).
[38]
Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. 2018. α diff: Cross-version binary code similarity detection with DNN. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE).
[39]
Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. 2022. How Machine Learning Is Solving the Binary Function Similarity Problem. In Proceedings of the USENIX Security Symposium.
[40]
Luca Massarelli, Giuseppe A Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. 2019. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In Proceedings of the 2nd Workshop on Binary Analysis Research (BAR).
[41]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[42]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS).
[43]
Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005.
[44]
Kexin Pei, Jonas Guan, David Williams King, Junfeng Yang, and Suman Jana. 2020. XDA: Accurate, robust disassembly with transfer learning. In Proceedings of the Network and Distributed System Security Symposium (NDSS).
[45]
Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. 2021. How could Neural Networks understand Programs? In Proceedings of the International Conference on Machine Learning (ICML).
[46]
Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. 2021. How could Neural Networks understand Programs? In Proceedings of the International Conference on Machine Learning (ICML).
[47]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD).
[48]
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P).
[49]
Jannik Pewny, Felix Schuster, Lukas Bernhard, Thorsten Holz, and Christian Rossow. 2014. Leveraging semantic signatures for bug search in binary programs. In Proceedings of the Annual Computer Security Applications Conference (ACSAC).
[50]
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In Proceedings of the International Conference on Learning Representations (ICLR).
[51]
Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing functions in binaries with neural networks. In Proceedings of the USENIX Security Symposium.
[52]
Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, and Christopher Kruegel. 2016. SoK: (State of) the art of war: Offensive techniques in binary analysis. In Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P).
[53]
Rohit Singh, Jinbo Xu, and Bonnie Berger. 2008. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences, 105, 35 (2008), 12763–12768.
[54]
Zequn Sun, Wei Hu, and Chengkai Li. 2017. Cross-lingual entity alignment via joint attribute-preserving embedding. In International Semantic Web Conference. 628–644.
[55]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS).
[56]
Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. 2018. Cross-lingual knowledge graph alignment via graph convolutional networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[57]
Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, Rui Yan, and Dongyan Zhao. 2019. Relation-aware entity alignment for heterogeneous knowledge graphs. arXiv preprint arXiv:1908.08210.
[58]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, and Klaus Macherey. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
[59]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the ACM Conference on Computer and Communications Security (CCS).
[60]
Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. 2015. Network representation learning with rich text information. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
[61]
Sheng Yu, Yu Qu, Xunchao Hu, and Heng Yin. 2022. DeepDi: Learning a Relational Graph Convolutional Network Model on Instructions for Fast and Accurate Disassembly. In Proceedings of the USENIX Security Symposium.
[62]
Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
[63]
Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2020. CodeCMR: Cross-modal retrieval for function-level binary source code matching. Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS).
[64]
Lingxiao Zhao and Leman Akoglu. 2020. PairNorm: Tackling Oversmoothing in GNNs. In Proceedings of the International Conference on Learning Representations (ICLR).
[65]
Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. In Proceedings of the Network and Distributed System Security Symposium (NDSS).

Cited By

View all
  • (2024)Analysis of Decompiled Program Code Using Abstract Syntax TreesAutomatic Control and Computer Sciences10.3103/S014641162308006057:8(958-967)Online publication date: 29-Feb-2024
  • (2024)CodeArt: Better Code Models by Attention Regularization When Symbols Are LackingProceedings of the ACM on Software Engineering10.1145/36437521:FSE(562-585)Online publication date: 12-Jul-2024
  • (2024)BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code MatchingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639100(1-13)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2022
808 pages
ISBN:9781450393799
DOI:10.1145/3533767
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2022

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Binary analysis
  2. Cross-platform
  3. Graph alignment

Qualifiers

  • Research-article

Conference

ISSTA '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)320
  • Downloads (Last 6 weeks)37
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Analysis of Decompiled Program Code Using Abstract Syntax TreesAutomatic Control and Computer Sciences10.3103/S014641162308006057:8(958-967)Online publication date: 29-Feb-2024
  • (2024)CodeArt: Better Code Models by Attention Regularization When Symbols Are LackingProceedings of the ACM on Software Engineering10.1145/36437521:FSE(562-585)Online publication date: 12-Jul-2024
  • (2024)BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code MatchingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639100(1-13)Online publication date: 20-May-2024
  • (2024)Comprehensive Security Analysis and Threat Mitigation Strategies for React.js Applications: Leveraging SonarQube for Robust Security Assurance2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC)10.1109/KHI-HTC60760.2024.10482157(1-6)Online publication date: 8-Jan-2024
  • (2024)Firm-Vehicle: Trusted Communication Enabled Instruction Embedding Model for Resource-Constrained VANET EnvironmentsAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5603-2_32(391-402)Online publication date: 1-Aug-2024
  • (2023)BlockMatch: A Fine-Grained Binary Code Similarity Detection Approach Using Contrastive Learning for Basic Block MatchingApplied Sciences10.3390/app13231275113:23(12751)Online publication date: 28-Nov-2023
  • (2023)PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution ModelProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616301(401-412)Online publication date: 30-Nov-2023
  • (2023)Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We?Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598143(1383-1395)Online publication date: 12-Jul-2023
  • (2023)ConFunc: Enhanced Binary Function-Level Representation through Contrastive Learning2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00169(1241-1248)Online publication date: 1-Nov-2023
  • (2022)NeuDep: neural binary memory dependence analysisProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3549147(747-759)Online publication date: 7-Nov-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media