research-article

Open access

Effectiveness of ChatGPT for Static Analysis: How Far Are We?

Authors:

Mohammad Mahdi Mohajer,

Nima Shiri Harzevili,

Alvine Boaye Belle,

Hung Viet Pham,

Song WangAuthors Info & Claims

AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software

Pages 151 - 160

https://doi.org/10.1145/3664646.3664777

Published: 10 July 2024 Publication History

Abstract

This paper conducted a novel study to explore the capabilities of ChatGPT, a state-of-the-art LLM, in static analysis tasks such as static bug detection and false positive warning removal. In our evaluation, we focused on two types of typical and critical bugs targeted by static bug detection, i.e., Null Dereference and Resource Leak, as our subjects. We employ Infer, a well-established static analyzer, to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that ChatGPT can achieve remarkable performance in the mentioned static analysis tasks, including bug detection and false-positive warning removal. In static bug detection, ChatGPT achieves accuracy and precision values of up to 68.37% and 63.76% for detecting Null Dereference bugs and 76.95% and 82.73% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer by 12.86% and 43.13% respectively. For removing false-positive warnings, ChatGPT can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs, surpassing existing state-of-the-art false-positive warning removal tools.

References

[1]

2023. GPT-4 Technical Report. ArXiv. arxiv:2303.08774 Accessed 17 Oct. 2023

[2]

Sharmin Afrose, Ya Xiao, Sazzadur Rahaman, Barton P. Miller, and Danfeng Yao. 2023. Evaluation of Static Vulnerability Detection Tools With Java Cryptographic API Benchmarks. IEEE Transactions on Software Engineering, 49, 2 (2023), 485–497. https://doi.org/10.1109/TSE.2022.3154717

[3]

Qirat Ashfaq, Rimsha Khan, and Sehrish Farooq. 2019. A comparative analysis of static code analysis tools that check java code adherence to java coding standards. In 2019 2nd International Conference on Communication, Computing and Digital systems (C-CODE). 98–103.

[4]

Rohan Bavishi, Hiroaki Yoshida, and Mukul R Prasad. 2019. Phoenix: Automated data-driven synthesis of repairs for static analysis violations. In FSE 2019. 613–624.

Digital Library

[5]

Bhargav Nagaraja Bhatt and Carlo A. Furia. 2022. Automated repair of resource leaks in Android applications. Journal of Systems and Software, 192 (2022), 111417. issn:0164-1212 https://doi.org/10.1016/j.jss.2022.111417

Digital Library

[6]

Jialun Cao, Meiziniu Li, Ming Wen, and Shing-chi Cheung. 2023. A study on prompt design, advantages and limitations of chatgpt for deep learning program repair. arXiv preprint arXiv:2304.08191.

[7]

Antônio Carvalho, Welder Luz, Diego Marcílio, Rodrigo Bonifácio, Gustavo Pinto, and Edna Dias Canedo. 2020. C-3PR: A Bot for Fixing Static Analysis Violations via Pull Requests. In SANER 2020. 161–171. https://doi.org/10.1109/SANER48275.2020.9054842

[8]

Mohan Cui, Chengjun Chen, Hui Xu, and Yangfan Zhou. 2023. SafeDrop: Detecting Memory Deallocation Bugs of Rust Programs via Static Data-flow Analysis. ACM Trans. Softw. Eng. Methodol., 32, 4 (2023), Article 82, may, 21 pages. issn:1049-331X https://doi.org/10.1145/3542948

Digital Library

[9]

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2022. Fuzzing deep-learning libraries via large language models. arXiv preprint arXiv:2212.14834.

[10]

Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014.

[11]

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K. Lahiri. 2022. TOGA: A Neural Method for Test Oracle Generation. ICSE ’22. Association for Computing Machinery, New York, NY, USA. 2130–2141. isbn:9781450392211

[12]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In ICSE 2023. 1469–1481.

Digital Library

[13]

Zhiyu Fan, Xiang Gao, Abhik Roychoudhury, and Shin Hwei Tan. 2022. Improving automatically generated code from Codex via Automated Program Repair. arXiv preprint arXiv:2205.10583.

[14]

Sidong Feng and Chunyang Chen. 2023. Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models. arXiv preprint arXiv:2306.01987.

[15]

Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: A T5-Based Automated Software Vulnerability Repair. ESEC/FSE 2022. 935–947.

[16]

Mohammadreza Ghanavati, Diego Costa, Janos Seboek, David Lo, and Artur Andrzejak. 2020. Memory and resource leak defects and their repairs in Java projects. Empirical Software Engineering, 25, 1 (2020), 678–718. issn:1573-7616 https://doi.org/10.1007/s10664-019-09731-8

Digital Library

[17]

Google. 2023. ErrorProne. https://errorprone.info/index Accessed on Date

[18]

Qi Guo and et al. 2023. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. arXiv, arxiv:2309.08221. arxiv:2309.08221 Accessed 19 Oct. 2023

[19]

Zhaoqiang Guo, Tingting Tan, Shiran Liu, Xutong Liu, Wei Lai, Yibiao Yang, Yanhui Li, Lin Chen, Wei Dong, and Yuming Zhou. 2023. Mitigating False Positive Static Analysis Warnings: Progress, Challenges, and Opportunities. IEEE Transactions on Software Engineering, 49, 12 (2023), 5154–5188. https://doi.org/10.1109/TSE.2023.3329667

Digital Library

[20]

Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding Patterns in Static Analysis Alerts: Improving Actionable Alert Ranking. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). Association for Computing Machinery, New York, NY, USA. 152–161. isbn:9781450328630 https://doi.org/10.1145/2597073.2597100

Digital Library

[21]

Nima S. Harzevili. 2023. Automatic Static Bug Detection for Machine Learning Libraries: Are We There Yet? ArXiv, arxiv:2307.04080 Accessed 18 Oct. 2023

[22]

Nima Shiri Harzevili, Jiho Shin, Junjie Wang, and Song Wang. 2022. Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries. arXiv preprint arXiv:2203.06502.

[23]

Sarah Heckman and Laurie Williams. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology, 53, 4 (2011), 363–387.

Digital Library

[24]

Tobias Hey, Jan Keim, Anne Koziolek, and Walter F. Tichy. 2020. NoRBERT: Transfer Learning for Requirements Classification. In RE 2020. 169–179.

[25]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620.

[26]

Infer. [n. d.]. Infer official website. https://fbinfer.com/

[27]

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, 1430–1442. isbn:9781665457019

Digital Library

[28]

Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263.

[29]

Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radiček. 2023. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 4 (2023), Jun., 5131–5140.

Digital Library

[30]

Hong Jin Kang, Khai Loong Aw, and David Lo. 2022. Detecting False Alarms from Automatic Static Analysis Tools: How Far Are We? ICSE ’22. Association for Computing Machinery, New York, NY, USA. 698–709. isbn:9781450392211

[31]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In ICSE 2023. 2312–2323.

Digital Library

[32]

Martin Kellogg, Narges Shadab, Manu Sridharan, and Michael D. Ernst. 2021. Lightweight and modular resource leak verification. ESEC/FSE 2021. Association for Computing Machinery, New York, NY, USA. 181–192. isbn:9781450385626

[33]

Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, and Neel Sundaresan. 2022. Learning to reduce false positives in analytic bug detectors. In Proceedings of the 44th International Conference on Software Engineering. 1307–1316.

Digital Library

[34]

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.). 35, Curran Associates, Inc., 22199–22213.

[35]

Junhee Lee, Seongjoon Hong, and Hakjoo Oh. 2022. NPEX: repairing Java null pointer exceptions without tests. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA. 1532–1544. isbn:9781450392211 https://doi.org/10.1145/3510003.3510186

Digital Library

[36]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, 919–931. isbn:9781665457019 https://doi.org/10.1109/ICSE48619.2023.00085

Digital Library

[37]

Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2023. Assisting Static Analysis with Large Language Models: A ChatGPT Experiment. ESEC/FSE 2023. Association for Computing Machinery, New York, NY, USA. 2107–2111. isbn:9798400703270 https://doi.org/10.1145/3611643.3613078

Digital Library

[38]

Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2023. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models. arXiv preprint arXiv:2308.00245.

[39]

Li Li, Tegawendé F. Bissyandé, Mike Papadakis, Siegfried Rasthofer, Alexandre Bartel, Damien Octeau, Jacques Klein, and Le Traon. 2017. Static analysis of android apps: A systematic literature review. Information and Software Technology, 88 (2017), 67–95. issn:0950-5849

Digital Library

[40]

Wen Li, Haipeng Cai, Yulei Sui, and David Manz. 2020. PCA: memory leak detection using partial call-path analysis. ESEC/FSE 2020. 1621–1625.

[41]

Stephan Lipp, Sebastian Banescu, and Alexander Pretschner. 2022. An Empirical Study on the Effectiveness of Static C Code Analyzers for Vulnerability Detection. ISSTA 2022. 544–555.

[42]

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawende F. Bissyandè. 2019. AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations. In SANER 2019. 1–12. https://doi.org/10.1109/SANER.2019.8667970

[43]

Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. 2023. PRCBERT: Prompt Learning for Requirement Classification Using BERT-Based Pretrained Language Models. ASE ’22. Article 75, 13 pages.

[44]

Diego Marcilio, Rodrigo Bonifácio, Eduardo Monteiro, Edna Canedo, Welder Luz, and Gustavo Pinto. 2019. Are Static Analysis Violations Really Fixed? A Closer Look at Realistic Usage of SonarQube. ICPC’19. 209–219.

[45]

Ehsan Mashhadi and Hadi Hemmati. 2021. Applying CodeBERT for Automated Program Repair of Java Simple Bugs. In MSR 2021. 505–509.

[46]

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey. ACM Comput. Surv., 56, 2 (2023), Article 30, sep, 40 pages. issn:0360-0300

[47]

Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, and Song Wang. 2024. Replication Package for "Effectiveness of ChatGPT for Static Analysis: How Far Are We?". https://doi.org/10.5281/zenodo.10828316

Digital Library

[48]

Ambarish Moharil and Arpit Sharma. 2022. Identification of Intra-Domain Ambiguity using Transformer-based Machine Learning. In NLBSE 2022. 51–58.

Digital Library

[49]

Tukaram Muske and Alexander Serebrenik. 2020. Techniques for Efficient Automated Elimination of False Positives. In SCAM 2020. 259–263.

[50]

Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J Mooney, and Milos Gligoric. 2023. Learning Deep Semantics for Test Completion. arXiv preprint arXiv:2302.10166.

[51]

OpenAI. 2023. ChatGPT. https://openai.com/blog/chatgpt Accessed on Date

[52]

OpenAI. 2023. ChatGPT-3.5. https://platform.openai.com/docs/models/gpt-3-5 Accessed on Date

[53]

Ya Pan, Xiuting Ge, Chunrong Fang, and Yong Fan. 2020. A Systematic Literature Review of Android Malware Detection Using Static Analysis. IEEE Access, 8 (2020), 116363–116379. https://doi.org/10.1109/ACCESS.2020.3002842

[54]

Zachary P. Reynolds, Abhinandan B. Jayanth, Ugur Koc, Adam A. Porter, Rajeev R. Raje, and James H. Hill. 2017. Identifying and Documenting False Positive Patterns Generated by Static Code Analysis Tools. In SER&IP 2017. 55–61.

[55]

Francisco Ribeiro. 2023. Large Language Models for Automated Program Repair. SPLASH 2023. 7–9.

[56]

Kimya Khakzad Shahandashti, Mithila Sivakumar, Mohammad Mahdi Mohajer, Alvine B Belle, Song Wang, and Timothy C Lethbridge. 2024. Evaluating the Effectiveness of GPT-4 Turbo in Creating Defeaters for Assurance Cases. arXiv preprint arXiv:2401.17991.

[57]

Haihao Shen, Jianhong Fang, and Jianjun Zhao. 2011. EFindBugs: Effective Error Ranking for FindBugs. In ICST 2011. 299–308.

[58]

Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. 2023. Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks. arXiv preprint arXiv:2310.10508.

[59]

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2998–3009.

[60]

David A. Tomassi. 2018. Bugs in the Wild: Examining the Effectiveness of Static Analyzers at Finding Real-World Bugs. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA. 980–982. isbn:9781450355735

[61]

David A. Tomassi and Cindy Rubio-González. 2021. On the Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions. 292–303.

[62]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. 2022. Generating Accurate Assert Statements for Unit Test Cases Using Pretrained Transformers. AST ’22. 54–64.

[63]

Rijnard van Tonder and Claire Le Goues. 2018. Static automated program repair for heap properties. In Proceedings of the 40th International Conference on Software Engineering. 151–162.

Digital Library

[64]

Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, and Andy Zaidman. 2020. How developers engage with static analysis tools in different contexts. Empirical Software Engineering, 25 (2020), 1419–1457.

[65]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.

[66]

Junjie Wang, Song Wang, and Qing Wang. 2018. Is There a "Golden" Feature Set for Static Warning Identification? An Experimental Evaluation. ESEM ’18. Association for Computing Machinery, New York, NY, USA. Article 17, 10 pages. isbn:9781450358231

[67]

Yawen Wang, Lin Shi, Mingyang Li, Qing Wang, and Yun Yang. 2020. A Deep Context-wise Method for Coreference Detection in Natural Language Requirements. In RE 2020. 180–191.

[68]

Jason Wei and et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv, arxiv:2201.11903. Accessed 19 Oct. 2023

[69]

Tzu-Tsung Wong and Po-Yang Yeh. 2020. Reliable Accuracy Estimates from k-Fold Cross Validation. IEEE Transactions on Knowledge and Data Engineering, 32, 8 (2020), 1586–1594.

[70]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models. ICSE ’23. IEEE Press, 1482–1494. isbn:9781665457019 https://doi.org/10.1109/ICSE48619.2023.00129

Digital Library

[71]

Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning. ESEC/FSE 2022. Association for Computing Machinery, New York, NY, USA. 959–971. isbn:9781450394130

[72]

Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, and Tim Menzies. 2021. Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy). Empirical Softw. Engg., 26, 3 (2021), may, 24 pages. issn:1382-3256 https://doi.org/10.1007/s10664-021-09948-6

Digital Library

[73]

Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada. 7443–7464.

[74]

Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022. An extensive study on pre-trained models for program understanding and generation. In ISSTA. 39–51.

Index Terms

Effectiveness of ChatGPT for Static Analysis: How Far Are We?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Software functional properties
      1. Formal methods
        Automated static analysis

Recommendations

Machine-learning-guided selectively unsound static analysis
ICSE '17: Proceedings of the 39th International Conference on Software Engineering

We present a machine-learning-based technique for selectively applying unsoundness in static analysis. Existing bug-finding static analyzers are unsound in order to be precise and scalable in practice. However, they are uniformly unsound and hence at ...
Static program analysis of embedded executable assembly code
CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems

We consider the problem of automatically checking if coding standards have been followed in the development of embedded applications. The problem arises from practical considerations because DSP chip manufacturers (in our case Texas Instruments) want ...
Machine-Learning-Guided Typestate Analysis for Static Use-After-Free Detection
ACSAC '17: Proceedings of the 33rd Annual Computer Security Applications Conference

Typestate analysis relies on pointer analysis for detecting temporal memory safety errors, such as use-after-free (UAF). For large programs, scalable pointer analysis is usually imprecise in analyzing their hard "corner cases", such as infeasible paths, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software

July 2024

182 pages

ISBN:9798400706851

DOI:10.1145/3664646

General Chairs:
Bram Adams
Queen’s University, Canada
,
Thomas Zimmermann
Microsoft Research, USA
,
Ipek Ozkaya
Carnegie Mellon University, USA
,
Program Chairs:
Dayi Lin
Huawei, Canada
,
Jie M. Zhang
King’s College London, United Kingdom

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Conference

AIware '24

Sponsor:

SIGSOFT

AIware '24: 1st ACM International Conference on AI-Powered Software

July 15 - 16, 2024

Porto de Galinhas, Brazil

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
268
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)104

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents