Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3664646.3664777acmconferencesArticle/Chapter ViewAbstractPublication PagesaiwareConference Proceedingsconference-collections
research-article
Open access

Effectiveness of ChatGPT for Static Analysis: How Far Are We?

Published: 10 July 2024 Publication History

Abstract

This paper conducted a novel study to explore the capabilities of ChatGPT, a state-of-the-art LLM, in static analysis tasks such as static bug detection and false positive warning removal. In our evaluation, we focused on two types of typical and critical bugs targeted by static bug detection, i.e., Null Dereference and Resource Leak, as our subjects. We employ Infer, a well-established static analyzer, to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that ChatGPT can achieve remarkable performance in the mentioned static analysis tasks, including bug detection and false-positive warning removal. In static bug detection, ChatGPT achieves accuracy and precision values of up to 68.37% and 63.76% for detecting Null Dereference bugs and 76.95% and 82.73% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer by 12.86% and 43.13% respectively. For removing false-positive warnings, ChatGPT can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs, surpassing existing state-of-the-art false-positive warning removal tools.

References

[1]
2023. GPT-4 Technical Report. ArXiv. arxiv:2303.08774 Accessed 17 Oct. 2023
[2]
Sharmin Afrose, Ya Xiao, Sazzadur Rahaman, Barton P. Miller, and Danfeng Yao. 2023. Evaluation of Static Vulnerability Detection Tools With Java Cryptographic API Benchmarks. IEEE Transactions on Software Engineering, 49, 2 (2023), 485–497. https://doi.org/10.1109/TSE.2022.3154717
[3]
Qirat Ashfaq, Rimsha Khan, and Sehrish Farooq. 2019. A comparative analysis of static code analysis tools that check java code adherence to java coding standards. In 2019 2nd International Conference on Communication, Computing and Digital systems (C-CODE). 98–103.
[4]
Rohan Bavishi, Hiroaki Yoshida, and Mukul R Prasad. 2019. Phoenix: Automated data-driven synthesis of repairs for static analysis violations. In FSE 2019. 613–624.
[5]
Bhargav Nagaraja Bhatt and Carlo A. Furia. 2022. Automated repair of resource leaks in Android applications. Journal of Systems and Software, 192 (2022), 111417. issn:0164-1212 https://doi.org/10.1016/j.jss.2022.111417
[6]
Jialun Cao, Meiziniu Li, Ming Wen, and Shing-chi Cheung. 2023. A study on prompt design, advantages and limitations of chatgpt for deep learning program repair. arXiv preprint arXiv:2304.08191.
[7]
Antônio Carvalho, Welder Luz, Diego Marcílio, Rodrigo Bonifácio, Gustavo Pinto, and Edna Dias Canedo. 2020. C-3PR: A Bot for Fixing Static Analysis Violations via Pull Requests. In SANER 2020. 161–171. https://doi.org/10.1109/SANER48275.2020.9054842
[8]
Mohan Cui, Chengjun Chen, Hui Xu, and Yangfan Zhou. 2023. SafeDrop: Detecting Memory Deallocation Bugs of Rust Programs via Static Data-flow Analysis. ACM Trans. Softw. Eng. Methodol., 32, 4 (2023), Article 82, may, 21 pages. issn:1049-331X https://doi.org/10.1145/3542948
[9]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2022. Fuzzing deep-learning libraries via large language models. arXiv preprint arXiv:2212.14834.
[10]
Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014.
[11]
Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K. Lahiri. 2022. TOGA: A Neural Method for Test Oracle Generation. ICSE ’22. Association for Computing Machinery, New York, NY, USA. 2130–2141. isbn:9781450392211
[12]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In ICSE 2023. 1469–1481.
[13]
Zhiyu Fan, Xiang Gao, Abhik Roychoudhury, and Shin Hwei Tan. 2022. Improving automatically generated code from Codex via Automated Program Repair. arXiv preprint arXiv:2205.10583.
[14]
Sidong Feng and Chunyang Chen. 2023. Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models. arXiv preprint arXiv:2306.01987.
[15]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: A T5-Based Automated Software Vulnerability Repair. ESEC/FSE 2022. 935–947.
[16]
Mohammadreza Ghanavati, Diego Costa, Janos Seboek, David Lo, and Artur Andrzejak. 2020. Memory and resource leak defects and their repairs in Java projects. Empirical Software Engineering, 25, 1 (2020), 678–718. issn:1573-7616 https://doi.org/10.1007/s10664-019-09731-8
[17]
Google. 2023. ErrorProne. https://errorprone.info/index Accessed on Date
[18]
Qi Guo and et al. 2023. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. arXiv, arxiv:2309.08221. arxiv:2309.08221 Accessed 19 Oct. 2023
[19]
Zhaoqiang Guo, Tingting Tan, Shiran Liu, Xutong Liu, Wei Lai, Yibiao Yang, Yanhui Li, Lin Chen, Wei Dong, and Yuming Zhou. 2023. Mitigating False Positive Static Analysis Warnings: Progress, Challenges, and Opportunities. IEEE Transactions on Software Engineering, 49, 12 (2023), 5154–5188. https://doi.org/10.1109/TSE.2023.3329667
[20]
Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding Patterns in Static Analysis Alerts: Improving Actionable Alert Ranking. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). Association for Computing Machinery, New York, NY, USA. 152–161. isbn:9781450328630 https://doi.org/10.1145/2597073.2597100
[21]
Nima S. Harzevili. 2023. Automatic Static Bug Detection for Machine Learning Libraries: Are We There Yet? ArXiv, arxiv:2307.04080 Accessed 18 Oct. 2023
[22]
Nima Shiri Harzevili, Jiho Shin, Junjie Wang, and Song Wang. 2022. Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries. arXiv preprint arXiv:2203.06502.
[23]
Sarah Heckman and Laurie Williams. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology, 53, 4 (2011), 363–387.
[24]
Tobias Hey, Jan Keim, Anne Koziolek, and Walter F. Tichy. 2020. NoRBERT: Transfer Learning for Requirements Classification. In RE 2020. 169–179.
[25]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620.
[26]
Infer. [n. d.]. Infer official website. https://fbinfer.com/
[27]
Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, 1430–1442. isbn:9781665457019
[28]
Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263.
[29]
Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radiček. 2023. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 4 (2023), Jun., 5131–5140.
[30]
Hong Jin Kang, Khai Loong Aw, and David Lo. 2022. Detecting False Alarms from Automatic Static Analysis Tools: How Far Are We? ICSE ’22. Association for Computing Machinery, New York, NY, USA. 698–709. isbn:9781450392211
[31]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In ICSE 2023. 2312–2323.
[32]
Martin Kellogg, Narges Shadab, Manu Sridharan, and Michael D. Ernst. 2021. Lightweight and modular resource leak verification. ESEC/FSE 2021. Association for Computing Machinery, New York, NY, USA. 181–192. isbn:9781450385626
[33]
Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, and Neel Sundaresan. 2022. Learning to reduce false positives in analytic bug detectors. In Proceedings of the 44th International Conference on Software Engineering. 1307–1316.
[34]
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.). 35, Curran Associates, Inc., 22199–22213.
[35]
Junhee Lee, Seongjoon Hong, and Hakjoo Oh. 2022. NPEX: repairing Java null pointer exceptions without tests. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA. 1532–1544. isbn:9781450392211 https://doi.org/10.1145/3510003.3510186
[36]
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, 919–931. isbn:9781665457019 https://doi.org/10.1109/ICSE48619.2023.00085
[37]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2023. Assisting Static Analysis with Large Language Models: A ChatGPT Experiment. ESEC/FSE 2023. Association for Computing Machinery, New York, NY, USA. 2107–2111. isbn:9798400703270 https://doi.org/10.1145/3611643.3613078
[38]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2023. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models. arXiv preprint arXiv:2308.00245.
[39]
Li Li, Tegawendé F. Bissyandé, Mike Papadakis, Siegfried Rasthofer, Alexandre Bartel, Damien Octeau, Jacques Klein, and Le Traon. 2017. Static analysis of android apps: A systematic literature review. Information and Software Technology, 88 (2017), 67–95. issn:0950-5849
[40]
Wen Li, Haipeng Cai, Yulei Sui, and David Manz. 2020. PCA: memory leak detection using partial call-path analysis. ESEC/FSE 2020. 1621–1625.
[41]
Stephan Lipp, Sebastian Banescu, and Alexander Pretschner. 2022. An Empirical Study on the Effectiveness of Static C Code Analyzers for Vulnerability Detection. ISSTA 2022. 544–555.
[42]
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawende F. Bissyandè. 2019. AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations. In SANER 2019. 1–12. https://doi.org/10.1109/SANER.2019.8667970
[43]
Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. 2023. PRCBERT: Prompt Learning for Requirement Classification Using BERT-Based Pretrained Language Models. ASE ’22. Article 75, 13 pages.
[44]
Diego Marcilio, Rodrigo Bonifácio, Eduardo Monteiro, Edna Canedo, Welder Luz, and Gustavo Pinto. 2019. Are Static Analysis Violations Really Fixed? A Closer Look at Realistic Usage of SonarQube. ICPC’19. 209–219.
[45]
Ehsan Mashhadi and Hadi Hemmati. 2021. Applying CodeBERT for Automated Program Repair of Java Simple Bugs. In MSR 2021. 505–509.
[46]
Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey. ACM Comput. Surv., 56, 2 (2023), Article 30, sep, 40 pages. issn:0360-0300
[47]
Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, and Song Wang. 2024. Replication Package for "Effectiveness of ChatGPT for Static Analysis: How Far Are We?". https://doi.org/10.5281/zenodo.10828316
[48]
Ambarish Moharil and Arpit Sharma. 2022. Identification of Intra-Domain Ambiguity using Transformer-based Machine Learning. In NLBSE 2022. 51–58.
[49]
Tukaram Muske and Alexander Serebrenik. 2020. Techniques for Efficient Automated Elimination of False Positives. In SCAM 2020. 259–263.
[50]
Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J Mooney, and Milos Gligoric. 2023. Learning Deep Semantics for Test Completion. arXiv preprint arXiv:2302.10166.
[51]
OpenAI. 2023. ChatGPT. https://openai.com/blog/chatgpt Accessed on Date
[52]
OpenAI. 2023. ChatGPT-3.5. https://platform.openai.com/docs/models/gpt-3-5 Accessed on Date
[53]
Ya Pan, Xiuting Ge, Chunrong Fang, and Yong Fan. 2020. A Systematic Literature Review of Android Malware Detection Using Static Analysis. IEEE Access, 8 (2020), 116363–116379. https://doi.org/10.1109/ACCESS.2020.3002842
[54]
Zachary P. Reynolds, Abhinandan B. Jayanth, Ugur Koc, Adam A. Porter, Rajeev R. Raje, and James H. Hill. 2017. Identifying and Documenting False Positive Patterns Generated by Static Code Analysis Tools. In SER&IP 2017. 55–61.
[55]
Francisco Ribeiro. 2023. Large Language Models for Automated Program Repair. SPLASH 2023. 7–9.
[56]
Kimya Khakzad Shahandashti, Mithila Sivakumar, Mohammad Mahdi Mohajer, Alvine B Belle, Song Wang, and Timothy C Lethbridge. 2024. Evaluating the Effectiveness of GPT-4 Turbo in Creating Defeaters for Assurance Cases. arXiv preprint arXiv:2401.17991.
[57]
Haihao Shen, Jianhong Fang, and Jianjun Zhao. 2011. EFindBugs: Effective Error Ranking for FindBugs. In ICST 2011. 299–308.
[58]
Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. 2023. Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks. arXiv preprint arXiv:2310.10508.
[59]
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2998–3009.
[60]
David A. Tomassi. 2018. Bugs in the Wild: Examining the Effectiveness of Static Analyzers at Finding Real-World Bugs. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA. 980–982. isbn:9781450355735
[61]
David A. Tomassi and Cindy Rubio-González. 2021. On the Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions. 292–303.
[62]
Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. 2022. Generating Accurate Assert Statements for Unit Test Cases Using Pretrained Transformers. AST ’22. 54–64.
[63]
Rijnard van Tonder and Claire Le Goues. 2018. Static automated program repair for heap properties. In Proceedings of the 40th International Conference on Software Engineering. 151–162.
[64]
Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, and Andy Zaidman. 2020. How developers engage with static analysis tools in different contexts. Empirical Software Engineering, 25 (2020), 1419–1457.
[65]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
[66]
Junjie Wang, Song Wang, and Qing Wang. 2018. Is There a "Golden" Feature Set for Static Warning Identification? An Experimental Evaluation. ESEM ’18. Association for Computing Machinery, New York, NY, USA. Article 17, 10 pages. isbn:9781450358231
[67]
Yawen Wang, Lin Shi, Mingyang Li, Qing Wang, and Yun Yang. 2020. A Deep Context-wise Method for Coreference Detection in Natural Language Requirements. In RE 2020. 180–191.
[68]
Jason Wei and et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv, arxiv:2201.11903. Accessed 19 Oct. 2023
[69]
Tzu-Tsung Wong and Po-Yang Yeh. 2020. Reliable Accuracy Estimates from k-Fold Cross Validation. IEEE Transactions on Knowledge and Data Engineering, 32, 8 (2020), 1586–1594.
[70]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models. ICSE ’23. IEEE Press, 1482–1494. isbn:9781665457019 https://doi.org/10.1109/ICSE48619.2023.00129
[71]
Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning. ESEC/FSE 2022. Association for Computing Machinery, New York, NY, USA. 959–971. isbn:9781450394130
[72]
Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, and Tim Menzies. 2021. Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy). Empirical Softw. Engg., 26, 3 (2021), may, 24 pages. issn:1382-3256 https://doi.org/10.1007/s10664-021-09948-6
[73]
Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada. 7443–7464.
[74]
Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022. An extensive study on pre-trained models for program understanding and generation. In ISSTA. 39–51.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software
July 2024
182 pages
ISBN:9798400706851
DOI:10.1145/3664646
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2024

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. ChatGPT
  2. Large language models
  3. Static analysis

Qualifiers

  • Research-article

Conference

AIware '24
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 268
    Total Downloads
  • Downloads (Last 12 months)268
  • Downloads (Last 6 weeks)104
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media