Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification

Published: 12 July 2024 Publication History

Abstract

Large Language Models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in automatically generating code from provided natural language requirements. However, in real-world practice, it is inevitable that the requirements written by users might be ambiguous or insufficient. Current LLMs will directly generate programs according to those unclear requirements, regardless of interactive clarification, which will likely deviate from the original user intents. To bridge that gap, we introduce a novel framework named ClarifyGPT, which aims to enhance code generation by empowering LLMs with the ability to identify ambiguous requirements and ask targeted clarifying questions. Specifically, ClarifyGPT first detects whether a given requirement is ambiguous by performing a code consistency check. If it is ambiguous, ClarifyGPT prompts an LLM to generate targeted clarifying questions. After receiving question responses, ClarifyGPT refines the ambiguous requirement and inputs it into the same LLM to generate a final code solution. To evaluate our ClarifyGPT, we invite ten participants to use ClarifyGPT for code generation on two benchmarks: MBPP-sanitized and MBPP-ET. The results show that ClarifyGPT elevates the performance (Pass@1) of GPT-4 from 70.96% to 80.80% on MBPP-sanitized. Furthermore, to conduct large-scale automated evaluations of ClarifyGPT across different LLMs and benchmarks without requiring user participation, we introduce a high-fidelity simulation method to simulate user responses. The results demonstrate that ClarifyGPT can significantly enhance code generation performance compared to the baselines. In particular, ClarifyGPT improves the average performance of GPT-4 and ChatGPT across five benchmarks from 62.43% to 69.60% and from 54.32% to 62.37%, respectively. A human evaluation also confirms the effectiveness of ClarifyGPT in detecting ambiguous requirements and generating high-quality clarifying questions. We believe that ClarifyGPT can effectively facilitate the practical application of LLMs in real-world development environments.

References

[1]
2023. CoderEval. https://github.com/CoderEval/CoderEval
[2]
2023. Website. https://github.com/ClarifyGPT/ClarifyGPT
[3]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. 2655–2668. https://doi.org/10.18653/v1/2021.naacl-main.211
[4]
Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Asking Clarifying Questions in Open-Domain Information-Seeking Conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 475–484. https://doi.org/10.1145/3331184.3331265
[5]
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR, abs/2108.07732 (2021), arXiv:2108.07732. arxiv:2108.07732
[6]
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. CoRR, abs/2204.06745 (2022), https://doi.org/10.48550/arXiv.2204.06745 arXiv:2204.06745.
[7]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712 (2023), https://doi.org/10.48550/ARXIV.2303.12712 arXiv:2303.12712.
[8]
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. CoRR, abs/2207.10397 (2022), https://doi.org/10.48550/arXiv.2207.10397 arXiv:2207.10397.
[9]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374 (2021), arXiv:2107.03374. arxiv:2107.03374
[10]
Kaustubh D. Dhole. 2020. Resolving Intent Ambiguities by Retrieving Discriminative Clarifying Questions. CoRR, abs/2008.07559 (2020), arXiv:2008.07559. arxiv:2008.07559
[11]
Yihong Dong, Jiazheng Ding, Xue Jiang, Zhuo Li, Ge Li, and Zhi Jin. 2023. CodeScore: Evaluating Code Generation by Learning Code Execution. CoRR, abs/2301.09043 (2023), https://doi.org/10.48550/arXiv.2301.09043 arXiv:2301.09043.
[12]
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration Code Generation via ChatGPT. CoRR, abs/2304.07590 (2023), https://doi.org/10.48550/arXiv.2304.07590 arXiv:2304.07590.
[13]
Zachary Eberhart and Collin McMillan. 2022. Generating Clarifying Questions for Query Refinement in Source Code Search. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022. IEEE, 140–151. https://doi.org/10.1109/SANER53432.2022.00028
[14]
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A Generative Model for Code Infilling and Synthesis. CoRR, abs/2204.05999 (2022), https://doi.org/10.48550/arXiv.2204.05999 arXiv:2204.05999.
[15]
Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, and Michael R. Lyu. 2023. Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. CoRR, abs/2304.07575 (2023), https://doi.org/10.48550/ARXIV.2304.07575 arXiv:2304.07575.
[16]
Michael D Gordon. 1990. Evaluating the effectiveness of information retrieval systems using simulated queries. Journal of the American Society for Information Science, 41, 5 (1990), 313–323.
[17]
Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. 2023. Self-planning Code Generation with Large Language Model. CoRR, abs/2303.06689 (2023), https://doi.org/10.48550/arXiv.2303.06689 arXiv:2303.06689.
[18]
Kimiya Keyvan and Jimmy Xiangji Huang. 2023. How to Approach Ambiguous Queries in Conversational Search: A Survey of Techniques, Approaches, Tools, and Challenges. ACM Comput. Surv., 55, 6 (2023), 129:1–129:40. https://doi.org/10.1145/3534965
[19]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. CoRR, abs/2205.11916 (2022), https://doi.org/10.48550/arXiv.2205.11916 arXiv:2205.11916.
[20]
Dmitrii Krasheninnikov, Egor Krasheninnikov, and David Krueger. 2022. Assistance with large language models. In NeurIPS ML Safety Workshop.
[21]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. CLAM: Selective Clarification for Ambiguous Questions with Generative Language Models.
[22]
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. SPoC: Search-based Pseudocode to Code. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 11883–11894. https://proceedings.neurips.cc/paper/2019/hash/7298332f04ac004a0ca44cc69ecf6f6b-Abstract.html
[23]
Shuvendu K. Lahiri, Aaditya Naik, Georgios Sakkas, Piali Choudhury, Curtis von Veh, Madanlal Musuvathi, Jeevana Priya Inala, Chenglong Wang, and Jianfeng Gao. 2022. Interactive Code Generation via Test-Driven User-Intent Formalization. CoRR, abs/2208.05950 (2022), https://doi.org/10.48550/arXiv.2208.05950 arXiv:2208.05950.
[24]
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 919–931. https://doi.org/10.1109/ICSE48619.2023.00085
[25]
Haau-Sing Li, Mohsen Mesgar, André F. T. Martins, and Iryna Gurevych. 2023. Python Code Generation by Asking Clarification Questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 14287–14306. https://doi.org/10.18653/v1/2023.acl-long.799
[26]
Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2023. Enabling Programming Thinking in Large Language Models Toward Code Generation. CoRR, abs/2305.06599 (2023), https://doi.org/10.48550/ARXIV.2305.06599 arXiv:2305.06599.
[27]
Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-Level Code Generation with AlphaCode. CoRR, abs/2203.07814 (2022), https://doi.org/10.48550/arXiv.2203.07814 arXiv:2203.07814.
[28]
Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, Xiaohong Zhang, and Meng Yan. 2023. Improving ChatGPT Prompt for Code Generation. CoRR, abs/2305.08360 (2023), https://doi.org/10.48550/ARXIV.2305.08360 arXiv:2305.08360.
[29]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. CoRR, abs/2305.01210 (2023), https://doi.org/10.48550/arXiv.2305.01210 arXiv:2305.01210.
[30]
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 5783–5797. https://doi.org/10.18653/V1/2020.EMNLP-MAIN.466
[31]
Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2450–2462. https://doi.org/10.1109/ICSE48619.2023.00205
[32]
Feng Nie, Meixi Chen, Zhirui Zhang, and Xu Cheng. 2022. Improving Few-Shot Performance of Language Models via Nearest Neighbor Calibration. CoRR, abs/2212.02216 (2022), https://doi.org/10.48550/ARXIV.2212.02216 arXiv:2212.02216.
[33]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=iaYcJKpY2B_
[34]
OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt/
[35]
OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774 (2023), https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774.
[36]
Anton Osika. 2023. GPT-Engineer. https://github.com/AntonOsika/gpt-engineer/
[37]
Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 143–155. https://doi.org/10.18653/v1/n19-1013
[38]
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. Adaptive Test Generation Using a Large Language Model. CoRR, abs/2302.06527 (2023), https://doi.org/10.48550/ARXIV.2302.06527 arXiv:2302.06527.
[39]
Ivan Sekulic, Mohammad Aliannejadi, and Fabio Crestani. 2022. Evaluating Mixed-initiative Conversational Search Systems via User Simulation. In WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022. ACM, 888–896. https://doi.org/10.1145/3488560.3498440
[40]
Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2022. Repository-Level Prompt Generation for Large Language Models of Code. CoRR, abs/2206.12839 (2022), https://doi.org/10.48550/arXiv.2206.12839 arXiv:2206.12839.
[41]
Jan Trienes and Krisztian Balog. 2019. Identifying Unclear Questions in Community Question Answering Websites. In Advances in Information Retrieval - 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11437). Springer, 276–289. https://doi.org/10.1007/978-3-030-15712-8_18
[42]
Vasudev Vikram, Caroline Lemieux, and Rohan Padhye. 2023. Can Large Language Models Write Good Property-Based Tests? CoRR, abs/2307.04346 (2023), https://doi.org/10.48550/ARXIV.2307.04346 arXiv:2307.04346.
[43]
Jian Wang and Wenjie Li. 2021. Template-guided Clarifying Question Generation for Web Search Clarification. In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021. ACM, 3468–3472. https://doi.org/10.1145/3459637.3482199
[44]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=1PL1NIMMrw
[45]
Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. Association for Computational Linguistics, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
[46]
Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-Based Evaluation for Open-Domain Code Generation. CoRR, abs/2212.10481 (2022), https://doi.org/10.48550/arXiv.2212.10481 arXiv:2212.10481.
[47]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR, abs/2201.11903 (2022), arXiv:2201.11903. arxiv:2201.11903
[48]
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In MAPS@PLDI 2022: 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022. ACM, 1–10. https://doi.org/10.1145/3520312.3534862
[49]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 37:1–37:12. https://doi.org/10.1145/3597503.3623316
[50]
Michal Zalewski. 2018. American fuzzing lop. https://lcamtuf.coredump.cx/afl/
[51]
Lingming Zhang, Darko Marinov, Lu Zhang, and Sarfraz Khurshid. 2011. An Empirical Study of JUnit Test-Suite Reduction. In IEEE 22nd International Symposium on Software Reliability Engineering, ISSRE 2011, Hiroshima, Japan, November 29 - December 2, 2011, Tadashi Dohi and Bojan Cukic (Eds.). IEEE Computer Society, 170–179. https://doi.org/10.1109/ISSRE.2011.26
[52]
Tianyi Zhang, Tao Yu, Tatsunori B. Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida I. Wang. 2022. Coder Reviewer Reranking for Code Generation. CoRR, abs/2211.16490 (2022), https://doi.org/10.48550/arXiv.2211.16490 arXiv:2211.16490.

Index Terms

  1. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Software Engineering
      Proceedings of the ACM on Software Engineering  Volume 1, Issue FSE
      July 2024
      2770 pages
      EISSN:2994-970X
      DOI:10.1145/3554322
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 July 2024
      Published in PACMSE Volume 1, Issue FSE

      Author Tags

      1. Code Generation
      2. Large Language Model
      3. Prompt Engineering

      Qualifiers

      • Research-article

      Funding Sources

      • Youth Innovation Promotion Association Chinese Academy of Sciences, Basic Research Program of ISCAS
      • Major Program of ISCAS
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 919
        Total Downloads
      • Downloads (Last 12 months)919
      • Downloads (Last 6 weeks)432
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media