Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Automated Commit Intelligence by Pre-training

Published: 22 November 2024 Publication History

Abstract

GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance.

References

[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-Training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.), Association for Computational Linguistics, 2655–2668. DOI:
[2]
Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153.
[3]
Authors. 2023. Automated Commit Intelligence by Pre-training. Retrieved from https://github.com/Lyz1213/CommitBart.
[4]
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, and Giacomo Domeniconi. 2020. Exploring software naturalness through neural language models. arXiv:2006.12641 Retrieved from
[5]
Saikat Chakraborty, Yangruibo Ding, Miltiadis Allamanis, and Baishakhi Ray. 2020. Codit: Code Editing with Tree-Based Neural Models. IEEE Transactions on Software Engineering 48, 4 (2020), 1385–1399.
[6]
Saikat Chakraborty and Baishakhi Ray. 2021. On Multi-Modal Learning of Editing Source Code. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). IEEE, 443–455.
[7]
ChatGPT. 2022. Do the OpenAI API Models have Knowledge of Current Events? Retrieved from https://help.openai.com/en/articles/6639781-do-the-openai-api-models-have-knowledge-of-current-events
[8]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. arXiv:2107.03374 . Retrieved from
[9]
Tianyu Chen, Lin Li, Taotao Qian, Zeyu Wang, Guangtai Liang, Ding Li, Qianxiang Wang, and Tao Xie. 2023. Identifying vulnerability patches by comprehending code commits with comprehensive change contexts. arXiv:2310.02530 . Retrieved from
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Vol. 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio (Eds.), Association for Computational Linguistics, 4171–4186. DOI:
[11]
Jinhao Dong, Yiling Lou, Dan Hao, and Lin Tan. 2023. Revisiting Learning-based Commit Message Generation. In Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE’23). IEEE, 794–805.
[12]
Jinhao Dong, Yiling Lou, Qihao Zhu, Zeyu Sun, Zhilin Li, Wenjie Zhang, and Dan Hao. 2022. FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation. In Proceedings of the 44th International Conference on Software Engineering. 970–981.
[13]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP’20). Trevor Cohn, Yulan He, and Yang Liu (Eds.), Association for Computational Linguistics, 1536–1547. DOI:
[14]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21). Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.), Association for Computational Linguistics, 6894–6910. DOI:
[15]
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-Training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22), Vol. 1 (Long Papers). Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.), Association for Computational Linguistics, 7212–7225. DOI:
[16]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-Training Code Representations with Data Flow. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21). OpenReview.net. Retrieved from https://openreview.net/forum?id=jLoC4ez43PZ
[17]
Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring the Potential of Chatgpt in Automated Code Refinement: An Empirical Study. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.
[18]
Yichen He, Liran Wang, Kaiyi Wang, Yupeng Zhang, Hang Zhang, and Zhoujun Li. 2023. COME: Commit Message Generation with Modification Embedding. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 792–803.
[19]
José Antonio Hernández López, Martin Weyssow, Jesús Sánchez Cuadrado, and Houari Sahraoui. 2022. AST-Probe: Recovering Abstract Syntax Trees From Hidden Representations of Pre-Trained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–11.
[20]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv:1909.09436 . Retrieved from
[21]
Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph Gonzalez, and Ion Stoica. 2021. Contrastive Code Representation Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.), Association for Computational Linguistics, 5954–5971. DOI:
[22]
Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically Generating Commit Messages From Diffs Using Neural Machine Translation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 135–146.
[23]
Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A Tree-Based Pre-Trained Model for Programming Language. In Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI’21), Vol. 161. Cassio P. de Campos, Marloes H. Maathuis, and Erik Quaeghebeur (Eds.), AUAI Press, 54–63. Retrieved from https://proceedings.mlr.press/v161/jiang21a.html
[24]
Tae-Hwan Jung. 2021. Commitbert: Commit message generation using pre-trained programming language model. arXiv:2105.14242 .
[25]
Anjan Karmakar and Romain Robbes. 2021. What do Pre-Trained Code Models Know About Code? In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). IEEE, 1332–1336.
[26]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.), Association for Computational Linguistics, Online, 6769–6781. Retrieved from
[27]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.), Association for Computational Linguistics, 7871–7880. Retrieved from
[28]
Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023b. Codeeditor: Learning to Edit Source Code with Pre-Trained Models. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–22.
[29]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Joãao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñnoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023a. StarCoder: May the source be with you! arXiv:2305.06161.
[30]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022a. Competition-Level Code Generation with Alphacode. Science 378, 6624 (2022), 1092–1097.
[31]
Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. 2023c. Multi-Target Backdoor Attacks for Code Pre-Trained Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), Vol. 1 (Long Papers). Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.), Association for Computational Linguistics, 7236–7254. DOI:
[32]
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan. 2022b. Automating Code Review Activities by Large-Scale Pre-Training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1035–1047.
[33]
Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. 2023. CCT5: A code-change-oriented pre-trained model. DOI:
[34]
Changshu Liu, Pelin Cetin, Yogesh Patodia, Saikat Chakraborty, Yangruibo Ding, and Baishakhi Ray. 2023a. Automated code editing with search-generate-modify. DOI:
[35]
Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019. Generating Commit Messages From Diffs Using Pointer-Generator Network. In Proceedings of the IEEE/ACM 16th International Conference on Mining Software Repositories (MSR’19). IEEE, 299–309.
[36]
Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2021. Retrieval-Augmented Generation for Code Summarization via Hybrid GNN. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21). OpenReview.net. Retrieved from https://openreview.net/forum?id=zv-typ1gPxA
[37]
Shangqing Liu, Cuiyun Gao, Sen Chen, Nie Lun Yiu, and Yang Liu. 2020. ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking. IEEE Transactions on Software Engineering 48, 1800–1817.
[38]
Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2023b. ContraBERT: Enhancing Code Pre-Trained Models via Contrastive Learning. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE’23). IEEE, 2476–2487. DOI:
[39]
Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far are We? In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 373–384.
[40]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks’21). Joaquin Vanschoren and Sai-Kit Yeung (Eds.). Retrieved from https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html
[41]
Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. 2022. Is self-attention powerful to learn code syntax and semantics? arXiv:2212.10017.
[42]
Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E. Hassan, Xuan-Bach D. Le, and David Lo. 2023. Multi-Granularity Detector for Vulnerability Fixes. IEEE Transactions on Software Engineering . 49, 8, 4035–4057.
[43]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In Proceedings of the 11th International Conference on Learning Representations (ICLR’23). OpenReview.net. Retrieved from https://openreview.net/pdf?id=iaYcJKpY2B_
[44]
Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. Spt-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In Proceedings of the 44th International Conference on Software Engineering. 2006–2018.
[45]
OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Retrieved from https://chat.openai.com
[46]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
[47]
Da Shen, Xinyun Chen, Chenguang Wang, Koushik Sen, and Dawn Song. 2022. Benchmarking Language Models for Code Syntax Understanding. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP’22). Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.), Association for Computational Linguistics, 3071–3093. DOI:
[48]
Ensheng Shi, Yanlin Wang, Wei Tao, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2022b. RACE: Retrieval-Augmented Commit Message Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP’22). Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.), Association for Computational Linguistics, 5520–5530. DOI:
[49]
Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, and Qing Wang. 2022a. Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 107–119.
[50]
Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, and Li Li. 2022. On the Importance of Building High-Quality Training Datasets for Neural Code Search. In Proceedings of the 44th International Conference on Software Engineering. 1609–1620.
[51]
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode Compose: Code Generation Using Transformer. In Proceedings of the ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.), ACM, New York, NY, 1433–1443. DOI:
[52]
Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, and Wenqiang Zhang. 2021. On the Evaluation of Commit Message Generation Models: An Experimental Study. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME’21). IEEE, 126–136.
[53]
Wei Tao, Yucheng Zhou, Wenqiang Zhang, and Yu Cheng. 2024. MAGIS: LLM-based multi-agent framework for GitHub issue resolution. arXiv:2403.17927.
[54]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 .
[55]
Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the 5th BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP@EMNLP’22). Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, and Sarah Wiegreffe (Eds.), Association for Computational Linguistics, 371–383. DOI:
[56]
Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using Pre-Trained Models to Boost Code Review Automation. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering (ICSE’22) . ACM, New York, NY, 2291–2302. DOI:
[57]
Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 163–174.
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.), 5998–6008. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[59]
Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. 2022. What Do They Capture? A Structural Analysis of Pre-Trained Language Models for Source Code. In Proceedings of the 44th International Conference on Software Engineering. 2377–2388.
[60]
Shu Wang, Xinda Wang, Kun Sun, Sushil Jajodia, Haining Wang, and Qi Li. 2023. GraphSPD: Graph-Based Security Patch Detection with Enriched Code Semantics. In Proceedings of the IEEE Symposium on Security and Privacy (SP’23). IEEE, 2409–2426.
[61]
Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, Sushil Jajodia, Sanae Benchaaboun, and Frank Geck. 2021a. Patchrnn: A Deep Learning-Based System for Security Patch Identification. In Proceedings of the IEEE Military Communications Conference (MILCOM’21). IEEE, 595–600.
[62]
Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021c. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:2108.04556.
[63]
Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021b. CodeT5: Identifier-Aware Unified Pre-Trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21). Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.), Association for Computational Linguistics, 8696–8708. DOI:
[64]
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need. arXiv:2312.02120 .
[65]
Bozhi Wu, Shangqing Liu, Ruitao Feng, Xiaofei Xie, Jingkai Siow, and Shang-Wei Lin. 2022. Enhancing Security Patch Identification by Capturing Structures in Commits. IEEE Transactions on Dependable and Secure Computing 20, 6 (2022), 15 pages.
[66]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised Feature Learning Via Non-Parametric Instance Discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733–3742.
[67]
Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, and Jian Lu. 2019. Commit Message Generation for Source Code Changes. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 3975–3981.
[68]
Kenji Yamauchi, Jiachen Yang, Keisuke Hotta, Yoshiki Higo, and Shinji Kusumoto. 2014. Clustering Commits for Understanding the Intents of Implementation. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE, 406–410.
[69]
Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493.
[70]
Jie Zhang, Wei Ma, Qiang Hu, Xiaofei Xie, Yves Le Traon, and Yang Liu. 2023. RNNS: Representation nearest neighbor search black-box attack on codemodels. arXiv:2305.05896.
[71]
Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2022. CoditT5: Pretraining for Source Code and Natural Language Editing. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
[72]
Yaqin Zhou, Jing Kai Siow, Chenyu Wang, Shangqing Liu, and Yang Liu. 2021. SPI: Automated Identification of Security Patches via Commits. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 1 (2021), 1–27.
[73]
Fei Zuo, Xin Zhang, Yuqi Song, Junghwan Rhee, and Jicheng Fu. 2023. Commit Message Can Help: Security Patch Detection in Open Source Software via Transformer. In Proceedings of the IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA’23). IEEE, 345–351.

Cited By

View all
  • (2024)Demystifying React Native Android Apps for Static AnalysisACM Transactions on Software Engineering and Methodology10.1145/3702977Online publication date: 12-Nov-2024
  • (2024)Ratchet: Retrieval Augmented Transformer for Program Repair2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00048(427-438)Online publication date: 28-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 8
November 2024
975 pages
EISSN:1557-7392
DOI:10.1145/3613733
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024
Online AM: 01 July 2024
Accepted: 05 June 2024
Revised: 30 May 2024
Received: 12 November 2023
Published in TOSEM Volume 33, Issue 8

Check for updates

Author Tags

  1. GitHub commit
  2. code pre-training model

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation (NRF), Singapore, and the Cyber Security Agency under its National Cybersecurity R & D Programme
  • NRF, Singapore, and DSO National Laboratories under the AI Singapore Programme
  • NRF Investigatorship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)412
  • Downloads (Last 6 weeks)69
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Demystifying React Native Android Apps for Static AnalysisACM Transactions on Software Engineering and Methodology10.1145/3702977Online publication date: 12-Nov-2024
  • (2024)Ratchet: Retrieval Augmented Transformer for Program Repair2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00048(427-438)Online publication date: 28-Oct-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media