Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ASE51524.2021.9678559acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

On multi-modal learning of editing source code

Published: 24 June 2022 Publication History

Abstract

In recent years, Neural Machine Translator (NMT) has shown promise in automatically editing source code. Typical NMT based code editor only considers the code that needs to be changed as input and suggests developers with a ranked list of patched code to choose from - where the correct one may not always be at the top of the list. While NMT based code editing systems generate a broad spectrum of plausible patches, the correct one depends on the developers' requirement and often on the context where the patch is applied. Thus, if developers provide some hints, using natural language, or providing patch context, NMT models can benefit from them.
As a proof of concept, in this research, we leverage three modalities of information: edit location, edit code context, commit messages (as a proxy of developers' hint in natural language) to automatically generate edits with NMT models. To that end, we build Modit, a multi-modal NMT based code editing engine. With in-depth investigation and analysis, we show that developers' hint as an input modality can narrow the search space for patches and outperform state-of-the-art models to generate correctly patched code in top-1 position.

References

[1]
B. Ray, M. Nagappan, C. Bird, N. Nagappan, and T. Zimmermann, "The uniqueness of changes: Characteristics and applications," ser. MSR '15. ACM, 2015.
[2]
N. Meng, M. Kim, and K. S. McKinley, "Systematic editing: generating program transformations from an example," ACM SIGPLAN Notices, vol. 46, no. 6, pp. 329--342, 2011.
[3]
N. Meng, "Lase: Locating and applying systematic edits by learning from examples," In Proceedings of 35th International Conference on Software Engineering (ICSE), pp. 502--511, 2013.
[4]
R. Rolim, G. Soares, L. D'Antoni, O. Polozov, S. Gulwani, R. Gheyi, R. Suzuki, and B. Hartmann, "Learning syntactic program transformations from examples," in Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 2017, pp. 404--415.
[5]
M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk, "On learning meaningful code changes via neural machine translation," arXiv preprint arXiv:1901.09102, 2019.
[6]
S. Chakraborty, Y. Ding, M. Allamanis, and B. Ray, "Codit: Code editing with tree-based neural models," IEEE Transactions on Software Engineering, vol. 1, pp. 1--1, 2020.
[7]
M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, "An empirical investigation into learning bug-fixing patches in the wild via neural machine translation," 2018.
[8]
M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, "An empirical study on learning bug-fixing patches in the wild via neural machine translation," ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 28, no. 4, pp. 1--29, 2019.
[9]
N. Jiang, T. Lutellier, and L. Tan, "Cure: Code-aware neural machine translation for automatic program repair," arXiv preprint arXiv:2103.00073, 2021.
[10]
T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan, "Coconut: combining context-aware neural translation models using ensemble for program repair," in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2020, pp. 101--114.
[11]
Z. Chen, S. J. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, "Sequencer: Sequence-to-sequence learning for end-to-end program repair," IEEE Transactions on Software Engineering, 2019.
[12]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems 30, 2017, pp. 5998--6008.
[13]
Z. Liu, X. Xia, A. E. Hassan, D. Lo, Z. Xing, and X. Wang, "Neural-machine-translation-based commit message generation: How far are we?" in 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018, pp. 373--384.
[14]
W. Wang, G. Li, S. Shen, X. Xia, and Z. Jin, "Modular tree network for source code representation learning," ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 4, pp. 1--23, 2020.
[15]
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, "CodeBERT: A pre-trained model for programming and natural languages," in Findings of the Association for Computational Linguistics: EMNLP 2020, Nov. 2020, pp. 1536--1547.
[16]
D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, J. Yin, D. Jiang et al., "Graphcodebert: Pre-training code representations with data flow," in International Conference on Learning Representations, 2021.
[17]
W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, "Unified pre-training for program understanding and generation," in 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
[18]
D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in International Conference on Learning Representations, 2015.
[19]
P. Yin and G. Neubig, "A syntactic neural model for general-purpose code generation," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2017, pp. 440--450.
[20]
B. Wei, G. Li, X. Xia, Z. Fu, and Z. Jin, "Code generation as a dual task of code summarization," in Advances in Neural Information Processing Systems 32, 2019, pp. 6563--6573.
[21]
W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, "A transformer-based approach for source code summarization," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
[22]
S. Kim, J. Zhao, Y. Tian, and S. Chandra, "Code prediction by feeding trees to transformers," arXiv preprint arXiv:2003.13848, 2020.
[23]
A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, "Intellicode compose: Code generation using transformer," in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1433--1443.
[24]
A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, "Pre-trained contextual embedding of source code," arXiv preprint arXiv:2001.00059, 2019.
[25]
L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, "Unified language model pre-training for natural language understanding and generation," arXiv preprint arXiv:1905.03197, 2019.
[26]
S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., "Codexglue: A machine learning benchmark dataset for code understanding and generation," arXiv preprint arXiv:2102.04664, 2021. [Online]. Available: https://arxiv.org/abs/2102.04664
[27]
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension," arXiv preprint arXiv:1910.13461, 2019.
[28]
T. Kudo and J. Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Nov. 2018, pp. 66--71.
[29]
R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, "Big code!= big vocabulary: Open-vocabulary models for source code," in 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 2020, pp. 1073--1085.
[30]
J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, "Fine-grained and accurate source code differencing," in Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. ACM, 2014, pp. 313--324.
[31]
R. Müller, S. Kornblith, and G. Hinton, "When does label smoothing help?" arXiv preprint arXiv:1906.02629, 2019.
[32]
Y. Ding, B. Ray, P. Devanbu, and V. J. Hellendoorn, "Patching as translation: the data and the metaphor," in 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2020, pp. 275--286.
[33]
R. Tufano, L. Pascarella, M. Tufano, D. Poshyvanyk, and G. Bavota, "Towards automating code review activities," arXiv preprint arXiv:2101.02518, 2021.
[34]
E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang, "Hoppity: Learning graph transformations to detect and fix bugs in programs," in International Conference on Learning Representations, 2019.
[35]
D. Tarlow, S. Moitra, A. Rice, Z. Chen, P.-A. Manzagol, C. Sutton, and E. Aftandilian, "Learning to fix build errors with graph2diff neural networks," in Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 2020, pp. 19--20.
[36]
Z. Yao, F. F. Xu, P. Yin, H. Sun, and G. Neubig, "Learning structural edits via incremental tree transformations," arXiv preprint arXiv:2101.12087, 2021.
[37]
R. M. Karampatsis and C. Sutton, "How often do single-statement bugs occur? The ManySStuBs4J dataset," arXiv preprint arXiv:1905.13334, 2019.
[38]
R. Sennrich, B. Haddow, and A. Birch, "Neural machine translation of rare words with subword units," arXiv preprint arXiv:1508.07909, 2015.
[39]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019. [Online]. Available: https://arxiv.org/abs/1907.11692
[40]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[41]
X. Ge, Q. L. DuBose, and E. Murphy-Hill, "Reconciling manual and automatic refactoring," in Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 2012, pp. 211--221.
[42]
V. Raychev, M. Schäfer, M. Sridharan, and M. Vechev, "Refactoring with synthesis," in ACM SIGPLAN Notices, vol. 48, no. 10. ACM, 2013, pp. 339--354.
[43]
N. Meng, L. Hua, M. Kim, and K. S. McKinley, "Does automated refactoring obviate systematic editing?" in Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015, pp. 392--402.
[44]
R. Robbes and M. Lanza, "Example-based program transformation," in International Conference on Model Driven Engineering Languages and Systems. Springer, 2008, pp. 174--188.
[45]
H. A. Nguyen, A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, and H. Rajan, "A study of repetitiveness of code changes in software evolution," in Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 2013, pp. 180--190.
[46]
H. A. Nguyen, T. T. Nguyen, G. Wilson Jr, A. T. Nguyen, M. Kim, and T. N. Nguyen, "A graph-based approach to API usage adaptation," in ACM Sigplan Notices, vol. 45, no. 10. ACM, 2010, pp. 302--321.
[47]
W. Tansey and E. Tilevich, "Annotation refactoring: inferring upgrade transformations for legacy applications," in ACM Sigplan Notices, vol. 43, no. 10. ACM, 2008, pp. 295--312.
[48]
B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, "On the" naturalness" of buggy code," in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 2016, pp. 428--439.
[49]
W. Wang, G. Li, B. Ma, X. Xia, and Z. Jin, "Detecting code clones with graph neural network and flow-augmented abstract syntax tree," in 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2020, pp. 261--271.
[50]
M. R. Parvez, S. Chakraborty, B. Ray, and K.-W. Chang, "Building language models for text with named entities," 2018.
[51]
S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, "Deep learning based vulnerability detection: Are we there yet," IEEE Transactions on Software Engineering, pp. 1--1, 2021.
[52]
H. Xu, S. Fan, Y. Wang, Z. Huang, H. Xu, and P. Xie, "Tree2tree structural language modeling for compiler fuzzing," in International Conference on Algorithms and Architectures for Parallel Processing. Springer, 2020, pp. 563--578.
[53]
U. Alon, M. Zilberstein, O. Levy, and E. Yahav, "code2vec: Learning distributed representations of code," in Proceedings of the ACM on Programming Languages, vol. 3. ACM, 2019, p. 40. [Online].
[54]
R. Just, D. Jalali, and M. D. Ernst, "Defects4J: A database of existing faults to enable controlled testing studies for java programs," in Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 2014, pp. 437--440.
[55]
Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks," in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 10 197--10 207.
[56]
Y. Zhou and A. Sharma, "Automated identification of security issues from commit messages and bug reports," in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 914--919.
[57]
K. Gallaba, C. Macho, M. Pinzger, and S. McIntosh, "Noise and heterogeneity in historical build data: an empirical study of travis ci," in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 87--97.
[58]
S. Kim, H. Zhang, R. Wu, and L. Gong, "Dealing with noise in defect prediction," in 2011 33rd International Conference on Software Engineering (ICSE). IEEE, 2011, pp. 481--490.

Cited By

View all
  • (2024)Unsupervised evaluation of code LLMs with round-trip correctnessProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692114(1050-1066)Online publication date: 21-Jul-2024
  • (2024)Improving Source Code Pre-Training via Type-Specific MaskingACM Transactions on Software Engineering and Methodology10.1145/369959934:3(1-34)Online publication date: 11-Oct-2024
  • (2024)Automated Commit Intelligence by Pre-trainingACM Transactions on Software Engineering and Methodology10.1145/367473133:8(1-30)Online publication date: 1-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '21: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering
November 2021
1446 pages
ISBN:9781665403375

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 24 June 2022

Check for updates

Author Tags

  1. automated programming
  2. neural machine translator
  3. neural networks
  4. pretraining
  5. source code edit
  6. transformers

Qualifiers

  • Research-article

Conference

ASE '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Unsupervised evaluation of code LLMs with round-trip correctnessProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692114(1050-1066)Online publication date: 21-Jul-2024
  • (2024)Improving Source Code Pre-Training via Type-Specific MaskingACM Transactions on Software Engineering and Methodology10.1145/369959934:3(1-34)Online publication date: 11-Oct-2024
  • (2024)Automated Commit Intelligence by Pre-trainingACM Transactions on Software Engineering and Methodology10.1145/367473133:8(1-30)Online publication date: 1-Jul-2024
  • (2024)Automatic Programming vs. Artificial IntelligenceProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664775(144-146)Online publication date: 10-Jul-2024
  • (2024)CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive NatureProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652142(466-478)Online publication date: 11-Sep-2024
  • (2024)Towards AI-Assisted Synthesis of Verified Dafny MethodsProceedings of the ACM on Software Engineering10.1145/36437631:FSE(812-835)Online publication date: 12-Jul-2024
  • (2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
  • (2024)Automated Code Editing With Search-Generate-ModifyIEEE Transactions on Software Engineering10.1109/TSE.2024.337638750:7(1675-1686)Online publication date: 1-Jul-2024
  • (2023)A Survey of Learning-based Automated Program RepairACM Transactions on Software Engineering and Methodology10.1145/363197433:2(1-69)Online publication date: 23-Dec-2023
  • (2023)LExecutor: Learning-Guided ExecutionProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616254(1522-1534)Online publication date: 30-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media