Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Revisiting file context for source code summarization

Published: 27 July 2024 Publication History

Abstract

Source code summarization is the task of writing natural language descriptions of source code. A typical use case is generating short summaries of subroutines for use in API documentation. The heart of almost all current research into code summarization is the encoder–decoder neural architecture, and the encoder input is almost always a single subroutine or other short code snippet. The problem with this setup is that the information needed to describe the code is often not present in the code itself—that information often resides in other nearby code. In this paper, we revisit the idea of “file context” for code summarization. File context is the idea of encoding select information from other subroutines in the same file. We propose a novel modification of the Transformer architecture that is purpose-built to encode file context and demonstrate its improvement over several baselines. We find that file context helps on a subset of challenging examples where traditional approaches struggle.

References

[1]
Ahmad, W., Chakraborty, S., Ray, B., et al.: A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4998–5007 (2020)
[2]
Ahmad, W.U., Chakraborty, S., Ray, B., et al.: Unified Pre-training for Program Understanding and Generation (2021). arXiv preprint arXiv:2103.06333
[3]
Ahmed, T., Devanbu, P.: Few-shot training LLMs for project-specific code-summarization (2022). arXiv preprint arXiv:2207.04237
[4]
Allamanis, M.: The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153 (2019)
[5]
Alon, U., Zilberstein, M., Levy, O., et al.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages 3(POPL):1–29 (2019)
[6]
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
[7]
Bansal, A., Haque, S., McMillan, C.: Project-level encoding for neural source code summarization of subroutines. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), pp. 253–264. IEEE (2021)
[8]
Bansal A, Eberhart Z, Karas Z, et al. Function call graph context encoding for neural source code summarization IEEE Trans. Softw. Eng. 2023
[9]
Bansal, A., Sharif, B., McMillan, C.: Towards modeling human attention from eye movements for neutral source code summarization. iN: Proceedings of ACM Human–Computer Interaction, vol. 7 (2023b)
[10]
Chiang, D., Rush, A.M., Barak, B.: Named Tensor Notation (2021). arXiv preprint arXiv:2102.13196
[11]
Dell, N., Vaidyanathan, V., Medhi, I., et al.: “yours is Better!” Participant Response Bias in HCI. In: Proceedings of the Sigchi Conference on Human Factors in Computing Systems, pp. 1321–1330 (2012)
[12]
Ding, Y., Wang, Z., Ahmad, W.U., et al.: CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context (2022). arXiv preprint arXiv:2212.10007
[13]
Feng, Z., Guo, D., Tang, D., et al.: CodeBERT: A Pre-trained Model for Programming and Natural Languages (2020). arXiv preprint arXiv:2002.08155
[14]
Guerrouj L, Di Penta M, Guéhéneuc YG, et al. An experimental investigation on the effects of context on source code identifiers splitting and expansion Empir. Softw. Eng. 2014 19 1706-1753
[15]
Haldar, R., Wu, L., Xiong, J., et al.: A Multi-perspective Architecture for Semantic Code Search (2020). arXiv preprint arXiv:2005.06980
[16]
Haque, S., LeClair, A., Wu, L., et al.: Improved automatic summarization of subroutines via attention to file context. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 300–310 (2020)
[17]
Haque, S., Bansal, A., Wu, L., et al.: Action word prediction for neural source code summarization. In: 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 330–341. IEEE (2021)
[18]
Haque, S., Eberhart, Z., Bansal, A., et al.: Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pp. 36–47 (2022)
[19]
Hill, E., Pollock, L., Vijay-Shanker, K.: Automatically capturing source code context of NL-queries for software maintenance and reuse. In: 2009 IEEE 31st International Conference on Software Engineering, pp. 232–242. IEEE (2009)
[20]
Holmes, R., Murphy, G.C.: Using structural context to recommend source code examples. In: Proceedings of the 27th International Conference on Software Engineering, pp. 117–125 (2005)
[21]
Hu, E.J., Shen, Y., Wallis, P., et al.: Lora: Low-Rank Adaptation of Large Language Models (2021). arXiv preprint arXiv:2106.09685
[22]
Hu, X., Li, G., Xia, X., et al.: Deep code comment generation. In: Proceedings of the 26th Conference on Program Comprehension, pp. 200–210. ACM (2018a)
[23]
Hu, X., Li, G., Xia, X., et al.: Summarizing source code with transferred API knowledge. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2269–2275. AAAI Press (2018b)
[24]
Huang, Z., Liang, D., Xu, P., et al.: Improve transformer models with better relative position embeddings. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3327–3335 (2020)
[25]
Jeong D, Aggarwal S, Robinson J, et al. Exhaustive or exhausting? Evidence on respondent fatigue in long surveys J. Dev. Econ. 2023 161 102992
[26]
Kramer, D.: API documentation from source code comments: a case study of Javadoc. In: Proceedings of the 17th Annual International Conference on Computer Documentation, pp. 147–153 (1999)
[27]
Kuang L, Zhou C, and Yang X Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems Autom. Softw. Eng. 2022 29 2 43
[28]
LeClair, A., McMillan, C.: Recommendations for datasets for source code summarization. In: Proceedings of NAACL-HLT, pp. 3931–3937 (2019)
[29]
LeClair, A., Jiang, S., McMillan, C.: A neural model for generating natural language summaries of program subroutines. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 795–806. IEEE (2019)
[30]
LeClair, A., Haque, S., Wu, L., et al.: Improved code summarization via a graph neural network. In: Proceedings of the 28th International Conference on Program Comprehension, pp. 184–195 (2020)
[31]
Li, Y., Wang, S., Nguyen, T.N.: A context-based automated approach for method name consistency checking and suggestion. In: Proceedings of the 43rd International Conference on Software Engineering. IEEE Press, ICSE ’21, pp. 574–586 (2021).
[32]
Li Z, Wu Y, Peng B, et al. SeTransformer: A transformer-based code semantic parser for code comment generation IEEE Trans. Reliab. 2022 72 258-273
[33]
Liang, Y., Zhu, K.Q.: Automatic generation of text descriptive comments for code blocks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
[34]
Liu, S., Chen, Y., Xie, X., et al.: Retrieval-augmented generation for code summarization via hybrid GNN. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=zv-typ1gPxA
[35]
Nie, P., Rai, R., Li, J.J., et al.: A framework for writing trigger-action todo comments in executable format. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 385–396. ACM (2019)
[36]
Roehm, T., Tiarks, R., Koschke, R., et al.: How do professional developers comprehend software? In: 2012 34th International Conference on Software Engineering (ICSE), pp. 255–265. IEEE (2012)
[37]
Roy, D., Fakhoury, S., Arnaoudova, V.: Reassessing automatic evaluation metrics for code summarization tasks. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1105–1116 (2021)
[38]
Shi, L., Mu, F., Chen, X., et al.: Are we building on the rock? On the importance of data preprocessing for code summarization. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 107–119 (2022)
[39]
Sievertsen HH, Gino F, and Piovesan M Cognitive fatigue influences students’ performance on standardized tests Proc. Natl. Acad. Sci. 2016 113 10 2621-2624
[40]
Sridhara, G., Hill, E., Muppaneni, D., et al.: Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM international conference on Automated software engineering, pp. 43–52 (2010)
[41]
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014)
[42]
Tang, Z., Shen, X., Li, C., et al.: AST-trans: Code summarization with efficient tree-structured attention. In: Proceedings of the 44th International Conference on Software Engineering, pp. 150–162 (2022)
[43]
Touvron, H., Lavril, T., Izacard, G., et al.: LLaMA: Open and Efficient Foundation Language Models (2023). arXiv preprint arXiv:2302.13971
[44]
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017)
[45]
Wan, Y., Zhao, Z., Yang, M., et al.: Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407. ACM (2018)
[46]
Wang, S., Wen, M., Lin, B., et al.: Lightweight global and local contexts guided method name recommendation with prior knowledge. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, ESEC/FSE 2021, pp. 741–753 (2021).
[47]
Wei, B., Li, G., Xia, X., et al.: Code generation as a dual task of code summarization. Adv. Neural Inf. Process. Syst. 32, 6563–6573 (2019)
[48]
Wei, B., Li, Y., Li, G., et al.: Retrieve and refine: exemplar-based neural comment generation. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 349–360 (2020)
[49]
Zügner, D., Kirschstein, T., Catasta, M., et al.: Language-agnostic representation learning of source code from structure and context. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=Xh5eMZVONGF

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Automated Software Engineering
Automated Software Engineering  Volume 31, Issue 2
Nov 2024
1350 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 27 July 2024
Accepted: 07 July 2024
Received: 17 August 2023

Author Tags

  1. Source code summarization
  2. Program comprehension
  3. Software and its documentation
  4. Information systems
  5. Natural language processing
  6. Machine translation

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media