research-article

Revisiting file context for source code summarization

Authors:

Collin McMillanAuthors Info & Claims

Automated Software Engineering, Volume 31, Issue 2

https://doi.org/10.1007/s10515-024-00460-x

Published: 27 July 2024 Publication History

Abstract

Source code summarization is the task of writing natural language descriptions of source code. A typical use case is generating short summaries of subroutines for use in API documentation. The heart of almost all current research into code summarization is the encoder–decoder neural architecture, and the encoder input is almost always a single subroutine or other short code snippet. The problem with this setup is that the information needed to describe the code is often not present in the code itself—that information often resides in other nearby code. In this paper, we revisit the idea of “file context” for code summarization. File context is the idea of encoding select information from other subroutines in the same file. We propose a novel modification of the Transformer architecture that is purpose-built to encode file context and demonstrate its improvement over several baselines. We find that file context helps on a subset of challenging examples where traditional approaches struggle.

References

[1]

Ahmad, W., Chakraborty, S., Ray, B., et al.: A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4998–5007 (2020)

[2]

Ahmad, W.U., Chakraborty, S., Ray, B., et al.: Unified Pre-training for Program Understanding and Generation (2021). arXiv preprint arXiv:2103.06333

[3]

Ahmed, T., Devanbu, P.: Few-shot training LLMs for project-specific code-summarization (2022). arXiv preprint arXiv:2207.04237

[4]

Allamanis, M.: The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153 (2019)

[5]

Alon, U., Zilberstein, M., Levy, O., et al.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages 3(POPL):1–29 (2019)

[6]

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

[7]

Bansal, A., Haque, S., McMillan, C.: Project-level encoding for neural source code summarization of subroutines. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), pp. 253–264. IEEE (2021)

[8]

Bansal A, Eberhart Z, Karas Z, et al. Function call graph context encoding for neural source code summarization IEEE Trans. Softw. Eng. 2023

Digital Library

[9]

Bansal, A., Sharif, B., McMillan, C.: Towards modeling human attention from eye movements for neutral source code summarization. iN: Proceedings of ACM Human–Computer Interaction, vol. 7 (2023b)

[10]

Chiang, D., Rush, A.M., Barak, B.: Named Tensor Notation (2021). arXiv preprint arXiv:2102.13196

[11]

Dell, N., Vaidyanathan, V., Medhi, I., et al.: “yours is Better!” Participant Response Bias in HCI. In: Proceedings of the Sigchi Conference on Human Factors in Computing Systems, pp. 1321–1330 (2012)

[12]

Ding, Y., Wang, Z., Ahmad, W.U., et al.: CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context (2022). arXiv preprint arXiv:2212.10007

[13]

Feng, Z., Guo, D., Tang, D., et al.: CodeBERT: A Pre-trained Model for Programming and Natural Languages (2020). arXiv preprint arXiv:2002.08155

[14]

Guerrouj L, Di Penta M, Guéhéneuc YG, et al. An experimental investigation on the effects of context on source code identifiers splitting and expansion Empir. Softw. Eng. 2014 19 1706-1753

Digital Library

[15]

Haldar, R., Wu, L., Xiong, J., et al.: A Multi-perspective Architecture for Semantic Code Search (2020). arXiv preprint arXiv:2005.06980

[16]

Haque, S., LeClair, A., Wu, L., et al.: Improved automatic summarization of subroutines via attention to file context. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 300–310 (2020)

[17]

Haque, S., Bansal, A., Wu, L., et al.: Action word prediction for neural source code summarization. In: 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 330–341. IEEE (2021)

[18]

Haque, S., Eberhart, Z., Bansal, A., et al.: Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pp. 36–47 (2022)

[19]

Hill, E., Pollock, L., Vijay-Shanker, K.: Automatically capturing source code context of NL-queries for software maintenance and reuse. In: 2009 IEEE 31st International Conference on Software Engineering, pp. 232–242. IEEE (2009)

[20]

Holmes, R., Murphy, G.C.: Using structural context to recommend source code examples. In: Proceedings of the 27th International Conference on Software Engineering, pp. 117–125 (2005)

[21]

Hu, E.J., Shen, Y., Wallis, P., et al.: Lora: Low-Rank Adaptation of Large Language Models (2021). arXiv preprint arXiv:2106.09685

[22]

Hu, X., Li, G., Xia, X., et al.: Deep code comment generation. In: Proceedings of the 26th Conference on Program Comprehension, pp. 200–210. ACM (2018a)

[23]

Hu, X., Li, G., Xia, X., et al.: Summarizing source code with transferred API knowledge. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2269–2275. AAAI Press (2018b)

[24]

Huang, Z., Liang, D., Xu, P., et al.: Improve transformer models with better relative position embeddings. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3327–3335 (2020)

[25]

Jeong D, Aggarwal S, Robinson J, et al. Exhaustive or exhausting? Evidence on respondent fatigue in long surveys J. Dev. Econ. 2023 161 102992

[26]

Kramer, D.: API documentation from source code comments: a case study of Javadoc. In: Proceedings of the 17th Annual International Conference on Computer Documentation, pp. 147–153 (1999)

[27]

Kuang L, Zhou C, and Yang X Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems Autom. Softw. Eng. 2022 29 2 43

Digital Library

[28]

LeClair, A., McMillan, C.: Recommendations for datasets for source code summarization. In: Proceedings of NAACL-HLT, pp. 3931–3937 (2019)

[29]

LeClair, A., Jiang, S., McMillan, C.: A neural model for generating natural language summaries of program subroutines. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 795–806. IEEE (2019)

[30]

LeClair, A., Haque, S., Wu, L., et al.: Improved code summarization via a graph neural network. In: Proceedings of the 28th International Conference on Program Comprehension, pp. 184–195 (2020)

[31]

Li, Y., Wang, S., Nguyen, T.N.: A context-based automated approach for method name consistency checking and suggestion. In: Proceedings of the 43rd International Conference on Software Engineering. IEEE Press, ICSE ’21, pp. 574–586 (2021).

Digital Library

[32]

Li Z, Wu Y, Peng B, et al. SeTransformer: A transformer-based code semantic parser for code comment generation IEEE Trans. Reliab. 2022 72 258-273

[33]

Liang, Y., Zhu, K.Q.: Automatic generation of text descriptive comments for code blocks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

[34]

Liu, S., Chen, Y., Xie, X., et al.: Retrieval-augmented generation for code summarization via hybrid GNN. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=zv-typ1gPxA

[35]

Nie, P., Rai, R., Li, J.J., et al.: A framework for writing trigger-action todo comments in executable format. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 385–396. ACM (2019)

[36]

Roehm, T., Tiarks, R., Koschke, R., et al.: How do professional developers comprehend software? In: 2012 34th International Conference on Software Engineering (ICSE), pp. 255–265. IEEE (2012)

[37]

Roy, D., Fakhoury, S., Arnaoudova, V.: Reassessing automatic evaluation metrics for code summarization tasks. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1105–1116 (2021)

[38]

Shi, L., Mu, F., Chen, X., et al.: Are we building on the rock? On the importance of data preprocessing for code summarization. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 107–119 (2022)

[39]

Sievertsen HH, Gino F, and Piovesan M Cognitive fatigue influences students’ performance on standardized tests Proc. Natl. Acad. Sci. 2016 113 10 2621-2624

[40]

Sridhara, G., Hill, E., Muppaneni, D., et al.: Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM international conference on Automated software engineering, pp. 43–52 (2010)

[41]

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014)

[42]

Tang, Z., Shen, X., Li, C., et al.: AST-trans: Code summarization with efficient tree-structured attention. In: Proceedings of the 44th International Conference on Software Engineering, pp. 150–162 (2022)

[43]

Touvron, H., Lavril, T., Izacard, G., et al.: LLaMA: Open and Efficient Foundation Language Models (2023). arXiv preprint arXiv:2302.13971

[44]

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017)

[45]

Wan, Y., Zhao, Z., Yang, M., et al.: Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407. ACM (2018)

[46]

Wang, S., Wen, M., Lin, B., et al.: Lightweight global and local contexts guided method name recommendation with prior knowledge. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, ESEC/FSE 2021, pp. 741–753 (2021).

Digital Library

[47]

Wei, B., Li, G., Xia, X., et al.: Code generation as a dual task of code summarization. Adv. Neural Inf. Process. Syst. 32, 6563–6573 (2019)

[48]

Wei, B., Li, Y., Li, G., et al.: Retrieve and refine: exemplar-based neural comment generation. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 349–360 (2020)

[49]

Zügner, D., Kirschstein, T., Catasta, M., et al.: Language-agnostic representation learning of source code from structure and context. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=Xh5eMZVONGF

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

GA-SCS: Graph-Augmented Source Code Summarization
Automatic source code summarization system aims to generate a valuable natural language description for a program, which can facilitate software development and maintenance, code categorization, and retrieval. However, previous sequence-based research did ...
Automatic documentation generation via source code summarization of method context
ICPC 2014: Proceedings of the 22nd International Conference on Program Comprehension

A documentation generator is a programming tool that creates documentation for software by analyzing the statements and comments in the software's source code. While many of these tools are manual, in that they require specially-formatted metadata ...
FCSO: Source Code Summarization by Fusing Multiple Code Features and Ensuring Self-consistency Output
Algorithms and Architectures for Parallel Processing
Abstract
Source code summarization is the process of generating a concise and generalized natural language summary from a given source code, which can facilitate software developers to comprehend and use the code better. Currently, most research on source ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Automated Software Engineering

Automated Software Engineering Volume 31, Issue 2

Nov 2024

1350 pages

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 27 July 2024

Accepted: 07 July 2024

Received: 17 August 2023

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents