research-article

LongCoder: a long-range pre-trained language model for code completion

AUTHORs:

Julian McAuleyAuthors Info & Claims

ICML'23: Proceedings of the 40th International Conference on Machine Learning

Article No.: 486, Pages 12098 - 12107

Published: 23 July 2023 Publication History

Abstract

In this paper, we introduce a new task for code completion that focuses on handling long code input and propose a sparse Transformer model, called LongCoder, to address this task. Long-Coder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens --bridge tokens and memory tokens -- to improve performance and efficiency. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction, while memory tokens are included to highlight important statements that may be invoked later and need to be memorized, such as package imports and definitions of classes, functions, or structures. We conduct experiments on a newly constructed dataset that contains longer code context and the publicly available CodeXGLUE benchmark. Experimental results demonstrate that LongCoder achieves superior performance on code completion tasks compared to previous models while maintaining comparable efficiency in terms of computational resources during inference.

References

[1]

Allamanis, M. The adverse effects of code duplication in machine learning models of code. In Onward!, pp. 143- 153. ACM, 2019.

[2]

Allamanis, M. and Sutton, C. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 207-216. IEEE, 2013.

[3]

Allamanis, M. and Sutton, C. Mining idioms from source code. In SIGSOFT FSE, pp. 472-483. ACM, 2014.

Digital Library

[4]

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.

[5]

Bielik, P., Raychev, V., and Vechev, M. T. PHOG: probabilistic model for code. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2933-2942. JMLR.org, 2016.

[6]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In NeurIPS, 2020.

[7]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

[8]

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

[9]

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In ICLR. OpenReview.net, 2021.

[10]

Clement, C. B., Lu, S., Liu, X., Tufano, M., Drain, D., Duan, N., Sundaresan, N., and Svyatkovskiy, A. Longrange modeling of source code files with ewash: Extended window access by syntax hierarchy. In EMNLP, pp. 4713- 4722. Association for Computational Linguistics, 2021.

[11]

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171-4186. Association for Computational Linguistics, 2019.

[12]

Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., and Hon, H. Unified language model pre-training for natural language understanding and generation. In NeurIPS, pp. 13042-13054, 2019.

[13]

Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.-t., Zettlemoyer, L., and Lewis, M. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.

[14]

Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., and Yin, J. Unixcoder: Unified cross-modal pre-training for code representation. In ACL, pp. 7212-7225. Association for Computational Linguistics, 2022.

[15]

Hellendoorn, V. J. and Devanbu, P. T. Are deep neural networks the best choice for modeling source code? In ESEC/SIGSOFT FSE, pp. 763-773. ACM, 2017.

[16]

Hindle, A., Barr, E. T., Gabel, M., Su, Z., and Devanbu, P. T. On the naturalness of software. Commun. ACM, 59(5): 122-131, 2016.

Digital Library

[17]

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.

[18]

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 5156-5165. PMLR, 2020.

[19]

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In ICLR. OpenReview.net, 2020.

[20]

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378(6624):1092-1097, 2022.

[21]

Liu, F., Li, G., Zhao, Y., and Jin, Z. Multi-task learning based pre-trained language model for code completion. In ASE, pp. 473-485. IEEE, 2020.

Digital Library

[22]

Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint, 2023.

[23]

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C. B., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S. K., Fu, S., and Liu, S. Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks, 2021.

[24]

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., and Xiong, C. A conversational paradigm for program synthesis. arXiv preprint arXiv:2203.13474, 2022.

[25]

Qin, Z., Sun, W., Deng, H., Li, D., Wei, Y., Lv, B., Yan, J., Kong, L., and Zhong, Y. cosformer: Rethinking softmax in attention. In ICLR. OpenReview.net, 2022.

[26]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[27]

Raychev, V., Bielik, P., and Vechev, M. T. Probabilistic model for code with decision trees. In OOPSLA, pp. 731-747. ACM, 2016.

Digital Library

[28]

Svyatkovskiy, A., Deng, S. K., Fu, S., and Sundaresan, N. Intellicode compose: code generation using transformer. In ESEC/SIGSOFT FSE, pp. 1433-1443. ACM, 2020.

Digital Library

[29]

Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. Efficient transformers: A survey. ACM Computing Surveys, 55(6): 1-28, 2022.

Digital Library

[30]

Tu, Z., Su, Z., and Devanbu, P. T. On the localness of software. In SIGSOFT FSE, pp. 269-280. ACM, 2014.

Digital Library

[31]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, pp. 5998-6008, 2017.

Digital Library

[32]

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.

[33]

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J. A systematic evaluation of large language models of code. In MAPS@PLDI, pp. 1-10. ACM, 2022.

[34]

Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontañón, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. Big bird: Transformers for longer sequences. In NeurIPS, 2020.

Recommendations

Using evolution patterns to find duplicated bugs
A Code Assignment Algorithm for Nonblocking OVSF Codes in WCDMA

OVSF codes are used as channelization codes in WCDMA. Due to code blocking property of OVSF codes, the bandwidth available in the system is severely limited. Code reassignments mitigate the impact of the blocking property at the expense of causing ...
Effective Bug Triage Based on Historical Bug-Fix Information
ISSRE '14: Proceedings of the 2014 IEEE 25th International Symposium on Software Reliability Engineering

For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'23: Proceedings of the 40th International Conference on Machine Learning

July 2023

43479 pages

Copyright © 2023.

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents