Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3618408.3618894guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

LongCoder: a long-range pre-trained language model for code completion

Published: 23 July 2023 Publication History

Abstract

In this paper, we introduce a new task for code completion that focuses on handling long code input and propose a sparse Transformer model, called LongCoder, to address this task. Long-Coder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens --bridge tokens and memory tokens -- to improve performance and efficiency. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction, while memory tokens are included to highlight important statements that may be invoked later and need to be memorized, such as package imports and definitions of classes, functions, or structures. We conduct experiments on a newly constructed dataset that contains longer code context and the publicly available CodeXGLUE benchmark. Experimental results demonstrate that LongCoder achieves superior performance on code completion tasks compared to previous models while maintaining comparable efficiency in terms of computational resources during inference.

References

[1]
Allamanis, M. The adverse effects of code duplication in machine learning models of code. In Onward!, pp. 143- 153. ACM, 2019.
[2]
Allamanis, M. and Sutton, C. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 207-216. IEEE, 2013.
[3]
Allamanis, M. and Sutton, C. Mining idioms from source code. In SIGSOFT FSE, pp. 472-483. ACM, 2014.
[4]
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
[5]
Bielik, P., Raychev, V., and Vechev, M. T. PHOG: probabilistic model for code. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2933-2942. JMLR.org, 2016.
[6]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In NeurIPS, 2020.
[7]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[8]
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[9]
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In ICLR. OpenReview.net, 2021.
[10]
Clement, C. B., Lu, S., Liu, X., Tufano, M., Drain, D., Duan, N., Sundaresan, N., and Svyatkovskiy, A. Longrange modeling of source code files with ewash: Extended window access by syntax hierarchy. In EMNLP, pp. 4713- 4722. Association for Computational Linguistics, 2021.
[11]
Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171-4186. Association for Computational Linguistics, 2019.
[12]
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., and Hon, H. Unified language model pre-training for natural language understanding and generation. In NeurIPS, pp. 13042-13054, 2019.
[13]
Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.-t., Zettlemoyer, L., and Lewis, M. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
[14]
Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., and Yin, J. Unixcoder: Unified cross-modal pre-training for code representation. In ACL, pp. 7212-7225. Association for Computational Linguistics, 2022.
[15]
Hellendoorn, V. J. and Devanbu, P. T. Are deep neural networks the best choice for modeling source code? In ESEC/SIGSOFT FSE, pp. 763-773. ACM, 2017.
[16]
Hindle, A., Barr, E. T., Gabel, M., Su, Z., and Devanbu, P. T. On the naturalness of software. Commun. ACM, 59(5): 122-131, 2016.
[17]
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
[18]
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 5156-5165. PMLR, 2020.
[19]
Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In ICLR. OpenReview.net, 2020.
[20]
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378(6624):1092-1097, 2022.
[21]
Liu, F., Li, G., Zhao, Y., and Jin, Z. Multi-task learning based pre-trained language model for code completion. In ASE, pp. 473-485. IEEE, 2020.
[22]
Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint, 2023.
[23]
Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C. B., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S. K., Fu, S., and Liu, S. Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks, 2021.
[24]
Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., and Xiong, C. A conversational paradigm for program synthesis. arXiv preprint arXiv:2203.13474, 2022.
[25]
Qin, Z., Sun, W., Deng, H., Li, D., Wei, Y., Lv, B., Yan, J., Kong, L., and Zhong, Y. cosformer: Rethinking softmax in attention. In ICLR. OpenReview.net, 2022.
[26]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[27]
Raychev, V., Bielik, P., and Vechev, M. T. Probabilistic model for code with decision trees. In OOPSLA, pp. 731-747. ACM, 2016.
[28]
Svyatkovskiy, A., Deng, S. K., Fu, S., and Sundaresan, N. Intellicode compose: code generation using transformer. In ESEC/SIGSOFT FSE, pp. 1433-1443. ACM, 2020.
[29]
Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. Efficient transformers: A survey. ACM Computing Surveys, 55(6): 1-28, 2022.
[30]
Tu, Z., Su, Z., and Devanbu, P. T. On the localness of software. In SIGSOFT FSE, pp. 269-280. ACM, 2014.
[31]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, pp. 5998-6008, 2017.
[32]
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
[33]
Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J. A systematic evaluation of large language models of code. In MAPS@PLDI, pp. 1-10. ACM, 2022.
[34]
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontañón, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. Big bird: Transformers for longer sequences. In NeurIPS, 2020.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'23: Proceedings of the 40th International Conference on Machine Learning
July 2023
43479 pages

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media