research-article

Cross-domain deep code search with meta learning

Authors:

Xiaodong GuAuthors Info & Claims

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 487 - 498

https://doi.org/10.1145/3510003.3510125

Published: 05 July 2022 Publication History

Abstract

Recently, pre-trained programming language models such as CodeBERT have demonstrated substantial gains in code search. Despite their success, they rely on the availability of large amounts of parallel data to fine-tune the semantic mappings between queries and code. This restricts their practicality in domain-specific languages with relatively scarce and expensive data. In this paper, we propose CDCS, a novel approach for domain-specific code search. CDCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages (such as Java and Python), and is further adapted to domain-specific languages such as Solidity and SQL. Unlike cross-language CodeBERT, which is directly fine-tuned in the target language, CDCS adapts a few-shot meta-learning algorithm called MAML to learn the good initialization of model parameters, which can be best reused in a domain-specific language. We evaluate the proposed approach on two domain-specific languages, namely Solidity and SQL, with model transferred from two widely used languages (Python and Java). Experimental results show that CDCS significantly outperforms conventional pre-trained code models that are directly fine-tuned in domain-specific languages, and it is particularly effective for scarce data.

References

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. arXiv:2103.06333 [cs.CL]

[2]

Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications. 681--682.

[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]

[4]

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964--974.

Digital Library

[5]

Casey Casalnuovo, Kenji Sagae, and Prem Devanbu. 2018. Studying the Difference Between Natural and Programming Language Corpora. arXiv:1806.02437 [cs.CL]

[6]

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol. 1. IEEE, 539--546.

Digital Library

[7]

Premkumar T. Devanbu. 1995. On "A framework for source code search using program patterns". IEEE Transactions on Software Engineering 21, 12 (1995), 1009--1010.

Digital Library

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]

[9]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1536--1547.

[10]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126--1135.

[11]

Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 49--60.

Digital Library

[12]

Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. 2018. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437 (2018).

[13]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933--944.

Digital Library

[14]

Hamel Husain and Ho-Hsiang Wu. 2018. How to create natural language semantic search for arbitrary objects with deep learning. Retrieved November 5 (2018), 2019.

[15]

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]

[16]

Christoph Lange and Michael Kohlhase. 2008. SWIM: A semantic wiki for mathematical knowledge management. In Emerging Technologies for Semantic Work Environments: Techniques, Methods, and Applications. IGI Global, 47--68.

[17]

Otávio AL Lemos, Adriano C de Paula, Felipe C Zanichelli, and Cristina V Lopes. 2014. Thesaurus-based automatic query expansion for interface-driven code search. In Proceedings of the 11th working conference on mining software repositories. 212--221.

Digital Library

[18]

Wei Li, Haozhe Qin, Shuhan Yan, Beijun Shen, and Yuting Chen. 2020. Learning Code-Query Interaction for Enhancing Code Searches. In IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 115--126.

[19]

Chao Liu, Xin Xia, David Lo, Zhiwei Liu, Ahmed E Hassan, and Shanping Li. 2020. Simplifying Deep-Learning-Based Model for Code Search. arXiv preprint arXiv:2005.14373 (2020).

[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]

[21]

Meili Lu, Xiaobing Sun, Shaowei Wang, David Lo, and Yucong Duan. 2015. Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 545--549.

[22]

Fei Lv, Hongyu Zhang, Jian-Guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E). In 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 260--270.

Digital Library

[23]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336--347.

Digital Library

[24]

Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 (2018).

[25]

Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. 2021. CoTexT: Multi-task Learning with Code-Text Transformer. arXiv preprint arXiv:2105.08645 (2021).

[26]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[27]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019).

[28]

Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: a neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 31--41.

Digital Library

[29]

Pasquale Salza, Christoph Schwizer, Jian Gu, and Harald C Gall. 2021. On the Effectiveness of Transfer Learning for Code Search. arXiv preprint arXiv:2108.05890 (2021).

[30]

Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical Networks for Few-shot Learning. arXiv:1703.05175 [cs.LG]

Digital Library

[31]

Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. 2019. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 403--412.

[32]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1199--1208.

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[34]

Alex Wang and Kyunghyun Cho. 2019. BERT has a mouth, and it must speak: BERT as a markov random field language model. arXiv preprint arXiv:1902.04094 (2019).

[35]

Chaozheng Wang, Zhenhao Nong, Cuiyun Gao, Zongjie Li, Jichuan Zeng, Zhenchang Xing, and Yang Liu. 2022. Enriching query semantics for code search with reinforcement learning. Neural Networks 145 (2022), 22--32.

Digital Library

[36]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696--8708.

[37]

Maximilian Wohrer and Uwe Zdun. 2018. Smart contracts: security patterns in the ethereum ecosystem and solidity. In International Workshop on Blockchain Oriented Software Engineering (IWBOSE). IEEE, 2--8.

[38]

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries. In 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 344--354.

[39]

Zhen Yang, Jacky Keung, Xiao Yu, Xiaodong Gu, Zhengyuan Wei, Xiaoxue Ma, and Miao Zhang. 2021. A Multi-Modal Transformer-based Code Summarization Approach for Smart Contracts. arXiv preprint arXiv:2103.07164 (2021).

[40]

Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. Coacor: Code annotation for code retrieval with reinforcement learning. In The World Wide Web Conference. 2203--2214.

Digital Library

[41]

Xin Ye, Razvan Bunescu, and Chang Liu. 2014. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 689--699.

Digital Library

[42]

Wenpeng Yin. 2020. Meta-learning for Few-shot Natural Language Processing: A Survey. arXiv:2007.09604 [cs.CL]

[43]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

[44]

Jakub Zakrzewski. 2018. Towards verification of Ethereum smart contracts: a formalization of core of Solidity. In Working Conference on Verified Software: Theories, Tools, and Experiments. Springer, 229--247.

Digital Library

[45]

Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, and Lu Zhang. 2020. OCoR: an overlapping-aware code retriever. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 883--894.

Digital Library

Cited By

Zhao YGong LYu YHuang ZWei M(2025)An empirical study of best practices for code pre-trained models on software engineering classification tasksExpert Systems with Applications10.1016/j.eswa.2025.126762272(126762)Online publication date: May-2025
https://doi.org/10.1016/j.eswa.2025.126762
Chen GLiu WXie X(2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 1-May-2025
https://dl.acm.org/doi/10.1007/s10515-025-00487-8
Zhang XXiang YLiu ZHu XZhou D(2024)I2RIntelligent Data Analysis10.3233/IDA-23008228:3(807-823)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.3233/IDA-230082
Show More Cited By

Index Terms

Cross-domain deep code search with meta learning
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Automatic programming
      2. Reusability

Recommendations

Survey of Code Search Based on Deep Learning
Code writing is repetitive and predictable, inspiring us to develop various code intelligence techniques. This survey focuses on code search, that is, to retrieve code that matches a given natural language query by effectively capturing the semantic ...
Convolutional Shrinkage Neural Networks Based Model-Agnostic Meta-Learning for Few-Shot Learning
Abstract
Meta Learning (ML) has the ability to quickly learn from a small number of samples, and has become an important research field after reinforcement learning. However, the complexity of sample features severely reduces the performance of few-shot ...
Understanding transfer learning and gradient-based meta-learning techniques
Abstract
Deep neural networks can yield good performance on various tasks but often require large amounts of data to train them. Meta-learning received considerable attention as one approach to improve the generalization of these networks from a limited ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

May 2022

2508 pages

ISBN:9781450392211

DOI:10.1145/3510003

General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The National Natural Science Foundation of China
CCF-Baidu Open Fund

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
358
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao YGong LYu YHuang ZWei M(2025)An empirical study of best practices for code pre-trained models on software engineering classification tasksExpert Systems with Applications10.1016/j.eswa.2025.126762272(126762)Online publication date: May-2025
https://doi.org/10.1016/j.eswa.2025.126762
Chen GLiu WXie X(2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 1-May-2025
https://dl.acm.org/doi/10.1007/s10515-025-00487-8
Zhang XXiang YLiu ZHu XZhou D(2024)I2RIntelligent Data Analysis10.3233/IDA-23008228:3(807-823)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.3233/IDA-230082
Gong JChen T(2024)Deep Configuration Performance Learning: A Systematic Survey and TaxonomyACM Transactions on Software Engineering and Methodology10.1145/370298634:1(1-62)Online publication date: 5-Nov-2024
https://dl.acm.org/doi/10.1145/3702986
Fan YLi CGe JHuang LLuo B(2024)Effective Hard Negative Mining for Contrastive Learning-Based Code SearchACM Transactions on Software Engineering and Methodology10.1145/369599434:3(1-35)Online publication date: 11-Oct-2024
https://dl.acm.org/doi/10.1145/3695994
Wang HGao ZBi TGrundy JWang XWu MYang X(2024)What Makes a Good TODO Comment?ACM Transactions on Software Engineering and Methodology10.1145/366481133:6(1-30)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3664811
Liu CZhang XZhang HWan ZHuang ZYan Md'Amorim M(2024)An Empirical Study of Code Search in Intelligent Coding Assistant: Perceptions, Expectations, and DirectionsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663848(283-293)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663848
Phan HJannesari A(2024)Leveraging Statistical Machine Translation for Code SearchProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661233(191-200)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661233
Gong JChen T(2024)Predicting Configuration Performance in Multiple Environments with Sequential Meta-LearningProceedings of the ACM on Software Engineering10.1145/36437431:FSE(359-382)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643743
Fan GChen SGao CXiao JZhang TFeng Z(2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641542
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten