Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3551349.3561168acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
short-paper

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Published: 05 January 2023 Publication History

Abstract

Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.

References

[1]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs, In International Conference on Learning Representations (ICLR). CoRR. https://doi.org/arXiv:1711.00740
[2]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International conference on machine learning. PMLR, 2091–2100.
[3]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL(2019), 1–29.
[4]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933–944.
[5]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436(2019).
[6]
Shihab Shahriar Khan, Nishat Tasnim Niloy, Md Aquib Azmain, and Ahmedul Kabir. 2020. Impact of Label Noise and Efficacy of Noise Filters in Software Defect Prediction. In SEKE. 347–352.
[7]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning. PMLR, 1885–1894.
[8]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated Graph Sequence Neural Networks, In International Conference on Learning Representations (ICLR). arXiv:1511.05493 [cs, stat]. arXiv:1511.05493.
[9]
Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John Grundy. 2021. Opportunities and challenges in code search tools. ACM Computing Surveys (CSUR) 54, 9 (2021), 1–40.
[10]
Yue Liu, Chakkrit Tantithamthavorn, Li Li, and Yepang Liu. 2021. Deep Learning for Android Malware Defenses: a Systematic Literature Review. arXiv preprint arXiv:2103.05292(2021).
[11]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 1287–1293.
[12]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Thirtieth AAAI conference on artificial intelligence.
[13]
Andrew Ng. 2022. Andrew Ng ”the data-centric AI approach”. https://www.youtube.com/watch?v=TU6u_T-s68Y
[14]
Pouya Pezeshkpour, Sarthak Jain, Byron Wallace, and Sameer Singh. 2021. An Empirical Comparison of Instance Attribution Methods for NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 967–975. https://doi.org/10.18653/v1/2021.naacl-main.75
[15]
Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. 2020. Estimating Training Data Influence by Tracing Gradient Descent. In NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/e6385d39ec9394f2f3a354d9d2b88eec-Abstract.html
[16]
Arumoy Shome, Luis Cruz, and Arie van Deursen. 2022. Data Smells in Public Datasets. arXiv preprint arXiv:2203.08007(2022).
[17]
Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, and Li Li. 2022. On the importance of building high-quality training datasets for neural code search. In Proceedings of the 44th International Conference on Software Engineering. 1609–1620.
[18]
Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2020. A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research. arXiv preprint arXiv:2009.06520(2020).
[19]
Yanjie Zhao, Li Li, Haoyu Wang, Haipeng Cai, Tegawendé F Bissyandé, Jacques Klein, and John Grundy. 2021. On the impact of sample duplication in machine-learning-based android malware detection. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3(2021), 1–38.
[20]
Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 10197–10207. https://proceedings.neurips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html

Cited By

View all
  • (2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024

Index Terms

  1. Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
          October 2022
          2006 pages
          ISBN:9781450394758
          DOI:10.1145/3551349
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 05 January 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Qualifiers

          • Short-paper
          • Research
          • Refereed limited

          Conference

          ASE '22

          Acceptance Rates

          Overall Acceptance Rate 82 of 337 submissions, 24%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)37
          • Downloads (Last 6 weeks)2
          Reflects downloads up to 26 Nov 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media