Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3611643.3616320acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice

Published: 30 November 2023 Publication History

Abstract

Software engineering (SE) techniques are increasingly relying on deep learning approaches to support many SE tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from specialized and intricate architectures and algorithms to a large number of training hyper-parameters and choices of evolving datasets, all compounded by how rapidly the machine learning technology is advancing, and the inherent sources of randomness in the training process. In this work we conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks (DNNs) appearing in 55 papers published in premier SE venues to provide a characterization of the state of the practice, pinpointing experiments’ common trends and pitfalls. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings. More specifically, we find: 1) weak analyses to determine that there is a true relationship between independent and dependent variables (87% of the experiments), 2) limited control over the space of DNN relevant variables, which can render a relationship between dependent variables and treatments that may not be causal but rather correlational (100% of the experiments), and 3) lack of specificity in terms of what are the DNN variables and their values utilized in the experiments (86% of the experiments) to define the treatments being applied, which makes it unclear whether the techniques designed are the ones being assessed, or how the sources of extraneous variation are controlled. We provide some practical recommendations to address these limitations.

Supplementary Material

Video (fse23main-p692-p-video.mp4)
"Binary similarity analysis determines if two binary executables are from the same source program. Existing techniques leverage static and dynamic program features and may utilize advanced Deep Learning techniques. Although they have demonstrated great potential, the community believes that a more effective representation of program semantics can further improve similarity analysis. In this paper, we propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. More importantly, it ensures that the collected samples are comparable across binaries, addressing the substantial variations of input specifications. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings, outperforming the baselines by 10-20%."

References

[1]
2022. The 37th AAAI Conference on Artificial Intelligence Reproducibility Checklist. accessed August 26, 2022.
[2]
Naomi Altman and Martin Krzywinski. 2021. Sources of variation. Nature Methods, 12 (2021), 5–6. https://doi.org/10.1038/nmeth.3224
[3]
Apostolos Ampatzoglou, Stamatia Bibi, Paris Avgeriou, Marijn Verbeek, and Alexander Chatzigeorgiou. 2019. Identifying, categorizing and mitigating threats to validity in software engineering secondary studies. Information and Software Technology, 106 (2019), 201–230. https://doi.org/10.1016/j.infsof.2018.10.006
[4]
Andrea Arcuri and Lionel Briand. 2011. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering. In Proceedings of the 33rd International Conference on Software Engineering. 1–10. https://doi.org/10.1145/1985793.1985795
[5]
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Gael Varoquaux, and Pascal Vincent. 2021. Accounting for Variance in Machine Learning Benchmarks. In Proceedings of Machine Learning and Systems. 747–769.
[6]
Prem Devanbu, Matthew Dwyer, Sebastian Elbaum, Michael Lowry, Kevin Moran, Denys Poshyvanyk, Baishakhi Ray, Rishabh Singh, and Xiangyu Zhang. 2020. Deep Learning & Software Engineering: State of Research and Future Directions. arXiv:2009.08525.
[7]
Tore Dybå, Vigdis By Kampenes, and Dag I. K. Sjøberg. 2006. A systematic review of statistical power in software engineering experiments. Information and Software Technology, 48, 8 (2006), 745–755. https://doi.org/10.1016/j.infsof.2005.08.009
[8]
Association for Computing Machinery. 2020. Artifact Review and Badging. https://www.acm.org/publications/policies/artifact-review-and-badging-current
[9]
Claudio Gallicchio, José Martín-Guerrero, Alessio Micheli, and Emilio Olivas. 2017. Randomized Machine Learning Approaches: Recent Developments and Challenges. In Proceedings of the 25th European Symposium on Artificial Neural Networks.
[10]
Omar S Gomez, Natalia Juristo, and Sira Vegas. 2014. Understanding replication of experiments in software engineering: A classification. Information and Software Technology, 56, 8 (2014), 1033–1048. https://doi.org/10.1016/j.infsof.2014.04.004
[11]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.
[12]
Odd Erik Gundersen, Kevin Coakley, Christine Kirkpatrick, and Yolanda Gil. 2023. Sources of Irreproducibility in Machine Learning: A Review. arXiv:2204.07610v2.
[13]
Jo Erskine Hannay, Dag I. K. Sjøberg, and Tore Dybå. 2007. A Systematic Review of Theory Use in Software Engineering Experiments. IEEE Transactions on Software Engineering, 33, 2 (2007), 87–107. https://doi.org/10.1109/TSE.2007.12
[14]
Andreas Jedlitschka, Marcus Ciolkowski, and Dietmar Pfahl. 2008. Reporting Experiments in Software Engineering. In Guide to Advanced Empirical Software Engineering, Forrest Shull, Janice Singer, and Dag I.K. Sjøberg (Eds.). Springer, 201–228.
[15]
Magne Jørgensen, Tore Dybå, Knut Liestøl, and Dag I.K. Sjøberg. 2016. Incorrect results in SE experiments: How to improve research practices. Journal of Systems and Software, 116 (2016), 133–145. https://doi.org/10.1016/j.jss.2015.03.065
[16]
Natalia Juristo and Ana M Moreno. 2011. Basics of software engineering experimentation. Springer Science & Business Media.
[17]
Vigdis By Kampenes, Tore Dybå, Jo Erskine Hannay, and Dag I. K. Sjøberg. 2007. A systematic review of effect size in software engineering experiments. Information and Software Technoly, 49, 11-12 (2007), 1073–1086. https://doi.org/10.1016/j.infsof.2007.02.015
[18]
B. Kitchenham, L. Madeyski, and D. Budgen. 2023. SEGRESS: Software Engineering Guidelines for REporting Secondary Studies. IEEE Transactions on Software Engineering, 49, 3 (2023), 1273–1298. https://doi.org/10.1109/TSE.2022.3174092
[19]
Michael A. Lones. 2023. How to avoid machine learning pitfalls: a guide for academic researchers. arXiv:2108.02497v3.
[20]
2020. The Machine Learning Reproducibility Checklist v2.0. accessed August 26, 2022.
[21]
Tom Mitchell. 2019. Machine Learning. McGraw-Hill Education.
[22]
Douglas C. Montgomery. 2019. Design and Analysis of Experiments. John Wiley & Sons Inc.
[23]
Meiyappan Nagappan, Thomas Zimmermann, and Christian Bird. 2013. Diversity in software engineering research. In 9th joint meeting on foundations of software engineering. 466–476. https://doi.org/10.1145/2491411.2491415
[24]
2022. The 36th Conference on Neural Information Processing Systems PaperChecklist Guidelines. accessed August 26, 2022.
[25]
Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. 2020. Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance. In Proceedings of the 35th International Conference on Automated Software Engineering. 771–783. https://doi.org/10.1145/3324884.3416545
[26]
Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri, Dietmar Pfahl, Romain Robbes, Daniel Russo, Nyyti Saarimäki, Federica Sarro, Davide Taibi, Janet Siegmund, Diomidis Spinellis, Miroslaw Staron, Klaas Stol, Margaret-Anne Storey, Damian Tamburri, Marco Torchiano, Christoph Treude, Burak Turhan, Xiaofeng Wang, and Sira Vegas. 2021. Empirical Standards for Software Engineering Research. arXiv:2010.03525v2.
[27]
Rolando Reyes, Óscar Dieste, Efraín R. Fonseca, and Natalia Juristo. 2018. Statistical errors in software engineering experiments: a preliminary literature review. In Proceedings of the 40th International Conference on Software Engineering. 1195–1206. https://doi.org/10.1145/3180155.3180161
[28]
Arthur Samuel. 1959. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3, 3 (1959), 210–229.
[29]
William R. Shadish, Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Wadsworth, Cengage Learning.
[30]
Martin Shepperd, Yuchen Guo, Ning Li, Mahir Arzoky, Andrea Capiluppi, Steve Counsell, Giuseppe Destefanis, Stephen Swift, Allan Tucker, and Leila Yousefi. 2019. The Prevalence of Errors in Machine Learning Experiments. In Intelligent Data Engineering and Automated Learning–IDEAL 2019, Hujun Yin, David Camacho, Peter Tino, Antonio J. Tallón-Ballesteros, Ronaldo Menezes, and Richard Allmendinger (Eds.). 102–109. https://doi.org/10.1007/978-3-030-33607-3_12
[31]
Dag I.K. Sjøberg and Gunnar Rye Bergersen. 2023. Construct Validity in Software Engineering. IEEE Transactions on Software Engineering, 49, 3 (2023), 1374–1396. https://doi.org/10.1109/TSE.2022.3176725
[32]
Cecilia Summers and Michael J. Dinneen. 2021. Nondeterminism and Instability in Neural Network Optimization. In Proceedings of the 38th International Conference on Machine Learning. 9913–9922.
[33]
Christopher S. Timperley, Lauren Herckis, Claire Le Goues, and Michael Hilton. 2021. Understanding and Improving Artifact Sharing in Software Engineering Research. Empirical Software Engineering, 26, 4 (2021), https://doi.org/10.1007/s10664-021-09973-5
[34]
Sira Vegas and Sebastian Elbaum. 2023. Badge Artifact for the paper Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice. https://doi.org/10.5281/zenodo.10075778
[35]
Sira Vegas and Sebastian Elbaum. 2023. Repository for the Paper Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice. https://github.com/GRISE-UPM/Pitfalls_Experiments_DNN4SE
[36]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.
[37]
Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, and Sara Hooker. 2022. Randomness In Neural Network Training: Characterizing The Impact of Tooling. In Proceedings of the 5th Conference on Machine Learning and Systems.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. deep learning
  2. machine learning for software engineering
  3. software engineering experimentation

Qualifiers

  • Research-article

Funding Sources

  • MCIN/AEI/10.13039/501100011033, ERDF A way of making Europe
  • NSF

Conference

ESEC/FSE '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 61
    Total Downloads
  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)3
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media