Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair

Published: 27 April 2023 Publication History

Abstract

Evaluation is the foundation of automated program repair (APR), as it provides empirical evidence on strengths and weaknesses of APR techniques. However, the reliability of such evaluation is often threatened by various introduced biases. Consequently, bias exploration, which uncovers biases in the APR evaluation, has become a pivotal activity and performed since the early years when pioneer APR techniques were proposed. Unfortunately, there is still no methodology to support a systematic comprehension and discovery of evaluation biases in APR, which impedes the mitigation of such biases and threatens the evaluation of APR techniques.
In this work, we propose to systematically understand existing evaluation biases by rigorously conducting the first systematic literature review on existing known biases and systematically uncover new biases by building a taxonomy that categorizes evaluation biases. As a result, we identify 17 investigated biases and uncover a new bias in the usage of patch validation strategies. To validate this new bias, we devise and implement an executable framework APRConfig, based on which we evaluate three typical patch validation strategies with four representative heuristic-based and constraint-based APR techniques on three bug datasets. Overall, this article distills 13 findings for bias understanding, discovery, and validation. The systematic exploration we performed and the open source executable framework we proposed in this article provide new insights as well as an infrastructure for future exploration and mitigation of biases in APR evaluation.

References

[1]
Rui Abreu, Peter Zoeteweij, and Arjan J. C. Van Gemund. 2006. An evaluation of similarity coefficients for software fault localization. In Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06). IEEE, 39–46.
[2]
Rui Abreu, Peter Zoeteweij, and Arjan J. C. Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Proceedings of Testing: Academic and Industrial Conference Practice and Research Techniques–MUTATION (TAICPART-MUTATION’07). IEEE, 89–98.
[3]
Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering (ICSE’11). IEEE, 1–10.
[4]
Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to fix bugs automatically. Proc. ACM Program. Lang. 3 (2019), 159:1–159:27. DOI:
[5]
Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 306–317.
[6]
Rohan Bavishi, Hiroaki Yoshida, and Mukul R Prasad. 2019. Phoenix: Automated data-driven synthesis of repairs for static analysis violations. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 613–624.
[7]
Samuel Benton, Ali Ghanbari, and Lingming Zhang. 2019. Defexts: A curated dataset of reproducible real-world bugs for modern jvm languages. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion’19). IEEE, 47–50.
[8]
Zhiqiang Bian, Aymeric Blot, and Justyna Petke. 2021. Refining fitness functions for search-based program repair. In Proceedings of the IEEE/ACM International Workshop on Automated Program Repair (APR’21).
[9]
Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and balanced? Bias in bug-fix datasets. In Proceedings of the 7th joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. 121–130.
[10]
Marcel Böhme and Abhik Roychoudhury. 2014. Corebench: Studying complexity of regression errors. In Proceedings of the International Symposium on Software Testing and Analysis. 105–115.
[11]
Liushan Chen, Yu Pei, and Carlo A Furia. 2017. Contract-based program repair without the contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 637–647.
[12]
Liushan Chen, Yu Pei, and Carlo Alberto Furia. 2021. Contract-based program repair without the contracts: An extended study. IEEE Trans. Softw. Eng. 47, 12 (2021), 2841–2857. DOI:
[13]
Arnaud Chevallier. 2016. Strategic Thinking in Complex Problem Solving. Oxford University Press.
[14]
Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. Flakime: Laboratory-controlled test flakiness impact assessment. a case study on mutation testing and program repair. CoRR, abs/1912.03197 (2019). http://arxiv.org/abs/1912.03197.
[15]
Benoit Cornu, Thomas Durieux, Lionel Seinturier, and Martin Monperrus. 2015. Npefix: Automatic runtime repair of null pointer exceptions in java. CoRR, abs/1512.07423 (2015). http://arxiv.org/abs/1512.07423.
[16]
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.
[17]
Heleno de S. Campos Junior, Marco Antônio P. Araújo, José Maria N. David, Regina Braga, Fernanda Campos, and Victor Ströele. 2017. Test case prioritization: A systematic review and mapping of the literature. In Proceedings of the 31st Brazilian Symposium on Software Engineering. 34–43.
[18]
Thomas Durieux, Benoit Cornu, Lionel Seinturier, and Martin Monperrus. 2017. Dynamic patch generation for null pointer exceptions using metaprogramming. In Proceedings of the IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER’17). IEEE, 349–358.
[19]
Thomas Durieux, Benjamin Danglot, Zhongxing Yu, Matias Martinez, Simon Urli, and Martin Monperrus. 2017. The patches of the nopol automatic repair system on the bugs of defects4j version 1.1. 0. Research Report. hal-01480084. Université Lille 1 - Sciences et Technologies.
[20]
Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Empirical review of Java program repair tools: A large-scale experiment on 2,141 bugs and 23,551 repair attempts. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 302–313.
[21]
Thomas Durieux and Martin Monperrus. 2016. Dynamoth: Dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop on Automation of Software Test. 85–91.
[22]
Thomas Durieux and Martin Monperrus. 2016. IntroClassJava: A benchmark of 297 small and buggy Java programs. Research Report. hal-01272126. Universite Lille 1. https://hal.archives-ouvertes.fr/hal-01272126/file/main.pdf.
[23]
Marvin Fleischmann, Miglena Amirpur, Alexander Benlian, and Thomas Hess. 2014. Cognitive biases in information systems research: A scientometric analysis. In 22st European Conference on Information Systems, ECIS 2014, Tel Aviv, Israel, June 9-11, 2014, Michel Avital, Jan Marco Leimeister, and Ulrike Schultze (Eds.). http://aisel.aisnet.org/ecis2014/proceedings/track02/5.
[24]
Andrew Forward and Timothy C. Lethbridge. 2008. A taxonomy of software types to facilitate search and evidence-based software engineering. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds. 179–191.
[25]
Paul Galdas. 2017. Revisiting bias in qualitative research: Reflections on its relationship with funding and impact. International Journal of Qualitative Methods 16, 1 (2017), 1–2.
[26]
Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2017. Automatic software repair: A survey. IEEE Trans. Softw. Eng. 45, 1 (2017), 34–67.
[27]
Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical program repair via bytecode mutation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 19–30.
[28]
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair. Commun. ACM 62, 12 (2019), 56–65.
[29]
He, Ye, Matias, Martinez, Thomas, Durieux, Martin, and Monperrus. 2019. A comprehensive study of automatic program repair on the QuixBugs benchmark. In Proceedings of the IEEE 1st International Workshop on Intelligent Bug Fixing (IBF’19).
[30]
Y. He, M. Martinez, T. Durieux, and M. Monperrus. 2021. A comprehensive study of automatic program repair on the QuixBugs benchmark. J. Syst. Softw. 171 (2021), 110825.
[31]
Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering. 12–23.
[32]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1110–1121.
[33]
Jiajun Jiang, Luyao Ren, Yingfei Xiong, and Lingming Zhang. 2019. Inferring program transformations from singular examples via big code. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19). IEEE, 255–266.
[34]
Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 298–309.
[35]
Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-aware neural machine translation for automatic program repair. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 1161–1173.
[36]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the International Symposium on Software Testing and Analysis. 437–440.
[37]
Maria Kechagia, Sergey Mechtaev, Federica Sarro, and Mark Harman. 2022. Evaluating automatic program repair capabilities to repair API misuses. IEEE Trans. Softw. Eng. 48, 7 (2022), 2658–2679. DOI:
[38]
Staffs Keele et al. 2007. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report. Citeseer.
[39]
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). IEEE, 802–811.
[40]
Jindae Kim and Sunghun Kim. 2019. Automatic patch generation with context-based change application. Emp. Softw. Eng. 24, 6 (2019), 4071–4106.
[41]
Barbara Kitchenham, O. Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering–a systematic literature review. Inf. Softw. Technol. 51, 1 (2009), 7–15.
[42]
Barbara Kitchenham, Rialette Pretorius, David Budgen, O. Pearl Brereton, Mark Turner, Mahmood Niazi, and Stephen Linkman. 2010. Systematic literature reviews in software engineering—A tertiary study. Inf. Softw. Technol. 52, 8 (2010), 792–805.
[43]
Pavneet Singh Kochhar, Yuan Tian, and David Lo. 2014. Potential biases in bug localization: Do they matter? In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. 803–814.
[44]
Pingfan Kong, Li Li, Jun Gao, Kui Liu, Tegawendé F. Bissyandé, and Jacques Klein. 2018. Automated testing of android apps: A systematic literature review. IEEE Trans. Reliabil. 68, 1 (2018), 45–66.
[45]
Xianglong Kong, Lingming Zhang, W. Eric Wong, and Bixin Li. 2015. Experience report: How do techniques, programs, and tests impact automated program repair? In Proceedings of the IEEE 26th International Symposium on Software Reliability Engineering (ISSRE’15). IEEE, 194–204.
[46]
Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2020. Fixminer: Mining relevant fix patterns for automated program repair. Emp. Softw. Eng. 25, 3 (2020), 1980–2024. DOI:
[47]
Barbara H. Kwasnik. 1999. The role of classification in knowledge representation and discovery. Libr. Trends 48, 1 (1999). http://alexia.lis.uiuc.edu/puboff/catalog/trends/48_1abs.html#kwasnik.
[48]
Ryan Lawler. 2012. How do you hire great engineers? Give them a challenge. https://gigaom.com/2012/01/19/quixey-challenge/.
[49]
Dinh Xuan Bach Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li, and Corina Pasareanu. 2019. On reliability of patch correctness assessment. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 524–535.
[50]
Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History driven program repair. In Proceedings of the IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER’16), Vol. 1. IEEE, 213–224.
[51]
Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Proceedings of the 34th International Conference on Software Engineering (ICSE’12). IEEE, 3–13.
[52]
Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass benchmarks for automated repair of C programs. IEEE Trans. Softw. Eng. 41, 12 (2015), 1236–1256.
[53]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair. IEEE Trans. Softw. Eng. 38, 1 (2011), 54–72.
[54]
Li Li, Tegawendé F. Bissyandé, Mike Papadakis, Siegfried Rasthofer, Alexandre Bartel, Damien Octeau, Jacques Klein, and Le Traon. 2017. Static analysis of android apps: A systematic literature review. Inf. Softw. Technol. 88 (2017), 67–95.
[55]
Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. DLFix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42th International Conference on Software Engineering. IEEE, 602–614.
[56]
Bo Lin, Shangwen Wang, Ming Wen, and Xiaoguang Mao. 2022. Context-aware code change embedding for better patch correctness assessment. ACM Trans. Softw. Eng. Methodol. 31, 3 (2022), 1–29.
[57]
Bo Lin, Shangwen Wang, Ming Wen, Zhang Zhang, Hongjun Wu, Yihao Qin, and Xiaoguang Mao. 2020. Understanding the non-repairability factors of automated program repair techniques. In Proceedings of the 27th Asia-Pacific Software Engineering Conference.
[58]
Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. 55–56.
[59]
Kui Liu, Anil Koyuncu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon. 2019. You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST’19). IEEE, 102–113.
[60]
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 1–12.
[61]
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.
[62]
Kui Liu, Anil Koyuncu, Kisub Kim, Dongsun Kim, and Tegawendé F Bissyandé. 2018. LSRepair: Live search of fix ingredients for automated program repair. In Proceedings of the 25th Asia-Pacific Software Engineering Conference (APSEC’18). IEEE, 658–662.
[63]
Kui Liu, Li Li, Anil Koyuncu, Dongsun Kim, Zhe Liu, Jacques Klein, and Tegawendé F. Bissyandé. 2021. A critical review on the evaluation of automated program repair systems. J. Syst. Softw. 171 (2021), 110817.
[64]
Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé François D. Assise Bissyande, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for Java programs. In Proceedings of the 42nd ACM/IEEE International Conference on Software Engineering (ICSE’20).
[65]
Xuliang Liu and Hao Zhong. 2018. Mining stackoverflow for program repair. In Proceedings of the IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). IEEE, 118–129.
[66]
Yepang Liu, Jue Wang, Lili Wei, Chang Xu, Shing-Chi Cheung, Tianyong Wu, Jun Yan, and Jian Zhang. 2019. DroidLeaks: A comprehensive database of resource leaks in Android apps. Emp. Softw. Eng. 24, 6 (2019), 3435–3483.
[67]
Giuliano Lorenzoni, Paulo Alencar, Nathalia Nascimento, and Donald Cowan. 2021. Machine learning model development from a software engineering perspective: A systematic literature review. CoRR, abs/2102.07574 (2021). https://arxiv.org/abs/2102.07574.
[68]
Yiling Lou, Samuel Benton, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. How does regression test selection affect program repair? An extensive study on 2 million patches. CoRR, abs/2105.07311 (2021). https://arxiv.org/abs/2105.07311.
[69]
Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. CoCoNuT: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–114.
[70]
Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An extensible Java bug benchmark for automatic program repair studies. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 468–478.
[71]
Amirabbas Majd, Mojtaba Vahidi-Asl, Alireza Khalilian, Ahmad Baraani-Dastjerdi, and Bahman Zamani. 2019. Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques. J. Comput. Lang. 53 (2019), 38–52.
[72]
Alexandru Marginean, Johannes Bader, Satish Chandra, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, and Andrew Scott. 2019. Sapfix: Automated end-to-end repair at scale. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’19). IEEE, 269–278.
[73]
Matias Martinez and Martin Monperrus. 2015. Mining software repair models for reasoning on the search space of automated program fixing. Emp. Softw. Eng. 20, 1 (2015), 176–205.
[74]
Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for java. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 441–444.
[75]
Matias Martinez and Martin Monperrus. 2018. Ultra-large repair search space with automatically mined templates: The cardumen mode of astor. In International Symposium on Search Based Software Engineering. Springer, 65–86.
[76]
Barbara Minto. 2009. The Pyramid Principle: Logic in Writing and Thinking. Pearson Education.
[77]
Rahul Mohanani, Iflaah Salman, Burak Turhan, Pilar Rodríguez, and Paul Ralph. 2018. Cognitive biases in software engineering: A systematic mapping study. IEEE Trans. Softw. Eng. 46, 12 (2018), 1318–1339.
[78]
Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: Essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the 36th International Conference on Software Engineering. 234–242.
[79]
Martin Monperrus. 2018. Automatic software repair: A bibliography. ACM Comput. Surv. 51, 1 (2018), 1–24.
[80]
Martin Monperrus. 2020. The living review on automated program repair. Technical Report. hal-01956501. HAL Archives Ouvertes. https://hal.archives-ouvertes.fr/hal-01956501v4/file/repair-living-review.pdf.
[81]
Martin Monperrus, Simon Urli, Thomas Durieux, Matias Martinez, Benoit Baudry, and Lionel Seinturier. 2019. Repairnator patches programs automatically. Ubiquity 2019(July2019), 1–12.
[82]
Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. 2018. Do automated program repair techniques repair hard and important bugs? Emp. Softw. Eng. 23, 5 (2018), 2901–2947.
[83]
Institute of Electrical and Electronics Engineers. 1987. IEEE Standard Taxonomy for Software Engineering Standards.
[84]
Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and improving fault localization. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 609–620.
[85]
Yuhua Qi, Wenhong Liu, Weixiang Zhang, and Deheng Yang. 2018. How to measure the performance of automated program repair. In Proceedings of the 5th International Conference on Information Science and Control Engineering (ICISCE’18). IEEE, 246–250.
[86]
Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. 254–265.
[87]
Yuhua Qi, Xiaoguang Mao, Yan Lei, and Chengsong Wang. 2013. Using automated program repair for evaluating the effectiveness of fault localization techniques. In Proceedings of the International Symposium on Software Testing and Analysis. 191–201.
[88]
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In Proceedings of the International Symposium on Software Testing and Analysis. 24–36.
[89]
Yihao Qin, Shangwen Wang, Kui Liu, Xiaoguang Mao, and Tegawendé F. Bissyandé. 2021. On the impact of flaky tests in automated program repair. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE, 295–306.
[90]
Abhik Roychoudhury and Yingfei Xiong. 2019. Automated program repair: A step towards software automation. Sci. Chin. Inf. Sci. 62, 10 (2019), 200103.
[91]
Ripon K Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R Prasad. 2018. Bugs.jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th International Conference on Mining Software Repositories. 10–13.
[92]
Ripon K. Saha, Hiroaki Yoshida, Mukul R. Prasad, Susumu Tokumoto, Kuniharu Takayama, and Isao Nanba. 2018. Elixir: An automated repair tool for Java programs. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. 77–80.
[93]
Seemanta Saha et al. 2019. Harnessing evolution for multi-hunk program repair. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 13–24.
[94]
Yusra Shakeel, Jacob Krüger, Ivonne Von Nostitz-Wallwitz, Gunter Saake, and Thomas Leich. 2019. Automated selection and quality assessment of primary studies: A systematic literature review. J. Data Inf. Qual. 12, 1 (2019), 1–26.
[95]
André Silva, Matias Martinez, Benjamin Danglot, Davide Ginelli, and Martin Monperrus. 2021. FLACOCO: Fault localization for Java based on industry-grade coverage. CoRR, abs/2111.12513 (2021). https://arxiv.org/abs/2111.12513.
[96]
Darja Šmite, Claes Wohlin, Zane Galviņa, and Rafael Prikladnicki. 2014. An empirically based terminology and taxonomy for global software engineering. Emp. Softw. Eng. 19, 1 (2014), 105–153.
[97]
Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. 532–543.
[98]
Joanna Smith and Helen Noble. 2014. Bias in research. Evid.-bas. Nurs. 17, 4 (2014), 100–101.
[99]
Webb Stacy and Jean MacMillan. 1995. Cognitive bias in software engineering. Commun. ACM 38, 6 (1995), 57–63.
[100]
Shin Hwei Tan, Jooyong Yi, Sergey Mechtaev, Abhik Roychoudhury, et al. 2017. Codeflaws: A programming competition benchmark for evaluating automated program repair tools. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C’17). IEEE, 180–182.
[101]
Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu. 2014. Automatically generated patches as debugging aids: A human study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 64–74.
[102]
Michael Unterkalmsteiner, Robert Feldt, and Tony Gorschek. 2014. A taxonomy for requirements engineering and software test alignment. ACM Trans. Softw. Eng. Methodol. 23, 2 (2014), 1–38.
[103]
Muhammad Usman, Ricardo Britto, Jürgen Börstler, and Emilia Mendes. 2017. Taxonomies in software engineering: A systematic mapping study and a revised taxonomy development method. Inf. Softw. Technol. 85 (2017), 43–59.
[104]
András Vargha and Harold D. Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat. 25, 2 (2000), 101–132.
[105]
Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, and Hai Jin. 2020. Automated patch correctness assessment: How far are we? In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 968–980.
[106]
Shangwen Wang, Ming Wen, Xiaoguang Mao, and Deheng Yang. 2019. Attention please: Consider Mockito when evaluating newly proposed automated program repair techniques. In Proceedings of the Evaluation and Assessment on Software Engineering. 260–266.
[107]
Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of the IEEE 31st International Conference on Software Engineering. IEEE, 364–374.
[108]
Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-aware patch generation for better automated program repair. In Proceedings of the IEEE/ACM 40th International Conference on Software Engineering (ICSE’18). IEEE, 1–11.
[109]
George R. Wheaton. 1968. Development of a taxonomy of human performance: A review of classificatory systems relating to tasks and performance. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.973.125&rep=rep1&type=pdf.
[110]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics. Springer, 196–202.
[111]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.
[112]
W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Trans. Softw. Eng. 42, 8 (2016), 707–740.
[113]
Qi Xin and Steven P. Reiss. 2017. Leveraging syntax-related code for automated program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 660–670.
[114]
Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 416–426.
[115]
Tongtong Xu, Liushan Chen, Yu Pei, Tian Zhang, Minxue Pan, and Carlo Alberto Furia. 2022. Restore: Retrospective fault localization enhancing automated program repair. IEEE Trans. Softw. Eng. 48, 2 (2022), 309–326.
[116]
Xuezheng Xu, Yulei Sui, Hua Yan, and Jingling Xue. 2019. VFix: Value-flow-guided precise program repair for null pointer dereferences. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 512–523.
[117]
Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Trans. Softw. Eng. 43, 1 (2016), 34–55.
[118]
Meng Yan, Xin Xia, Yuanrui Fan, Ahmed E. Hassan, David Lo, and Shanping Li. 2022. Just-in-time defect identification and localization: A two-phase framework. IEEE Trans. Softw. Eng. 48, 1 (2022), 82–101. DOI:
[119]
Deheng Yang. 2022. Artifact Page of Our Study. Retrieved from https://github.com/DehengYang/APRConfig, 2021.
[121]
Deheng Yang. 2022. The Guideline on How to Extend APRConfig. Retrieved from https://github.com/DehengYang/APRConfig/blob/master/How_to_extend.md.
[122]
[124]
Deheng Yang, Yan Lei, Xiaoguang Mao, David Lo, Huan Xie, and Meng Yan. 2021. Is the ground truth really accurate? Dataset purification for automated program repair. In Proceedings of the IEEE 28th International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE.
[125]
He Ye, Matias Martinez, and Martin Monperrus. 2021. Automated patch assessment for program repair at scale. Emp. Softw. Eng. 26, 2 (2021), 1–38.
[126]
He Ye, Matias Martinez, and Martin Monperrus. 2021. Neural program repair with execution-based backpropagation. CoRR, abs/2105.04123 (2021). https://arxiv.org/abs/2105.04123.
[127]
He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural program repair with execution-based backpropagation. In Proceedings of the IEEE/ACM 44th International Conference on Software Engineering (ICSE’22). IEEE, 1506–1518.
[128]
Yuan Yuan and Wolfgang Banzhaf. 2020. ARJA: Automated repair of java programs via multi-objective genetic programming. IEEE Trans. Softw. Eng. 46, 10 (2020), 1040–1067.
[129]
Yuan Yuan and Wolfgang Banzhaf. 2020. Toward better evolutionary program repair: An integrated approach. ACM Trans. Softw. Eng. Methodol. 29, 1 (2020), 1–53.
[130]
Jie M. Zhang and Mark Harman. 2021. “Ignorance and prejudice” in software fairness. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 1436–1447.

Cited By

View all
  • (2025)GenProgJS: A Baseline System for Test-Based Automated Repair of JavaScript ProgramsIEEE Transactions on Software Engineering10.1109/TSE.2024.349779851:2(325-343)Online publication date: Feb-2025
  • (2024)Evolving Paradigms in Automated Program Repair: Taxonomy, Challenges, and OpportunitiesACM Computing Surveys10.1145/369645057:2(1-43)Online publication date: 10-Oct-2024
  • (2024)Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationProceedings of the ACM on Software Engineering10.1145/36607781:FSE(1585-1608)Online publication date: 12-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 3
May 2023
937 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3594533
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2023
Online AM: 04 September 2022
Accepted: 26 August 2022
Revised: 19 August 2022
Received: 05 December 2021
Published in TOSEM Volume 32, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automated program repair
  2. bias study
  3. empirical evaluation

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Major Key Project of PCL

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)10
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)GenProgJS: A Baseline System for Test-Based Automated Repair of JavaScript ProgramsIEEE Transactions on Software Engineering10.1109/TSE.2024.349779851:2(325-343)Online publication date: Feb-2025
  • (2024)Evolving Paradigms in Automated Program Repair: Taxonomy, Challenges, and OpportunitiesACM Computing Surveys10.1145/369645057:2(1-43)Online publication date: 10-Oct-2024
  • (2024)Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationProceedings of the ACM on Software Engineering10.1145/36607781:FSE(1585-1608)Online publication date: 12-Jul-2024
  • (2024)Towards More Precise Coincidental Correctness Detection With Deep Semantic LearningIEEE Transactions on Software Engineering10.1109/TSE.2024.348189350:12(3265-3289)Online publication date: 1-Dec-2024
  • (2024)Improving effort-aware defect prediction by directly learning to rank software modulesInformation and Software Technology10.1016/j.infsof.2023.107250165:COnline publication date: 1-Jan-2024
  • (2023)Revisiting ‘revisiting supervised methods for effort‐aware cross‐project defect prediction’IET Software10.1049/sfw2.1213317:4(472-495)Online publication date: 27-Jun-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media