Abstract
In this article, we present a systematic mapping study of replications in software engineering. The goal is to plot the landscape of current published replications of empirical studies in software engineering research. We applied the systematic review method to search and select published articles, and to extract and synthesize data from the selected articles that reported replications. Our search retrieved more than 16,000 articles, from which we selected 96 articles, reporting 133 replications performed between 1994 and 2010, of 72 original studies. Nearly 70 % of the replications were published after 2004 and 70 % of these studies were internal replications. The topics of software requirements, software construction, and software quality concentrated over 55 % of the replications, while software design, configuration management, and software tools and methods were the topics with the smallest number of replications. We conclude that the number of replications has grown in the last few years, but the absolute number of replications is still small, in particular considering the breadth of topics in software engineering. We still need incentives to perform external replications, better standards to report empirical studies and their replications, and collaborative research agendas that could speed up development and publication of replications.
Similar content being viewed by others
Notes
http://dl.acm.org/citation.cfm?doid=1838687.1838698 (last visited April, 2012)
http://dl.acm.org/citation.cfm?doid=2088883.2088889 (last visited April, 2012)
http://jabref.sourceforce.net. JabRef is an open source bibliography reference manager. We used JabRef to record the data extracted from the articles, including the reference data and extracts of the text that we used to answer the research questions.
http://www.mendeley.com. We used Mendeley to share the consolidated references of the selected papers on the Web, so multiple researchers could access them.
We provide information about the inter-rater agreements in the specific sections below.
References
Abran A, Moore J, Bourque P, Dupuis T (Eds.) (2004) Guide to software engineering body of knowledge, IEEE Computer Society. 204
Almqvist JPF (2006) Replication of controlled experiments in empirical software engineering —a survey. Master’s Thesis, Department of Computer Science, Faculty of Science, Lund University, Sweden. 129
Arksey H, O’Malley L (2005) Scoping studies: towards a methodological framework. Int J Soc Res Meth 8:19–32
Basili V et al (1999) Building knowledge through families of experiments. IEEE Trans Software Eng 25:456–473. doi:10.1109/32.799939
Brooks A et al. (1995) Replication of Experimental Results in Software Engineering. Technical Report, EFoCS-17-95 [RR/95/193], Dept. of Computer Science, Univ. of Strathclyde. 38
Brooks A et al. (2007) Replication’s role in software engineering. In F Shull, J Singer, and DIK Sjøberg (eds) Guide to Advanced Empirical Software Engineering. Springer, pp 365–379
Carver JC. (2010) Towards Reporting Guidelines for Experimental Replications: A Proposal. In RESER’2010: Proceedings of the 1st International Workshop on Replication in Empirical Software Engineering Research, Cape Town, South Africa. 4
Carver JC et al. (2003) Issues in using students in empirical studies in soft —ware engineering education. In Proceedings of the 9th International Software Metrics Symposium (METRICS2003), pp239–249
Ciolkowski M et al. (2004) Using academic courses for empirical validation of software development processes. In Proceedings of the 30th Euromicro Conference, pp 354–361
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46. doi:10.1177/001316446002000104
da Silva FQB et al (2011a) Six years of systematic literature reviews in software engineering: an updated tertiary study. Inform Software Tech 53(9):899–913. doi:10.1016/j.infsof.2011.04.004
da Silva FQB et al. (2011b) Replication of empirical studies in software engineering: Preliminary findings from a systematic mapping study. Proceedings of the 2nd International Workshop on Replication in Empirical Software Engineering Research RESER’2011, pp 61–70
Daly J, Brooks A, Miller J, Roper M, Wood M (1994) Verification of Results in Software Maintenance Through External Replication. IEEE International Conference on Software Maintenance, pp. 50–57
Davidsen MK, Krogstie J (2010) A longitudinal study of development and maintenance. Inform Software Tech 52(7):707–719
Dybå T, Dingsøyr T (2008) Empirical studies of agile software development: a systematic review. Inform Software Tech 50:833–859
Easterbrook SM et al. (2007) Selecting Empirical Methods for Software Engineering Research.. In: F Shull, J Singer and D Sjøberg (eds.) Guide to Advanced Empirical Software Engineering. Springer, pp 285–311
França A César C et al. (2010) The Effect of Reasoning Strategies on Success in Early Learning of Programming: Lessons Learned from an External Experiment Replication. In EASE’2010: 14th International Conference on Evaluation and Assessment in Software Engineering, Keele University, UK. 10
Gómez G, Omar S, Juristo N, Vegas N (2010a) Replication, Reproduction and Re-analysis: Three ways for verifying experimental findings. In RESER’2010: Proceedings of the 1st International Workshop on Replication in Empirical Software Engineering Research. Cape Town, South Africa. pp 42–44
Gómez G, Omar S, Juristo N, Vegas N (2010b) Replications Types in Experimental Disciplines. In ESEM’2010: Proceedings of the ACM/IEEE 4th International Symposium on Empirical Software Engineering and Measurement, September 16–17, Bolzano-Bozen, Italy. pp. 1–10
Gould J, Kolb WL (eds) (1964) A dictionary of the social sciences. Tavistock Publications, London, 761
Holgeid KK, Krogstie J, Sjøberg DIK (2000) A study of development and maintenance in Norway: assessing the efficiency of information systems support using functional maintenance. Inform Software Tech 42:687–700
Juristo N, Vegas S (2009) Using differences among replications of software engineering experiments to gain knowledge. In ESEM’09: Proceedings of the ACM/IEEE 3rd International Symposium on Empirical Software Engineering and Measurement. IEEE Computer Society, Washington, DC, USA, pp 356–366
Kitchenham B (2008) The role of replications in empirical software engineering—a word of warning. Empir Software Eng 13:219–221
Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering, Technical Report EBSE-2007-01, School of Computer Science and Mathematics, Keele University
Kitchenham BA, Pfleeger SL (2007) Personal Opinion Surveys. In: F Shull, J Singer, D. Sjøberg (eds), pp. 63–92, Guide to Advanced Empirical Software Engineering, Springer
Kitchenham B, Dybå T, Jørgensen M (2004) Evidence-based Software Engineering. In ICSE’2004: Proceedings of the 26th International Conference on Software Engineering, Washington DC, USA. pp 273–281
Kitchenham B et al (2010) Literature reviews in software engineering—a tertiary study. Inform Software Tech 52:792–805
Krein Jonathan L, Knutson Charles D (2010) A Case for Replication: Synthesizing Research Methodologies in Software Engineering. In RESER’2010: Proceedings of the 1st International Workshop on Replication in Empirical Software Engineering Research, Cape Town, South Africa. 10
Krogstie J, Sølvberg A (1994) Software Maintenance in Norway: a survey investigation. In ICSM’1994: Proceedings of the International Conference on Software Maintenance. pp 304–313
Krogstie J, Jahr A, Sjøberg DIK (2006) A longitudinal study of development and maintenance in Norway: report from the 2003 investigation. Inform Softw Technol 48:993–1005
La Sorte MA (1972) Replication as a verification technique in survey research: a paradigm. Socio Q 13(2):219–227
Lindsay RM, Ehrenberg A (1993) The design of replicated studies. Am Stat 47(3):217–228
Lung J et al. (2008) On the difficulty of replicating human subjects studies in software engineering. In ICSE’2008: Proceedings of the 13th international conference on Software engineering, New York, USA: ACM Press. pp 191–201
Petticrew M, Roberts H (2006) Systematic Reviews in the Social Sciences. Blackwell Publishing. 336
Popper K (1959) The Logic of Scientific Discovery. Hutchinson & Co. 513
Schmidt S (2009) Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev Gen Psychol 13:90–100. doi:10.1037/a0015108
Shull F, Basili V, Carver J, Maldonado JC, Travassos GH, Mendonça M, Fabbri S (2002) Replicating software engineering experiments: Addressing the tacit knowledge problem. In ISESE’2002: Proc. Int. Symp. on Empirical Softw. Eng., Washington, DC, USA, IEEE Computer Society. 10
Shull F, Carver J, Vegas S, Juristo N (2008) The Role of Replications in Empirical Software Engineering. Empir Software Eng 13:211–218
Sjøberg D (2010) Confronting the myth of rapid obsolescence in computing research. Commun ACM 53(9):62–67
Sjøberg D et al (2005) A survey of controlled experiments in software engineering. IEEE Trans Software Eng 31:733–753
Vegas S et al. (2006) Analysis of the Influence of Communication between Researchers on Experiment Replication. In ISESE’2006: Proceedings of the 5th International Symposium on Empirical Software Engineering. September 20–21, Rio de Janeiro, Brazil. pp 28–37
Yin RK (2009) Case study research: Design and methods, 4th edn. Sage Publications, London, 240
Zhang H, Babarb MA, Tell P (2010) Identifying relevant studies in software engineering. Inform Software Tech 53(6):625–637, http://dx.doi.org/10.1016/j.infsof.2010.12.010
Acknowledgments
Fabio Q. B. da Silva holds a research grant from the Brazilian National Research Council (CNPq), process #314523/2009-0. This article was written while Prof. Fabio Silva was in a sabbatical leave at the University of Toronto, receiving a CAPES research grant process # 6441/10-6. A. César C. França is a doctoral student at the Center of Informatics of the Federal University of Pernambuco where he receives a scholarship from the Brazilian National Research Council (CNPq), process #141156/2010-4. We would like to thank Prof. Steve Easterbrook, Jonathan Lung, and Elizabeth Patitsas for many discussions, comments, and criticisms that lead to important improvements in the content and structure of this article. We also thank Prof. André Santos, Rodrigo Lopes, João Paulo Oliveira, and Leonardo Oliveira, for their participation in the earlier version of this study published at RESER’2011. Finally, we are grateful for the partial support of the Samsung Institute for Development of Informatics (Samsung/SIDI) for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Natalia Juristo
Preliminary and partial results of this mapping study were published and presented at the 2nd International Workshop on Replication in Empirical Software Engineering Research (RESER’2011).
Appendices
Appendix A—Selected Primary Studies
In this Appendix, we describe the selected papers that report replications and primary studies, which form the dataset of our mapping study. In Section A.1, we describe the two types of papers reporting replications: Original-Included reports and Replication-Only reports. In Section A.2, we present the complete list of references of these papers. In Section A.3, we present the complete reference list of the papers reporting solely original studies.
1.1 A.1 Selected Papers Reporting Replications
In this section we presente a summary of the papers reporting replications in Section A.1.1. We then provide details about the papers that compose the sets of replications in Section A.1.2.
1.1.1 A.1.1 Descriptive Information of Papers Reporting Replications
In Table 14, we presente an overview of the 96 papers reporting the 133 replications. This table is ordered by the year of publication of the paper reporting the replication. The quality score is discussed in detail in Appendix B. The column Original Ref. presents the reference to the paper reporting the original study. When this column is empty, it indicates that the paper reporting the replication also reported the original study, i.e., it is an Original-Included report of internal replications.
1.1.2 A.1.2 Composition of the Sets of Replications
In Table 15, we presente the sets with two or more replications and their original studies, grouped by SWEBOK chapter, which enables the reader to identify the members of the sets and their corresponding original studies. Table 15 also shows the dates of publication of the original study and the replications, the type of each replication in the set, and the number of replications that were reported in each paper.
1.2 A.2 Reference List of Replications
[REP001] English M, Buckley J, Cahill T (2010) A replicated and refined empirical study of the use of friends in C ++ software. The Journal of Systems & Software 83(11):2275–2286. doi:10.1016/j.jss.2010.07.013
[REP003] Abrahão S, Poels G (2009) A family of experiments to evaluate a functional size measurement procedure for Web applications. The Journal of Systems & Software 82(2):253–269. doi:10.1016/j.jss.2008.06.031
[REP005] Zhang H (2009) An Investigation of the Relationships between Lines of Code and Defects. 2009 IEEE International Conference on Software Maintenance 274–283.
[REP006] Huynh T, Miller J (2009) Another viewpoint on “evaluating web software reliability based on workload and failure data extracted from server logs.” Empirical Software Engineering 14(4):371–396. doi:10.1007/s10664-008-9084-6
[REP007] Reynoso L, Manso E, Genero M, Piattini M (2010) Assessing the influence of import-coupling on OCL expression maintainability: A cognitive theory-based perspective. Information Sciences 180(20):3837–3862. doi:10.1016/j.ins.2010.06.028
[REP009] Dias-Neto AC, Travassos GH (2009) Evaluation of {model-based} Testing Techniques Selection Approaches: an External Replication. 3rd International Symposium on Empirical Software Engineering and Measurement 269–278.
[REP010] Geet JV, Demeyer S (2009) Feature Location in COBOL Mainframe Systems: an Experience Report. IEEE International Conference on Software Maintenance 361–370.
[REP011] Abrahão S, Insfran E, Gravino C, Scanniello G (2009) On the Effectiveness of Dynamic Modeling in UML: Results from an External Replication. 3rd International Symposium on Empirical Software Engineering and Measurement 468–472.
[REP012] Ricca F, Scanniello G, Torchiano M, Reggio G, Astesiano E (2010) On the Effectiveness of Screen Mockups in Requirements Engineering: Results from an Internal Replication. Empirical Software Engineering and Measurement.
[REP014] Ceccato M, Penta MD, Nagra J, Falcarin P, Ricca F, Torchiano M, Tonella P (2009) The Effectiveness of Source Code Obfuscation: an Experimental Assessment. 17th International Conference 178–187.
[REP015] Do H, Mirarab S, Tahvildari L, Rothermel G (2010) The Effects of Time Constraints on Test Case Prioritization: A Series of Controlled Experiments. IEEE Transactions on Engineering Management 36(5):593–617.
[REP016] Cruz-Lemus JA, Maes A, Genero M, Poels G, Piattini M (2010) The impact of structural complexity on the understandability of UML statechart diagrams. Information Sciences 180(11):2209–2220. doi:10.1016/j.ins.2010.01.026
[REP019] Du G, McElroy J, Ruhe G (2006) A Family of Empirical Studies to Compare Informal and Optimization-based Planning of Software Releases. International Symposium on Empirical Software Engineering 212–221.
[REP020] Andersson C (2007) A replicated empirical study of a selection method for software reliability growth models. Empirical Software Engineering 12(2):161–182. doi:10.1007/s10664-006-9018-0
[REP021] Andersson C, Runeson P, Member S (2007) A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems. IEEE Transactions on Software Engineering 33(5):273–286.
[REP023] Falessi D, Capilla R, Cantone G (2008) A Value-Based Approach for Documenting Design Decisions Rationale: A Replicated Experiment. International Conference on Software Engineering 63–70.
[REP024] Ardimento P, Baldassarre MT, Caivano D, Visaggio G (2006) Assessing multiview framework (MF) comprehensibility and efficiency: A replicated experiment. Information and Software Technology 48(5):313–322. doi:10.1016/j.infsof.2005.09.010
[REP025] Lokan C, Mendes E (2006) Cross-company and Single-company Effort Models Using the ISBSG Database: a Further Replicated Study. International Symposium on Empirical Software Engineering 75–84.
[REP026] Staron M, Kuzniarz L, Wohlin C (2006) Empirical assessment of using stereotypes to improve comprehension of UML models: A set of experiments. Journal of Systems and Software 79(5):727–742. doi:10.1016/j.jss.2005.09.014
[REP027] Canfora G, Cimitile A, Garcia F, Piattini M, Visaggio CA (2007) Evaluating performances of pair designing in industry. Journal of Systems and Software 80(8):1317–1327. doi:10.1016/j.jss.2006.11.004
[REP028] Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG Database. Empirical Software Engineering 13(1):3–37. doi:10.1007/s10664-007-9045-5
[REP029] Ricca F, Penta MD, Torchiano M, Tonella P, Ceccato M (2007) The Role of Experience and Ability in Comprehension Tasks supported by UML Stereotypes. 29th International Conference on Software Engineering 375–384.
[REP030] Baresi L, Morasca S (2007) Three Empirical Studies on Estimating the Design Effort of Web Applications. ACM Transactions on Software Engineering and Methodology 16(4):15. doi:10.1145/1276933.1276936
[REP031] Ricca F, Torchiano M, Penta MD, Ceccato M, Tonella P (2009) Using acceptance tests as a support for clarifying requirements: A series of experiments. Information and Software Technology 51(2):270–283. doi:10.1016/j.infsof.2008.01.007
[REP032] Vokác M, Tichy W, Sjøberg DIK, Arisholm E, Aldrin M (2004) A Controlled Experiment Comparing the Maintainability of Programs Designed with and without Design Patterns—A Replication in a Real Programming Environment. Empirical Software Engineering 9(3):149–195.
[REP033] Briand LC, Bunse C, Daly JW (2001) A Controlled Experiment for Evaluating Quality Guidelines on the Maintainability of Object-Oriented Designs. IEEE Transactions on Software Engineering 27(6):513–530.
[REP034] Prechelt L, Unger B, Philippsen M, Tichy W (2001) A Controlled Experiment on Inheritance Depth as a Cost Factor for Code Maintenance. Journal of Systems and Software 65(2):115–132.
[REP035] Canfora G, García F, Piattini M, Ruiz F, Visaggio CA (2005) A family of experiments to validate metrics for software process models. Journal of Systems and Software 77(2):113–129. doi:10.1016/j.jss.2004.11.007
[REP036] Mendes E, Mosley N, Counsell S (2003) A Replicated Assessment of the Use of Adaptation Rules to Improve Web Cost Estimation. International Symposium on Empirical Software Engineering 100–109.
[REP037] Mendes E, Lokan C, Harrison R, Triggs C (2005) A Replicated Comparison of Cross-company and Within-company Effort Estimation Models using the ISBSG Database. 11th IEEE International Software Metrics Symposium 331–340.
[REP038] Thelin T, Andersson C, Runeson P, Dzamashvili-fogelström N (2004) A Replicated Experiment of Usage-Based and Checklist-Based Reading. 10th International Symposium on Software Metrics.
[REP039] Shepperd M, Cartwright M (2005) A Replication of the Use of Regression Towards the Mean (R2M) as an Adjustment to Effort Estimation Models. 11th IEEE International Software Metrics Symposium
[REP040] Herbsleb JD, Mockus A (2003) An Empirical Study of Speed and Communication in Globally Distributed Software Development. IEEE Transactions on Software Engineering 29(6):481–494.
[REP041] Lanubile F, Mallardo T, Calefato F, Denger C, Ciolkowski M (2004) Assessing the Impact of Active Guidance for Defect Detection: A Replicated Experiment. 10th International Symposium on Software Metrics 269–278.
[REP043] Thelin T, Petersson H, Runeson P (2002) Confidence intervals for capture—recapture estimations in software inspections. Information and Software Technology 44:683–702.
[REP045] Myrtveit I, Stensrud E (1999) A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models. IEEE Transactions on Software Engineering 25(4):510–525.
[REP046] Briand LC, Langley T, Wieczorek I (2000) A replicated Assessment and Comparison of Common Software Cost Modeling Techniques. International Conference on Software Engineering (2):377–386.
[REP047] Fusaro P, Lanubile F, Visaggio G (1997) A Replicated Experiment to Assess Requirements Inspection Techniques. Empirical Software Engineering 2:39–57.
[REP048] Roper Marc, Wood Murray, Miller James (1997) An empirical evaluation of defect detection techniques. Information and Software Technology 39:763–775.
[REP049] Miller J, Macdonald F (2000) An empirical incremental approach to tool evaluation and improvement. Journal of Systems and Software 51(1):19–35.
[REP050] Sandahl K, Blomkvist O, Karlsson J, Krysander C, Lindvall M, Ohlsson N (1998) An Extended Replication of an Experiment for Assessing Methods for Software Requirements Inspections. Empirical Software Engineering 3:327–354.
[REP051] Laitenberger O, Emam KE, Harbich TG (2001) An Internally Replicated Quasi-Experimental Comparison of Checklist and Perspective-Based Reading of Code Documents. IEEE Transactions on Software Engineering 27(5):387–421.
[REP052] Visaggio G (1999) Assessing the maintenance process through replicated, controlled experiments. Journal of Systems and Software 44(3):187–197.
[REP053] Porter A, Votta LG, Basili V (1995) Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment. IEEE Transactions on Software Engineering 21(6):563–575.
[REP055] Briand LC, Wüst J, Ikonomovski SV, Lounis H (1999) Investigating Quality Factors in Object-Oriented Designs: an Industrial Case Study. International Conference on Software Engineering 345–354.
[REP058] Cartwright M, Shepperd M (1998) An Empirical View of Inheritance. Information and Software Technology 40(14):795–799.
[REP060] Cox K, Phalp K (2000) Replicating the CREWS Use Case Authoring Guidelines Experiment. Empirical Software Engineering 5(3):245–267.
[REP061] Lindvall M, Rus I, Donzelli P, Menon A, Zelkowitz MV, Can-Betin A, Bultan T, et al. (2007) Experimenting with software testbeds for evaluating new technologies. Empirical Software Engineering 12(4):417–444. doi:10.1007/s10664-006-9034-0
[REP065] Shah HB, Görg C, Harrold MJ (2010) Understanding Exception Handling: Viewpoints of Novices and Experts. IEEE Transactions on Software Engineering 36(2):150–161.
[REP066] Lui KM, Chan KCC, Nosek JT (2008) The Effect of Pairs in Program Design Tasks. IEEE Transactions on Software Engineering 34(2):197–211.
[REP068] Anda B, Sjøberg DIK (2005) Investigating the Role of Use Cases in the Construction of Class Diagrams. Empirical Software Engineering 10(3):285–309.
[REP071] Hochstein L, Carver J, Shull F, Asgari S, Basili V, Hollingsworth JK, Zelkowitz MV (2005) Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers. ACM/IEEE Supercomputing Conference 35–43.
[REP072] Lucia AD, Gravino C, Oliveto R, Tortora G (2010) An experimental comparison of ER and UML class diagrams for data modelling. Empirical Software Engineering 15(5):455–492. doi:10.1007/s10664-009-9127-7
[REP073] Daly J, Brooks A, Miller J, Roper M, Wood M (1994) Verification of Results in Software Maintenance Through External Replication. IEEE International Conference on Software Maintenance 50–57.
[REP076] Dinh-Trong TT, Bieman JM (2005) The FreeBSD Project: A Replication Case Study of Open Source Development. IEEE Transactions on Software Engineering 31(6):481–494.
[REP082] Jørgensen M, Teigen KH, Moløkken K (2004) Better Sure Than Safe? Overconfidence in Judgment Based Software Development Effort Prediction Intervals. Journal of Systems and Software 70(1):79–93.
[REP083] Agarwal R, De P, Sinha AP (1999) Comprehending Object and Process Models: An Empirical Study. IEEE Transactions on Software Engineering 25(4):541–556.
[REP085] Kamsties E, Lott CM (1995) An Empirical Evaluation of Three Defect-Detection Techniques. Information and Software Technology 39(11):763–775.
[REP086] Kiper J, Auerheimer B (1997) Visual Depiction of Decision Statements: What is Best For Programmers and Non-programmers.
[REP087] Land LPW, Jeffery R, Sauer C (1997) Validating the Defect Detection Performance Advantage of Group Designs for Software Reviews: Report of a Replicated Experiment. Engineering 17–26.
[REP088] Miller J, Wood M, Roper M (1998) Further Experiences with Scenarios and Checklists. Empirical Software Engineering 3:37–64.
[REP089] Zweben SH, Edwards SH, Weide BW, Hollingsworth JE (1995) The Effects of Layering and Encapsulation on Software Development Cost and Quality. IEEE Transactions on Software Engineering 21(3):200–208.
[REP090] Harrison R, Counsell S, Nithi R (2000) Experimental assessment of the effect of inheritance on the maintainability of object-oriented systems. Journal of Systems and Software 52:173–179.
[REP091] Prechelt L, Unger-lamprecht B, Philippsen M, Tichy WF (2002) Two Controlled Experiments Assessing the Usefulness of Design Pattern Documentation in Program Maintenance. IEEE Transactions on Software Engineering 28(6):595–606.
[REP092] Regnell B, Runeson P, Thelin T (2001) Are the Perspectives Really Different?—Further Experimentation on Scenario-Based Reading of Requirements. Empirical Software Engineering 5(1):331–356.
[REP093] Arisholm E, Sjøberg DIK (2004) Evaluating the Effect of a Delegated versus Centralized Control Style on the Maintainability of Object-Oriented Software. IEEE Transactions on Software Engineering 30(8):521–534.
[REP094] Ricca F, Penta MD, Torchiano M, Tonella P, Ceccato M (2010) How Developers’ Experience and Ability Influence Web Application Comprehension Tasks Supported by UML Stereotypes: A Series of Four Experiments. IEEE Transactions on Software Engineering 36(1):96–118.
[REP095] Jørgensen M (2010) Identification of more risks can lead to increased over-optimism of and over-confidence in software development effort estimates. Information and Software Technology 52(5):506–516. doi:10.1016/j.infsof.2009.12.002
[REP098] Maldonado C, Carver J, Shull F, Fabbri S, Dória E, Martiniano L, Mendonça M, et al. (2006) Perspective-Based Reading: A Replicated Experiment Focused on Individual Reviewer Effectiveness. Empirical Software Engineering 11(1):119–142. doi:10.1007/s10664-006-5967-6
[REP101] Verelst JAN (2005) The Influence of the Level of Abstraction on the Evolvability of Conceptual Models of Information Systems. Empirical Software Engineering 10(4):467–494.
[REP102] Koru AG, Emam KE, Zhang D, Liu H, Mathew D (2008) Theory of relative defect proneness - Replicated studies on the functional form of the size-defect relationship. Empirical Software Engineering 13:473–498. doi:10.1007/s10664-008-9080-x
[REP103] Müller MM (2005) Two controlled experiments concerning the comparison of pair programming to peer review. Journal of Systems and Software 78:166–179. doi:10.1016/j.jss.2004.12.019
[REP104] Calefato F, Gendarmi D, Lanubile F (2010) Investigating the use of tags in collaborative development environments: a replicated study. International Symposium on Empirical Software Engineering and Measurement 24:1–24:9.
[REP105] Mendes E, Lokan C (2009) Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions: A Replicated Study. 32nd Australian Conference on Computer Science.
[REP106] Wesslén A (2000) A Replicated Empirical Study of the Impact of the Methods in the PSP on Individual Engineers. Empirical Software Engineering 5:93–123.
[REP107] Phongpaibul M, Boehm B (2007) A Replicate Empirical Comparison between Pair Development and Software Development with Inspection. 1st International Symposium on Empirical Software Engineering and Measurement 265–274. doi:10.1109/ESEM.2007.33
[REP111] Lucia AD, Oliveto R, Tortora G (2009) Assessing IR-based traceability recovery tools through controlled experiments. Empirical Software Engineering 14:57–92. doi:10.1007/s10664-008-9090-8
[REP112] Porter A, Votta LG (1998) Comparing Detection Methods For Software Requirements Inspections: A Replication Using Professional Subjects. Empirical Software Engineering 3:355–379.
[REP113] Ciolkowski M, Differding C, Laitenberger O, Münch J (1997) Empirical Investigation of Perspective-based Reading: A Replicated Experiment.
[REP118] Lung J, Aranda J, Easterbrook S, Wilson G (2008) On the Difficulty of Replicating Human Subjects Studies in Software Engineering. International Conference on Software Engineering 191–200.
[REP119] Caspersen ME, Bennedsen J, Larsen KD (2007) Mental Models and Programming Aptitude. ACM SIGCSE Bulletin 39(3):206–210.
[REP120] França ACC, da Cunha PRM, Da Silva FQB (2010) The Effect of Reasoning Strategies on Success in Early Learning of Programming: Lessons Learned from an External Experiment Replication. 14th International Conference on Evaluation and Assessment in Software Engineering 1–10.
[REP121] Ma X, Zhou M, Mei H (2010) How Developers Participate in Open Source Projects: a Replicate Case Study on JBossAS, JOnAS and Apache Geronimo. Workshop on Replication in Empirical Software Engineering Research.
[REP122] Genero M, Piattini M, Jiménez L (2001) Empirical validation of class diagram complexity metrics. 21st International Conference of the Chilean Computer Science Society 95–104. doi:10.1109/SCCC.2001.972637
[REP123] Genero M, Jiménez L, Piattini M (2002) A Controlled Experiment for Validating Class Diagram Structural Complexity Metrics. OOIS’02 Proceedings of the 8th International Conference on Object-Oriented 372–383.
[REP124] Genero M, Piattini M, Manso E, Cantone G (2003) Building UML Class Diagram Maintainability Prediction Models Based on Early Metrics. 9th International Software Metrics Symposium 263–275.
[REP125] Genero M, Piatini M, Manso E (2004) Finding “Early” Indicators of UML Class Diagrams Understandability and Modifiability. 2004 International Symposium on Empirical Software Engineering 207–216.
[REP126] Genero M, Manso E, Visaggio A, Canfora G, Piattini M (2007) Building measure-based prediction models for UML class diagram maintainability. Empirical Software Engineering 12(5):517–549. doi:10.1007/s10664-007-9038-4
[REP127] Bornat R, Dehnadi S, Simon (2008) Mental models, Consistency and Programming Aptitude. ACE’08 Proceedings of the tenth conference on Australasian computing education.
[REP128] Wray S (2007) SQ Minus EQ can Predict Programming Aptitude. Psychology of Programming Interest Group 19th Annual Workshop 243–254.
[REP129] Cruz-Lemus JA, Genero M, Manso ME, Piattini M (2005) Evaluating the Effect of Composite States on the Understandability of UML Statechart Diagrams. Lecture Notes in Computer Science 3713:113–125.
[REP130] Cruz-Lemus JA, Genero M, Piattini M, Morasca S (2006) Improving the Experimentation for Evaluating the Effect of Composite States on the Understandability of UML Statechart Diagrams. 5th ACM-IEEE International Symposium on Empirical Software Engineering (ISESE 2006) 9–11.
[REP131] Cruz-Lemus JA, Genero M, Manso ME, Morasca S, Piattini M (2009) Assessing the understandability of UML statechart diagrams with composite states—A family of empirical studies. Empirical Software Engineering 14(6):685–719. doi:10.1007/s10664-009-9106-z
[REP132] Mockus A, Fielding RT, Herbsleb JD (2002) Two Case Studies of Open Source Software Development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology 11(3):309–346.
[REP133] Phongpaibul M, Boehm B (2006) An Empirical Comparison Between Pair Development and Software Inspection in Thailand. International Symposium on Empirical Software Engineering 85–94.
[REP134] Dehnadi S, Bornat R (2006) The camel has two humps (working title). Little Psychology of Programming Interest Group 2(23):1–21.
1.3 A.3 Reference List of Original Studies
[ORI001] Counsell S (2000) Use of friends in C++ software: an empirical investigation. Journal of Systems and Software 53(1):15–21. doi:10.1016/S0164-1212(00)00004-2
[ORI006] Tian J, Rudraraju S, Li Z (2004) Evaluating Web software reliability based on workload and failure data extracted from server logs. IEEE Transactions on Software Engineering 30(11):754–769. doi:10.1109/TSE.2004.87
[ORI009] Vegas S, Basili V (2005) A Characterisation Schema for Software Testing Techniques. Empirical Software Engineering 10(4):437–466. doi:10.1007/s10664-005-3862-1
[ORI010] Eisenbarth T, Koschke R, Simon D (2003) Locating Features in Source Code. IEEE Transactions on Software Engineering 29(3)
[ORI011] Gravino C, Scanniello G, Tortora G (2008) An Empirical Investigation on Dynamic Modeling in Requirements Engineering. MoDELS 615–629.
[ORI012] Ricca F, Scanniello G, Torchiano M, Reggio G, Astesiano, E (2010) Can screen mockups improve the comprehension of functional requirements? http://www.scienzemfn.unisa.it/scanniello/ScreenMockupExp/material/main.pdf. Accessed 13 October 2011
[ORI020] Stringfellow C, Falls W, Andrews AA, Science C (2002) An Empirical Method for Selecting Software Reliability Growth Models. Empirical Software Engineering 7:319–343.
[ORI021] Fenton NE, Ohlsson N (2000) Quantitative Analysis of Faults and Failures in a Complex Software System. IEEE Transactions on Software Engineering 26(8):797–814.
[ORI023] Falessi D, Cantone G, Kruchten P (2008) Value-Based Design Decision Rationale Documentation: Principles and Empirical Feasibility Study. 7th Working IEEE/IFIP Conference on Software Architecture 189–198. doi:10.1109/WICSA.2008.8
[ORI024] Baldassarre MT, Caivano D, Visaggio G (2003) Comprehensibility and efficiency of multiview framework for measurement plan design. International Symposium on Empirical Software Engineering. 89–98. doi:10.1109/ISESE.2003.1237968
[ORI025] Jeffery R, Ruhe M, Wieczorek I (2001) Using Public Domain Metrics to Estimate Software Development Effort. 7th International Software Metrics Symposium 16–27.
[ORI027] Canfora G, Cimitile A, Garcia F, Piattini M, Visaggio CA (2006) Performances of pair designing on software evolution: a controlled experiment. Conference on Software Maintenance and Reengineering 197–205. doi:10.1109/CSMR.2006.40
[ORI032] Prechelt L, Unger B, Tichy WF, Brössler P, Votta LG (2001) A Controlled Experiment in Maintenance Comparing Design Patterns to Simpler Solutions. IEEE Transactions on Software Engineering 27(12):1134–1144.
[ORI033] Briand LC, Bunse C, Daly JW, Differding C (1996) An experimental comparison of the maintainability of object-oriented and structured design documents. International Conference on Software Maintenance 130–138. doi:10.1109/ICSM.1997.624239
[ORI034] Daly J, Brooks A, Miller J, Roper M, Wood M (1996) Evaluating Inheritance Depth on the Maintainability of Object-Oriented Software. Empirical Software Engineering 1(2):109–132.
[ORI036]Mendes E, Mosley N, Counsell Steve (2003) Do Adaptation Rules Improve Web Cost Estimation? 14th ACM Conference on Hypertext and Hypermedia 174–183.
[ORI038] Thelin T, Runeson P, Wohlin C (2003) An Experimental Comparison of Usage-Based and Checklist-Based Reading. IEEE Transactions on Software Engineering 29(8):687–704.
[ORI039] Jørgensen M, Indahl U, Sjøberg D (2003) Effort Estimation: Software Effort Estimation by Analogy and “Regression Toward the Mean.” Journal of Systems and Software 68:253–262.
[ORI041] Denger C, Ciolkowski M, Lanubile F (2003) Does Active Guidance Improve Software Inspections? A Preliminary Empirical Study. IEEE Transactions on Software Engineering 29(6):408–413.
[ORI043] Miller J (1999) Estimating the number of remaining defects after inspection. Software Testing, Verification and Reliability, 9(3): 167–189.
[ORI045] Shepperd M, Schofield C, Kitchenham B (1996) Effort Estimation Using Analogy. International Conference on Software Engineering 170–178.
[ORI046] Briand LC, Emam KE, Surmann D, Wieczorek I, Maxwell KD (1999) An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques. 21st International Conference on Software Engineering 313–322.
[ORI055] Briand LC, Daly JW, Porter V, Wüst J (1998) A Comprehensive Empirical Validation of Product Measures for Object-Oriented Systems. 5th International Software Metrics Symposion 246–257.
[ORI060] Achour CB, Rolland C, Maiden NAM, Souveyet C (1998) Guiding Use Case Authoring: Results of an Empirical Study. 4th IEEE International Symposium on Requirements Engineering.
[ORI085] Basili VR, Selby RW (1987) Comparing the Effectiveness of Software Testing Strategies. IEEE Transactions on Software Engineering SE-13(12):1278–1296.
[ORI087] Land LPW, Sauer C, Jeffery R (1997) Validating the Defect Detection Performance Advantage of Group Designs for Software Reviews: Report of a Laboratory Experiment Using Program Code. European Sofrware Engineering Conference 22–25.
[ORI092] Basili VR, Green S, Laitenberger O, Lanubile F, Shull F, Sørumgård S, Zelkowitz MV (1996) The Empirical Investigation of Perspective-Based Reading Reading. Empirical Software Engineering 1(2):133–164.
[ORI093] Arisholm E, Sjøberg DIK, Jørgensen M (2001) Assessing the Changeability of two Object-Oriented Design Alternatives - a Controlled Experiment. Empirical Software Engineering 6(3):231–277.
[ORI104] Treude C, Storey M-anne (2009) How Tagging helps bridge the Gap between Social and Technical Aspects in Software Development. International Conference on Software Engineering 12–22.
[ORI105] Lokan C, Mendes E (2008) Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions. 12th International Conference on Evaluation and Assessment in Software Engineering.
[ORI106] Hayes W, Over JW (1997) The Personal Software Process (PSP): An Empirical Study of the Impact of PSP on Individual Engineers.
[ORI111] Lucia AD, Fasano F, Oliveto R (2007) Recovering Traceability Links in Software Artifact Management Systems using Information Retrieval Methods. Transactions on Software Engineering and Methodology 16(4):13. doi:10.1145/1276933.1276934
[ORI121] Mockus A, Fielding RT, Herbsleb J (2000) A Case Study of Open Source Software Development: The Apache Server. International Conference on Software Engineering 263–272.
[ORI122] Genero M, Jiménez L, Piattini M (2001) A Prediction Model for OO Information System Quality Based on Early Indicators. Advances in Databases and Information Systems.
Appendix B—Quality Assessment
In this mapping study, we are interested in evaluating the quality of the empirical studies in general and the specific quality aspects of replication reports. We did not use quality assessment to exclude reports, only to allow comparative analysis about the studies related to the quality of the information reported in the papers.
2.1 B.1 Quality Assessment Criteria
Kitchenham and Charters (2007, pp. 25), and Dybå and Dingsøyr (2008) informed our choice of quality assessment criteria to assess issues that were related to empirical studies in general. We added replication specific criteria extracted from the propositions presented by Carver (2010). Table 16 shows the complete set of criteria, which includes the nine replication-specific criteria marked (RS), and the seven generic criteria marked (GC).
It is important to note that we assessed the quality of the papers reporting replications, not the quality of each individual replication reported in the papers. When a paper reported more than one replication, we assessed the quality of the entire study not each replication separately. Although this might be a limitation, the lack of detailed information about each individual replication within the papers made individual quality assessment difficult or even impossible.
In the initial definition of the review protocol, we built one set of quality criteria to be used with all papers reporting replications. As explained above, these criteria included items to assess the replication specific aspects, including the quality of the description of the original study, which was advocated to be an essential part of replication reports by Carver (2010). During the quality assessment process, we noticed that several papers reporting internal replications presented the replication (or a set of replications) and the original study in the same paper. For most of these papers, we could not find a clear-cut way to separate the description of the original study from the description of the replications and therefore could not evaluate most of the (RS) criteria.
At this point in the analysis, we found it necessary to separate the papers that reported one or more replications together with an original study (called Original-Included reports) from the papers that reported the original study separately (Replication-Only reports). Replication-Only replications (including both internal and external replications) were assessed using the criteria presented in Table 16 (the initial set of criteria). Original-Included internal replications were assessed using the criteria in Table 17 (a subset of the initial set with seven (RS) criteria removed). We updated the mapping study protocol to reflect this deviation from the initial plan and discuss the implications of having two sets of criteria in Section B.2.
Two researchers performed the assessment of each paper by assigning a score for each quality item on a three-point scale. A third researcher solved the disagreements. When a solution was not reached by the third assessment, the final quality score was defined in a consensus meeting with the rest of the research team. We used Cohen’s Kappa (κ) coefficient (Cohen 1960) to measure the agreement level between assessments before disagreements were resolved.
2.2 B.2 Quality Assessment Results
In the quality assessment process, we achieved an acceptable inter-rater agreement (κ = 0.69) between two researchers, before a third researcher or a consensus meeting solved the disagreements. In Table 18, we presente references to the replication papers, grouped by quartiles of the quality score, and types of reports (Original-Included Internal, Replication-Only Internal, and External). The scores were presented as a percentage of the maximum score, making it possible to compare these three sets.
From Table 18, we can see that Original-Included internal replications papers seem to score higher than Replication-Only, internal and external replications. Furthermore, it seems that external replications scored lower than internal ones. These tendencies are better visualized in Fig. 10.
We calculated the average of the quality scores for each replication type (Table 19). Since the scores were obtained using two different sets of criteria, we did not calculate the average for the entire set of replications nor averaged the quality of Original-Included and Replication-Only reports together.
As visualized in Table 18 and Fig. 10, the entire set of Replication-Only reports of replications (internal and external) scored significantly lower than the Original-Included reports of internal replications (t = −4.769, df = 94, p < 0.001). Similarly, the Replication-Only internal and Replication-Only external sub-sets scored significantly lower than Original-Included internal sub-set, (t = −3.629, df = 63, p < 0.001) and (t = −4.246, df = 65, p < 0.001) respectively. As predicted by looking at Fig. 10, no significant difference in the means of quality score was found between Replication-Only internal and external replications.
These differences are explained by looking at the individual quality assessment criteria. Replication-Only external and internal replications were assessed with respect to the replication specific criteria (RS criteria), which included criteria that evaluated the level of information about the original study, and they scored consistently lower in most of these criteria than on the rest of the criteria, as can be seen in the scores highlighted in boldface in Table 20. As explained above, most of the (RS) criteria were not used to assess Original-Included internal replications, due to a lack of consistent information in the papers. When we use the same sub-set of criteria to evaluate Original-Included reports of replications (removing the criteria highlighted in boldface in Table 20), the average quality score of external replications increased to 87 %, and the average for Replication-Only (internal and external together) increase to 89 %, both very close to the 88 % of the set of Original-Included internal replications.
These results indicated that the overall quality of the papers is good if we considere generic quality assessment criteria to evaluate the empirical studies in general (GC criteria discussed in Section B.1). However, if we adde replication specific (RS) quality criteria, as proposed by Carver (2010), the quality of the Replication-Only reports decrease.
One could argue that our choice of criteria biased the results; since we picked the criteria that the replication reports (in particular the Replication-Only reports) scored consistently low on. However, it is important to remember that the choice of criteria was made during protocol development, before we selected the papers, and was based on accepted guidelines. Although it is true that the two sets of replications scored similarly using the (CG) criteria, the differences in the scores using replication specific criteria was relevant for this study, because we were interested in the evaluation of replication reports. Furthermore, Original-Included and Replication-Only reports of replications were also distinct with respect to other factors presented in the results section. These distinctions must be carefully assessed, because they may be indicative of limitations and threats to validity of replication studies. In particular, we argue these distinctions are indicative of publication bias, as discussed in the main part of our article.
We also compared the quality scores of papers published in journals with the scores of papers published in other sources (conferences proceedings, etc.). Journal papers scored significantly higher than non-journal papers (t = 3,269; df = 131; sig. = 0,001). Furthermore, we compared the quality of original-included reports with the other types of reports in journals and in non-journals sources. In both cases, original-included reports scored significantly higher than the other types of reports, with (t = 3,703; df = 70; sig. = 001) for the difference in the journal papers and (t = 4,732; df = 59; sig. = 000) for the difference in the non-journal papers. The scores of the Original-included reports in journals and non-journal are not significantly different. Replication-only reports also have non-significant difference in scores between journals and non-journal papers. Table 21 shows the mean score and standard deviation for each subset of report type.
In general, replication reports (specifically the Replication-Only reports) lack descriptive information detailing the original studies and other replication specific information. We can think of three reasons to explain these poor descriptions:
First, a complete description of the original study required space and may have been left out in certain publications that had constraints on page count (typical in conference proceedings, but also found in some journals or in special issues). Second, the researchers were unaware that descriptive information should have been included in their papers. Third, detailed information about the original study was not available and the researchers reported what they could find. Replication reporting guidelines should have addressed the second reason. A persistent repository to store data about experiments and replications could have helped with the first and third reasons. If the third reason applied, it would have helped to explain our findings in RQ6 related to confirmation of results.
Rights and permissions
About this article
Cite this article
da Silva, F.Q.B., Suassuna, M., França, A.C.C. et al. Replication of empirical studies in software engineering research: a systematic mapping study. Empir Software Eng 19, 501–557 (2014). https://doi.org/10.1007/s10664-012-9227-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-012-9227-7