Abstract
While the amount of research in the area of software engineering is ever increasing, it is still a challenge to select a research data set. Quite a number of data sets have been proposed, but we still lack a systematic approach to creating ones that would evolve together with the industry. We aim to present a systematic method of selecting data sets of industry-relevant software projects for the purposes of software engineering research. We present a set of guidelines for filtering GitHub projects and implement those guidelines in a form of an R script. In particular, we select mostly projects from the biggest industrial open source contributors and remove projects in the first quartile in any of several categories from the data set. We use the latest GitHub GraphQL API to select the desired set of repositories. We evaluate the technique on Java projects. Presented technique systematizes methods for creating software development data sets and their evolution. Proposed algorithm has reasonable precision—between 0.65 and 0.80—and can be used as a baseline for further refinements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Except for specific cases like insufficient performance or for reverse-engineering, which we ignore in this discussion.
- 2.
This also means that we rejected all forks and only left the main repository.
References
Madeyski, L.: Test-Driven Development: An Empirical Evaluation of Agile Practice. Springer, (Heidelberg, London, New York) (2010). https://doi.org/10.1007/978-3-642-04288-1
Rafique, Y., Misic, V.B.: The effects of test-driven development on external quality and productivity: A meta-analysis. IEEE Trans. Softw. Eng. 39(6), 835–856 (2013)
Madeyski, L., Kawalerowicz, M.: Continuous Test-Driven Development: A Preliminary Empirical Evaluation using Agile Experimentation in Industrial Settings. In: Towards a Synergistic Combination of Research and Practice in Software Engineering, Studies in Computational Intelligence, vol. 733, pp. 105–118. Springer (2018). https://doi.org/10.1007/978-3-319-65208-5_8
Arisholm, E., Gallis, H., Dybå, T., Sjøberg, D.I.K.: Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise. IEEE Transactions on Software Engineering 33(2), 65–86 (2007)
Dybå, T., Dingsøyr, T.: Empirical studies of agile software development: A systematic review. Information and Software Technology 50(9–10), 833–859 (2008)
Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H., Noble, J.: The qualitas corpus: A curated collection of java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp. 336–345 (2010). https://doi.org/10.1109/APSEC.2010.46
Ortu, M., Destefanis, G., Adams, B., Murgia, A., Marchesi, M., Tonelli, R.: The jira repository dataset: Understanding social aspects of software development. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE ’15, pp. 1:1–1:4. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2810146.2810147. http://doi.acm.org/10.1145/2810146.2810147
Lamastra, C.R.: Software innovativeness. a comparison between proprietary and free/open source solutions offered by italian smes. R&D Management 39(2), 153–169 (2009). https://doi.org/10.1111/j.1467-9310.2009.00547.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9310.2009.00547.x
MacCormack, A., Rusnak, J., Baldwin, C.Y.: Exploring the structure of complex software designs: An empirical study of open source and proprietary code. Management Science 52(7), 1015–1030 (2006). 10.1287/mnsc.1060.0552. https://doi.org/10.1287/mnsc.1060.0552
Pruett, J., Choi, N.: A comparison between select open source and proprietary integrated library systems. Library Hi Tech 31(3), 435–454 (2013). https://doi.org/10.1108/LHT-01-2013-0003
Bird, C., Pattison, D., D’Souza, R., Filkov, V., Devanbu, P.: Latent social structure in open source projects. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT ’08/FSE-16, pp. 24–35. ACM, New York, NY, USA (2008). https://doi.org/10.1145/1453101.1453107. http://doi.acm.org/10.1145/1453101.1453107
Vasudevan, A.R., Harshini, E., Selvakumar, S.: Ssenet-2011: A network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 Second Asian Himalayas International Conference on Internet (AH-ICI), pp. 1–5 (2011). https://doi.org/10.1109/AHICI.2011.6113948
Madeyski, L.: Training data preparation method. Tech. rep., code quest (research project NCBiR POIR.01.01.01-00-0792/16) (2019)
Raemaekers, S., van Deursen, A., Visser, J.: The maven repository dataset of metrics, changes, and dependencies. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 221–224 (2013). https://doi.org/10.1109/MSR.2013.6624031
Habayeb, M., Miranskyy, A., Murtaza, S.S., Buchanan, L., Bener, A.: The firefox temporal defect dataset. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 498–501. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820597
Lamkanfi, A., Prez, J., Demeyer, S.: The eclipse and mozilla defect tracking dataset: A genuine dataset for mining bug information. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 203–206 (2013). https://doi.org/10.1109/MSR.2013.6624028
Ohira, M., Kashiwa, Y., Yamatani, Y., Yoshiyuki, H., Maeda, Y., Limsettho, N., Fujino, K., Hata, H., Ihara, A., Matsumoto, K.: A dataset of high impact bugs: Manually-classified issue reports. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 518–521 (2015). https://doi.org/10.1109/MSR.2015.78
Filó, T.G., Bigonha, M.A., Ferreira, K.A.: Statistical dataset on software metrics in object-oriented systems. SIGSOFT Softw. Eng. Notes 39(5), 1–6 (2014). https://doi.org/10.1145/2659118.2659130
Open-source version control system for machine learning projects. https://dvc.org/. Accessed: 2019-04-23
dat:// a peer-to-peer protocol. https://datproject.org/. Accessed: 2019-04-23
Gousios, G.: The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 233–236. IEEE Press, Piscataway, NJ, USA (2013). http://dl.acm.org/citation.cfm?id=2487085.2487132
Cosentino, V., Izquierdo, J.L.C., Cabot, J.: Findings from github: Methods, datasets and limitations. In: 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp. 137–141 (2016). https://doi.org/10.1109/MSR.2016.023
Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 92–101. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597074. http://doi.acm.org/10.1145/2597073.2597074
Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in github: An empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 352–355. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597118. http://doi.acm.org/10.1145/2597073.2597118
Pletea, D., Vasilescu, B., Serebrenik, A.: Security and emotion: Sentiment analysis of security discussions on github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 348–351. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597117. http://doi.acm.org/10.1145/2597073.2597117
Sawant, A.A., Bacchelli, A.: A dataset for api usage. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 506–509. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820599
Badashian, A.S., Esteki, A., Gholipour, A., Hindle, A., Stroulia, E.: Involvement, contribution and influence in github and stack overflow. In: Proceedings of 24th Annual International Conference on Computer Science and Software Engineering, CASCON ’14, pp. 19–33. IBM Corp., Riverton, NJ, USA (2014). http://dl.acm.org/citation.cfm?id=2735522.2735527
Awesome empirical software engineering resources. https://github.com/dspinellis/awesome-msr. Accessed: 2019-03-31
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: PROMISE’2010: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp. 9:1–9:10. ACM (2010). https://doi.org/10.1145/1868328.1868342
Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empirical Software Engineering 22(6), 3219–3253 (2017)
Smith, T.M., McCartney, R., Gokhale, S.S., Kaczmarczyk, L.C.: Selecting open source software projects to teach software engineering. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, SIGCSE 14, pp. 397–402. ACM, New York, NY, USA (2014)
Tamburri, D.A., Palomba, F., Serebrenik, A., Zaidman, A.: Discovering community patterns in open-source: a systematic approach and its evaluation. Empirical Software Engineering (2018)
Falessi, D., Smith, W., Serebrenik, A.: Stress: A semi-automated, fully replicable approach for project selection. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 151–156 (2017)
Gebru, T., Morgenstern, J.H., Vecchione, B., Vaughan, J.W., Wallach, H.M., Daumé, H., Crawford, K.: Datasheets for datasets. CoRR abs/1803.09010 (2018)
Asay, M.: Who really contributes to open source (2018). https://www.infoworld.com/article/3253948/who-really-contributes-to-open-source.html. [Online; posted 7-February-2018; Accessed 23-April-2019]
Madeyski, L., Kitchenham, B.: reproducer: Reproduce Statistical Analyses and Meta-Analyses (2019). http://madeyski.e-informatyka.pl/reproducible-research/. R package version (http://CRAN.R-project.org/package=reproducer)
Madeyski, L., Kitchenham, B.: Would wider adoption of reproducible research be beneficial for empirical software engineering research? Journal of Intelligent & Fuzzy Systems 32(2), 1509–1521 (2017). https://doi.org/10.3233/JIFS-169146
Madeyski, L., Kitchenham, B.: Effect Sizes and their Variance for AB/BA Crossover Design Studies. Empirical Software Engineering 23(4), 1982–2017 (2018). https://doi.org/10.1007/s10664-017-9574-5
Sharma, A., Thung, F., Kochhar, P.S., Sulistya, A., Lo, D.: Cataloging github repositories. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, EASE’17, pp. 314–319. ACM, New York, NY, USA (2017)
Tiobe index. https://www.tiobe.com/tiobe-index/. Accessed: 2019-04-24
Acknowledgements
This work has been conducted as a part of research and development project POIR.01.01.01-00-0792/16 supported by the National Centre for Research and Development (NCBiR). We would like to thank Tomasz Korzeniowski and Marek Skrajnowski from code quest sp. z o.o. for all of the comments and feedback from the real-world software engineering environment.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Lewowski, T., Madeyski, L. (2020). Creating Evolving Project Data Sets in Software Engineering. In: Jarzabek, S., Poniszewska-Marańda, A., Madeyski, L. (eds) Integrating Research and Practice in Software Engineering. Studies in Computational Intelligence, vol 851. Springer, Cham. https://doi.org/10.1007/978-3-030-26574-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-26574-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26573-1
Online ISBN: 978-3-030-26574-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)