Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A random forest model for early-stage software effort estimation for the SEERA dataset

Published: 02 July 2024 Publication History

Abstract

Context

Publicly available software cost estimation datasets are outdated and may not represent current industrial environments. Thus most research has concentrated on the development and evaluation of estimation models with limited evidence of their applicability to industrial practice. Moreover, these datasets and models may not be applicable in (under-represented) technically and economically constrained environments such as the software development environment in Sudan.

Objective

This paper aims to develop a machine learning model that is suitable for the Sudanese software industry. To demonstrate the suitability of our approach, we evaluate our model using the publicly available SEERA (Software enginEERing in SudAn) dataset, which is a software cost estimation dataset from organizations in Sudan.

Method

We demonstrated the suitability of the SEERA dataset for effort estimation by comparing the attributes that had a high correlation with actual effort and actual duration to the cost factors identified by (Sudanese) experts. In addition, we developed an early-stage Random Forest model to estimate project effort and duration from the SEERA dataset. Early-stage estimation is in-line with current Sudanese industrial practice. We investigated the impact of oversampling, feature selection, heterogeneity and local environmental factors on model accuracy.

Results

Our experimental results showed that the Random Forest model with oversampling and feature selection provided accurate estimates that were better than random guessing (standardized accuracy > 70 %). Our results were similar to accuracies reported in the literature. In addition, we demonstrated that our random forest model provided estimations that were more accurate than (Sudanese) expert judgement.

Conclusion

This study has demonstrated the feasibility of our random forest model for early-stage effort and duration estimation for Sudanese software projects. The results demonstrate the importance of representative models and datasets for non-traditional technical environments. Further research is required to investigate the impact of local environmental factors on software cost estimation.

References

[1]
T. Menzies, Y. Yang, G. Mathew, B. Boehm, J. Hihn, Negative results for software effort estimation, Empir. Software Engineer. 22 (5) (2017) 2658–2683,.
[2]
M. Jørgensen, M. Shepperd, A systematic review of software development cost estimation studies, IEEE Transac. Software Engineer. 33 (1) (2007) 33–53,.
[3]
V.K. Bardsiri, D.N.A. Jawawi, S.Z.M. Hashim, E. Khatibi, A flexible method to estimate the software development effort based on the classification of projects and localization of comparisons, Empir. Software Engineer. 19 (4) (2014) 857–884,.
[4]
E. Kocagunel, T. Menzies, J. Keung, D. Cok, R. Madachy, Active learning and effort estimation: finding the essential content of software effort estimation data, IEEE Trans. Software Eng. 39 (8) (2013) 1040–1053,.
[5]
M. Bosu, S. MacDonell, Experience: quality benchmarking of datasets used in software effort estimation, J. Data Inform. Quality 11 (4) (2019) 1–38,. 19.
[6]
E. Mustafa, R. Osman, An analysis of the inclusion of environmental cost factors in software cost estimation datasets, in: 2018 IEEE International QRS Workshop on Conflicts and Synergies among Reliability, Security, and other Qualities, Lisbon, Portugal, 2018, pp. 623–630,.
[7]
C. Mair, M. Shepperd, M. Jørgensen, An analysis of data sets used to train and validate cost prediction systems, in: Workshop on predictor models in software engineering (PROMISE '05), St. Louis, Missouri, USA, 2005, pp. 1–6,.
[8]
B. Kitchenham, Empirical studies of assumptions that underlie software cost-estimation models, Inform. Software Technol. 34 (4) (1992) 211–218,.
[9]
M. Hosni, A. Idri, A. Abran, A.B. Nassif, On the value of parameter tuning in heterogeneous ensembles effort estimation, Soft Comput. 22 (18) (2018) 5977–6010,.
[10]
F. Sarro, R. Moussa, A. Petrozziello, M. Harman, Learning from mistakes: machine learning enhanced human expert effort estimates, IEEE Transac. Software Engineer. 48 (6) (2020) 1868–1882,.
[11]
S.K. Sehra, Y.S. Brar, N. Kaur, S.S. Sehra, Research patterns and trends in software effort estimation, Inform. Software Technol. 91 (2017) 1–21,.
[12]
E.I. Mustafa, R. Osman, SEERA: a software cost estimation dataset for constrained environments, in: 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’20), Virtual, USA, 2020, pp. 61–70,.
[13]
E.I. Mustafa and R. Osman. The SEERA software cost estimation dataset (version 1.1) [Data set]. Zenodo. 2020. https://doi.org/10.5281/zenodo.4066438.
[14]
M.F. Bosu, S.G. MacDonell, A taxonomy of data quality challenges in empirical software engineering, in: 22nd Australian Software Engineering Conference, Hawthorne, VIC, Australia, 2013, pp. 97–106,.
[15]
K. Dejaeger, W. Verbeke, D. Martens, B. Baesens, Data mining techniques for software effort estimation: a comparative study, IEEE Trans. Software Eng. 38 (2) (2012) 375–397,.
[16]
P.G.F. Matsubara, B.F. Gadelha, I. Steinmacher, T.U. Conte, SEXTAMT: a systematic map to navigate the wide seas of factors affecting expert judgment software estimates, J. Syst. Software 185 (2022),.
[17]
J. Wen, S. Li, Z. Lin, Y. Hu, C. Huang, Systematic literature review of machine learning based software development effort estimation models, Inform. Software Technol. 54 (1) (2012) 41–59,.
[18]
S. Alamdy, R. Osman, The realities of the software industry in Sudan (in Arabic), J. Engin Comp. Sci. Sudan Uni. Press 8 (2) (2017) 5–64.
[19]
S. Alamdy, R. Osman, Software industry practice in Africa: case study Sudan, in: IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Turin, Italy, 2017,.
[20]
E. Mustafa and R. Osman, Software cost estimation in the Sudanese software industry (in Arabic), Manuscript in preparation, 2024.
[21]
ISBSG Limited. isbsg10 [Data set]. Zenodo. 2012. https://doi.org/10.5281/zenodo.268485.
[22]
V. Tawosi, F. Sarro, A. Petrozziello, M. Harman, Multi-objective software effort estimation: a replication study, IEEE Transac. Software Engineer. 48 (8) (2022) 3185–3205,.
[23]
B. Baskeles, B. Turhan, A. Bener, Software effort estimation using machine learning methods, in: 22nd International Symposium on Computer and Information Sciences, Ankara, Turkey, 2007, pp. 1–6,.
[24]
A. Idri, M. Hosni, A. Abran, Systematic literature review of ensemble effort estimation, J. Syst. Software 118 (2016) 151–175,.
[25]
M. Azzeh, A.B. Nassif, I.B. Attili, Predicting software effort from use case points: a systematic review, Sci. Computer Programm. 204 (2021),.
[26]
A. Idri, F.A. Amazal, A. Abran, Analogy-based software development effort estimation: a systematic mapping and review, Inform. Software Technol. 58 (2015) 206–230,.
[27]
P.S. Kumar, H.S. Behera, A. Kumari, J. Nayak, B. Naik, Advancement from neural networks to deep learning in software effort estimation: perspective of two decades, Computer Science Review 38 (2020),.
[28]
E. Mendes, N. Mosley, S. Counsell, Investigating web size metrics for early web cost estimation, J. Syst. Software 77 (2) (2005) 157–172,.
[29]
A.B. Nassif, D. Ho, L.F. Capretz, Towards an early software estimation using log-linear regression and a multilayer perceptron model, J. Syst. Software 86 (1) (2013) 144–160,.
[30]
S.M. Satapathy, B.P. Acharya, S.K. Rath, Early stage software effort estimation using random forest technique based on use case points, IET Software 10 (1) (2016) 10–17,.
[31]
S. Mensah, J. Keung, M.F. Bosu, K.E. Bennin, Duplex output software effort estimation model with self-guided interpretation, Inform. Software Technol. 94 (2018) 1–13,.
[32]
C. Lopez-Martin, A fuzzy logic model for predicting the development effort of short scale programs based upon two independent variables, Appl. Soft Comput. 11 (1) (2011) 724–732,.
[33]
M. Shepperd, S. MacDonell, Evaluating prediction systems in software project estimation, Inform. Software Technol. 54 (8) (2012) 820–827,.
[34]
L.L. Minku, X. Yao, Ensembles and locality: insight on improving software effort estimation, Inform. Software Technol. 55 (8) (2013) 1512–1528,.
[35]
L. Villalobos-Arias, C. Quesada-López, J. Guevara-Coto, A. Martínez, M. Jenkins, Evaluating hyper-parameter tuning using random search in support vector machines for software effort estimation, in: 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, Virtual, USA, 2020, pp. 31–40,.
[36]
F. Zare, H.K. Zare, M.S. Fallahnezhad, Software effort estimation based on the optimal Bayesian belief network, Appl. Soft Comput. 49 (2016) 968–980,.
[37]
S. Laqrichi, F. Marmier, D. Gourc, J. Nevoux, Integrating uncertainty in software effort estimation using Bootstrap based neural networks, IFAC-PapersOnLine 48 (3) (2015) 954–959,.
[38]
P. Pospieszny, B. Czarnacka-Chrobot, A. Kobylinski, An effective approach for software project effort and duration estimation with machine learning algorithms, J. Syst. Software 137 (2018) 184–196,.
[39]
Z. Shahpar, V.Khatibi Bardsiri, A.Khatibi Bardsiri, Hybrid PSO-SA approach for feature weighting in analogy-based software project effort estimation, J. AI and Data Mining 9 (3) (2021) 329–340,.
[40]
E. Kocaguneli, T. Menzies, A. Bener, J.W. Keung, Exploiting the essential assumptions of analogy-based effort estimation, IEEE Transac. Software Engineer. 38 (2) (2012) 425–438,.
[41]
A.B. Nassif, L.F. Capretz, D. Ho, Software effort estimation in the early stages of the software life cycle using a cascade correlation neural network model, in: 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, Kyoto, Japan, 2012, pp. 589–594,.
[42]
P.A. Whigham, C.A. Owen, S.G. Macdonell, A baseline model for software effort estimation, ACM Trans. Softw. Eng. Methodol. 24 (3) (2015) 1–11,. 20.
[43]
M. Azzeh, A.B. Nassif, A hybrid model for estimating software project effort from use case points, Appl. Soft Comput. 49 (2016) 981–989,.
[44]
P. Phannachitta, On an optimal analogy-based software effort estimation, Inform. Software Technol. 125 (2020),.
[45]
M.A. Shah, D.N.A. Jawawi, M.A. Isa, M. Younas, A. Abdelmaboud, F. Sholichin, Ensembling artificial bee colony with analogy-based estimation to improve software development effort prediction, IEEE Access 8 (2020) 58402–58415,.
[46]
N. Rankovic, D. Rankovic, M. Ivanovic, L. Lazic, A new approach to software effort estimation using different artificial neural network architectures and taguchi orthogonal arrays, IEEE Access 9 (2021) 26926–26936,.
[47]
H.D.P. De Carvalho, R. Fagundes, W. Santos, Extreme learning machine applied to software development effort estimation, IEEE Access 9 (2021) 92676–92687,.
[48]
E. Mustafa, R. Osman, Identifying critical success factors for the sudanese public sector software projects (in Arabic), J. Engin Comp. Sci. Sudan Uni. Press 20 (2) (2019) 2–19. http://repository.sustech.edu/handle/123456789/23371.
[49]
T. Menzies, R. Krishna, D. Pryor, The Promise Repository of Empirical Software Engineering Data, North Carolina State University, Department of Computer Science, 2016, http://openscience.us/repo.
[50]
E. Kocaguneli. Cocomosdr [Data set]. Zenodo. 2009. https://doi.org/10.5281/zenodo.268433.
[51]
Center for Software Engineering, University of Southern California. (1995). COCOMO II model definition manual. Available: https://athena.ecs.csus.edu/∼buckley/CSc231_files/Cocomo_II_Manual.pdf.
[52]
G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical learning: With Applications in R, 2nd ed., Spinger, 2021.
[53]
S. Bibi, I. Stamelos, Selecting the appropriate machine learning techniques for the prediction of software development costs, in: IFIP International Conference on Artificial Intelligence Applications and Innovations, 2006, pp. 533–540.
[54]
M. Kuhn, K. Johnson, Feature Engineering and Selection: A Practical Approach for Predictive Models, 1st ed., Chapman and Hall/CRC, New York, US, 2019.
[55]
E. Scornet, Tuning parameters in random forests, ESAIM 60 (2017) 144–162,.
[56]
L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32,.
[57]
Z. Abdelali, H. Mustapha, N. Abdelwahed, Investigating the use of random forest in software effort estimation, Proced. Computer Sci. 148 (2019) 343–352,.
[58]
J. Huang, Y.-F. Li, M. Xie, An empirical analysis of data preprocessing for machine learning-based software cost estimation, Inform. Software Technol. 67 (2015) 108–127,.
[59]
E. Kocaguneli, T. Menzies, Software effort models should be assessed via leave-one-out validation, J. Syst. Software 86 (7) (2013) 1879–1890,.
[60]
scikit-learn.org, RandomForestRegressor. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html, 2021 (accessed July 31, 2023).
[61]
P. Probst, M.N. Wright, A.L. Boulesteix, Hyperparameters and tuning strategies for random forest, Wiley Interdiscipl. Rev. 9 (3) (2019) e1301,.
[62]
P. Branco, L. Torgo, R.P. Ribeiro, SMOGN: a pre-processing approach for imbalanced regression, in: Proceedings of Machine Learning Research: LIDTA 2017, 2017, pp. 36–50.
[63]
L. Torgo, R.P. Ribeiro, B. Pfahringer, P. Branco, Smote for regression, in: L. Correia, L.P. Reis, J. Cascalho (Eds.), Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science, 8154, Berlin, Heidelberg, Springer, 2013, pp. 378–389.
[64]
P. Branco, R.P. Ribeiro, and L. Torgo, "UBL: an R package for utility-based learning," arXiv preprint arXiv:1604.08079, (2016). https://doi.org/10.48550/arXiv.1604.08079.
[65]
Anaconda Inc., Anaconda. www.anaconda.com, 2021 (accessed 31 July 2021).
[66]
M. Shepperd, G. Kadoda, Comparing software prediction techniques using simulation, IEEE Trans. Software Eng. 27 (11) (2001) 1014–1022,.
[67]
E. Mendes, M. Kalinowski, D. Martins, F. Ferrucci, F. Sarro, Cross-vs. within-company cost estimation studies revisited: an extended systematic review, in: 18th International Conference on Evaluation and Assessment in Software Engineering, 2014, pp. 1–10,.
[68]
A. Bakır, B. Turhan, A.B. Bener, A new perspective on data homogeneity in software cost estimation: a study in the embedded systems domain, Software Quality Journal 18 (1) (2010) 57–80,.
[69]
M. Usman, R. Britto, L.-O. Damm, J. Börstler, Effort estimation in large-scale software development: an industrial case study, Inform. Software Technol. 99 (2018) 21–40,.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information and Software Technology
Information and Software Technology  Volume 169, Issue C
May 2024
167 pages

Publisher

Butterworth-Heinemann

United States

Publication History

Published: 02 July 2024

Author Tags

  1. Software effort and duration estimation
  2. Random forest
  3. Early-stage
  4. The SEERA dataset
  5. Technically constrained environments
  6. SMOGN

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media