research-article

A random forest model for early-stage software effort estimation for the SEERA dataset

Authors:

Emtinan I. Mustafa,

Rasha OsmanAuthors Info & Claims

Volume 169, Issue C

https://doi.org/10.1016/j.infsof.2024.107413

Published: 02 July 2024 Publication History

Abstract

Context

Publicly available software cost estimation datasets are outdated and may not represent current industrial environments. Thus most research has concentrated on the development and evaluation of estimation models with limited evidence of their applicability to industrial practice. Moreover, these datasets and models may not be applicable in (under-represented) technically and economically constrained environments such as the software development environment in Sudan.

Objective

This paper aims to develop a machine learning model that is suitable for the Sudanese software industry. To demonstrate the suitability of our approach, we evaluate our model using the publicly available SEERA (Software enginEERing in SudAn) dataset, which is a software cost estimation dataset from organizations in Sudan.

Method

We demonstrated the suitability of the SEERA dataset for effort estimation by comparing the attributes that had a high correlation with actual effort and actual duration to the cost factors identified by (Sudanese) experts. In addition, we developed an early-stage Random Forest model to estimate project effort and duration from the SEERA dataset. Early-stage estimation is in-line with current Sudanese industrial practice. We investigated the impact of oversampling, feature selection, heterogeneity and local environmental factors on model accuracy.

Results

Our experimental results showed that the Random Forest model with oversampling and feature selection provided accurate estimates that were better than random guessing (standardized accuracy > 70 %). Our results were similar to accuracies reported in the literature. In addition, we demonstrated that our random forest model provided estimations that were more accurate than (Sudanese) expert judgement.

Conclusion

This study has demonstrated the feasibility of our random forest model for early-stage effort and duration estimation for Sudanese software projects. The results demonstrate the importance of representative models and datasets for non-traditional technical environments. Further research is required to investigate the impact of local environmental factors on software cost estimation.

References

[1]

T. Menzies, Y. Yang, G. Mathew, B. Boehm, J. Hihn, Negative results for software effort estimation, Empir. Software Engineer. 22 (5) (2017) 2658–2683,.

Abstract

Context

Objective

Method

Results

Conclusion

References

Recommendations

Investigating the use of random forest in software effort estimation

SEERA: a software cost estimation dataset for constrained environments

Early stage software effort estimation using random forest technique based on use case points

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations