research-article

Open access

SATDAUG - A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt

Authors:

Andrea CapiluppiAuthors Info & Claims

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

Pages 289 - 293

https://doi.org/10.1145/3643991.3644880

Published: 02 July 2024 Publication History

Abstract

Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts, workarounds, or temporary solutions within the codebase. Over recent years, researchers have manually labeled datasets derived from various software development artifacts: source code comments, messages from the issue tracker and pull request sections, and commit messages. These datasets are designed for training, evaluation, performance validation, and improvement of machine learning and deep learning models to accurately identify SATD instances. However, class imbalance poses a serious challenge across all the existing datasets, particularly when researchers are interested in categorizing the specific types of SATD. In order to address the scarcity of labeled data for SATD identification (i.e., whether an instance is SATD or not) and categorization (i.e., which type of SATD is being classified) in existing datasets, we share the SATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages. These augmented datasets have been balanced in relation to the available artifacts and provide a much richer source of labeled data for training machine learning or deep learning models.

References

[1]

Xin Chen, Dongjin Yu, Xulin Fan, Lin Wang, and Jie Chen. 2021. Multiclass classification for self-admitted technical debt based on XGBoost. IEEE Transactions on Reliability 71, 3 (2021), 1309--1324.

[2]

Ward Cunningham. 1992. The WyCash portfolio management system. ACM Sigplan Oops Messenger 4, 2 (1992), 29--30.

Digital Library

[3]

Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using natural language processing to automatically detect self-admitted technical debt. IEEE Transactions on Software Engineering 43, 11 (2017), 1044--1062.

Digital Library

[4]

Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Zihao Wu, Lin Zhao, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, et al. 2023. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007 (2023).

[5]

Ke Dai and Philippe Kruchten. 2017. Detecting Technical Debt through Issue Trackers. In QuASoQ@ APSEC. 59--65.

[6]

Mário André de Freitas Farias, Manoel Gomes de Mendonça Neto, Marcos Kalinowski, and Rodrigo Oliveira Spínola. 2020. Identifying self-admitted technical debt through code comment analysis with a contextualized vocabulary. Information and Software Technology 121 (2020), 106270.

Digital Library

[7]

Amleto Di Salle, Alessandra Rota, Phuong T Nguyen, Davide Di Ruscio, Francesca Arcelli Fontana, and Irene Sala. 2022. PILOT: synergy between text processing and neural networks to detect self-admitted technical debt. In Proceedings of the International Conference on Technical Debt. 41--45.

Digital Library

[8]

Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C Prati, Bartosz Krawczyk, and Francisco Herrera. 2018. Learning from imbalanced data sets. Vol. 10. Springer.

[9]

Jernej Flisar and Vili Podgorelec. 2018. Enhanced feature selection using word embeddings for self-admitted technical debt identification. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 230--233.

[10]

Qiao Huang, Emad Shihab, Xin Xia, David Lo, and Shanping Li. 2018. Identifying self-admitted technical debt in open source projects using text mining. Empirical Software Engineering 23 (2018), 418--451.

Digital Library

[11]

J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159--174.

[12]

Yikun Li, Mohamed Soliman, and Paris Avgeriou. 2022. Identifying self-admitted technical debt in issue tracking systems using machine learning. Empirical Software Engineering 27, 6 (2022), 131.

Digital Library

[13]

Yikun Li, Mohamed Soliman, and Paris Avgeriou. 2023. Automatic identification of self-admitted technical debt from four different sources. Empirical Software Engineering 28, 3 (2023), 1--38.

Digital Library

[14]

Zengyang Li, Paris Avgeriou, and Peng Liang. 2015. A systematic mapping study on technical debt and its management. Journal of Systems and Software 101 (2015), 193--220.

Digital Library

[15]

Aniket Potdar and Emad Shihab. 2014. An exploratory study on self-admitted technical debt. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 91--100.

Digital Library

[16]

Xiaoxue Ren, Zhenchang Xing, Xin Xia, David Lo, Xinyu Wang, and John Grundy. 2019. Neural network-based detection of self-admitted technical debt: From performance to explainability. ACM transactions on software engineering and methodology (TOSEM) 28, 3 (2019), 1--45.

Digital Library

[17]

Irene Sala, Antonela Tommasel, and Francesca Arcelli Fontana. 2021. DebtHunter: A machine learning-based approach for detecting self-admitted technical debt. In Evaluation and Assessment in Software Engineering. 278--283.

[18]

Rafael Meneses Santos, Methanias Colaço Rodrigues Junior, and Manoel Gomes de Mendonça Neto. 2020. Self-admitted technical debt classification using LSTM neural network. In 17th International Conference on Information Technology-New Generations (ITNG 2020). Springer, 679--685.

[19]

Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review 5, 1 (2001), 3--55.

[20]

Giancarlo Sierra, Emad Shihab, and Yasutaka Kamei. 2019. A survey of self-admitted technical debt. Journal of Systems and Software 152 (2019), 70--82.

Digital Library

[21]

Murali Sridharan, Mika Mantyla, Leevi Rantala, and Maelick Claes. 2021. Data balancing improves self-admitted technical debt detection. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 358--368.

[22]

Edi Sutoyo, Paris Avgeriou, and Andrea Capiluppi. Under Review. Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt. (Under Review).

[23]

Edi Sutoyo and Andrea Capiluppi. 2023. Detecting Technical Debt Using Natural Language Processing Approaches-A Systematic Literature Review. arXiv preprint arXiv:2312.15020 (2023).

[24]

Jiapeng Wang and Yihong Dong. 2020. Measurement of text similarity: a survey. Information 11, 9 (2020), 421.

[25]

Supatsara Wattanakriengkrai, Rungroj Maipradit, Hideki Hata, Morakot Choetkiertikul, Thanwadee Sunetnanta, and Kenichi Matsumoto. 2018. Identifying design and requirement self-admitted technical debt using n-gram idf. In 2018 9th International Workshop on Empirical Software Engineering in Practice (IWESEP). IEEE, 7--12.

[26]

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).

[27]

Laerte Xavier, Fabio Ferreira, Rodrigo Brito, and Marco Tulio Valente. 2020. Beyond the code: Mining self-admitted technical debt in issue tracker systems. In Proceedings of the 17th international conference on mining software repositories. 137--146.

Digital Library

[28]

Xuhang Xie, Xuesong Lu, and Bei Chen. 2022. Multi-task Learning for Paraphrase Generation With Keyword and Part-of-Speech Reconstruction. In Findings of the Association for Computational Linguistics: ACL 2022. 1234--1243.

[29]

Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020. Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546 (2020).

[30]

Kuiyu Zhu, Ming Yin, and Yizhen Li. 2021. Detecting and classifying self-admitted of technical debt with CNN-BiLSTM. In Journal of Physics: Conference Series, Vol. 1955. IOP Publishing, 012102.

[31]

Kuiyu Zhu, Ming Yin, Dan Zhu, Xiaogang Zhang, Cunzhi Gao, and Jijiao Jiang. 2023. SCGRU: A general approach for identifying multiple classes of self-admitted technical debt with text generation oversampling. Journal of Systems and Software 195 (2023), 111514.

Digital Library

[32]

Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. 2016. Do we need more training data? International Journal of Computer Vision 119, 1 (2016), 76--92.

Digital Library

Recommendations

DebtHunter: A Machine Learning-based Approach for Detecting Self-Admitted Technical Debt
EASE '21: Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering

Due to limited time, budget or resources, a team is prone to introduce code that does not follow the best software development practices. This code that introduces instability in the software projects is known as Technical Debt (TD). Often, TD ...
Towards automating self-admitted technical debt repayment
Abstract Context:
Self-Admitted Technical Debt (SATD) refers to the technical debt in software that is explicitly flagged, typically by the source code comment. The SATD literature has mainly focused on comprehending, describing, detecting, and ...
Automatic identification of self-admitted technical debt from four different sources
Abstract
Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems. A large part of technical debt is explicitly reported by the developers themselves; this is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

April 2024

788 pages

ISBN:9798400705878

DOI:10.1145/3643991

Chair:
Diomidis Spinellis,
Program Chair:
Alberto Bacchelli,
Program Co-chair:
Eleni Constantinou

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MSR '24

Sponsor:

SIGSOFT

MSR '24: 21st International Conference on Mining Software Repositories

April 15 - 16, 2024

Lisbon, Portugal

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
64
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)20

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents