Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3643991.3644880acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Open access

SATDAUG - A Balanced and Augmented Dataset for Detecting Self-Admitted Technical Debt

Published: 02 July 2024 Publication History

Abstract

Self-admitted technical debt (SATD) refers to a form of technical debt in which developers explicitly acknowledge and document the existence of technical shortcuts, workarounds, or temporary solutions within the codebase. Over recent years, researchers have manually labeled datasets derived from various software development artifacts: source code comments, messages from the issue tracker and pull request sections, and commit messages. These datasets are designed for training, evaluation, performance validation, and improvement of machine learning and deep learning models to accurately identify SATD instances. However, class imbalance poses a serious challenge across all the existing datasets, particularly when researchers are interested in categorizing the specific types of SATD. In order to address the scarcity of labeled data for SATD identification (i.e., whether an instance is SATD or not) and categorization (i.e., which type of SATD is being classified) in existing datasets, we share the SATDAUG dataset, an augmented version of existing SATD datasets, including source code comments, issue tracker, pull requests, and commit messages. These augmented datasets have been balanced in relation to the available artifacts and provide a much richer source of labeled data for training machine learning or deep learning models.

References

[1]
Xin Chen, Dongjin Yu, Xulin Fan, Lin Wang, and Jie Chen. 2021. Multiclass classification for self-admitted technical debt based on XGBoost. IEEE Transactions on Reliability 71, 3 (2021), 1309--1324.
[2]
Ward Cunningham. 1992. The WyCash portfolio management system. ACM Sigplan Oops Messenger 4, 2 (1992), 29--30.
[3]
Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using natural language processing to automatically detect self-admitted technical debt. IEEE Transactions on Software Engineering 43, 11 (2017), 1044--1062.
[4]
Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Zihao Wu, Lin Zhao, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, et al. 2023. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007 (2023).
[5]
Ke Dai and Philippe Kruchten. 2017. Detecting Technical Debt through Issue Trackers. In QuASoQ@ APSEC. 59--65.
[6]
Mário André de Freitas Farias, Manoel Gomes de Mendonça Neto, Marcos Kalinowski, and Rodrigo Oliveira Spínola. 2020. Identifying self-admitted technical debt through code comment analysis with a contextualized vocabulary. Information and Software Technology 121 (2020), 106270.
[7]
Amleto Di Salle, Alessandra Rota, Phuong T Nguyen, Davide Di Ruscio, Francesca Arcelli Fontana, and Irene Sala. 2022. PILOT: synergy between text processing and neural networks to detect self-admitted technical debt. In Proceedings of the International Conference on Technical Debt. 41--45.
[8]
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C Prati, Bartosz Krawczyk, and Francisco Herrera. 2018. Learning from imbalanced data sets. Vol. 10. Springer.
[9]
Jernej Flisar and Vili Podgorelec. 2018. Enhanced feature selection using word embeddings for self-admitted technical debt identification. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 230--233.
[10]
Qiao Huang, Emad Shihab, Xin Xia, David Lo, and Shanping Li. 2018. Identifying self-admitted technical debt in open source projects using text mining. Empirical Software Engineering 23 (2018), 418--451.
[11]
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159--174.
[12]
Yikun Li, Mohamed Soliman, and Paris Avgeriou. 2022. Identifying self-admitted technical debt in issue tracking systems using machine learning. Empirical Software Engineering 27, 6 (2022), 131.
[13]
Yikun Li, Mohamed Soliman, and Paris Avgeriou. 2023. Automatic identification of self-admitted technical debt from four different sources. Empirical Software Engineering 28, 3 (2023), 1--38.
[14]
Zengyang Li, Paris Avgeriou, and Peng Liang. 2015. A systematic mapping study on technical debt and its management. Journal of Systems and Software 101 (2015), 193--220.
[15]
Aniket Potdar and Emad Shihab. 2014. An exploratory study on self-admitted technical debt. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 91--100.
[16]
Xiaoxue Ren, Zhenchang Xing, Xin Xia, David Lo, Xinyu Wang, and John Grundy. 2019. Neural network-based detection of self-admitted technical debt: From performance to explainability. ACM transactions on software engineering and methodology (TOSEM) 28, 3 (2019), 1--45.
[17]
Irene Sala, Antonela Tommasel, and Francesca Arcelli Fontana. 2021. DebtHunter: A machine learning-based approach for detecting self-admitted technical debt. In Evaluation and Assessment in Software Engineering. 278--283.
[18]
Rafael Meneses Santos, Methanias Colaço Rodrigues Junior, and Manoel Gomes de Mendonça Neto. 2020. Self-admitted technical debt classification using LSTM neural network. In 17th International Conference on Information Technology-New Generations (ITNG 2020). Springer, 679--685.
[19]
Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review 5, 1 (2001), 3--55.
[20]
Giancarlo Sierra, Emad Shihab, and Yasutaka Kamei. 2019. A survey of self-admitted technical debt. Journal of Systems and Software 152 (2019), 70--82.
[21]
Murali Sridharan, Mika Mantyla, Leevi Rantala, and Maelick Claes. 2021. Data balancing improves self-admitted technical debt detection. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 358--368.
[22]
Edi Sutoyo, Paris Avgeriou, and Andrea Capiluppi. Under Review. Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt. (Under Review).
[23]
Edi Sutoyo and Andrea Capiluppi. 2023. Detecting Technical Debt Using Natural Language Processing Approaches-A Systematic Literature Review. arXiv preprint arXiv:2312.15020 (2023).
[24]
Jiapeng Wang and Yihong Dong. 2020. Measurement of text similarity: a survey. Information 11, 9 (2020), 421.
[25]
Supatsara Wattanakriengkrai, Rungroj Maipradit, Hideki Hata, Morakot Choetkiertikul, Thanwadee Sunetnanta, and Kenichi Matsumoto. 2018. Identifying design and requirement self-admitted technical debt using n-gram idf. In 2018 9th International Workshop on Empirical Software Engineering in Practice (IWESEP). IEEE, 7--12.
[26]
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
[27]
Laerte Xavier, Fabio Ferreira, Rodrigo Brito, and Marco Tulio Valente. 2020. Beyond the code: Mining self-admitted technical debt in issue tracker systems. In Proceedings of the 17th international conference on mining software repositories. 137--146.
[28]
Xuhang Xie, Xuesong Lu, and Bei Chen. 2022. Multi-task Learning for Paraphrase Generation With Keyword and Part-of-Speech Reconstruction. In Findings of the Association for Computational Linguistics: ACL 2022. 1234--1243.
[29]
Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020. Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546 (2020).
[30]
Kuiyu Zhu, Ming Yin, and Yizhen Li. 2021. Detecting and classifying self-admitted of technical debt with CNN-BiLSTM. In Journal of Physics: Conference Series, Vol. 1955. IOP Publishing, 012102.
[31]
Kuiyu Zhu, Ming Yin, Dan Zhu, Xiaogang Zhang, Cunzhi Gao, and Jijiao Jiang. 2023. SCGRU: A general approach for identifying multiple classes of self-admitted technical debt with text generation oversampling. Journal of Systems and Software 195 (2023), 111514.
[32]
Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. 2016. Do we need more training data? International Journal of Computer Vision 119, 1 (2016), 76--92.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories
April 2024
788 pages
ISBN:9798400705878
DOI:10.1145/3643991
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

  1. self-admitted technical debt
  2. data augmentation
  3. class imbalance

Qualifiers

  • Research-article

Conference

MSR '24
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 64
    Total Downloads
  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)20
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media