Abstract
Commit messages contain diverse and valuable types of knowledge in all aspects of software maintenance and evolution. Links are an example of such knowledge. Previous work on “9.6 million links in source code comments” showed that links are prone to decay, become outdated, and lack bidirectional traceability. We conducted a large-scale study of 18,201,165 links from commits in 23,110 GitHub repositories to investigate whether they suffer the same fate. Results show that referencing external resources is prevalent and that the most frequent domains other than github.com are the external domains of Stack Overflow and Google Code. Similarly, links serve as source code context to commit messages, with inaccessible links being frequent. Although repeatedly referencing links is rare (4%), 14% of links that are prone to evolve become unavailable over time; e.g., tutorials or articles and software homepages become unavailable over time. Furthermore, we find that 70% of the distinct links suffer from decay; the domains that occur the most frequently are related to Subversion repositories. We summarize that links in commits share the same fate as links in code, opening up avenues for future work.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.7536500.
Notes
MySQL database dump 2019-02-01 from http://ghtorrent.org/downloads.html.
References
Abdalkareem R, Mujahid S, Shihab E (2020) A machine learning approach to improve the detection of ci skip commits. IEEE Trans Softw Eng 47:2740–2754
Aghajani E, Nagy C, Vega-Márquez OL, Linares-Vásquez M, Moreno L, Bavota G, Lanza M (2019) Software documentation issues unveiled. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, pp 1199–1210
Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In: Proc. 20th int. conf. very large data bases, VLDB, Citeseer, vol 1215. pp 487–499
Alali A, Kagdi H, Maletic JI (2008) What’s a typical commit? a characterization of open source software repositories. In: 2008 16th IEEE international conference on program comprehension. IEEE, pp 182–191
Aniche M, Treude C, Steinmacher I, Wiese I, Pinto G, Storey MA, Gerosa MA (2018) How modern news aggregators help development communities shape and share knowledge. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, pp 499–510
Baltes S, Diehl S (2019) Usage and attribution of stack overflow code snippets in github projects. Empir Softw Eng 24(3):1259–1295
Baltes S, Dumani L, Treude C, Diehl S (2018) Sotorrent: reconstructing and analyzing the evolution of stack overflow posts. In: Proceedings of the 15th international conference on mining software repositories. pp 319–330
Baltes S, Treude C, Robillard MP (2022) Contextual documentation referencing on stack overflow. IEEE Trans Software Eng 48(2):135–149. https://doi.org/10.1109/TSE.2020.2981898
Barrie JM, Presti DE (2000) Digital plagiarism-the web giveth and the web shall taketh. J Med Internet Res 2(1):e6
Buse RP, Weimer WR (2010) Automatically documenting program changes. In: Proceedings of the IEEE/ACM international conference on Automated software engineering. pp 33–42
Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in github: transparency and collaboration in an open software repository. In: Proceedings of the ACM 2012 conference on computer supported cooperative work. pp 1277–1286
D’Ambros M, Lanza M, Robbes R (2010) Commit 2.0. In: Proceedings of the 1st Workshop on Web 2.0 for Software Engineering. pp 14–19
Dragan N, Collard ML, Hammad M, Maletic JI (2011) Using stereotypes to help characterize commits. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, pp 520–523
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: 2013 35th International Conference on Software Engineering (ICSE). IEEE, pp 422–431
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378
Forte A, Kittur N, Larco V, Zhu H, Bruckman A, Kraut RE (2012) Coordination and beyond: social functions of groups in open content production. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. pp 417–426
Fu Y, Yan M, Zhang X, Xu L, Yang D, Kymer JD (2015) Automated classification of software change messages by semi-supervised latent dirichlet allocation. Inf Softw Technol 57:369–377
Girba T, Kuhn A, Seeberger M, Ducasse S (2005) How developers drive software evolution. In: Eighth international workshop on principles of software evolution (IWPSE’05). IEEE, pp 113–122
Gómez C, Cleary B, Singer L (2013) A study of innovation diffusion through link sharing on stack overflow. In: 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, pp 81–84
Gousios G (2013) The ghtorent dataset and tool suite. In: 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, pp 233–236
Hassan AE (2008) The road ahead for mining software repositories. In: 2008 Frontiers of Software Maintenance. IEEE, pp 48–57
Hata H, Treude C, Kula RG, Ishio T (2019) 9.6 million links in source code comments: Purpose, evolution, and decay. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, pp 1211–1221
Hata H, Novielli N, Baltes S, Kula RG, Treude C (2022) Github discussions: an exploratory study of early adoption. Empir Softw Eng 27(1):1–32
Huang Y, Jia N, Zhou HJ, Chen XP, Zheng ZB, Tang MD (2020) Learning human-written commit messages to document code changes. J Comput Sci Technol 35(6):1258–1277
Kehoe C, Pitkow J, Rogers J (1998) Gvu’s ninth www user survey report. Office of Technology Licensing, Georgia Tech Research Corp., Atlanta
Kittur A, Kraut RE (2010) Beyond wikipedia: coordination and conflict in online production groups. In: Proceedings of the 2010 ACM conference on Computer supported cooperative work. pp 215–224
Krasniqi R, Cleland-Huang J (2020) Enhancing source code refactoring detection with explanations from commit messages. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 512–516
Krejcie RV, Morgan DW (1970) Determining sample size for research activities. Educ Psychol Meas 30(3):607–610
Liu B, Zhang L, Jiang J, Wang L (2022) A method for identifying references between projects in github. Sci Comput Program 222:102858
Liu J, Xia X, Lo D, Zhang H, Zou Y, Hassan AE, Li S (2021) Broken external links on stack overflow. IEEE Trans Softw Eng 48:3242–3267
Liu J, Zhang H, Xia X, Lo D, Zou Y, Hassan AE, Li S (2022) An exploratory study on the repeatedly shared external links on stack overflow. Empir Softw Eng 27(1):1–32
Liu S, Gao C, Chen S, Yiu NL, Liu Y (2020) Atom: commit message generation based on abstract syntax tree and hybrid ranking. IEEE Trans Softw Eng 48:1800–1817
Maalej W, Happel HJ (2009) From work to word: how do software developers describe their work? In: 2009 6th IEEE International Working Conference on Mining Software Repositories. IEEE, pp 121–130
Maalej W, Happel HJ (2010) Can development work describe itself? In: 2010 7th IEEE working conference on mining software repositories (MSR 2010). IEEE, pp 191–200
Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: icsm. pp 120–130
Moreno L, Bavota G, Di Penta M, Oliveto R, Marcus A, Canfora G (2014) Automatic generation of release notes. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp 484–495
Movshovitz-Attias D, Movshovitz-Attias Y, Steenkiste P, Faloutsos C (2013) Analysis of the reputation system and user contributions on a question answering website: stackoverflow. In: 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). IEEE, pp 886–893
Murphy G (2009) Attacking information overload in software development. In: 2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, pp 4–4
Murphy J, Hashim NH, O’Connor P (2007) Take me back: validating the wayback machine. J Comput-Mediated Commun 13(1):60–75
Nagar Y (2012) What do you think? the structuring of an online community as a collective-sensemaking process. In: Proceedings of the ACM 2012 conference on computer supported cooperative work. pp 393–402
O’mahony S, Ferraro F (2007) The emergence of governance in an open source community. Acad Manag J 50(5):1079–1106
Rath M, Rendall J, Guo JL, Cleland-Huang J, Mäder P (2018) Traceability in the wild: automatically augmenting incomplete trace links. In: Proceedings of the 40th International Conference on Software Engineering. pp 834–845
Rebai S, Kessentini M, Alizadeh V, Sghaier OB, Kazman R (2020) Recommending refactorings via commit message analysis. Inf Softw Technol 126:106332
Santos EA, Hindle A (2016) Judging a commit by its cover. In: Proceedings of the 13th International Workshop on Mining Software Repositories-MSR, vol 16. pp 504–507
Sarwar MU, Zafar S, Mkaouer MW, Walia GS, Malik MZ (2020) Multi-label classification of commit messages using transfer learning. In: 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, pp 37–42
Schermann G, Brandtner M, Panichella S, Leitner P, Gall H (2015) Discovering loners and phantoms in commit and issue data. In: 2015 IEEE 23rd International Conference on Program Comprehension. IEEE, pp 4–14
Sun Y, Wang Q, Yang Y (2017) Frlink: Improving the recovery of missing issue-commit links by revisiting file relevance. Inf Softw Technol 84:33–47
Tian Y, Zhang Y, Stol KJ, Jiang L, Liu H (2022) What makes a good commit message? In: Proceedings of the 44th International Conference on Software Engineering, pp 2389–2401. https://doi.org/10.1145/3510003.3510205
Vasilescu B, Filkov V, Serebrenik A (2013) Stackoverflow and github: Associations between software development and crowdsourced knowledge. In: 2013 International Conference on Social Computing. IEEE, pp 188–195
Vasilescu B, Serebrenik A, Devanbu P, Filkov V (2014) How social q &a sites are changing knowledge sharing in open source software communities. In: Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp 342–354
Viera A, Garrett J (2005) Understanding interobserver agreement: the kappa statistic. Family Med 37:360–363
Wang D, Xiao T, Thongtanunam P, Kula RG, Matsumoto K (2021) Understanding shared links and their intentions to meet information needs in modern code review. Empir Softw Eng 26(5):1–32
Wattanakriengkrai S, Chinthanet B, Hata H, Kula RG, Treude C, Guo J, Matsumoto K (2022) Github repositories with links to academic papers: public access, traceability, and evolution. J Syst Softw 183:111117
Wu J, He H, Xiao W, Gao K, Zhou M (2022) Demystifying software release note issues on github. In: 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC). pp 602–613. https://doi.org/10.1145/3524610.3527919
Xiao T, Wang D, Mcintosh S, Hata H, Kula RG, Ishio T, Matsumoto K (2021) Characterizing and mitigating self-admitted technical debt in build systems. IEEE Trans Softw Eng 48:4214–4228
Xie R, Chen L, Ye W, Li Z, Hu T, Du D, Zhang S (2019) Deeplink: a code knowledge graph based deep learning approach for issue-commit link recovery. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 434–444
Xiong Y, Meng Z, Shen B, Yin W (2017) Mining developer behavior across github and stackoverflow. In: SEKE. pp 578–583
Ye D, Xing Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific q &a sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406
Zampetti F, Ponzanelli L, Bavota G, Mocci A, Di Penta M, Lanza M (2017) How developers document pull requests with external references. In: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, pp 23–33
Zhang Y, Yu Y, Wang H, Vasilescu B, Filkov V (2018) Within-ecosystem issue linking: a large-scale study of rails. In: Proceedings of the 7th International Workshop on Software Mining. pp 12–19
Zhang Y, Wu Y, Wang T, Wang H (2020) ilinker: a novel approach for issue knowledge acquisition in github projects. World Wide Web 23(3):1589–1619
Zhou Y, Sharma A (2017) Automated identification of security issues from commit messages and bug reports. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering. pp 914–919
Acknowledgements
This work was inspired by the International Workshop series on Dynamic Software Documentation, held at McGill’s Bellairs Research Institute, and was supported by JSPS KAKENHI under Grants JP18H04094, JP20K19774, JP20H05706, and JP22K11970.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that Sebastian Baltes, Hideaki Hata, Christoph Treude, and Raula Gaikovina Kula are members of the EMSE Editorial Board. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report.
Additional information
Communicated by: Sven Apel.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiao, T., Baltes, S., Hata, H. et al. 18 million links in commit messages: purpose, evolution, and decay. Empir Software Eng 28, 91 (2023). https://doi.org/10.1007/s10664-023-10325-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10325-8