Abstract
Recovering traceability links between source code and software documentation is an important research topic in software maintenance and software reuse. There have been a lot of research efforts in recovering traceability between documentation and code elements (class, interface, method, etc.), mostly based on program analysis. However, there are still a lot of noise links being established in existing work. In this paper, we propose a novel approach to classifying code elements, occurring in a document, into contextual code elements and salient code elements. As a result, we can filter the noise traceability links between a software document and its contextual code elements and get a higher quality link set. Our classifier is trained based on open source project Lucene’s source code and 1899 StackOverflow answer documents about Lucene. We extract code elements from these documents and represent each of these code elements with a 7-dimension feature vector, then we use a decision-tree-based learning model to classify them as salient or not. In the experiments, we get a precision of 70.7% in recognizing the salient code elements of these documents and get 12% improvement compared with Rigby’s work. We can filter out 56.5%~69.3% noise traceability links with different thresholds in our classifier. It can improve the quality of traceability links between source code and their related software documents in application.
Similar content being viewed by others
References
Antoniol G, Canfora G, Casazza G, et al. Recovering traceability links between code and documentation. IEEE Trans Softw Eng, 2002, 28: 970–983
Marcus A, Maletic J I. Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of the 25th International Conference on Software Engineering, Portland, 2003. 125–135
Robillard M P, Marcus A, Treude C, et al. On-demand developer documentation. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2017), Shanghai, 2017. 479–483
Bacchelli A, D’Ambros M, Lanza M, et al. Benchmarking lightweight techniques to link e-mails and source code. In: Proceedings of the 16th Working Conference on Reverse Engineering (WCRE 2009), Lille, 2009. 205–214
Bacchelli A, Lanza M, Robbes R. Linking e-mails and source code artifacts. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, Cape Town, 2010. 375–384
Dagenais B, Robillard M P. Recovering traceability links between an API and its learning resources. In: Proceedings of the 34th International Conference on Software Engineering (ICSE 2012), Zurich, 2012. 47–57
Rigby P C, Robillard M P. Discovering essential code elements in informal documentation. In: Proceedings of the 2013 International Conference on Software Engineering, San Francisco, 2013. 832–841
McMillan C, Poshyvanyk D, Revelle M. Combining textual and structural analysis of software artifacts for traceability link recovery. In: Proceedings of ICSE Workshop on Traceability in Emerging Forms of Software Engineering. Washington: IEEE Computer Society, 2009. 41–48
Panichella A, McMillan C, Moritz E, et al. When and how using structural information to improve ir-based traceability recovery. In: Proceedings of the 17th European Conference on Software Maintenance and Reengineering (CSMR 2013), Genova, 2013. 199–208
Subramanian S, Inozemtseva L, Holmes R. Live API documentation. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), Hyderabad, 2014. 643–652
Petrosyan G, Robillard M P, Mori R D. Discovering information explaining API types using text classification. In: Proceedings of the 37th International Conference on Software Engineering-Volume 1, Florence, 2015. 869–879
Jiang H, Zhang J, Li X, et al. A more accurate model for finding tutorial segments explaining APIs. In: Proceedings of IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016), Suita, 2016. 157–167
Zou Y Z, Ye T, Lu Y Y, et al. Learning to rank for question-oriented software text retrieval. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), Lincoln, 2015. 1–11
Lin Z Q, Xie B, Zou Y Z, et al. Intelligent development environment and software knowledge graph. J Comput Sci Technol, 2017, 32: 242–249
Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. ArXiv:1301.3781
Friedman J H. Greedy function approximation: a gradient boosting machine. Ann Stat, 2001, 29: 1189–1232
Friedman J H. Stochastic gradient boosting. Comput Stat Data Anal, 2002, 38: 367–378
Tsuchiya R, Kato T, Washizaki H, et al. Recovering traceability links between requirements and source code in the same series of software products. In: Proceedings of the 17th International Software Product Line Conference, Tokyo, 2013. 121–130
Tsuchiya R, Washizaki H, Fukazawa Y, et al. Recovering traceability links between requirements and source code using the configuration management log. IEICE Trans Inf Syst, 2015, 98: 852–862
Xu Y, Liu C. Research on retrieval methods for traceability between Chinese documentation and source code based on LDA. Comput Eng Appl, 2013, 49: 70–76
Lai G, Wang X, Liu C. Analysis and improvement on retrieval methods for traceability links between source code and documentation. ACTA Electron Sin, 2009, 37: 22–30
Yang B, Liu C. Research on traceability recovery between documentation and source code based on software structure. J Front Comput Sci Tech, 2014, 6: 7
Ye X, Shen H, Ma X, et al. From word embeddings to document similarities for improved information retrieval in software engineering. In: Proceedings of the 38th International Conference on Software Engineering, Austin, 2016. 404–415
Rahimi M, Goss W, Cleland-Huang J. Evolving requirements-to-code trace links across versions of a software system. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, 2016. 99–109
Zhang Y, Lo D, Xia X, et al. Inferring links between concerns and methods with multi-abstraction vector space model. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, 2016. 110–121
Kim S, Kim H Y, Kim J A, et al. A study on traceability between documents of a software R&D project. In: Advanced Multimedia and Ubiquitous Engineering. Berlin: Springer, 2016. 203–210
de Lucia A, Fasano F, Oliveto R, et al. Enhancing an artefact management system with traceability recovery features. In: Proceedings of the 20th International Conference on Software Maintenance (ICSM 2004), Chicago, 2004. 306–315
Nishikawa K, Washizaki H, Fukazawa Y, et al. Recovering transitive traceability links among software artifacts. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, 2015. 576–580
Ye D, Xing Z, Foo C Y, et al. Learning to extract api mentions from informal natural language discussions. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, 2016. 389–399
Sridhara G, Hill E, Muppaneni D, et al. Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE 2010), Antwerp, 2010. 43–52
Eddy B P, Kraft N A. Using structured queries for source code search. In: Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, 2014. 431–435
Ponzanelli L, Mocci A, Bacchelli A, et al. Improving low quality stack overflow post detection. In: Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, 2014. 541–544
Lin Y, Liu Z, Sun M, et al. Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, 2015. 2181–2187
Creation O W L. To generate the ontology from Java source code. Int J Adv Comput Sci Appl, 2011, 2: 111–116
McMillan C, Grechanik M, Poshyvanyk D, et al. Portfolio: finding relevant functions and their usage. In: Proceedings of the 33rd International Conference on Software Engineering, Waikiki, 2011. 111–120
Bajracharya S K, Ossher J, Lopes C V. Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, Santa Fe, 2010. 157–166
Butler S, Wermelinger M, Yu Y J. Investigating naming convention adherence in Java references. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, 2015. 41–50
Acknowledgements
This paper was supported by National Key Research and Development Project of China (Grant No. 2016YFB1000804) and National Natural Science Fund for Distinguished Young Scholars (Grant No. 61525201).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cao, Y., Zou, Y., Luo, Y. et al. Toward accurate link between code and software documentation. Sci. China Inf. Sci. 61, 050105 (2018). https://doi.org/10.1007/s11432-017-9402-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-017-9402-3