Establishing Traceability Between Natural Language Requirements and Software Artifacts by Combining RAG and LLMs

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15238))

Included in the following conference series:

International Conference on Conceptual Modeling

396 Accesses

Abstract

Software Engineering aims to effectively translate stakeholders’ requirements into executable code to fulfill their needs. Traceability from natural language use case requirements to classes in a UML class diagram, subsequently translated into code implementation, is essential in systems development and maintenance. Tasks such as assessing the impact of changes and enhancing software reusability require a clear link between these requirements and their software implementation. However, establishing such links manually across extensive codebases is prohibitively challenging. Requirements, typically articulated in natural language, embody semantics that clarify the purpose of the codebase. Conventional traceability methods, relying on textual similarities between requirements and code, often suffer from low precision due to the semantic gap between high-level natural language requirements and the syntactic nature of code. The advent of Large Language Models (LLMs) provides new methods to address this challenge through their advanced capability to interpret both natural language and code syntax. Furthermore, representing code as a knowledge graph facilitates the use of graph structural information to enhance traceability links. This paper introduces an LLM-supported retrieval augmented generation approach for enhancing requirements traceability to the class diagram of the code, incorporating keyword, vector, and graph indexing techniques, and their integrated application. We present a comparative analysis against conventional methods and among different indexing strategies and parameterizations on the performance. Our results demonstrate how this methodology significantly improves the efficiency and accuracy of establishing traceability links in software development processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Requirements Traceability Through Information Retrieval Using Dynamic Integration of Structural and Co-change Coupling

Exploring New Directions in Traceability Link Recovery in Models: The Process Models Case

Recovering Fine Grained Traceability Links Between Software Mandatory Constraints and Source Code

Notes

References

Center of excellence for software & systems traceability (COEST) (2024). http://sarec.nd.edu/coest/datasets.html. Accessed 3 June 2024
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Booch, G., Rumbaugh, J.E., Jacobson, I.: The Unified Modeling Language User Guide - Covers UML 2.0. 2nd edn. Addison Wesley Object Technology Series. Addison-Wesley (2005)
Google Scholar
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: M3-embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In: Findings of the Association for Computational Linguistics ACL 2024, pp. 2318–2335 (2024)
Google Scholar
Chen, L., Wang, D., Shi, L., Wang, Q.: A self-enhanced automatic traceability link recovery via structure knowledge mining for small-scale labeled data. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 904–913. IEEE (2021)
Google Scholar
De La Vara, J.L., Wnuk, K., Berntsson-Svensson, R., Sánchez, J., Regnell, B.: An empirical study on the importance of quality requirements in industry. In: SEKE, pp. 438–443 (2011)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/V1/N19-1423
Divya, K., Subha, R., Palaniswami, S.: Similar words identification using Naive and TF-IDF method. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 6(11), 42 (2014)
Google Scholar
Eyl, M., Reichmann, C., Müller-Glaser, K.: Traceability in a fine grained software configuration management system. In: Winkler, D., Biffl, S., Bergsmann, J. (eds.) SWQD 2017. LNBIP, vol. 269, pp. 15–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49421-0_2
Chapter Google Scholar
Ezzini, S., Abualhaija, S., Arora, C., Sabetzadeh, M.: Automated handling of anaphoric ambiguity in requirements: a multi-solution study. In: Proceedings of the 44th International Conference on Software Engineering, pp. 187–199 (2022)
Google Scholar
Gotel, O., et al.: The grand challenge of traceability (v1. 0). In: Software Systems Traceability, pp. 343–409 (2012)
Google Scholar
Guerrouj, L., Bourque, D., Rigby, P.C.: Leveraging informal documentation to summarize classes and methods in context. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 2, pp. 639–642. IEEE (2015)
Google Scholar
Hadi, M.U., et al.: A survey on large language models: applications, challenges, limitations, and practical usage. Authorea Preprints (2023)
Google Scholar
Hey, T., Chen, F., Weigelt, S., Tichy, W.F.: Improving traceability link recovery using fine-grained requirements-to-code relations. In: 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 12–22. IEEE (2021)
Google Scholar
Hou, X., et al.: Large language models for software engineering: a systematic literature review. CoRR abs/2308.10620 (2023). https://doi.org/10.48550/ARXIV.2308.10620
Huang, Y., Liu, Z., Chen, X., Luo, X.: Automatic matching release notes and source code by generating summary for software change. In: 2016 6th International Conference on Digital Home (ICDH), pp. 104–109. IEEE (2016)
Google Scholar
Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: 54th Annual Meeting of the Association for Computational Linguistics 2016, pp. 2073–2083. Association for Computational Linguistics (2016)
Google Scholar
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: Spanbert: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Article Google Scholar
Kasneci, E., et al.: Chatgpt for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 103, 102274 (2023)
Article Google Scholar
Khlif, W., Kchaou, D., Bouassida, N.: A complete traceability methodology between UML diagrams and source code based on enriched use case textual description. Informatica 46(1) (2022)
Google Scholar
Kim, T.K.: T test as a parametric statistic. Korean J. Anesthesiol. 68(6), 540 (2015)
Article Google Scholar
Liang, Y., Zhu, K.: Automatic generation of text descriptive comments for code blocks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Lin, J., Liu, Y., Zeng, Q., Jiang, M., Cleland-Huang, J.: Traceability transformed: generating more accurate links with pre-trained BERT models. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 324–335. IEEE (2021)
Google Scholar
Lin, Z., Zou, Y., Zhao, J., Xie, B.: Improving software text retrieval using conceptual knowledge in source code. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 123–134. IEEE (2017)
Google Scholar
Lohar, S., Amornborvornwong, S., Zisman, A., Cleland-Huang, J.: Improving trace accuracy through data-driven configuration and composition of tracing features. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 378–388 (2013)
Google Scholar
Mills, C., Escobar-Avila, J., Haiduc, S.: Automatic traceability maintenance via machine learning classification. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 369–380. IEEE (2018)
Google Scholar
Moharil, A., Sharma, A.: Tabasco: a transformer based contextualization toolkit. Sci. Comput. Program. 230, 102994 (2023)
Article Google Scholar
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224 (2010)
Google Scholar
Moran, K., et al.: Improving the effectiveness of traceability link recovery using hierarchical Bayesian networks. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 873–885 (2020)
Google Scholar
Nejati, S., Sabetzadeh, M., Arora, C., Briand, L.C., Mandoux, F.: Automated change impact analysis between sysml models of requirements and design. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 242–253 (2016)
Google Scholar
Pauzi, Z., Capiluppi, A.: Applications of natural language processing in software traceability: a systematic mapping study. J. Syst. Softw. 198, 111616 (2023)
Article Google Scholar
Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: BM25 and beyond. Found. Trends®Inf. Retrieval 3(4), 333–389 (2009)
Google Scholar
Sridhara, G., Mazumdar, S., et al.: Chatgpt: a study on its utility for ubiquitous software engineering tasks. arXiv preprint arXiv:2305.16837 (2023)
Tian, Q., Cao, Q., Sun, Q.: Adapting word embeddings to traceability recovery. In: 2018 International Conference on Information Systems and Computer Aided Education (ICISCAE), pp. 255–261. IEEE (2018)
Google Scholar
Wan, Y., et al.: Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407 (2018)
Google Scholar
Willett, P.: The porter stemming algorithm: then and now. Program 40(3), 219–223 (2006)
Article Google Scholar
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, Heidelberg (2012)
Google Scholar
Xu, C., Li, Y., Wang, B., Dong, S.: A systematic mapping study on machine learning methodologies for requirements management. IET Software 17(4), 405–423 (2023)
Article Google Scholar
Yazawa, Y., Ogata, S., Okano, K., Kaiya, H., Washizaki, H.: Traceability link mining - focusing on usability. In: 41st IEEE Annual Computer Software and Applications Conference, COMPSAC 2017, vol. 2, pp. 286–287. IEEE Computer Society (2017). https://doi.org/10.1109/COMPSAC.2017.254
Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017)
Zan, D., et al.: Large language models meet NL2Code: a survey. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7443–7464 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

TU Wien, Business Informatics Group, Vienna, Austria
Syed Juned Ali & Dominik Bork
Microsoft, Hyderabad, India
Varun Naganathan

Authors

Syed Juned Ali
View author publications
You can also search for this author in PubMed Google Scholar
Varun Naganathan
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Bork
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Juned Ali .

Editor information

Editors and Affiliations

Saarland University and German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
Wolfgang Maass
Illinois State University, Normal, IL, USA
Hyoil Han
Software Engineering Institute – Carnegie Mellon University, Pittsburgh, PA, USA
Hasan Yasar
Pacific Northwest National Laboratory, Richland, WA, USA
Nick Multari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, S.J., Naganathan, V., Bork, D. (2025). Establishing Traceability Between Natural Language Requirements and Software Artifacts by Combining RAG and LLMs. In: Maass, W., Han, H., Yasar, H., Multari, N. (eds) Conceptual Modeling. ER 2024. Lecture Notes in Computer Science, vol 15238. Springer, Cham. https://doi.org/10.1007/978-3-031-75872-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-75872-0_16
Published: 21 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75871-3
Online ISBN: 978-3-031-75872-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Establishing Traceability Between Natural Language Requirements and Software Artifacts by Combining RAG and LLMs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Requirements Traceability Through Information Retrieval Using Dynamic Integration of Structural and Co-change Coupling

Exploring New Directions in Traceability Link Recovery in Models: The Process Models Case

Recovering Fine Grained Traceability Links Between Software Mandatory Constraints and Source Code

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Establishing Traceability Between Natural Language Requirements and Software Artifacts by Combining RAG and LLMs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Requirements Traceability Through Information Retrieval Using Dynamic Integration of Structural and Co-change Coupling

Exploring New Directions in Traceability Link Recovery in Models: The Process Models Case

Recovering Fine Grained Traceability Links Between Software Mandatory Constraints and Source Code

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation