Abstract
Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). LDA is used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current attempts to interpret topics assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provide fresh insight into their evolving software development activities.
Similar content being viewed by others
Notes
Generated using David A. Wheeler’s SLOCCount, http://dwheeler.com/sloccount.
NLTK: http://www.nltk.org/.
For our word lists visit http://softwareprocess.es/nomen/.
Since the MySQL and MaxDB data had poor records for developer ids, we focused on PostgreSQL.
References
Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. In: Conference on object oriented programming systems languages and applications, pp 543–562. Nashville
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022. doi:10.1162/jmlr.2003.3.4-5.993
Bøegh J (2008) A new standard for quality requirements. IEEE Software 25(2):57–63. doi:10.1109/MS.2008.30
Boehm B, Brown JR, Lipow M (1976) Quantitative evaluation of software quality. In: International conference on software engineering, pp 592–605
Chung L, Nixon BA, Yu ES, Mylopoulos J (1999) Non-functional requirements in software engineering. In: International series in software engineering, vol 5. Kluwer Academic, Boston
Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional requirements with application to early aspects. In: International requirements engineering conference, pp 39–48. Minneapolis, Minnesota. doi:10.1109/RE.2006.65
Ernst NA, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: International working conference on requirements engineering: foundation for software quality. Essen, Germany
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT Press, Cambridge
Few S (2006) Information dashboard design: the effective visual communication of data, 1st edn. O’Reilly Media. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0596100167
Flach P (2003) The geometry of roc space: understanding machine learning metrics through roc isometrics. In: Proc. 20th international conference on machine learning (ICML’03). AAAI Press, pp 194–201. URL http://www.cs.bris.ac.uk/Publications/Papers/1000704.pdf
Forman G, Scholz M (2010) Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor Newsl 12:49–57. doi:10.1145/1882471.1882479
German DM (2003) The GNOME project: a case study of open source, global software development. Softw Process Improv Pract 8(4):201–215. doi:10.1002/spip.189
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18. URL http://www.kdd.org/explorations/issues/11-1-2009-07/p2V11n1.pdf
Hindle A, Godfrey MW, Holt RC (2007) Release pattern discovery via partitioning: methodology and case study. In: International workshop on mining software repositories at ICSE, pp 19–27. Minneapolis, MN. doi:10.1109/MSR.2007.28
Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: MSR ’08: Proceedings of the 2008 international working conference on mining software repositories. ACM, New York, pp 99–108. doi:10.1145/1370750.1370773
Hindle A, Godfrey MW, Holt RC (2009) What’s hot and what’s not: windowed developer topic analysis. In: International conference on software maintenance, pp 339–348. Edmonton, Alberta, Canada. doi:10.1109/ICSM.2009.5306310
Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: International conference on mining software repositories
ISO (2001) Software engineering—product quality—part 1: quality model. Tech. rep., International Standards Organization - JTC 1/SC 7
Kayed A, Hirzalla N, Samhan A, Alfayoumi M (2009) Towards an ontology for software product quality attributes. In: International conference on internet and web applications and services, pp 200–204. doi:10.1109/ICIW.2009.36
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143. Toronto. URL http://portal.acm.org/citation.cfm?id=1643047
Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, pp 214–223. doi:10.1109/WCRE.2004.10
Massey B (2002) Where do open source requirements come from (and what should we do about it)? In: Workshop on Open source software engineering at ICSE. Orlando, FL, USA
McCall J (1977) Factors in software quality: preliminary handbook on software quality for an acquisiton manager, vols 1–3. General Electric. URL http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA049055
Mei Q, Shen X, Zhai C (2007) Automatic labeling of multinomial topic models. In: International conference on knowledge discovery and data mining, pp 490–499. San Jose, California. doi:10.1145/1281192.1281246
Mockus A, Votta L (2000) Identifying reasons for software changes using historic databases. In: International conference on software maintenance, pp 120–130. San Jose, CA. doi:10.1109/ICSM.2000.883028. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=883028
Scacchi W, Jensen C, Noll J, Elliott M (2005) Multi-modal modeling, analysis and validation of open source software requirements processes. In: International conference on open source systems, vol 1, pp 1–8. Genoa, Italy
Treude C, Storey MA (2009) ConcernLines: a timeline view of co-occurring concerns. In: International conference on software engineering, pp 575–578. Vancouver
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Tao Xie, Thomas Zimmermann and Arie van Deursen
Rights and permissions
About this article
Cite this article
Hindle, A., Ernst, N.A., Godfrey, M.W. et al. Automated topic naming. Empir Software Eng 18, 1125–1155 (2013). https://doi.org/10.1007/s10664-012-9209-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-012-9209-9