Automated topic naming

Abram Hindle¹,
Neil A. Ernst²,
Michael W. Godfrey³ &
…
John Mylopoulos⁴

1090 Accesses
Explore all metrics

Abstract

Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). LDA is used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current attempts to interpret topics assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provide fresh insight into their evolving software development activities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Generated using David A. Wheeler’s SLOCCount, http://dwheeler.com/sloccount.
http://www.sdn.sap.com/irj/sdn/maxdb
http://www.postgresql.org/docs/7.3/static/
NLTK: http://www.nltk.org/.
For our word lists visit http://softwareprocess.es/nomen/.
Since the MySQL and MaxDB data had poor records for developer ids, we focused on PostgreSQL.

References

Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. In: Conference on object oriented programming systems languages and applications, pp 543–562. Nashville
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022. doi:10.1162/jmlr.2003.3.4-5.993
MATH Google Scholar
Bøegh J (2008) A new standard for quality requirements. IEEE Software 25(2):57–63. doi:10.1109/MS.2008.30
Article Google Scholar
Boehm B, Brown JR, Lipow M (1976) Quantitative evaluation of software quality. In: International conference on software engineering, pp 592–605
Chung L, Nixon BA, Yu ES, Mylopoulos J (1999) Non-functional requirements in software engineering. In: International series in software engineering, vol 5. Kluwer Academic, Boston
Google Scholar
Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional requirements with application to early aspects. In: International requirements engineering conference, pp 39–48. Minneapolis, Minnesota. doi:10.1109/RE.2006.65
Ernst NA, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: International working conference on requirements engineering: foundation for software quality. Essen, Germany
Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT Press, Cambridge
MATH Google Scholar
Few S (2006) Information dashboard design: the effective visual communication of data, 1st edn. O’Reilly Media. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0596100167
Flach P (2003) The geometry of roc space: understanding machine learning metrics through roc isometrics. In: Proc. 20th international conference on machine learning (ICML’03). AAAI Press, pp 194–201. URL http://www.cs.bris.ac.uk/Publications/Papers/1000704.pdf
Forman G, Scholz M (2010) Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor Newsl 12:49–57. doi:10.1145/1882471.1882479
Article Google Scholar
German DM (2003) The GNOME project: a case study of open source, global software development. Softw Process Improv Pract 8(4):201–215. doi:10.1002/spip.189
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18. URL http://www.kdd.org/explorations/issues/11-1-2009-07/p2V11n1.pdf
Article Google Scholar
Hindle A, Godfrey MW, Holt RC (2007) Release pattern discovery via partitioning: methodology and case study. In: International workshop on mining software repositories at ICSE, pp 19–27. Minneapolis, MN. doi:10.1109/MSR.2007.28
Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: MSR ’08: Proceedings of the 2008 international working conference on mining software repositories. ACM, New York, pp 99–108. doi:10.1145/1370750.1370773
Chapter Google Scholar
Hindle A, Godfrey MW, Holt RC (2009) What’s hot and what’s not: windowed developer topic analysis. In: International conference on software maintenance, pp 339–348. Edmonton, Alberta, Canada. doi:10.1109/ICSM.2009.5306310
Google Scholar
Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: International conference on mining software repositories
ISO (2001) Software engineering—product quality—part 1: quality model. Tech. rep., International Standards Organization - JTC 1/SC 7
Kayed A, Hirzalla N, Samhan A, Alfayoumi M (2009) Towards an ontology for software product quality attributes. In: International conference on internet and web applications and services, pp 200–204. doi:10.1109/ICIW.2009.36
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143. Toronto. URL http://portal.acm.org/citation.cfm?id=1643047
Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, pp 214–223. doi:10.1109/WCRE.2004.10
Massey B (2002) Where do open source requirements come from (and what should we do about it)? In: Workshop on Open source software engineering at ICSE. Orlando, FL, USA
McCall J (1977) Factors in software quality: preliminary handbook on software quality for an acquisiton manager, vols 1–3. General Electric. URL http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA049055
Mei Q, Shen X, Zhai C (2007) Automatic labeling of multinomial topic models. In: International conference on knowledge discovery and data mining, pp 490–499. San Jose, California. doi:10.1145/1281192.1281246
Mockus A, Votta L (2000) Identifying reasons for software changes using historic databases. In: International conference on software maintenance, pp 120–130. San Jose, CA. doi:10.1109/ICSM.2000.883028. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=883028
Scacchi W, Jensen C, Noll J, Elliott M (2005) Multi-modal modeling, analysis and validation of open source software requirements processes. In: International conference on open source systems, vol 1, pp 1–8. Genoa, Italy
Treude C, Storey MA (2009) ConcernLines: a timeline view of co-occurring concerns. In: International conference on software engineering, pp 575–578. Vancouver
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer

Download references

Author information

Authors and Affiliations

Dept. of Computing Science, University of Alberta, Edmonton, AB, Canada
Abram Hindle
Dept. of Computer Science, University of British Columbia, Vancouver, BC, Canada
Neil A. Ernst
David Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Michael W. Godfrey
Dept. Information Eng. and Computer Science, University of Trento, Trento, Italy
John Mylopoulos

Authors

Abram Hindle
View author publications
You can also search for this author in PubMed Google Scholar
Neil A. Ernst
View author publications
You can also search for this author in PubMed Google Scholar
Michael W. Godfrey
View author publications
You can also search for this author in PubMed Google Scholar
John Mylopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abram Hindle.

Additional information

Editors: Tao Xie, Thomas Zimmermann and Arie van Deursen

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hindle, A., Ernst, N.A., Godfrey, M.W. et al. Automated topic naming. Empir Software Eng 18, 1125–1155 (2013). https://doi.org/10.1007/s10664-012-9209-9

Download citation

Published: 03 May 2012
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10664-012-9209-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Topic modeling in software engineering research

A survey on the use of topic models when mining software repositories

System for extracting domain topic using link analysis and searching for relevant features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Automated topic naming

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Topic modeling in software engineering research

A survey on the use of topic models when mining software repositories

System for extracting domain topic using link analysis and searching for relevant features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now