Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Toward mining "concept keywords" from identifiers in large software projects

Published: 17 May 2005 Publication History

Abstract

We propose the Concept Keyword Term Frequency/Inverse Document Frequency (ckTF/IDF) method as a novel technique to efficiency mine concept keywords from identifiers in large software projects. ckTF/IDF is suitable for mining concept keywords, since the ckTF/IDF is more lightweight than the TF/IDF method, and the ckTF/IDF's heuristics is tuned for identifiers in programs.We then experimentally apply the ckTF/IDF to our educational operating system udos, consisting of around 5,000 lines in C code, which produced promising results; the udos's source code was processed in 1.4 seconds with an accuracy of around 57%. This preliminary result suggests that our approach is useful for mining concept keywords from identifiers, although we need more research and experience.

References

[1]
Homepage for bugzilla. http://bugzilla.mozilla.org/.
[2]
Homepage for FAT32 file system specification. http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspx.
[3]
Homepage for GNU GLOBAL. http://www.gnu.org/software/global.
[4]
Homepage for Grappa. http://www.research.att.com/~john/Grappa/.
[5]
Ruby language homepage. http://www.ruby-lang.org.
[6]
Nicolas Anquetil. Characterizing the informal knowledge contained in systems. In WCRE: Proc. 8th Working Conf. on Reverse Engineering, pages 166--175, 2001.
[7]
Nicolas Anquetil and Timothy Lethbridge. Extracting concepts from file names: a new file clustering criterion. In ICSE '98: Proc. 20th Int. Conf on Software Engineering, pages 84--93. IEEE Computer Society, 1998.
[8]
Bruno Caprile and Paolo Tonella, Nomen est omen: Analyzing the language of function identifiers. In WCRE '99: Proc. 6th Working Conf. on Reverse Engineering, page 112. IEEE Computer Society, 1999.
[9]
Bruno Caprile and Paolo Tonella. Restructuring program identifier names. In ICSM: Int. Conf. on Software Maintenance, pages 97--107, 2000.
[10]
K. Gondow. Homepage for an educational operating system udos. http://www.sde.cs.titech.ac.jp/~gondow/udos/.
[11]
K. Gondow, T. Suzuki, and H. Kawashima. Binary-level lightweight data integration to develop program understanding tools for embedded software in c. In Proc. 11th Asia-Pacific Software Engineering Conference (APSEC), pages 336--345, 2004.
[12]
P. A. V. Hall. Overview of reverse engineering and reuse research. Information and Software Technology, 34(4):239 - 249, 1992.
[13]
Donald E. Knuth. Literate Programming (Center for the Study of Language and Information - Lecture Notes, No. Van Nostrand Reinhold Computer, 1989.
[14]
G.C. Murphy, D. Notkin, and E. S.-C. Lan. An empirical study of static call graph extractors. In Proc. 18th Int. Conf. on Software Engineering (ICSE-18), pages 90--99, 25--29 Mar 1996.
[15]
M. Ohba. Homepage for the concept keyword mining tool. http://www.sde.cs.titech.ac.jp/~m-ohba/cktfidf/.
[16]
Gerard Salton and Christopher Buckley. Termweighting approaches in automatic text retrieval. Information Processing and Management, Vol. 24(5), 1988.

Cited By

View all
  • (2023)Towards semantically enhanced detection of emerging quality-related concerns in source codeSoftware Quality Journal10.1007/s11219-023-09614-831:3(865-915)Online publication date: 17-Feb-2023
  • (2023) PyScribe –Learning to describe python code Software: Practice and Experience10.1002/spe.329154:3(501-527)Online publication date: 9-Dec-2023
  • (2014)Learning natural coding conventionsProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering10.1145/2635868.2635883(281-293)Online publication date: 11-Nov-2014
  • Show More Cited By

Index Terms

  1. Toward mining "concept keywords" from identifiers in large software projects

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM SIGSOFT Software Engineering Notes
      ACM SIGSOFT Software Engineering Notes  Volume 30, Issue 4
      July 2005
      1514 pages
      ISSN:0163-5948
      DOI:10.1145/1082983
      Issue’s Table of Contents
      • cover image ACM Other conferences
        MSR '05: Proceedings of the 2005 international workshop on Mining software repositories
        May 2005
        109 pages
        ISBN:1595931236
        DOI:10.1145/1083142
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 May 2005
      Published in SIGSOFT Volume 30, Issue 4

      Check for updates

      Author Tags

      1. TF/IDF
      2. concept keywords
      3. identifiers
      4. program understanding

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 28 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Towards semantically enhanced detection of emerging quality-related concerns in source codeSoftware Quality Journal10.1007/s11219-023-09614-831:3(865-915)Online publication date: 17-Feb-2023
      • (2023) PyScribe –Learning to describe python code Software: Practice and Experience10.1002/spe.329154:3(501-527)Online publication date: 9-Dec-2023
      • (2014)Learning natural coding conventionsProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering10.1145/2635868.2635883(281-293)Online publication date: 11-Nov-2014
      • (2021)Multilabel Sentiment Prediction by Addressing Imbalanced Class Problem Using OversamplingAdvances in Smart Communication Technology and Information Processing10.1007/978-981-15-9433-5_23(239-249)Online publication date: 16-Feb-2021
      • (2020)Automated Fine Grained Traceability Links Recovery between High Level Requirements and Source Code ImplementationsParadigmPlus10.55969/paradigmplus.v1n2a21:2(18-41)Online publication date: 19-Aug-2020
      • (2018)Interactive Query Reformulation for Source-Code Search With Word RelationsIEEE Access10.1109/ACCESS.2018.28839636(75660-75668)Online publication date: 2018
      • (2018)Lexical Similarity Between Argument and Parameter Names: An Empirical StudyIEEE Access10.1109/ACCESS.2018.28751256(58461-58481)Online publication date: 2018
      • (2017)Automatically generating natural language descriptions for object-related statement sequences2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER.2017.7884622(205-216)Online publication date: Feb-2017
      • (2015)Navigating source code with words2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM.2015.7335403(71-80)Online publication date: Sep-2015
      • (2015)Developing a model of loop actions by mining loop characteristics from a large code corpusProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSM.2015.7332451(51-60)Online publication date: 29-Sep-2015
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media