Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2337223.2337268acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Content classification of development emails

Published: 02 June 2012 Publication History

Abstract

Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is challenging, due to the unstructured, noisy, and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces.
We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to subsequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems.

References

[1]
G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering (TSE), 28(10):970-983, 2002.
[2]
A. Bacchelli, A. Cleve, M. Lanza, and A. Mocci. Extracting structured data from natural language documents with island parsing. In Proceedings of ASE 2011 (International Conference On Automated Software Engineering), 2011.
[3]
A. Bacchelli, M. D'Ambros, and M. Lanza. Extracting source code from e-mails. In Proceedings of ICPC 2010 (18th IEEE International Conference on Program Comprehension), pages 24-33. IEEE Computer Society, 2010.
[4]
A. Bacchelli, M. Lanza, and V. Humpa. RTFM (Read The Factual Mails)-Augmenting program comprehension with REmail. In Proceedings of CSMR 2011 (15th IEEE European Conference on Software Maintenance and Reengineering), pages 15-24, 2011.
[5]
A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proceedings of ICSE 2010 (32nd International Conference on Software Engineering), pages 375-384. ACM, 2010.
[6]
M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. Factorial hidden markov models. In Machine Learning, pages 29-245. MIT Press, 1997.
[7]
A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996.
[8]
N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim. Extracting structural information from bug reports. In Proceedings of MSR 2008 (5th International Workshop on Mining Software Repositories), pages 27-30. ACM, 2008.
[9]
N. Bettenburg, E. Shihab, and A. E. Hassan. An empirical study on the risks of using off-the-shelf techniques for processing mailing list data. In Proceedings of ICSM 2009 (25th International Conference on Software Maintenance), pages 539-542. IEEE, 2009.
[10]
C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced? Bias in bug-fix datasets. In Proceedings of ESEC-FSE 2009, pages 121-130. ACM, 2009.
[11]
C. Bird, A. Gourley, and P. Devanbu. Detecting patch submission and acceptance in OSS projects. In Proceedings of MSR 2007, pages 26-29. IEEE Computer Society, 2007.
[12]
R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of ICML (23rd International Conference on Machine learning), pages 161-168. ACM, 2006.
[13]
V. R. Carvalho and W. W. Cohen. Learning to extract signature and reply lines from email. In Proceedings of CEAS 2004 (1st Conference on Email and Anti-Spam), 2004.
[14]
M. D'Ambros, M. Lanza, and R. Robbes. Evaluating defect prediction approaches: a benchmark and an extensive comparison. International Journal on Empirical Software Engineering (EMSE), x(x):to be published, 2011.
[15]
A. Dekhtyar, J. H. Hayes, and T. Menzies. Text is software too. In Proceedings of MSR 2004, pages 22-26, 2004.
[16]
S. Ducasse, A. Lienhard, and L. Renggli. Seaside: A flexible environment for building dynamic web applications. IEEE Software, 24(5):56-63, 2007.
[17]
T. Gîrba and S. Ducasse. Modeling history to analyze software evolution. Journal of Software Maintenance and Evolution, 18:207-236, 2006.
[18]
T. Gleixner. The realtime preemption patch: Pragmatic ignorance or a chance to collaborate? In Keynote of ECRTS 2010 (22nd Euromicro Conference on Real-Time Systems), 2010. http://lwn.net/Articles/397422/.
[19]
S. Haiduc, J. Aponte, and A. Marcus. Supporting program comprehension with source code summarization. In Proceedings of ICSE 2010, pages 223-226. ACM, 2010.
[20]
K. S. Jones. Automatic summarising: The state of the art. Information Processing and Management, 43:1449-1481, 2007.
[21]
D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2nd edition, 2009.
[22]
D. Kawrykow and M. P. Robillard. Non-essential changes in version histories. In Proceedings of ICSE 2011, pages 351-360, 2011.
[23]
A. Kuhn, S. Ducasse, and T. Gírba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 49(3):230-243, 2007.
[24]
W. Lidwell, K. Holden, and J. Butler. Universal Principles of Design. Rockport, 2003.
[25]
C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[26]
T. Mitchell. Machine Learning. McGraw Hill, 1997.
[27]
L. Moonen. Generating robust parsers using island grammars. In Proceedings of WCRE 2001 (8th Working Conference on Reverse Engineering), pages 13-22. IEEE CS, 2001.
[28]
S. Rastkar, G. C. Murphy, and G. Murray. Summarizing software artifacts: a case study of bug reports. In Proceedings of ICSE 2010, pages 505-514. ACM, 2010.
[29]
L. Renggli, S. Ducasse, Gîrba, and O. Nierstrasz. Practical dynamic grammars for dynamic languages. In Proc. of DYLA 2010 (4th Workshop on Dynamic Languages), 2010.
[30]
C. B. Seaman. Qualitative methods in empirical studies of software engineering. IEEE TSE, 25:557-572, 1999.
[31]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34:1-47, 2002.
[32]
O. Stock, R. Falcone, and P. Insinnamo. Island parsing and bidirectional charts. In Proc. of the 12th Conf. on Computational Linguistics, pages 636-641, 1988.
[33]
J. Tang, H. Li, Y. Cao, and Z. Tang. Email data cleaning. In Proceedings of KDD 2005 (11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 489-498. ACM, 2005.
[34]
A. E. H. Thanh H. D. Nguyen, Bram Adams. A case study of bias in bug-fix datasets. In Proceedings of WCRE 2010, pages 259-268. IEEE CS Press, 2010.
[35]
M. Triola. Elementary Statistics. Addison-Wesley, 2006.
[36]
G. Venolia. Textual allusions to artifacts in software-related repositories. In Proceedings of MSR 2006, pages 151-154. ACM, 2006.
[37]
T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schroter, and C. Weiss. What makes a good bug report? IEEE TSE, 36(5):618-643, 2010.

Cited By

View all
  • (2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
  • (2021)Promises and Perils of Inferring Personality on GitHubProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475775(1-11)Online publication date: 11-Oct-2021
  • (2020)Automatic Identification of Decisions from the Hibernate Developer Mailing ListProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering10.1145/3383219.3383225(51-60)Online publication date: 15-Apr-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '12: Proceedings of the 34th International Conference on Software Engineering
June 2012
1657 pages
ISBN:9781467310673

Sponsors

Publisher

IEEE Press

Publication History

Published: 02 June 2012

Check for updates

Qualifiers

  • Article

Conference

ICSE '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
  • (2021)Promises and Perils of Inferring Personality on GitHubProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475775(1-11)Online publication date: 11-Oct-2021
  • (2020)Automatic Identification of Decisions from the Hibernate Developer Mailing ListProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering10.1145/3383219.3383225(51-60)Online publication date: 15-Apr-2020
  • (2019)On reliability of patch correctness assessmentProceedings of the 41st International Conference on Software Engineering10.1109/ICSE.2019.00064(524-535)Online publication date: 25-May-2019
  • (2019)Classifying code comments in Java software systemsEmpirical Software Engineering10.1007/s10664-019-09694-w24:3(1499-1537)Online publication date: 1-Jun-2019
  • (2018)Classifying code comments in Java mobile applicationsProceedings of the 5th International Conference on Mobile Software Engineering and Systems10.1145/3197231.3198444(39-40)Online publication date: 27-May-2018
  • (2018)Self-reported activities of Android developersProceedings of the 5th International Conference on Mobile Software Engineering and Systems10.1145/3197231.3197251(144-155)Online publication date: 27-May-2018
  • (2018)Natural language or not (NLON)Proceedings of the 15th International Conference on Mining Software Repositories10.1145/3196398.3196444(387-391)Online publication date: 28-May-2018
  • (2017)Boosting complete-code tool for partial programProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering10.5555/3155562.3155646(671-681)Online publication date: 30-Oct-2017
  • (2017)Classifying code comments in Java open-source software systemsProceedings of the 14th International Conference on Mining Software Repositories10.1109/MSR.2017.63(227-237)Online publication date: 20-May-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media