Article

Content classification of development emails

Authors:

Alberto Bacchelli,

Tommaso Dal Sasso,

Marco D'Ambros,

Michele LanzaAuthors Info & Claims

ICSE '12: Proceedings of the 34th International Conference on Software Engineering

Pages 375 - 385

Published: 02 June 2012 Publication History

Abstract

Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is challenging, due to the unstructured, noisy, and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces.

We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to subsequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems.

References

[1]

G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering (TSE), 28(10):970-983, 2002.

Digital Library

[2]

A. Bacchelli, A. Cleve, M. Lanza, and A. Mocci. Extracting structured data from natural language documents with island parsing. In Proceedings of ASE 2011 (International Conference On Automated Software Engineering), 2011.

Digital Library

[3]

A. Bacchelli, M. D'Ambros, and M. Lanza. Extracting source code from e-mails. In Proceedings of ICPC 2010 (18th IEEE International Conference on Program Comprehension), pages 24-33. IEEE Computer Society, 2010.

Digital Library

[4]

A. Bacchelli, M. Lanza, and V. Humpa. RTFM (Read The Factual Mails)-Augmenting program comprehension with REmail. In Proceedings of CSMR 2011 (15th IEEE European Conference on Software Maintenance and Reengineering), pages 15-24, 2011.

Digital Library

[5]

A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proceedings of ICSE 2010 (32nd International Conference on Software Engineering), pages 375-384. ACM, 2010.

Digital Library

[6]

M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. Factorial hidden markov models. In Machine Learning, pages 29-245. MIT Press, 1997.

Digital Library

[7]

A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996.

Digital Library

[8]

N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim. Extracting structural information from bug reports. In Proceedings of MSR 2008 (5th International Workshop on Mining Software Repositories), pages 27-30. ACM, 2008.

Digital Library

[9]

N. Bettenburg, E. Shihab, and A. E. Hassan. An empirical study on the risks of using off-the-shelf techniques for processing mailing list data. In Proceedings of ICSM 2009 (25th International Conference on Software Maintenance), pages 539-542. IEEE, 2009.

[10]

C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced? Bias in bug-fix datasets. In Proceedings of ESEC-FSE 2009, pages 121-130. ACM, 2009.

Digital Library

[11]

C. Bird, A. Gourley, and P. Devanbu. Detecting patch submission and acceptance in OSS projects. In Proceedings of MSR 2007, pages 26-29. IEEE Computer Society, 2007.

Digital Library

[12]

R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of ICML (23rd International Conference on Machine learning), pages 161-168. ACM, 2006.

Digital Library

[13]

V. R. Carvalho and W. W. Cohen. Learning to extract signature and reply lines from email. In Proceedings of CEAS 2004 (1st Conference on Email and Anti-Spam), 2004.

[14]

M. D'Ambros, M. Lanza, and R. Robbes. Evaluating defect prediction approaches: a benchmark and an extensive comparison. International Journal on Empirical Software Engineering (EMSE), x(x):to be published, 2011.

Digital Library

[15]

A. Dekhtyar, J. H. Hayes, and T. Menzies. Text is software too. In Proceedings of MSR 2004, pages 22-26, 2004.

[16]

S. Ducasse, A. Lienhard, and L. Renggli. Seaside: A flexible environment for building dynamic web applications. IEEE Software, 24(5):56-63, 2007.

Digital Library

[17]

T. Gîrba and S. Ducasse. Modeling history to analyze software evolution. Journal of Software Maintenance and Evolution, 18:207-236, 2006.

Digital Library

[18]

T. Gleixner. The realtime preemption patch: Pragmatic ignorance or a chance to collaborate? In Keynote of ECRTS 2010 (22nd Euromicro Conference on Real-Time Systems), 2010. http://lwn.net/Articles/397422/.

[19]

S. Haiduc, J. Aponte, and A. Marcus. Supporting program comprehension with source code summarization. In Proceedings of ICSE 2010, pages 223-226. ACM, 2010.

Digital Library

[20]

K. S. Jones. Automatic summarising: The state of the art. Information Processing and Management, 43:1449-1481, 2007.

Digital Library

[21]

D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2nd edition, 2009.

Digital Library

[22]

D. Kawrykow and M. P. Robillard. Non-essential changes in version histories. In Proceedings of ICSE 2011, pages 351-360, 2011.

Digital Library

[23]

A. Kuhn, S. Ducasse, and T. Gírba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 49(3):230-243, 2007.

Digital Library

[24]

W. Lidwell, K. Holden, and J. Butler. Universal Principles of Design. Rockport, 2003.

[25]

C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

Digital Library

[26]

T. Mitchell. Machine Learning. McGraw Hill, 1997.

Digital Library

[27]

L. Moonen. Generating robust parsers using island grammars. In Proceedings of WCRE 2001 (8th Working Conference on Reverse Engineering), pages 13-22. IEEE CS, 2001.

Digital Library

[28]

S. Rastkar, G. C. Murphy, and G. Murray. Summarizing software artifacts: a case study of bug reports. In Proceedings of ICSE 2010, pages 505-514. ACM, 2010.

Digital Library

[29]

L. Renggli, S. Ducasse, Gîrba, and O. Nierstrasz. Practical dynamic grammars for dynamic languages. In Proc. of DYLA 2010 (4th Workshop on Dynamic Languages), 2010.

[30]

C. B. Seaman. Qualitative methods in empirical studies of software engineering. IEEE TSE, 25:557-572, 1999.

Digital Library

[31]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34:1-47, 2002.

Digital Library

[32]

O. Stock, R. Falcone, and P. Insinnamo. Island parsing and bidirectional charts. In Proc. of the 12th Conf. on Computational Linguistics, pages 636-641, 1988.

Digital Library

[33]

J. Tang, H. Li, Y. Cao, and Z. Tang. Email data cleaning. In Proceedings of KDD 2005 (11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 489-498. ACM, 2005.

Digital Library

[34]

A. E. H. Thanh H. D. Nguyen, Bram Adams. A case study of bias in bug-fix datasets. In Proceedings of WCRE 2010, pages 259-268. IEEE CS Press, 2010.

Digital Library

[35]

M. Triola. Elementary Statistics. Addison-Wesley, 2006.

[36]

G. Venolia. Textual allusions to artifacts in software-related repositories. In Proceedings of MSR 2006, pages 151-154. ACM, 2006.

Digital Library

[37]

T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schroter, and C. Weiss. What makes a good bug report? IEEE TSE, 36(5):618-643, 2010.

Digital Library

Cited By

Yang YXia XLo DBi TGrundy JYang X(2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3503509
van Mil FRastogi AZaidman ALanubile F(2021)Promises and Perils of Inferring Personality on GitHubProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475775(1-11)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3475716.3475775
Li XLiang PLi ZLi JJaccheri LDingsøyr TChitchyan R(2020)Automatic Identification of Decisions from the Hibernate Developer Mailing ListProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering10.1145/3383219.3383225(51-60)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3383219.3383225
Show More Cited By

Recommendations

Development emails content analyzer: intention mining in developer discussions
ASE '15: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering

Written development communication (e.g. mailing lists, issue trackers) constitutes a precious source of information to build recommenders for software engineers, for example aimed at suggesting experts, or at redocumenting existing source code. In this ...
How Experts Detect Phishing Scam Emails
CSCW

Phishing scam emails are emails that pretend to be something they are not in order to get the recipient of the email to undertake some action they normally would not. While technical protections against phishing reduce the number of phishing emails ...
A Sender-Centric Approach to Detecting Phishing Emails
CYBERSECURITY '12: Proceedings of the 2012 International Conference on Cyber Security

Email-based online phishing is a critical security threat on the Internet. Although phishers have great flexibility in manipulating both the content and structure of phishing emails, phishers have much less flexibility in completely concealing the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '12: Proceedings of the 34th International Conference on Software Engineering

June 2012

1657 pages

ISBN:9781467310673

General Chair:
Martin Glinz,
Program Chairs:
Gail Murphy,
Mauro Pezzè

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

IEEE Press

Publication History

Published: 02 June 2012

Check for updates

Qualifiers

Article

Conference

ICSE '12

Sponsor:

SIGSOFT

ICSE '12: 34th International Conference on Software Engineering

June 2 - 9, 2012

Zurich, Switzerland

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
327
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang YXia XLo DBi TGrundy JYang X(2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3503509
van Mil FRastogi AZaidman ALanubile F(2021)Promises and Perils of Inferring Personality on GitHubProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475775(1-11)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3475716.3475775
Li XLiang PLi ZLi JJaccheri LDingsøyr TChitchyan R(2020)Automatic Identification of Decisions from the Hibernate Developer Mailing ListProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering10.1145/3383219.3383225(51-60)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3383219.3383225
Le XBao LLo DXia XLi SPasareanu CAtlee JBultan TWhittle J(2019)On reliability of patch correctness assessmentProceedings of the 41st International Conference on Software Engineering10.1109/ICSE.2019.00064(524-535)Online publication date: 25-May-2019
https://dl.acm.org/doi/10.1109/ICSE.2019.00064
Pascarella LBruntink MBacchelli A(2019)Classifying code comments in Java software systemsEmpirical Software Engineering10.1007/s10664-019-09694-w24:3(1499-1537)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10664-019-09694-w
Pascarella LJulien CLewis GSegall I(2018)Classifying code comments in Java mobile applicationsProceedings of the 5th International Conference on Mobile Software Engineering and Systems10.1145/3197231.3198444(39-40)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3197231.3198444
Pascarella LGeiger FPalomba FDi Nucci DMalavolta IBacchelli AJulien CLewis GSegall I(2018)Self-reported activities of Android developersProceedings of the 5th International Conference on Mobile Software Engineering and Systems10.1145/3197231.3197251(144-155)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3197231.3197251
Mäntylä MCalefato FClaes MZaidman AKamei YHill E(2018)Natural language or not (NLON)Proceedings of the 15th International Conference on Mining Software Repositories10.1145/3196398.3196444(387-391)Online publication date: 28-May-2018
https://dl.acm.org/doi/10.1145/3196398.3196444
Zhong HWang XRosu GDi Penta MNguyen T(2017)Boosting complete-code tool for partial programProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering10.5555/3155562.3155646(671-681)Online publication date: 30-Oct-2017
https://dl.acm.org/doi/10.5555/3155562.3155646
Pascarella LBacchelli AGonzalez-Barahona JHindle ATan L(2017)Classifying code comments in Java open-source software systemsProceedings of the 14th International Conference on Mining Software Repositories10.1109/MSR.2017.63(227-237)Online publication date: 20-May-2017
https://dl.acm.org/doi/10.1109/MSR.2017.63
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents