article

Redundancy-free analysis of multi-revision software artifacts

Authors:

Carol V. Alexandru,

Sebastiano Panichella,

Sebastian Proksch,

Harald C. GallAuthors Info & Claims

Empirical Software Engineering, Volume 24, Issue 1

Pages 332 - 380

https://doi.org/10.1007/s10664-018-9630-9

Published: 01 February 2019 Publication History

Abstract

Researchers often analyze several revisions of a software project to obtain historical data about its evolution. For example, they statically analyze the source code and monitor the evolution of certain metrics over multiple revisions. The time and resource requirements for running these analyses often make it necessary to limit the number of analyzed revisions, e.g., by only selecting major revisions or by using a coarse-grained sampling strategy, which could remove significant details of the evolution. Most existing analysis techniques are not designed for the analysis of multi-revision artifacts and they treat each revision individually. However, the actual difference between two subsequent revisions is typically very small. Thus, tools tailored for the analysis of multiple revisions should only analyze these differences, thereby preventing re-computation and storage of redundant data, improving scalability and enabling the study of a larger number of revisions. In this work, we propose the Lean Language-Independent Software Analyzer (LISA), a generic framework for representing and analyzing multi-revisioned software artifacts. It employs a redundancy-free, multi-revision representation for artifacts and avoids re-computation by only analyzing changed artifact fragments across thousands of revisions. The evaluation of our approach consists of measuring the effect of each individual technique incorporated, an in-depth study of LISA resource requirements and a large-scale analysis over 7 million program revisions of 4,000 software projects written in four languages. We show that the time and space requirements for multi-revision analyses can be reduced by multiple orders of magnitude, when compared to traditional, sequential approaches.

References

[1]

Alexandru CV, Gall HC (2015) Rapid multi-purpose, multi-commit code analysis. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering (ICSE), vol 2, pp 635-638.

Digital Library

[2]

Alexandru CV, Panichella S, Gall HC (2017) Reducing redundancies in multi-revision code analysis. In: IEEE 24th international conference on software analysis, evolution and reengineering, SANER 2017, Klagenfurt, Austria, 2017.

[3]

Allamanis M, Sutton CA (2013) Mining source code repositories at massive scale using language modeling. In: Proceedings of the 10th working conference on mining software repositories, MSR '13, San Francisco, CA, USA, 2013.

Digital Library

[4]

Arbuckle T (2011) Measuring multi-language software evolution: a case study. In: Proceedings of the 12th international workshop on principles of software evolution and the 7th annual ERCIM workshop on software evolution, pp 91-95.

Digital Library

[5]

Bavota G, Canfora G, Di Penta M, Oliveto R, Panichella S (2013) The evolution of project interdependencies in a software ecosystem: the case of Apache. In: 2013 IEEE international conference on software maintenance, pp 280-289.

Digital Library

[6]

Bavota G, Canfora G, Di Penta M, Oliveto R, Panichella S (2014) How the Apache community upgrades dependencies: an evolutionary study. Empir Softw Eng 20:1-43.

Digital Library

[7]

Bavota G, Qusef A, Oliveto R, Lucia AD, Binkley D (2012) An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In: 28th IEEE international conference on software maintenance, ICSM 2012, Trento, Italy, September 23-28, 2012, pp 56-65.

Digital Library

[8]

Baxter ID, Yahin A, Moura L, Sant'Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: Software maintenance.

Digital Library

[9]

Bevan J, Whitehead EJ Jr, Kim S, Godfrey M (2005) Facilitating software evolution research with Kenyon. In: Proceedings of the 13th ACM SIGSOFT international symposium on foundations of software engineering, pp 177-186.

Digital Library

[10]

Binkley D, Gold N, Islam S, Krinke J, Yoo S (2017) Tree-oriented vs. line-oriented observation-based slicing. In: 2017 IEEE 17th international working conference on source code analysis and manipulation (SCAM), pp 21-30.

[11]

Bird C, Nagappan N, Devanbu PT, Gall HC, Murphy B (2009) Does distributed development affect software quality? an empirical case study of Windows Vista. In: 31st international conference on software engineering, ICSE 2009, May 16-24, 2009, Vancouver, Canada, Proceedings, pp 518-528.

Digital Library

[12]

Bird C, Pattison D, D'Souza R, Filkov V, Devanbu P (2008) Latent social structure in open source project. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT '08/FSE-16. ACM, New York, pp 24-35.

Digital Library

[13]

Bois BD, Gorp PV, Amsel A, Eetvelde NV, Stenten H, Demeyer S (2004) A discussion of refactoring in research and practice. Technical report.

[14]

Boughanmi F (2010) Multi-language and heterogeneously-licensed software analysis. In: 2010 17th working conference on reverse engineering, pp 293-296.

Digital Library

[15]

Chacon S, Straub B (2014) Pro Git. Apress, New York.

Digital Library

[16]

Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, SIGMOD '96, pp 493-504.

Digital Library

[17]

D'Ambros M, Gall HC, Lanza M, Pinzger M (2008) Analysing software repositories to understand software evolution. In: Software evolution, pp 37-67.

[18]

D'Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17:531-577.

Digital Library

[19]

Deissenboeck F, Juergens E, Hummel B, Wagner S, y Parareda BM, Pizka M (2008) Tool support for continuous quality control. IEEE Softw 25:60-67.

Digital Library

[20]

Deruelle L, Melab N, Bouneffa M, Basson H (2001) Analysis and manipulation of distributed multilanguage software code. In: Proceedings first IEEE international workshop on source code analysis and manipulation, pp 43-54.

[21]

Dyer R (2013) Bringing ultra-large-scale software repository mining to the masses with boa. PhD thesis, Ames, IA, USA. AAI3610634.

Digital Library

[22]

Dyer R, Rajan H, Nguyen TN (2013) Declarative visitors to ease fine-grained source code mining with full history on billions of ast nodes. In: Proceedings of the 12th international conference on generative programming: concepts & experiences, pp 23-32.

Digital Library

[23]

Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: International conference on software maintenance, 2003. ICSM 2003. Proceedings, pp 23-32.

Digital Library

[24]

Fluri B, Wuersch M, Pinzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33(11):725-743.

Digital Library

[25]

Gall H, Fluri B, Pinzger M (2009) Change analysis with Evolizer and Change Distiller. IEEE Softw 26(1):26-33.

Digital Library

[26]

Gall HC, Jazayeri M, Klösch R, Trausmuth G (1997) Software evolution observations based on product release history. In: 1997 international conference on software maintenance (ICSM '97), Proceedings, p 160.

Digital Library

[27]

Ghezzi G, Gall H (2011) Sofas: a lightweight architecture for software analysis as a service. In: 2011 9th working IEEE/IFIP conference on software architecture (WICSA), pp 93-102.

Digital Library

[28]

Ghezzi G, Gall H (2013) Replicating mining studies with SOFAS. In: 2013 10th IEEE working conference on mining software repositories (MSR), pp 363-372.

Digital Library

[29]

Gîrba T, Ducasse S (2006) Modeling history to analyze software evolution. J Softw Maint Evol Res Pract 18(3):207-236.

Digital Library

[30]

González-Barahona JM, Robles G (2012) On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir Softw Eng 17(1):75-89.

Digital Library

[31]

Hadjidj R, Yang X, Tlili S, Debbabi M (2008) Model-checking for software vulnerabilities detection with multi-language support. In: 2008 sixth annual conference on privacy, security and trust, pp 133-142.

Digital Library

[32]

Hernandez L, Costa H (2015) Identifying similarity of software in Apache ecosystem - an exploratory study. In: 2015 12th international conference on information technology - new generations, pp 397-402.

Digital Library

[33]

Hills M, Klint P, Vinju JJ (2012) Program analysis scenarios in rascal. Springer, Berlin, pp 10-30.

Digital Library

[34]

Izmaylova A, Klint P, Shahi A, Vinju JJ (2013) M3: an open model for measuring code artifacts. CoRR, arXiv:1312.1188.

[35]

Juergens E, Deissenboeck F, Hummel B (2010) Code similarities beyond copy & paste. In: 2010 14th european conference on software maintenance and reengineering (CSMR).

Digital Library

[36]

Kästner C, Giarrusso PG, Rendel T, Erdweg S, Ostermann K, Berger T (2011) Variability-aware parsing in the presence of lexical macros and conditional compilation. In: Proceedings of the 2011 ACM international conference on object oriented programming systems languages and applications, OOPSLA '11. ACM, New York, pp 805-824.

Digital Library

[37]

Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceedings of the 33rd international conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, May 21-28, 2011, pp 351-360.

Digital Library

[38]

Kienle HM, Müller HA (2010) Rigi--an environment for software reverse engineering, exploration, visualization, and redocumentation. Sci Comput Program 75(4):247-263.

Digital Library

[39]

Kim M, Nam J, Yeon J, Choi S, Kim S (2010) Remi: defect prediction for efficient API testing. In: Proceedings of the IEEE/ACM international conference on automated software engineering. ACM, To appear.

[40]

Kim M, Notkin D (2006) Program element matching for multi-version program analyses. In: Proceedings of the 2006 international workshop on mining software repositories, MSR '06. ACM, New York, pp 58-64.

Digital Library

[41]

Kim S, Pan K, Whitehead EEJ Jr (2006) Memories of bug fixes. In: Proceedings of the 14th ACM SIGSOFT international symposium on foundations of software engineering, SIGSOFT '06/FSE-14. ACM, pp 35-45.

Digital Library

[42]

Kocaguneli E, Menzies T, Keung J (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38(6):1403-1416.

Digital Library

[43]

Kontogiannis K, Linos PK, Wong K (2006) Comprehension and maintenance of large-scale multilanguage software applications. In: 22nd IEEE international conference on software maintenance (ICSM 2006), 24-27 September 2006, Philadelphia, Pennsylvania, USA, pp 497-500.

Digital Library

[44]

Lam P, Bodden E, Lhotak O, Hendren L (2011) The Soot framework for Java program analysis: a retrospective. In: Cetus users and compiler infastructure workshop, CETUS'11.

[45]

Lanza M, Ducasse S, Gall H, Pinzger M (2005) Codecrawler - an information visualization tool for program comprehension. In: 27th international conference on software engineering, 2005. ICSE 2005. Proceedings, pp 672-673.

Digital Library

[46]

Lanza M, Marinescu R, Ducasse S (2005) Object-oriented metrics in practice. Springer, New York.

Digital Library

[47]

Laval J, Denier S, Ducasse S, Falleri J-R (2011) Supporting simultaneous versions for software evolution assessment. Sci Comput Program 76(12):1177-1193.

Digital Library

[48]

Le W, Pattison SD (2014) Patch verification via multiversion interprocedural control flow graphs. In: Proceedings of the 36th international conference on software engineering, ICSE 2014. ACM, New York, pp 1047-1058.

Digital Library

[49]

Lundberg J, Löwe W (2012) Points-to analysis: a fine-grained evaluation. Journal of Universal Computer Science 18:2851-2878.

[50]

Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: 20th IEEE international conference on software maintenance, 2004. Proceedings. pp 350-359.

Digital Library

[51]

McCabe T (1976) A complexity measure. IEEE Trans Softw Eng SE-2(4):308-320.

Digital Library

[52]

Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictor models in software engineering, PROMISE '09. ACM, pp 7:1-7:10.

Digital Library

[53]

Mens T (2008) Introduction and roadmap: history and challenges of software evolution. In: Software evolution. Springer, Berlin, pp 1-11.

[54]

Mens T, Claes M, Grosjean P, Serebrenik A (2014) Studying evolving software ecosystems based on ecological models. In: Evolving software systems, pp 297-326.

Digital Library

[55]

Mens T, Tourwe T (2004) A survey of software refactoring. IEEE Trans Softw Eng 30(2):126-139.

Digital Library

[56]

Menzies T, Krishna R, Pryor D (2015) The promise repository of empirical software engineering data.

[57]

Moha N, Guéhéneuc Y, Duchien L, Meur AL (2010) DECOR: a method for the specification and detection of code and design smells. IEEE Trans Softw Eng 36(1):20-36.

Digital Library

[58]

Munro M (2005) Product metrics for automatic identification of "bad smell" design problems in Java source-code. In: 11th IEEE international symposium on software metrics, 2005, pp 15-15.

Digital Library

[59]

Nagappan M, Zimmermann T, Bird C (2012) Representativeness in software engineering research. Technical report, Microsoft Research.

[60]

Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE '06. ACM, pp 452-461.

Digital Library

[61]

Nguyen AT, Hilton M, Codoban M, Nguyen HA, Mast L, Rademacher E, Nguyen TN, Dig D (2016) API code recommendation using statistical learning from fine-grained changes. In: International symposium on foundations of software engineering. ACM.

Digital Library

[62]

Nguyen HA, Nguyen AT, Nguyen TT, Nguyen TN, Rajan H (2013) A study of repetitiveness of code changes in software evolution. In: 2013 28th IEEE/ACM international conference on automated software engineering (ASE).

Digital Library

[63]

Oosterman J, Irwin W, Churcher N (2011) EvoJava: A tool for measuring evolving software. In: Proceedings of the thirty-fourth Australasian computer science conference, ACSC '11, vol 113. Australian Computer Society, Inc, pp 117-126.

Digital Library

[64]

Panichella S, Arnaoudova V, Penta MD, Antoniol G (2015) Would static analysis tools help developers with code reviews? In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015, Montreal, QC, Canada, March 2-6, 2015, pp 161-170.

[65]

Picazo JJM (2016) Analisis y busqueda de idioms procedentes de repositorios escritos en python. Master's thesis, Universidad Rey Juan Carlos, Madrid, Spain.

[66]

Proksch S, Amann S, Nadi S, Mezini M (2016) A dataset of simplified syntax trees for c#. In: International conference on mining software repositories. ACM.

Digital Library

[67]

Proksch S, Lerch J, Mezini M (2015) Intelligent code completion with Bayesian networks. ACM Trans Softw Eng Methodol 25:1-31.

Digital Library

[68]

Proksch S, Nadi S, Amann S, Mezini M (2017) Enriching in-IDE process information with fine-grained source code history. In: International conference on software analysis, evolution, and reengineering.

[69]

Rakic G, Budimac Z, Savic M (2013) Language independent framework for static code analysis. In: Proceedings of the 6th Balkan Conference in Informatics, BCI '13. ACM, New York, pp 236-243.

Digital Library

[70]

Ray B, Nagappan M, Bird C, Nagappan N, Zimmermann T (2015) The uniqueness of changes: characteristics and applications. In: 2015 IEEE/ACM 12th working conference on mining software repositories, pp 34-44.

Digital Library

[71]

Rompaey BV, Bois BD, Demeyer S, Rieger M (2007) On the detection of test smells: a metrics-based approach for general fixture and eager test. IEEE Trans Softw Eng 33(12):800-817.

Digital Library

[72]

Strein D, Kratz H, Lowe W (2006) Cross-language program analysis and refactoring. In: Sixth IEEE international workshop on source code analysis and manipulation, 2006. SCAM '06, pp 207-216.

Digital Library

[73]

Stutz P, Bernstein A, Cohen W (2010) Signal/collect graph algorithms for the (semantic) web. In: Proceedings of the 9th international semantic web conference on the semantic web - volume Part I, ISWC'10. Springer, pp 764-780.

Digital Library

[74]

Szoke G, Nagy C, Ferenc R, Gyimóthy T (2014) A case study of refactoring large-scale industrial systems to efficiently improve source code quality. In: Computational science and its applications - ICCSA 2014, vol 8583 of Lecture notes in computer science. Springer, pp 524-540.

[75]

Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) Qualitas corpus: a curated collection of Java code for empirical studies. In: 2010 Asia Pacific software engineering conference (APSEC2010).

Digital Library

[76]

Tichelaar S, Ducasse S, Demeyer S, Nierstrasz O (2000) A meta-model for language-independent refactoring. In: International symposium on principles of software evolution, 2000. Proceedings, pp 154-164.

[77]

Tsantalis N, Chatzigeorgiou A (2009) Identification of move method refactoring opportunities. IEEE Trans Softw Eng 35(3):347-367.

Digital Library

[78]

Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2015) When and why your code starts to smell bad. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering (ICSE), vol 1, pp 403-414.

Digital Library

[79]

VanHilst M, Huang S, Mulcahy J, Ballantyne W, Suarez-Rivero E, Harwood D (2011) Measuring effort in a corporate repository. In: IRI. IEEE Systems, Man, and Cybernetics Society, pp 246-252.

[80]

Winter A, Kullbach B, Riediger V (2002) An overview of the GXL graph exchange language. In: Revised lectures on software visualization, international seminar. Springer, London, pp 324-336.

Digital Library

[81]

Wu W, Khomh F, Adams B, Guéhéneuc Y-G, Antoniol G (2016) An exploratory study of API changes and usages based on Apache and Eclipse ecosystems. Empir Softw Eng 21(6):2366-2412.

Digital Library

[82]

Yang W, Horwitz S, Reps T (1992) A program integration algorithm that accommodates semantics-preserving transformations. ACM Trans Softw Eng Methodol 1(3):310-354.

Digital Library

[83]

Yu Y, Tun TT, Nuseibeh B (2011) Specifying and detecting meaningful changes in programs. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering, ASE '11. IEEE Computer Society, Washington, pp 273-282.

Digital Library

[84]

Zaidman A, Rompaey BV, van Deursen A, Demeyer S (2011) Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empir Softw Eng 16(3):325-364.

Digital Library

[85]

Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE '09. ACM, New York, pp 91-100.

Digital Library

[86]

Zimmermann T, Zeller A, Weissgerber P, Diehl S (2005) Mining version histories to guide software changes. IEEE Trans Softw Eng 31(6):429-445.

Digital Library

Cited By

Le Dilavrec QKhelladi DBlouin AJézéquel JChandra SBlincoe KTonella P(2023)HyperDiff: Computing Source Code Diffs at ScaleProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616312(288-299)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616312
Keshani MGousios GProksch S(2023)Frankenstein: fast and lightweight call graph generation for software buildsEmpirical Software Engineering10.1007/s10664-023-10388-729:1Online publication date: 16-Nov-2023
https://dl.acm.org/doi/10.1007/s10664-023-10388-7
Le Dilavrec QKhelladi DBlouin AJézéquel J(2022)HyperAST: Enabling Efficient Analysis of Software Histories at ScaleProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3560423(1-12)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3551349.3560423
Show More Cited By

Recommendations

Multi-language dynamic taint analysis in a polyglot virtual machine
MPLR '20: Proceedings of the 17th International Conference on Managed Programming Languages and Runtimes

Dynamic taint analysis is a popular program analysis technique in which sensitive data is marked as tainted and the propagation of tainted data is tracked in order to determine whether that data reaches critical program locations. This analysis ...
Analysis of a deployed software
ESEC-FSE '07: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering

Analyzing a deployed software provides a means to characterize and leverage the software's runtime behavior as it is employed by its intended users. Preliminary studies have shown that leveraging the information obtained from the field provides ...
Structure-Sensitive Pointer Analysis for Multi-structure Objects
Internetware '24: Proceedings of the 15th Asia-Pacific Symposium on Internetware

Static analysis is a method within software analysis, and pointer analysis is an important component of static analysis. An important dimension of pointer analysis is field-sensitivity, which has been proven to effectively enhance the accuracy of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Empirical Software Engineering

Empirical Software Engineering Volume 24, Issue 1

February 2019

535 pages

ISSN:1382-3256

Issue’s Table of Contents

Copyright © Copyright © 2019 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Le Dilavrec QKhelladi DBlouin AJézéquel JChandra SBlincoe KTonella P(2023)HyperDiff: Computing Source Code Diffs at ScaleProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616312(288-299)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616312
Keshani MGousios GProksch S(2023)Frankenstein: fast and lightweight call graph generation for software buildsEmpirical Software Engineering10.1007/s10664-023-10388-729:1Online publication date: 16-Nov-2023
https://dl.acm.org/doi/10.1007/s10664-023-10388-7
Le Dilavrec QKhelladi DBlouin AJézéquel J(2022)HyperAST: Enabling Efficient Analysis of Software Histories at ScaleProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3560423(1-12)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3551349.3560423
Keshani MJuristo N(2021)Scalable call graph constructor for mavenProceedings of the 43rd International Conference on Software Engineering: Companion Proceedings10.1109/ICSE-Companion52605.2021.00046(99-101)Online publication date: 25-May-2021
https://dl.acm.org/doi/10.1109/ICSE-Companion52605.2021.00046
Zheng SGai JYu HZou HGao S(2021)Training data selection for imbalanced cross-project defect predictionComputers and Electrical Engineering10.1016/j.compeleceng.2021.10737094:COnline publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1016/j.compeleceng.2021.107370

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents