Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3368089.3409683acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Dynamically reconfiguring software microbenchmarks: reducing execution time without sacrificing result quality

Published: 08 November 2020 Publication History

Abstract

Executing software microbenchmarks, a form of small-scale performance tests predominantly used for libraries and frameworks, is a costly endeavor. Full benchmark suites take up to multiple hours or days to execute, rendering frequent checks, e.g., as part of continuous integration (CI), infeasible. However, altering benchmark configurations to reduce execution time without considering the impact on result quality can lead to benchmark results that are not representative of the software’s true performance.
We propose the first technique to dynamically stop software microbenchmark executions when their results are sufficiently stable. Our approach implements three statistical stoppage criteria and is capable of reducing Java Microbenchmark Harness (JMH) suite execution times by 48.4% to 86.0%. At the same time it retains the same result quality for 78.8% to 87.6% of the benchmarks, compared to executing the suite for the default duration.
The proposed approach does not require developers to manually craft custom benchmark configurations; instead, it provides automated mechanisms for dynamic reconfiguration. Hence, making dynamic reconfiguration highly effective and efficient, potentially paving the way to inclusion of JMH microbenchmarks in CI.

Supplementary Material

Auxiliary Teaser Video (fse20main-p168-p-teaser.mp4)
This is the main video for our ESEC/FSE '20 paper "Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality". We introduce a technique to dynamically stop software microbenchmark executions when their results are sufficiently stable based on statistical stoppage criteria. It allows to reduce the execution time of Java Microbenchmark Harness (JMH) suites by 66.2% to 82.0% compared to standard JMH while 78.8% to 87.6% of the microbenchmarks have the same result.
Auxiliary Presentation Video (fse20main-p168-p-video.mp4)
This is the main video for our ESEC/FSE '20 paper "Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality". We introduce a technique to dynamically stop software microbenchmark executions when their results are sufficiently stable based on statistical stoppage criteria. It allows to reduce the execution time of Java Microbenchmark Harness (JMH) suites by 66.2% to 82.0% compared to standard JMH while 78.8% to 87.6% of the microbenchmarks have the same result.

References

[1]
Hammam M. AlGhamdi, Cor-Paul Bezemer, Weiyi Shang, Ahmed E. Hassan, and Parminder Flora. 2020. Towards reducing the time needed for load testing. Journal of Software: Evolution and Process (July 2020 ). https://doi.org/10.1002/smr.2276
[2]
Hammam M. AlGhamdi, Mark D. Syer, Weiyi Shang, and Ahmed E. Hassan. 2016. An Automated Approach for Recommending When to Stop Performance Tests. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME 2016 ). 279-289. https://doi.org/10.1109/ICSME. 2016.46
[3]
Deema Alshoaibi, Kevin Hannigan, Hiten Gupta, and Mohamed Wiem Mkaouer. 2019. PRICE: Detection of Performance Regression Introducing Code Changes Using Static and Dynamic Metrics. In Proceedings of the 11th International Symposium on Search Based Software Engineering (Tallinn, Estonia) ( SSBSE 2019 ). Springer Nature, 75-88. https://doi.org/10.1007/978-3-030-27455-9_6
[4]
Eytan Bakshy and Eitan Frachtenberg. 2015. Design and Analysis of Benchmarking Experiments for Distributed Internet Services. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (WWW 2015 ). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 108-118. https://doi.org/10.1145/2736277.2741082
[5]
Sebastian Baltes and Paul Ralph. 2020. Sampling in Software Engineering Research: A Critical Review and Guidelines. CoRR abs/ 2002.07764 ( 2020 ). arXiv: 2002.07764 https://arxiv.org/abs/ 2002.07764
[6]
Cor-Paul Bezemer, Simon Eismann, Vincenzo Ferme, Johannes Grohmann, Robert Heinrich, Pooyan Jamshidi, Weiyi Shang, André van Hoorn, Monica Villavicencio, Jürgen Walter, and Felix Willnecker. 2019. How is Performance Addressed in DevOps?. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (Mumbai, India) ( ICPE 2019). ACM, New York, NY, USA, 45-50. https://doi.org/10.1145/3297663.3309672
[7]
Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney, José Nelson Amaral, Tim Brecht, Lubomír Bulej, Clif Click, Lieven Eeckhout, Sebastian Fischmeister, and et al. 2016. The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations. ACM Transactions on Programming Languages and Systems 38, 4, Article 15 (Oct. 2016 ), 20 pages. https://doi.org/10.1145/2983574
[8]
Lubomír Bulej, Vojte˘ch Horký, Petr Tůma, François Farquet, and Aleksandar Prokopec. 2020. Duet Benchmarking: Improving Measurement Accuracy in the Cloud. In Proceedings of the 2020 ACM/SPEC International Conference on Performance Engineering (ICPE 2020 ). ACM, New York, NY, USA. https://doi.org/ 10.1145/3358960.3379132
[9]
Lubomír Bulej, Vojte˘ch Horký, and Petr Tůma. 2019. Initial Experiments with Duet Benchmarking: Performance Testing Interference in the Cloud. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 249-255. https://doi.org/10.1109/ MASCOTS. 2019.00035
[10]
Charlie Curtsinger and Emery D. Berger. 2013. STABILIZER: Statistically Sound Performance Evaluation. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) ( ASPLOS 2013). ACM, New York, NY, USA, 219-228. https: //doi.org/10.1145/2451116.2451141
[11]
Diego Elias Damasceno Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2019. What's Wrong With My Benchmark Results? Studying Bad Practices in JMH Benchmarks. IEEE Transactions on Software Engineering ( 2019 ), 1-1. https://doi.org/10.1109/TSE. 2019.2925345
[12]
Anthony C. Davison and D Hinkley. 1997. Bootstrap Methods and Their Application. J. Amer. Statist. Assoc. 94 ( Jan. 1997 ).
[13]
Augusto Born de Oliveira, Sebastian Fischmeister, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2017. Perphecy: Performance Regression Test Selection Made Simple but Efective. In 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST). 103-113. https: //doi.org/10.1109/ICST. 2017.17
[14]
Augusto Born de Oliveira, Jean-Christophe Petkovich, Thomas Reidemeister, and Sebastian Fischmeister. 2013. DataMill: Rigorous Performance Evaluation Made Easy. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (Prague, Czech Republic) (ICPE 2013). ACM, New York, NY, USA, 137-148. https://doi.org/10.1145/2479871.2479892
[15]
Zishuo Ding, Jinfu Chen, and Weiyi Shang. 2020. Towards the Use of the Readily Available Tests from the Release Pipeline as Performance Tests. Are We There Yet?. In Proceedings of the 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE 2020). ACM, New York, NY, USA, 12.
[16]
King Chun Foo, Zhen Ming (Jack) Jiang, Bram Adams, Ahmed E. Hassan, Ying Zou, and Parminder Flora. 2015. An Industrial Case Study on the Automated Detection of Performance Regressions in Heterogeneous Environments. In Proceedings of the 37th International Conference on Software Engineering-Volume 2 ( Florence, Italy) ( ICSE 2015). IEEE Press, Piscataway, NJ, USA, 159-168. https://doi.org/10.1109/icse. 2015.144
[17]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications (Montreal, Quebec, Canada) ( OOPSLA 2007). ACM, New York, NY, USA, 57-76. https: //doi.org/10.1145/1297027.1297033
[18]
Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. 2011. A Microbenchmark Case Study and Lessons Learned. In Proceedings of the Compilation of the Co-located Workshops on DSM'11, TMC'11, AGERE! 2011, AOOPES'11, NEAT' 11, & VMIL' 11 (Portland, Oregon, USA) ( SPLASH 2011 Workshops). ACM, New York, NY, USA, 297-308. https://doi.org/10.1145/2095050.2095100
[19]
Sen He, Glenna Manns, John Saunders, Wei Wang, Lori Pollock, and Mary Lou Sofa. 2019. A Statistics-based Performance Testing Methodology for Cloud Applications. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 188-199. https://doi.org/10.1145/3338906.3338912
[20]
Tim C. Hesterberg. 2015. What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. The American Statistician 69, 4 ( 2015 ), 371-386. https://doi.org/10.1080/00031305. 2015.1089789
[21]
Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. 2016. Usage, Costs, and Benefits of Continuous Integration in Open-Source Projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (Singapore, Singapore) ( ASE 2016). ACM, New York, NY, USA, 426-437. https://doi.org/10.1145/2970276.2970358
[22]
Peng Huang, Xiao Ma, Dongcai Shen, and Yuanyuan Zhou. 2014. Performance Regression Testing Target Prioritization via Performance Risk Analysis. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) ( ICSE 2014). ACM, New York, NY, USA, 60-71. https: //doi.org/10.1145/2568225.2568232
[23]
Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Wiley.
[24]
Zhen Ming Jiang and Ahmed E. Hassan. 2015. A Survey on Load Testing of Large-Scale Software Systems. IEEE Transactions on Software Engineering 41, 11 (Nov. 2015 ), 1091-1118. https://doi.org/10.1109/TSE. 2015.2445340
[25]
Tomas Kalibera and Richard Jones. 2012. Quantifying Performance Changes with Efect Size Confidence Intervals. Technical Report 4-12. University of Kent. 55 pages. http://www.cs.kent.ac.uk/pubs/2012/3233
[26]
Tomas Kalibera and Richard Jones. 2013. Rigorous Benchmarking in Reasonable Time. In Proceedings of the 2013 International Symposium on Memory Management (Seattle, Washington, USA) ( ISMM 2013). ACM, New York, NY, USA, 63-74. https: //doi.org/10.1145/2464157.2464160
[27]
Solomon Kullback and Richard A. Leibler. 1951. On Information and Suficiency. Annals of Mathematical Statistics 22, 1 (March 1951 ), 79-86. https://doi.org/10. 1214/aoms/1177729694
[28]
Christoph Laaber. 2020. bencher-JMH Benchmark Analysis and Prioritization. https://github.com/chrstphlbr/bencher
[29]
Christoph Laaber. 2020. pa-Performance (Change) Analysis using Bootstrap. https://github.com/chrstphlbr/pa
[30]
Christoph Laaber and Philipp Leitner. 2018. An Evaluation of Open-Source Software Microbenchmark Suites for Continuous Performance Assessment. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) ( MSR 2018). ACM, New York, NY, USA, 119-130. https: //doi.org/10.1145/3196398.3196407
[31]
Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software Microbenchmarking in the Cloud. How Bad is it Really? Empirical Software Engineering (17 April 2019 ), 40. https://doi.org/10.1007/s10664-019-09681-1
[32]
Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. JMH with Dynamic Reconfiguration. https://github.com/sealuzh/jmh
[33]
Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. Replication Package "Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time Without Sacrificing Result Quality". https://doi.org/ 10.6084/m9.figshare.11944875
[34]
Philipp Leitner and Cor-Paul Bezemer. 2017. An Exploratory Study of the State of Practice of Performance Testing in Java-Based Open Source Projects. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (L'Aquila, Italy) ( ICPE 2017). ACM, New York, NY, USA, 373-384. https://doi.org/10.1145/3030207.3030213
[35]
Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. Taming Performance Variability. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) ( OSDI 2018). USENIX Association, USA, 409-425. https: //www.usenix.org/conference/osdi18/presentation/maricq
[36]
Daniel A. Menascé. 2002. Load testing of Web sites. IEEE Internet Computing 6, 4 ( July 2002 ), 70-74. https://doi.org/10.1109/MIC. 2002.1020328
[37]
Shaikh Mostafa, Xiaoyin Wang, and Tao Xie. 2017. PerfRanker: Prioritization of Performance Regression Tests for Collection-Intensive Software. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis (Santa Barbara, CA, USA) ( ISSTA 2017). ACM, New York, NY, USA, 23-34. https: //doi.org/10.1145/3092703.3092725
[38]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing Wrong Data Without Doing Anything Obviously Wrong!. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (Washington, DC, USA) (ASPLOS XIV). ACM, New York, NY, USA, 265-276. https://doi.org/10.1145/1508244.1508275
[39]
Thanh H. D. Nguyen, Meiyappan Nagappan, Ahmed E. Hassan, Mohamed Nasser, and Parminder Flora. 2014. An Industrial Case Study of Automatically Identifying Performance Regression-Causes. In Proceedings of the 11th Working Conference on Mining Software Repositories (Hyderabad, India) ( MSR 2014). ACM, New York, NY, USA, 232-241. https://doi.org/10.1145/2597073.2597092
[40]
Michael Pradel, Markus Huggler, and Thomas R. Gross. 2014. Performance Regression Testing of Concurrent Classes. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) ( ISSTA 2014). ACM, New York, NY, USA, 13-25. https://doi.org/10.1145/2610384.2610393
[41]
Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Aminzadeh, Xuezhang Hou, and Shenghan Lai. 2010. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics 37, 9 ( 2010 ), 1487-1498. https://doi.org/10.1080/ 02664760903046102
[42]
Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry. 2016. Automatic Microbenchmark Generation to Prevent Dead Code Elimination and Constant Folding. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (Singapore, Singapore) (ASE 2016 ). Association for Computing Machinery, New York, NY, USA, 132-143. https: //doi.org/10.1145/2970276.2970346
[43]
Juan Pablo Sandoval Alcocer, Alexandre Bergel, and Marco Tulio Valente. 2016. Learning from Source Code History to Identify Performance Failures. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering (Delft, The Netherlands) (ICPE 2016). ACM, New York, NY, USA, 37-48. https://doi.org/10.1145/2851553.2851571
[44]
Marija Selakovic and Michael Pradel. 2016. Performance Issues and Optimizations in JavaScript: An Empirical Study. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) ( ICSE 2016). ACM, New York, NY, USA, 61-72. https://doi.org/10.1145/2884781.2884829
[45]
Aleksey Shipilev. 2018. Reconsider defaults for fork count. https://bugs.openjdk. java.net/browse/CODETOOLS-7902170
[46]
Aleksey Shipilev. 2018. Reconsider defaults for warmup and measurement iteration counts, durations. https://bugs.openjdk.java.net/browse/CODETOOLS-7902165
[47]
Petr Stefan, Vojte˘ch Horký, Lubomír Bulej, and Petr Tůma. 2017. Unit Testing Performance in Java Projects: Are We There Yet?. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (L'Aquila, Italy) ( ICPE 2017). ACM, New York, NY, USA, 401-412. https://doi.org/10.1145/ 3030207.3030226
[48]
Luca Della Tofola, Michael Pradel, and Thomas R. Gross. 2018. Synthesizing Programs That Expose Performance Bottlenecks. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (Vienna, Austria) ( CGO 2018 ). Association for Computing Machinery, New York, NY, USA, 314-326. https://doi.org/10.1145/3168830
[49]
Elaine J. Weyuker and Filippos I. Vokolos. 2000. Experience with Performance Testing of Software Systems: Issues, an Approach, and Case Study. IEEE Transactions on Software Engineering 26, 12 (Dec. 2000 ), 1147-1156. https: //doi.org/10.1109/32.888628
[50]
Murray Woodside, Greg Franks, and Dorina C. Petriu. 2007. The Future of Software Performance Engineering. In 2007 Future of Software Engineering (FOSE 2007 ). IEEE Computer Society, Washington, DC, USA, 171-187. https://doi.org/ 10.1109/FOSE. 2007.32

Cited By

View all
  • (2024)AI-driven Java Performance Testing: Balancing Result Quality with Testing TimeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695017(443-454)Online publication date: 27-Oct-2024
  • (2024)Evaluating Search-Based Software Microbenchmark PrioritizationIEEE Transactions on Software Engineering10.1109/TSE.2024.338083650:7(1687-1703)Online publication date: 1-Jul-2024
  • (2023)Automated Generation and Evaluation of JMH Microbenchmark Suites From Unit TestsIEEE Transactions on Software Engineering10.1109/TSE.2022.318800549:4(1704-1725)Online publication date: 1-Apr-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2020
1703 pages
ISBN:9781450370431
DOI:10.1145/3368089
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. JMH
  2. configuration
  3. performance testing
  4. software benchmarking

Qualifiers

  • Research-article

Funding Sources

  • Vetenskapsrådet
  • Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

Conference

ESEC/FSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)6
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)AI-driven Java Performance Testing: Balancing Result Quality with Testing TimeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695017(443-454)Online publication date: 27-Oct-2024
  • (2024)Evaluating Search-Based Software Microbenchmark PrioritizationIEEE Transactions on Software Engineering10.1109/TSE.2024.338083650:7(1687-1703)Online publication date: 1-Jul-2024
  • (2023)Automated Generation and Evaluation of JMH Microbenchmark Suites From Unit TestsIEEE Transactions on Software Engineering10.1109/TSE.2022.318800549:4(1704-1725)Online publication date: 1-Apr-2023
  • (2023)Using Microbenchmark Suites to Detect Application Performance ChangesIEEE Transactions on Cloud Computing10.1109/TCC.2022.321794711:3(2575-2590)Online publication date: 1-Jul-2023
  • (2023)Early Stopping of Non-productive Performance Testing Experiments Using Measurement Mutations2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)10.1109/SEAA60479.2023.00022(86-93)Online publication date: 6-Sep-2023
  • (2023)Towards Continuous Performance Assessment of Java Applications With PerfBot2023 IEEE/ACM 5th International Workshop on Bots in Software Engineering (BotSE)10.1109/BotSE59190.2023.00009(6-8)Online publication date: May-2023
  • (2023)Towards effective assessment of steady state performance in Java software: are we there yet?Empirical Software Engineering10.1007/s10664-022-10247-x28:1Online publication date: 1-Jan-2023
  • (2022)Exploring Performance Assurance Practices and Challenges in Agile Software Development: An Ethnographic StudyEmpirical Software Engineering10.1007/s10664-021-10069-327:3Online publication date: 1-May-2022
  • (2021)Using application benchmark call graphs to quantify and improve the practical relevance of microbenchmark suitesPeerJ Computer Science10.7717/peerj-cs.5487(e548)Online publication date: 28-May-2021
  • (2021)How Software Refactoring Impacts Execution TimeACM Transactions on Software Engineering and Methodology10.1145/348513631:2(1-23)Online publication date: 24-Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media