Abstract
Due to less healthy, contention for shared resources, operating system interference and other factors in high performance computers, there are performance variability phenomena during various components runtime. With the scale of systems and numerical simulation program parallelism increases, the impact of performance variability will be magnified. This will introduce performance variability and degradations, affect applications scalability and overall system throughput. In this context, the performance variability becomes important question for both HPC systems and numerical simulation applications. The future research about this question will be helpful for the system and application design towards future exascale computing. In terms of this issue, this paper gives a literature review about quantitative measurement of performance variability in HPC systems. We summarize the quantitative measurement method of performance variability for three different components, including computation, memory and communication, respectively. Finally, we analyze the gap between researches and challenging demands, potential research issues and future work are also introduced.
This research is supported by the National Key R&D Plan of China (No. 2016YFBO201403), National Natural Science Foundation of China (61672003).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
The Top 500 Supercomputer List. http://www.top500.org
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19328-6_1
Benoit, A., Cavelan, A., Robert, Y., Sun, H.: Assessing general-purpose algorithms to cope with fail-stop and silent errors. (Research Report) RR-8599, INRIA (2014)
Hardy, D., Sideris, I., Ladas, N., Sazeides, Y.: Modelling the performance vulnerability of arrays to permanent faults. In: The 9th Workshop on Silicon Errors in Logic System Effects (2013)
Allan, B.: Memory reliability and performance degradation: hunting rabbits with an elephant gun. In: Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) Workshop. IEEE Cluster (2014)
Petrini, F., Kerbyson, D.K., Pakin, S.: The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: 2003 ACM/IEEE Conference IEEE Supercomputing (2003)
Ferreira, K.B., Bridges, P., Brightwell, R.: Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, Piscataway, NJ, USA, pp. 1–12. IEEE Press (2008)
Wu, L., Wei, Y., Xu, X., Liu, X.: Impact of system noise by quantitative analysis. J. Comput. Res. Dev. 52(5), 1146–1152 (2015)
Mraz, R.: Reducing the variance of point to point transfers in the IBM 9076 parallel computer. In: Proceedings of the 1994 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press (1994)
Tabe, T.B., Hardwick, J.P. Stout, Q.F.: Statistical analysis of communication time on the IBM SP2. Comput. Sci. Stat. 347–351 (1996)
HPC-Colony Project. http://www.hpc-colony.org/
International Workshop on Runtime and Operating System for Supercomputer. http://htor.inf.ethz.ch/ross2012/
Johnson, G.: P-SNAP: a system benchmark for quantifying operating system interference or noise. http://www.c3.lanl.gov/pal/software/psnap/
Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp. 1–11 (2010)
Sottile, M., Minnich, R.: Analysis of microbenchmarks for performance tuning of clusters. In: 2004 IEEE International Conference on IEEE Cluster Computing, pp. 371–377 (2004)
Beckman, P., Iskra, K., Yoshii, K., et al.: Benchmarking the effects of operating system interference on extreme-scale parallel machines. Cluster Comput. 11(1), 3–16 (2008)
Hoefler, T., Mehlan, T., Lumsdaine, A., Rehm, W.: Netgauge: a network performance measurement framework. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, Laurence T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 659–671. Springer, Heidelberg (2007). doi:10.1007/978-3-540-75444-2_62
Wu, L., Wei, Y., Liu, X.: The quantitative measurement of system noise in multicore multiprocessor clustered systems. In: CCF HPC CHINA, Zhangjiajie, Hunan Province (2012)
Van Straalen, B., Shalf, J., Ligocki, T., et al.: Scalability challenges for massively parallel AMR applications. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009. IEEE, pp. 1–12 (2009)
Pusukuri, K.K., Gupta, R., Bhuyan, L.N.: Thread Tranquilizer: dynamically reducing performance variation. ACM Trans. Archit. Code Optim. (TACO) 8(4), 46 (2012)
Application performance variability on hopper. http://www.nersc.gov/users/computational-systems/hopper/performance-and-optimization/application-performance-variability-on-hopper/
Bhatele, A., et al.: There goes the neighborhood: performance degradation due to nearby jobs. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM (2013)
Jokanovic, A., et al.: Impact of inter-application contention in current and future HPC systems. In: 2010 IEEE 18th Annual Symposium on High Performance Interconnects (HOTI). IEEE (2010). Author, F.: Article title. Journal 2(5), 99–110 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wu, L., Xu, X., Wei, Y., Liu, X. (2017). A Survey About Quantitative Measurement of Performance Variability in High Performance Computers. In: Dou, Y., Lin, H., Sun, G., Wu, J., Heras, D., Bougé, L. (eds) Advanced Parallel Processing Technologies. APPT 2017. Lecture Notes in Computer Science(), vol 10561. Springer, Cham. https://doi.org/10.1007/978-3-319-67952-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-67952-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67951-8
Online ISBN: 978-3-319-67952-5
eBook Packages: Computer ScienceComputer Science (R0)