Abstract
Current high performance computing systems all rely on parallel processing techniques to achieve high performance. With the parallel computer systems scaling up, the new generation of high performance computers puts more emphasis on "high productivity" [1], rather than "high performance" as in the past. These new systems will not only meet the traditional requirements of computing performance, but also address the ongoing technical challenges in the current high-end computing domain regarding energy consumption, reliability, etc.
For energy consumption, with the computer system scaling up, it increases dramatically [2]. High energy consumption means high maintenance cost and low system stability. For example, the peak energy consumption of the Earth Simulator and BlueGene/L is 18MW and 1.6MW respectively. For reliability, with the complexity of a computer system increasing, its meantime-between-failure (MTBF) is becoming significantly shorter than what is required by many current high performance computing applications [3], such as BlueGene/L. Therefore, energy optimization techniques and fault tolerance techniques should be introduced to computer systems to achieve low energy consumption and high reliability.
To improve the productivity of high performance computing systems, we need to find a proper way to measure it. Unfortunately, traditional measurement models can not evaluate the system productivity comprehensively and effectively [4]. To address this issue, this paper proposes an effective scalability metric for high performance computing systems based on Gustafson speedup law. The metric makes a good balance among runtime productivity factors including computing performance, energy consumption and reliability. The contribution of our work lies in the following three aspects.
First, in order to measure the scalability of an energy-consumption optimized parallel program, we should consider not only whether the program computing performance is scalable, but also whether the energy consumption increases smoothly with the computing performance scaling up. Therefore, we propose an energy-smoothed scalability metric based on a new definition of energy efficiency, which reflects the effect of energy consumption on runtime performance. This metric can be used to measure whether energy consumption increases smoothly with the computer system scaling up.
Second, when evaluating the scalability of parallel programs, we should consider the effect of fault tolerance overhead on the program performance. Therefore, we propose a reliability-assured scalability metric based on a new definition of reliable efficiency, which reflects the effect of fault tolerance overhead on runtime performance. This metric can be used to measure whether the performance with the introduction of fault tolerance overhead is scalable as the computer system scales up.
Third, based on the analyses above, we propose a synthetic scalability metric, which measures whether the systems are energy-smoothed and reliability-assured when the systems scale up. The metric simultaneously measures the multiple productivity factors regarding computing performance, energy consumption and reliability.
The metric is demonstrated by applying it to some well-known energy optimization techniques and fault tolerance techniques. Case studies indicate that using our model, it is more effective to solve the following problems: First, measuring the scalability for high performance computing systems by quantifying the effect of runtime factors including computing performance, energy consumption and reliability on scalability; Second, providing suggestions on how to keep and improve the scalability of high performance computing systems, and guiding the proper selection of energy optimization techniques and fault tolerant techniques to achieve high scalability of high performance computer systems.