Specialized 3-D graphics processors (GPUs) for the commodity market first appeared in the mid-1990s. At that time, commodity CPU development had already had a nearly 20 year head start on the development of specialized graphics hardware. GPU architects exploited the incredible inroads in semiconductor technology that were driven by the already mature CPU manufacturers, filling the available silicon real-estate with logic. In the intervening decade, the complexity of GPUs has advanced considerably, but most of the additional complexity comes as a result of increased parallelization with the introduction of more vertex and fragment processing pipelines. GPUs parallelized to the limits of the lithographic process, placing as many transistors as possible in each new generation product.
By 2004, this increased complexity was leading to thermal management issues, which strongly influence both lifetime and reliability. At that time, reliability was not high on the list of priorities for GPU vendors, though the cost of the cooling solution was. However, in 2003, vendors started adding programmability to some functional units of the graphics pipeline in commodity GPUs. Since then, as this programmability has continued to evolve, researchers have put graphics processors to use as highly parallel, floating point co-processors for scientific calculation ( General Purpose computation on Graphics Processing Units or GPGPU ). With the advent of GPGPU came a push for reliability of the results of the computation. Indeed, GPU based supercomputers have already been built and high errors rates are regularly observed.
This dissertation discusses Qsilver, a simulation framework for graphics architectures; it uses Qsilver to analyze the application of some CPU static and dynamic thermal management techniques to the graphics domain; it presents a characterization of the effects of transient errors on traditional graphics workloads, including an assessment of the most vulnerable set of state for traditional graphics; and finally, it provides a detailed survey and analysis of proposed transient fault detection and recovery mechanisms for GPGPU on modern graphics processors.
Index Terms
- Physical challenges in reliable graphics hardware design
Recommendations
Why is graphics hardware so fast?
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programmingNVIDIA has claimed that their graphics processors (or GPUs) are improving at a rate three times faster than Moore's Law for processors. A $25 GPU is rated from 50-100 gigaflops and approximately 1 teraop (8-bit ops). Alongside this increase in ...
Parallel ant colony for nonlinear function optimization with graphics hardware acceleration
SMC'09: Proceedings of the 2009 IEEE international conference on Systems, Man and CyberneticsThis paper presents a massively parallel Ant Colony Optimization - Pattern Search (ACO-PS) algorithm with graphics hardware acceleration on nonlinear function optimization problems. The objective of this study is to determine the effectiveness of using ...