research-article

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Authors:

Perhaad Mistry,

David KaeliAuthors Info & Claims

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Pages 54 - 65

https://doi.org/10.1145/2458523.2458529

Published: 16 March 2013 Publication History

Abstract

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a CPU and GPU on the same chip). These emerging class of platforms are now being targeted to accelerate applications where the host processor (typically a CPU) and compute device (typically a GPU) co-operate on a computation. In this scenario, the performance of the application is not only dependent on the processing power of the respective heterogeneous processors, but also on the efficient interaction and communication between them.

To help architects and application developers to quantify many of the key aspects of heterogeneous execution, this paper presents a new set of benchmarks called the Valar. The Valar benchmarks are applications specifically chosen to study the dynamic behavior of OpenCL applications that will benefit from host-device interaction. We describe the general characteristics of our benchmarks, focusing on specific characteristics that can help characterize heterogeneous applications. For the purposes of this paper we focus on OpenCL as our programming environment, though we envision versions of Valar in additional heterogeneous programming languages.

We profile the Valar benchmarks based on their mapping and execution on different heterogeneous systems. Our evaluation examines optimizations for host-device communication and the effects of closely-coupled execution of the benchmarks on the multiple OpenCL devices present in heterogeneous systems.

References

[1]

The OpenACC Application Programming Interface 1.0. http://www.openacc-standard.org/, 2011.

[2]

AMD. Accelerated parallel processing: Opencl programming guide. 2011.

Digital Library

[3]

M. Arora, S. Nath, S. Mazumdar, S. B. Baden, and D. M. Tullsen. Redefining the Role of the CPU in the Era of CPU-GPU Integration. In IEEE Micro, pages 1--1, 2012.

Digital Library

[4]

S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12, page 23, New York, New York, USA, 2012. ACM Press.

Digital Library

[5]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 163--174. IEEE, Apr. 2009.

[6]

P. Bakkum and K. Skadron. Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU '10, page 94, New York, USA, 2010. ACM Press.

Digital Library

[7]

C. Bienia and K. Li. Fidelity and scaling of the PARSEC benchmark inputs. In IEEE International Symposium on Workload Characterization (IISWC'10), pages 1--10. IEEE, Dec. 2010.

Digital Library

[8]

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, and K. Skadron. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In IEEE International Symposium on Workload Characterization (IISWC'10), pages 1--11. IEEE, Dec. 2010.

Digital Library

[9]

J. Coleman, T. Lau, B. Lokhande, P. Shum, R. W. Wisniewski, and M. P. Yost. The Autonomic Computing Benchmark. pages 1--22, 2008.

[10]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. L. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In ACM International Conference Proceeding Series; Vol. 425, 2010.

Digital Library

[11]

K. Fatahalian, W. J. Dally, P. Hanrahan, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, and A. Aiken. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06, number November, page 83, New York, USA, 2006. ACM Press.

Digital Library

[12]

M. Ferdman, B. Falsafi, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, and A. Ailamaki. Clearing the Clouds A Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12, number Asplos, page 37, New York, USA, 2012. ACM Press.

Digital Library

[13]

B. Gaster, L. Howes, D. Kaeli, P. Mistry, and D. Schaa. Heterogeneous Computing with OpenCL. Morgan Kaufmann.

Digital Library

[14]

B. R. Gaster and L. Howes. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? Computer, 45(8):42--52, Aug. 2012.

Digital Library

[15]

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11, page 235, New York, USA, 2011. ACM Press.

Digital Library

[16]

C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 134--144. IEEE, Apr. 2011.

Digital Library

[17]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08, page 260, New York, New York, USA, 2008. ACM Press.

Digital Library

[18]

W. Heirman, T. E. Carlson, S. Che, K. Skadron, and L. Eeckhout. Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads. 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 38--49, Nov. 2011.

Digital Library

[19]

J. L. Henning. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, Sept. 2006.

Digital Library

[20]

T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. 2012 IEEE International Symposium on Performance Analysis of Systems & Software, pages 88--98, Apr. 2012.

Digital Library

[21]

V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid. Web Search Using Mobile Cores: Quantifying and Mitigating the Price of Efficiency. In Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10, number Table 1, page 314, New York, USA, 2010. ACM Press.

Digital Library

[22]

B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures. IEEE Transactions on Parallel and Distributed Systems, 22(1):105--118, Jan. 2011.

Digital Library

[23]

Z. Kalal, J. Matas, and K. Mikolajczyk. Online learning of robust object detectors during unstable tracking. 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pages 1417--1424, Sept. 2009.

[24]

T. Kalibera, J. Hagelberg, and P. Maj. A family of real time Java benchmarks. Concurrency and Computation: Practice and Experience, 2011.

Digital Library

[25]

M. Kulkarni and V. Pai. Towards architecture independent metrics for multicore performance analysis. ACM SIGMETRICS Performance, 2011.

Digital Library

[26]

P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi. The HPC Challenge (HPCC) benchmark suite. Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06, page 213, Nov. 2006.

Digital Library

[27]

M. Mantor and M. Houston. AMD Graphics Core Next. In AMD Fusion Developer Summit, 2011.

[28]

P. Mistry, C. Gregg, N. Rubin, D. Kaeli, and K. Hazelwood. Analyzing program flow within a many-kernel OpenCL application. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4, page 1, New York, New York, USA, 2011. ACM Press.

Digital Library

[29]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming PPoPP 08, pages:73, 2008.

Digital Library

[30]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X) - ASPLOS '02, page 45, New York, New York, USA, 2002. ACM Press.

Digital Library

[31]

K. L. Spafford, J. S. Meredith, S. Lee, D. Li, P. C. Roth, and J. S. Vetter. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proceedings of the 9th conference on Computing Frontiers - CF '12, page 103, New York, 2012. ACM Press.

Digital Library

[32]

K. L. Spafford, J. S. Meredith, and J. S. Vetter. Maestro: Data Orchestration and Tuning for OpenCL Devices. EuroPar 2010Parallel Processing, 6272:275--286, 2010.

Digital Library

[33]

J. A. Stratton, C. I. Rodrigues, I.-j. Sung, N. Obeid, L.-w. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. 2012.

[34]

D. Strippgen and K. Nagel. Multi-agent traffic simulation with CUDA. In 2009 International Conference on High Performance Computing & Simulation, pages 106--114. IEEE, June 2009.

[35]

W. Thies, M. Karczmarek, J. Sermulins, R. Rabbah, and S. Amarasinghe. Teleport messaging for distributed stream programs. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '05, page 224, New York, New York, USA, 2005. ACM Press.

Digital Library

Cited By

Arif MRafique MLim SMalik Z(2020)Infrastructure-Aware TensorFlow for Heterogeneous Datacenters2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS50786.2020.9285969(1-8)Online publication date: 17-Nov-2020
https://doi.org/10.1109/MASCOTS50786.2020.9285969
Siddiqui AKhan G(2019)Design Space Exploration of Embedded Applications on Heterogeneous CPU-GPU Platforms2019 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS48598.2019.9188052(620-627)Online publication date: Jul-2019
https://doi.org/10.1109/HPCS48598.2019.9188052
Jamieson PSanaullah AHerbordt M(2018)Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547635(1-6)Online publication date: Sep-2018
https://doi.org/10.1109/HPEC.2018.8547635
Show More Cited By

Index Terms

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Recommendations

NUPAR: A Benchmark Suite for Modern GPU Architectures
ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

March 2013

156 pages

ISBN:9781450320177

DOI:10.1145/2458523

Editors:
John Cavazos
University of Delaware
,
Xiang Gong,
David Kaeli

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

GPGPU-6

GPGPU-6: Sixth Workshop on General Purpose Processing Using GPUs

March 16, 2013

Texas, Houston, USA

Acceptance Rates

GPGPU-6 Paper Acceptance Rate 15 of 37 submissions, 41%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
386
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Arif MRafique MLim SMalik Z(2020)Infrastructure-Aware TensorFlow for Heterogeneous Datacenters2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS50786.2020.9285969(1-8)Online publication date: 17-Nov-2020
https://doi.org/10.1109/MASCOTS50786.2020.9285969
Siddiqui AKhan G(2019)Design Space Exploration of Embedded Applications on Heterogeneous CPU-GPU Platforms2019 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS48598.2019.9188052(620-627)Online publication date: Jul-2019
https://doi.org/10.1109/HPCS48598.2019.9188052
Jamieson PSanaullah AHerbordt M(2018)Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547635(1-6)Online publication date: Sep-2018
https://doi.org/10.1109/HPEC.2018.8547635
Mishra ALi LKong MFinkel HChapman B(2017)Benchmarking and Evaluating Unified Memory for OpenMP GPU OffloadingProceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC10.1145/3148173.3148184(1-10)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3148173.3148184
Gomez-Luna JHajj IChang LGarcia-Flores Vde Gonzalo SJablin TPena AHwu W(2017)Chai: Collaborative heterogeneous applications for integrated-architectures2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2017.7975269(43-54)Online publication date: Apr-2017
https://doi.org/10.1109/ISPASS.2017.7975269
Garcıa VGomez-Luna JGrass TRico AAyguade EPena A(2016)Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications2016 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2016.7581277(1-10)Online publication date: Sep-2016
https://doi.org/10.1109/IISWC.2016.7581277
Xie BLiu XMcKee SZhan JJia ZWang LZhang L(2016)Understanding Data Analytics Workloads on Intel(R) Xeon Phi(R)2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0039(206-215)Online publication date: Dec-2016
https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0039
Ndu GNavaridas JLuján MMcIntosh-Smith SBergen B(2015)CHOProceedings of the 3rd International Workshop on OpenCL10.1145/2791321.2791331(1-10)Online publication date: 12-May-2015
https://dl.acm.org/doi/10.1145/2791321.2791331
Mittal SVetter J(2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2788396
Ukidave YParavecino FYu LKalra CMomeni AChen ZMaterise NDaley BMistry PKaeli DJohn LSmith CSachs KLladó C(2015)NUPARProceedings of the 6th ACM/SPEC International Conference on Performance Engineering10.1145/2668930.2688046(253-264)Online publication date: 28-Jan-2015
https://dl.acm.org/doi/10.1145/2668930.2688046
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten