research-article

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Authors:

Kyle L. Spafford,

Jeremy S. Meredith,

Philip C. Roth,

Jeffrey S. VetterAuthors Info & Claims

CF '12: Proceedings of the 9th conference on Computing Frontiers

Pages 103 - 112

https://doi.org/10.1145/2212908.2212924

Published: 15 May 2012 Publication History

Abstract

With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.

References

[1]

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: the PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180, 2009.

[2]

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Euro-Par 2009 Parallel Processing, volume 5704 of Lecture Notes in Computer Science, pages 863--874. 2009.

Digital Library

[3]

D. C. B. Chamberlain and H. P. Zima. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications, 2007.

Digital Library

[4]

N. Brookwood. AMD Fusion Family of APUs: Enabling a Superior, Immersive PC Experience. http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf, Mar 2010.

[5]

W. M. Brown, A. Kohlmeyer, S. J. Plimpton, and A. N. Tharrington. Implementing Molecular Dynamics on Hybrid High Performance Computers | Particle-Particle Particle-Mesh. Computer Physics Communications, 183(3):449 -- 459, 2012.

[6]

W. M. Brown, P. Wang, S. J. Plimpton, and A. N. Tharrington. Implementing Molecular Dynamics on Hybrid High Performance Computers | Short Range Forces. Computer Physics Communications, 182(4):898 -- 911, 2011.

[7]

S. Carrillo, J. Siegel, and X. Li. A Control-Structure Splitting Optimization for GPGPU. In Proceedings of the 6th ACM Conference on Computing Frontiers, CF'09, pages 147--150, New York, NY, USA, 2009. ACM.

Digital Library

[8]

M. Daga, A. Aji, and W. Feng. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. In 2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC), pages 141--149, July 2011.

Digital Library

[9]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spa ord, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM.

Digital Library

[10]

J. J. Dongarra and P. Luszczek. Introduction to the HPCChallenge Benchmark Suite. Technical Report ICL-UT-05-01, Innovative Computing Laboratory, University of Tennessee-Knoxville, 2005.

[11]

T. Endo, A. Nukada, S. Matsuoka, and N. Maruyama. Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators. In 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--8, 2010.

[12]

J. Dongarra, P. Beckman et al. International exascale software roadmap. International Journal of High Performance Computing Applications, 25(1), 2011.

Digital Library

[13]

D. Kaeli and D. Akodes. The Convergence of HPC and Embedded Systems in Our Heterogeneous Computing Future. In 2011 IEEE 29th International Conference on Computer Design (ICCD), pages 9--11, oct. 2011.

Digital Library

[14]

L. V. Kale and G. Zheng. Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects. Advanced Computational Infrastructures for Parallel and Distributed Adaptive Applications, pages 265--282, 2009.

[15]

C. Leiserson. The Cilk++ Concurrency Platform. The Journal of Supercomputing, 51:244--257, 2010.

Digital Library

[16]

M. Daga, T. Scogland, and W. Feng. Performance Characterization and Optimization of Atomic Operations on AMD GPUs. In Technical Report TR-11-08, Computer Science, Virginia Tech, Retrieved from http://eprints.cs.vt.edu/archive/00001159/.

[17]

M. Elteir, H. Lin, and W. Feng. Performance Characterization and Optimization of Atomic Operations on AMD GPUs. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), pages 234--243, sept. 2011.

Digital Library

[18]

A. Nukada and S. Matsuoka. Auto-tuning 3-D FFT Library for CUDA GPUs. In Proceedings of the Conference on High Performance Computing, Networking, Storage and Analysis, SC '09, pages 30:1--30:10, New York, NY, USA, 2009. ACM.

Digital Library

[19]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. Proceedings of the IEEE, 96(5):879--899, May 2008.

[20]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '08, pages 195--204, New York, NY, USA, 2008. ACM.

Digital Library

[21]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. zee Ueng, and W. Hwu. Program Optimization Study on a 128-Core GPU. In Proceedings of the First Workshop on General Purpose Processing on Graphics Processing Units, 2007.

[22]

S. Ryoo, C. I. Rodrigues, S. S. Stone, J. A. Stratton, S.-Z. Ueng, S. S. Baghsorkhi, and W. Hwu. Program Optimization Carving for GPU Computing. Journal of Parallel and Distributed Computing, 68(10):1389--1401, 2008. General-Purpose Processing using Graphics Processing Units.

Digital Library

[23]

K. Spafford, J. S. Meredith, and J. S. Vetter. Quantifying NUMA and Contention Effects in Multi-GPU Systems. In Proceedings of The Fourth Workshop on General Purpose Processing on Graphics Processing Units. ACM, 2011.

Digital Library

[24]

J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer; 2nd edition, 1996.

[25]

V. Volkov and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages -11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[26]

Y. Zhang, Y. Hu, B. Li, and L. Peng. Performance and Power Analysis of ATI GPU: A Statistical Approach. In 2011 6th IEEE International Conference on Networking, Architecture and Storage (NAS), pages 149--158, July 2011.

Digital Library

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Rosenfeld VBreß SMarkl V(2022)Query Processing on Heterogeneous CPU/GPU SystemsACM Computing Surveys10.1145/348512655:1(1-38)Online publication date: 17-Jan-2022
https://dl.acm.org/doi/10.1145/3485126
Liu XMao MBi XLi HChen Y(2021)Exploring Applications of STT-RAM in GPU ArchitecturesIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2020.303189568:1(238-249)Online publication date: Jan-2021
https://doi.org/10.1109/TCSI.2020.3031895
Show More Cited By

Index Terms

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Accelerating simulation of agent-based models on heterogeneous architectures
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

The wide usage of GPGPU programming models and compiler techniques enables the optimization of data-parallel programs on commodity GPUs. However, mapping GPGPU applications running on discrete parts to emerging integrated heterogeneous architectures ...
Portable performance on heterogeneous architectures
ASPLOS '13

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '12: Proceedings of the 9th conference on Computing Frontiers

May 2012

320 pages

ISBN:9781450312158

DOI:10.1145/2212908

General Chair:
John Feo
Pacific Northwest National Laboratory, USA
,
Program Chairs:
Paolo Faraboschi
HP Labs, Spain
,
Oreste Villa
Pacific Northwest National Laboratory, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'12

Sponsor:

SIGMICRO

CF'12: Computing Frontiers Conference

May 15 - 17, 2012

Cagliari, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
597
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Rosenfeld VBreß SMarkl V(2022)Query Processing on Heterogeneous CPU/GPU SystemsACM Computing Surveys10.1145/348512655:1(1-38)Online publication date: 17-Jan-2022
https://dl.acm.org/doi/10.1145/3485126
Liu XMao MBi XLi HChen Y(2021)Exploring Applications of STT-RAM in GPU ArchitecturesIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2020.303189568:1(238-249)Online publication date: Jan-2021
https://doi.org/10.1109/TCSI.2020.3031895
Bateni SWang ZZhu YHu YLiu C(2020)Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS48715.2020.00007(310-323)Online publication date: Apr-2020
https://doi.org/10.1109/RTAS48715.2020.00007
Gerzhoy DSun XZuzak MYeung D(2019)Nested MIMD-SIMD Parallelization for Heterogeneous MicroprocessorsACM Transactions on Architecture and Code Optimization10.1145/336830416:4(1-27)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3368304
Zhang FLiu WFeng NZhai JDu X(2019)Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processorsCCF Transactions on High Performance Computing10.1007/s42514-019-00008-61:2(131-143)Online publication date: 12-Jun-2019
https://doi.org/10.1007/s42514-019-00008-6
K RChiplunkar N(2018)A survey on techniques for cooperative CPU-GPU computingSustainable Computing: Informatics and Systems10.1016/j.suscom.2018.07.01019(72-85)Online publication date: Sep-2018
https://doi.org/10.1016/j.suscom.2018.07.010
Zhang FZhai JHe BZhang SChen W(2017)Understanding Co-Running Behaviors on Integrated CPU/GPU ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.258607428:3(905-918)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2586074
Gomez-Luna JHajj IChang LGarcia-Flores Vde Gonzalo SJablin TPena AHwu W(2017)Chai: Collaborative heterogeneous applications for integrated-architectures2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2017.7975269(43-54)Online publication date: Apr-2017
https://doi.org/10.1109/ISPASS.2017.7975269
Huangfu YZhang W(2017)WCET analysis of the shared data cache in integrated CPU-GPU architectures2017 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2017.8091059(1-7)Online publication date: Sep-2017
https://doi.org/10.1109/HPEC.2017.8091059
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents