research-article

Harmonia: balancing compute and memory power in high-performance GPUs

Authors:

Sudhakar YalamanchiliAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 54 - 65

https://doi.org/10.1145/2749469.2750404

Published: 13 June 2015 Publication History

Abstract

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation.

Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables---compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED²) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.

References

[1]

AMD, "PowerTune Technology whitepaper, 2010."

[2]

M. Arora, S. Nath, S. Mazumdar, S. Baden, and D. Tullsen, "Redefining the Role of the CPU in the Era of CPU-GPU Integration," IEEE Micro, 2012.

Digital Library

[3]

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley," Technical Report UCB/EECS-183.2006, 2006.

[4]

W. L. Bircher, M. Valluri, J. Law, and L. John, "Runtime Identification of Microprocessor Energy Saving Opportunities," in International Symp. on Low Power Electronics and Design (ISLPED), 2005.

Digital Library

[5]

W. Brown, P. Wang, S. Plimpton, and A. Tharrington, "Implementing Molecular Dynamics on Hybrid High Performance Computers---Short Range Forces," Compute Physics Communications, 2011.

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE Intl. Symp. on Workload Characterization, 2009.

Digital Library

[7]

S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in IEEE Intl. Symp. on Workload Characterization, 2011.

Digital Library

[8]

M. Chen, X. Wang, and X. Li, "Coordinating Processor and Main Memory for Efficient Server Power Control," in International Conference on Supercomputing (ICS), 2011.

Digital Library

[9]

J. Choi, D. Bedard, R. Fowler, and R. Vuduc, "A Roofline Model of Energy," in IEEE International Distributed Process Symposium, 2013.

Digital Library

[10]

CodeXL, "http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/."

[11]

M. Daga and M. Nutter, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on APUs," in Workshop on Irregular Applications, Architectures and Algorithms (IA3), 2012.

Digital Library

[12]

A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tipparaju, and J. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmarking Suite," in Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2010.

Digital Library

[13]

H. David, C. Fallin, E. Gorbatov, U. Hanebutte, and O. Mutlu, "Memory Power Management vis Dynamic Voltage/Frequency Scaling," in International Conference on Autonomous Computing (ICAC), 2011.

Digital Library

[14]

H. David, E. Gorbatov, U. Hanebutte, K. Khanna, and C. Le, "RAPL: Memory Power Estimation and Capping," in International Symposium on Low Power Electronics and Design (ISLPED), 2010.

Digital Library

[15]

Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "CoScale: Coordinating CPU and Memory System DVFS in Server Systems," in International Symposium on Microarchitecture (MICRO), 2012.

Digital Library

[16]

Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "MultiScale: Memory System DVFS with Multiple Memory Controllers," in International Symposium on Low Power Electronics and Design (ISLPED), 2012.

Digital Library

[17]

Q. Deng, D. Meisner, L. Ramos, T. Wenisch, and R. Bianchini, "Mem-Scale: Active Low-Power Modes for Main Memory," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.

Digital Library

[18]

B. Diniz, D. Guedez, W. Meira, and R. Bianchini, "Limiting the Power Consumption of Main Memory," in International Symposium on Computer Architecture (ISCA), 2007.

Digital Library

[19]

Elpida, "http://www.elpida.com/en/news/2011/06-27.html."

[20]

W. Felter, K. Rajamani, T. Keller, and C. Rusu, "A Performance-Conserving Approach for Reducing Peak Power Consumption in Server Systems," in International Conference on Supercomputing (ICS), 2005.

Digital Library

[21]

Green500 List, "http://www.green500.org."

[22]

M. Heroux, D. Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich, "Improving Performance via Mini-applications," Sandia Report, SAND2009-5574, 2009.

[23]

S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness," in International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[24]

C. Hsu and W. Feng, "Effective Dynamic Voltage Scaling through CPU-Boundedness Detection," Lec. Notes in Computer Science, 2004.

Digital Library

[25]

W. Huang, M. Stan, K. Sankaranarayanan, R. Ribando, and K. Skadron, "Many-core Design from a Thermal Perspective," in Design Automation Conference (DAC), 2008.

Digital Library

[26]

JEDECWide I/O, "http://www.jedec.org/news/pressreleases/jedecpublishes-breakthrough-standard-wide-io-mobile-dram, jan 2012."

[27]

S. Kaxiras and M. Martonosi, "Computer Architecture Techniques for Power Efficiency," Synth. Lec. on Computer Architecture, 2008.

Digital Library

[28]

S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, 2011.

Digital Library

[29]

G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the Energy Cost of Data Movement in Scientific Applications," in International Symposium on Workload Characterization (IISWC), 2013.

[30]

J. Laros, K. Pedretti, S. Kelly, W. Shu, and C. Vaughan, "Energy Based Performance Tuning for Large Scale High Performance Computing Systems," in Symp. on High-Performance Computing, 2012.

Digital Library

[31]

J. Lee and H. Kim, "TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture," in International Conference on High-Performance Computer Architecture (HPCA), 2012.

Digital Library

[32]

J. Lee, V. Sathisha, M. Schulte, K. Compton, and N. S. Kim, "Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011.

Digital Library

[33]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in International Symposium on Computer Architecture (ISCA), 2013.

Digital Library

[34]

C. Luk, S. Hong, and H. Kim, "Qilin: Exploiting Parallelism on Hetergeneous Multiprocessors with Adaptive Mapping," in International Symposium on Microarchitecture (MICRO), 2009.

Digital Library

[35]

M. Mantor and M. Houston, "AMD Graphics Core Next," in AMD Fusion Developer Summit, 2011.

[36]

A. McLaughlin, I. Paul, J. Greathouse, S. Manne, and S. Yalamanchili, "A Power Characterization and Management of GPU Graph Traversal," in Workshop on Architectures and Systems for Big Data, 2014.

[37]

R. Murphy, K. Wheeler, B. Barett, and J. Ang, "Introduing the Graph500," Cray User's Group (CUG), 2010.

[38]

Online, "http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed."

[39]

Online, "http://www.techspot.com/news/52003-future-nvidia-volta-gpu-has-stacked-dram-offers-1tb-s-bandwidth.html, march 2013."

[40]

S. Pakin, C. Storlie, M. Lang, R. Fields, E. Romero, C. Idler, S. Michalak, H. Greeberg, J. Loncaric, R. Rheinheimer, G. Grider, and J. Wendelberger, "Power Usage of Production Supercomputers and Production Workloads," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2012.

[41]

I. Paul, S. Manne, M. Arora, W. L. Bircher, and S. Yalamanchili, "Cooperative Boost: Needy vs. Greedy Power Management," in International Symposium on Computer Architecture (ISCA), 2013.

Digital Library

[42]

I. Paul, V. Ravi, S. Manne, M. Arora, and S. Yalamanchili, "Coordinated Energy Management in Heterogeneous Processors," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2013.

Digital Library

[43]

J. Pawlowski, "Hybrid Memory Cube (HMC)," in HotChips, 2011.

[44]

E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weisman, "Power Management Architectures of the Intel Microarchitecture Code-Named Sandy Bridge," IEEE Micro, 2012.

Digital Library

[45]

B. Rountree, D. Lowenthal, B. de Supinski, M. Schulz, V. Freeh, and T. Bletsch, "Adagio: Making DVS Practical for Complex HPC Applications," in International Conference on Supercomputing (ICS), 2009.

Digital Library

[46]

B. Rountree, D. Lowenthal, S. Funk, V. Freeh, B. de Supinski, and M. Schulz, "Bounding Energy Consumption in Large-Scale MPI Programs," in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2007.

Digital Library

[47]

J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in International Conference on High Performance Computing for Computational Science, 2010.

Digital Library

[48]

A. Sharifi, A. K. Mishra, S. Srikantaiah, M. Kandemir, and C. R. Das, "PEPON: performance-aware hierarchical power budgeting for NoC based multicores," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012.

Digital Library

[49]

A. Tiwari, M. Laurenzano, L. Carrington, and A. Snavely, "Autotuning for Energy Usage in Scientific Applications," in International Conference on Parallel Processing (Euro-Par), 2011.

Digital Library

[50]

H. Wang, V. Sathish, R. Singh, M. Schulte, and N. Kim, "Worload and Power Budgest Partitioning for Single Chip Heterogeneous Processors," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012.

Digital Library

[51]

S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 2009.

Digital Library

Cited By

Zhang YWang QLin ZXu PWang B(2024)Improving GPU Energy Efficiency through an Application-transparent Frequency Scaling Policy with Performance AssuranceProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629584(769-785)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629584
Wang YHao MHe HZhang WTang QSun XWang Z(2024)DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement LearningIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33626979:5(712-726)Online publication date: Sep-2024
https://doi.org/10.1109/TSUSC.2024.3362697
Hussain HS T(2024)Analysis of Energy-Efficient LCRM Optimization Algorithm in Computer Vision-based CNNs2024 IEEE 8th Energy Conference (ENERGYCON)10.1109/ENERGYCON58629.2024.10488814(1-6)Online publication date: 4-Mar-2024
https://doi.org/10.1109/ENERGYCON58629.2024.10488814
Show More Cited By

Index Terms

Harmonia: balancing compute and memory power in high-performance GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Electronic design automation
    1. Physical design (EDA)
  2. Hardware validation

Recommendations

Harmonia: balancing compute and memory power in high-performance GPUs
ISCA'15

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain ...
Harmonia: a high throughput B+tree for GPUs
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

B+tree is one of the most important data structures and has been widely used in different fields. With the increase of concurrent queries and data-scale in storage, designing an efficient B+tree structure has become critical. Due to abundant computation ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,199
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YWang QLin ZXu PWang B(2024)Improving GPU Energy Efficiency through an Application-transparent Frequency Scaling Policy with Performance AssuranceProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629584(769-785)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629584
Wang YHao MHe HZhang WTang QSun XWang Z(2024)DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement LearningIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33626979:5(712-726)Online publication date: Sep-2024
https://doi.org/10.1109/TSUSC.2024.3362697
Hussain HS T(2024)Analysis of Energy-Efficient LCRM Optimization Algorithm in Computer Vision-based CNNs2024 IEEE 8th Energy Conference (ENERGYCON)10.1109/ENERGYCON58629.2024.10488814(1-6)Online publication date: 4-Mar-2024
https://doi.org/10.1109/ENERGYCON58629.2024.10488814
Alawneh TJarajreh MAlkasassbeh JSharadqh A(2023)High-Performance and Power-Saving Mechanism for Page Activations Based on Full Independent DRAM Sub-Arrays in Multi-Core SystemsIEEE Access10.1109/ACCESS.2023.329984811(79801-79822)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3299848
Hussain HTamizharasan PYadav P(2023)LCRM: Layer-Wise Complexity Reduction Method for CNN Model Optimization on End DevicesIEEE Access10.1109/ACCESS.2023.329062011(66838-66857)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3290620
Maura DGoel TGoswami KBanerjee DDas S(2023)Variation aware power management for GPU memoriesMicroprocessors & Microsystems10.1016/j.micpro.2022.10471196:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.micpro.2022.104711
Chitkara Y(2022)A Review on Statistical Power Modelling for a Graphics Processing Unit (GPU)2022 Sixth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)10.1109/I-SMAC55078.2022.9987403(327-330)Online publication date: 10-Nov-2022
https://doi.org/10.1109/I-SMAC55078.2022.9987403
Mutlu OGhose SGómez-Luna JAusavarungnirun R(2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
https://doi.org/10.1007/978-981-16-7487-7_7
Straube KLowe-Power JNitta CFarrens MAkella V(2020)HCAPP: Scalable Power Control for Heterogeneous 2.5D Integrated SystemsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404448(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404448
Wang YWang QShi SHe XTang ZZhao KChu X(2020)Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-15(744-751)Online publication date: May-2020
https://doi.org/10.1109/CCGrid49817.2020.00-15
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents