Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors

Published: 27 September 2017 Publication History

Abstract

Heterogeneous chip-multiprocessors with CPU and GPU integrated on the same die allow sharing of critical memory system resources among the CPU and GPU applications. Such architectures give rise to challenging resource scheduling problems. In this paper, we explore memory access scheduling algorithms driven by criticality of GPU accesses in such systems. Different GPU access streams originate from different parts of the GPU rendering pipeline, which behaves very differently from the typical CPU pipeline requiring new techniques for GPU access criticality estimation. We propose a novel queuing network model to estimate the performance-criticality of the GPU access streams. If a GPU application performs below the quality of service requirement (e.g., frame rate in 3D scene rendering), the memory access scheduler uses the estimated criticality information to accelerate the critical GPU accesses. Detailed simulations done on a heterogeneous chip-multiprocessor model with one GPU and four CPU cores running heterogeneous mixes of DirectX, OpenGL, and CPU applications show that our proposal improves the GPU performance by 15% on average without degrading the CPU performance much. Extensions proposed for the mixes containing GPGPU applications, which do not have any quality of service requirement, improve the performance by 7% on average for these mixes.

References

[1]
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 25--38.
[2]
R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. 2012. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In Proceedings of the 39th International Symposium on Computer Architecture. 416--427.
[3]
D. Bouvier, B. Cohen, W. Fry, S. Godey, and M. Mantor. 2014. Kabini: An AMD Accelerated Processing Unit System on a Chip. In IEEE Micro, 34, 2, 22--33.
[4]
N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 128--139.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54.
[6]
S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. 2010. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. 1--11.
[7]
R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core Mapping Policies to Reduce Memory System Interference in Multi-core Systems. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. 107--118.
[8]
M. Demler. 2013. Iris Pro Takes On Discrete GPUs. In Microprocessor Report.
[9]
G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-synchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques. 353--364.
[10]
E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. 2010. Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. 335--346.
[11]
E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. 2011. Parallel Application Memory Scheduling. In Proceedings of the 44th International Symposium on Microarchitecture. 362--373.
[12]
S. Ghose, H. Lee, and J. F. Martinez. 2013. Improving Memory Scheduling via Processor-side Load Criticality Information. In Proceedings of the 40th International Symposium on Computer Architecture. 84--95.
[13]
N. Greene, M. Kass, and G. Miller. 1993. Hierarchical Z-buffer Visibility. In Proceedings of the 20th SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques. 231--238.
[14]
P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, J. Hong, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D’Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. 2014. Haswell: The Fourth Generation Intel Core Processor. In IEEE Micro, 34, 2, 6--20.
[15]
M. Harris. Dynamic Texturing. Available at http://developer.download.nvidia.com/assets/gamedev/docs/DynamicTexturing.pdf.
[16]
I. Hur and C. Lin. 2016. Adaptive History-Based Memory Schedulers. In Proceedings of the 37th International Symposium on Microarchitecture. 343--354.
[17]
Intel Corporation. Intel Core i7-4770 Processor. Available at http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz.
[18]
E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana. 2008. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. In Proceedings of the 35th International Symposium on Computer Architecture. 39--50.
[19]
A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer. 2010. High Performance Cache Replacement using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture. 60--71.
[20]
M. K. Jeong, M. Erez, C. Sudanthi, and N. C. Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference. 850--855.
[21]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th International Symposium on Computer Architecture. 332--343.
[22]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. 395--406.
[23]
A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Science (SIGMETRICS). 351--363.
[24]
D. Kanter. Intel’s Ivy Bridge Graphics Architecture. April 2012. Available at http://www.realworldtech.com/ivy-bridge-gpu/.
[25]
D. Kanter. Intel’s Sandy Bridge Graphics Architecture. August 2011. Available at http://www.realworldtech.com/sandy-bridge-gpu/.
[26]
D. Kanter. AMD Fusion Architecture and Llano. June 2011. Available at http://www.realworldtech.com/fusion-llano/.
[27]
O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th International Symposium on Microarchitecture. 114--126.
[28]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 157--166.
[29]
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. 2010. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In Proceedings of the 16th International Conference on High-Performance Computer Architecture.
[30]
H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho. 2012. MacSim: A CPU-GPU Heterogeneous Simulation Framework. Available at https://code.google.com/p/macsim/.
[31]
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. 2010. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In Proceedings of the 43rd International Symposium on Microarchitecture. 65--76.
[32]
N. Kirman, M. Kirman, M. Chaudhuri, and J. F. Martinez. 2005. Checkpointed Early Load Retirement. In Proceedings of the 11th International Conference on High-Performance Computer Architecture. 16--27.
[33]
N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. 2012. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. In IEEE Computer Architecture Letters, 11, 2, 33--36.
[34]
S.-Y. Lee, A. Arunkumar, and C.-J. Wu. 2015. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd International Symposium on Computer Architecture. 515--527.
[35]
S.-Y. Lee and C.-J. Wu. 2014. CAWS: Criticality-aware Warp Scheduling for GPGPU Workloads. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 175--186.
[36]
J. Lee and H. Kim. 2012. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. 91--102.
[37]
F. D. Luna. Introduction to 3D Game Programming with DirectX 10. Wordware Publishing Inc.
[38]
R. Manikantan and R. Govindarajan. 2008. Focused Prefetching: Performance Oriented Prefetching Based on Commit Stalls. In Proceedings of the 22nd International Conference on Supercomputing. 339--348.
[39]
V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. 2013. Managing Shared Last-level Cache in a Heterogeneous Multicore Processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 225--234.
[40]
V. Moya, C. Gonzalez, J. Roca, A. Fernandez, and R. Espasa. 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 231--241. Source and traces available at http://attila.ac.upc.edu/wiki/index.php/Main_Page.
[41]
S. P. Muralidhara, L. Subramanian, O. Mutlu, M. T. Kandemir, and T. Moscibroda. 2011. Reducing Memory Interference in Multicore Systems via Application-aware Memory Channel Partitioning. In Proceedings of the 44th International Symposium on Microarchitecture. 374--385.
[42]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 129--140.
[43]
O. Mutlu and T. Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. 146--160.
[44]
O. Mutlu and T. Moscibroda. 2008. Parallelism-aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proceedings of the 35th International Symposium on Computer Architecture. 63--74.
[45]
N. C. Nachiappan, P. Yedlapalli, N. Soundararajan, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das. 2014. GemDroid: A Framework to Evaluate Mobile Platforms. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 355--366.
[46]
K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. 2006. Fair Queuing Memory Systems. In Proceedings of the 39th International Symposium on Microarchitecture. 208--222.
[47]
T. Olson. 2010. Mali 400 MP: A Scalable GPU for Mobile and Embedded Devices. In Symposium on High-Performance Graphics.
[48]
T. Piazza. 2012. Intel Processor Graphics. In Symposium on High-Performance Graphics.
[49]
S. Rai and M. Chaudhuri. 2016. Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors. In Proceedings of the 30th International Conference on Supercomputing.
[50]
S. Rai and M. Chaudhuri. 2017. Improving CPU Performance through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors. In Proceedings of the 26th IEEE International Heterogeneity in Computing Workshop. 18--29.
[51]
M. Ribble. 2008. Next-gen Tile-based GPUs. In Game Developers’ Conference.
[52]
S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. 2000. Memory Access Scheduling. In Proceedings of the 27th International Symposium on Computer Architecture. 128--138.
[53]
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 10, 1, 16--19.
[54]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57.
[55]
A. L. Shimpi. Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested. June 2013. Available at http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested.
[56]
D. Shingari, A. Arunkumar, and C.-J. Wu. 2015. Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones. In Proceedings of the International Symposium on Workload Characterization. 22--33.
[57]
A. Stevens. 2010. QoS for High-performance and Power-efficient HD Multimedia. ARM White Paper.
[58]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report IMPACT-12-01.
[59]
S. Subramaniam, A. Bracy, H. Wang, and G. H. Loh. 2009. Criticality-based Optimizations for Efficient Load Processing. In Proceedings of the 15th International Conference on High-Performance Computer Architecture. 419--430.
[60]
L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. 2014. The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost. In Proceedings of the 32nd International Conference on Computer Design. 8--15.
[61]
L. Subramanian, V. Seshadri, A. Ghosh, S. M. Khan, and O. Mutlu. 2015. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory. In Proceedings of the 48th International Symposium on Microarchitecture. 62--75.
[62]
L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu. 2013. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. 639--650.
[63]
R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques. 335--344.
[64]
H. Usui, L. Subramanian, K. K.-W. Chang, and O. Mutlu. 2016. DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators. In ACM Transactions on Architecture and Code Optimization, 12, 4.
[65]
J. Walton. The AMD Trinity Review (A10-4600M): A New Hope. May 2012. Available at http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/.
[66]
M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts. 2011. A Fully Integrated Multi-CPU, GPU, and Memory Controller 32 nm Processor. In Proceedings of the International Solid-State Circuits Conference. 264--266.
[67]
3D Mark Benchmark. http://www.3dmark.com/.

Cited By

View all
  • (2024)Design and Performance Analysis of Modern Computational Storage Devices: A Systematic ReviewExpert Systems with Applications10.1016/j.eswa.2024.123570(123570)Online publication date: Feb-2024
  • (2023)A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous systemCPU-GPU异构系统感知和预测的批处理内存调度策略Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.220044924:7(994-1006)Online publication date: 28-Jul-2023
  • (2023)Architecting Selective Refresh based Multi-Retention Cache for Heterogeneous System (ARMOUR)2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247878(1-6)Online publication date: 9-Jul-2023
  • Show More Cited By

Index Terms

  1. Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 16, Issue 5s
    Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
    October 2017
    1448 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3145508
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 27 September 2017
    Accepted: 01 June 2017
    Revised: 01 June 2017
    Received: 01 April 2017
    Published in TECS Volume 16, Issue 5s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D rendering
    2. CPU-GPU heterogeneous multi-core
    3. DRAM access scheduling
    4. GPGPU
    5. GPU access criticality

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 02 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Design and Performance Analysis of Modern Computational Storage Devices: A Systematic ReviewExpert Systems with Applications10.1016/j.eswa.2024.123570(123570)Online publication date: Feb-2024
    • (2023)A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous systemCPU-GPU异构系统感知和预测的批处理内存调度策略Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.220044924:7(994-1006)Online publication date: 28-Jul-2023
    • (2023)Architecting Selective Refresh based Multi-Retention Cache for Heterogeneous System (ARMOUR)2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247878(1-6)Online publication date: 9-Jul-2023
    • (2021)Energy-Efficient Resource Management for Federated Edge Learning With CPU-GPU Heterogeneous ComputingIEEE Transactions on Wireless Communications10.1109/TWC.2021.308891020:12(7947-7962)Online publication date: Dec-2021
    • (2019)Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCsIEEE Transactions on Computers10.1109/TC.2019.2943855(1-1)Online publication date: 2019
    • (2018)Reducing Memory Interference Latency of Safety-Critical Applications via Memory Request Throttling and Linux Cgroup2018 31st IEEE International System-on-Chip Conference (SOCC)10.1109/SOCC.2018.8618555(215-220)Online publication date: Sep-2018
    • (2018)Deadline-aware Memory Scheduler and Governor for Heterogeneous Processors2018 IEEE 16th International Conference on Industrial Informatics (INDIN)10.1109/INDIN.2018.8471962(239-244)Online publication date: Jul-2018

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media