Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Symmetry-Agnostic Coordinated Management of the Memory Hierarchy in Multicore Systems

Published: 04 January 2016 Publication History

Abstract

In a multicore system, many applications share the last-level cache (LLC) and memory bandwidth. These resources need to be carefully managed in a coordinated way to maximize performance. DRAM is still the technology of choice in most systems. However, as traditional DRAM technology faces energy, reliability, and scalability challenges, nonvolatile memory (NVM) technologies are gaining traction. While DRAM is read/write symmetric (a read operation has comparable latency and energy consumption as a write operation), many NVM technologies (such as Phase-Change Memory, PCM) experience read/write asymmetry: write operations are typically much slower and more power hungry than read operations. Whether the memory’s characteristics are symmetric or asymmetric influences the way shared resources are managed.
We propose two symmetry-agnostic schemes to manage a shared LLC through way partitioning and memory through bandwidth allocation. The proposals work well for both symmetric and asymmetric memory. First, an exhaustive search is proposed to find the best combination of a cache way partition and bandwidth allocation. Second, an approximate scheme, derived from a theoretical model, is proposed without the overhead of exhaustive search. Simulation results show that the approximate scheme improves weighted speedup by at least 14% on average (regardless of the memory symmetry) over a state-of-the-art way partitioning and memory bandwidth allocation. Simulation results also show that the approximate scheme achieves comparable weighted speedup as a state-of-the-art multiple resource management scheme, XChange, for symmetric memory, and outperforms it by an average of 10% for asymmetric memory.

Supplementary Material

TACO1204-61 (taco1204-61.pdf)
Slide deck associated with this paper

References

[1]
Ramazan Bitirgen, Engin Ipek, and Jose F Martinez. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In MICRO’08.
[2]
Niladrish Chatterjee, Naveen Muralimanohar, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi. 2012. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In HPCA’12.
[3]
Jian Chen and Lizy Kurian John. 2011. Predictive coordination of multiple on-chip resources for chip multiprocessors. In ICS’11.
[4]
Sangyeun Cho and Hyunjin Lee. 2009. Flip-N-write: A simple deterministic technique to improve PRAM write performance, energy and endurance. In MICRO’09.
[5]
Seungryul Choi and Donald Yeung. 2006. Learning-based SMT processor resource distribution via hill-climbing. In ISCA’06.
[6]
Jianbo Dong, Lei Zhang, Yinhe Han, and Xiaowei Li. 2011. Wear rate leveling: Lifetime enhancement of PRAM with endurance variation. In DAC’11.
[7]
Yu Du, Miao Zhou, Bruce Childers, Daniel Mossé, and Rami Melhem. 2013. Bit mapping for balanced PCM cell programming. In ISCA’13.
[8]
Yu Du, Miao Zhou, Bruce Childers, Daniel Mossé, and Rami Melhem. 2013. Delta-compressed caching for overcoming the write bandwidth limitation of hybrid main memory. TACO 9, 4 (Jan. 2013).
[9]
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2010. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS’10.
[10]
Philip G. Emma. 1997. Understanding some simple processor-performance limits. IBM J. Res. Dev. 41, 3 (May 1997).
[11]
Richard Fackenthal, Makoto Kitagawa, Wataru Otsuka, Kirk Prall, Duane Mills, Keiichi Tsutsui, Jahanshir Javanifard, Kerry Tedrow, Tomohito Tsushima, Yoshiyuki Shibahara, and Glen Hush. 2014. 19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology. In ISSCC'14.
[12]
Alexandre P. Ferreira, Bruce Childers, Rami Melhem, Daniel Mossé, and Mazin Yousif. 2010a. Using PCM in next-generation embedded space applications. In RTAS’10.
[13]
Alexandre P. Ferreira, Miao Zhou, Santiago Bock, Bruce Childers, Rami Melhem, and Daniel Mossé. 2010b. Increasing PCM main memory lifetime. In DATE’10.
[14]
Allan Hartstein, Viji Srinivasan, Thomas R. Puzak, and Philip G. Emma. 2006. In Computing Frontiers’06.
[15]
Henry Cook, Miquel Moreto, Sarah Bird, Khanh Dao, David A. Patterson, and Krste Asanovic. 2013. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In ISCA'13.
[16]
Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In ISCA’08.
[17]
Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA'08.
[18]
Juan A. Colmenares, Gage Eads, Steven Hofmeyr, Sarah Bird, Miquel Moreto, David Chou, Brian Gluzman, Eric Roman, Davide B. Bartolini, Nitesh Mor, Krste Asanovic, and John D. Kubiatowicz. 2013. Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation. In DAC'13.
[19]
Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu. 2013. Evaluating STT-RAM as an energy-efficient main memory alternative. In ISPASS’13.
[20]
Kyle J. Nesbit, Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and James E. Smith. 2008. Multicore resource management. IEEE Micro 28, 3 (May 2008).
[21]
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable DRAM alternative. In ISCA’09.
[22]
Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When prefetching works, when it doesn’t, and why. ACM Trans. Archit. Code Optim. 9, 1 (March 2012).
[23]
Fang Liu, Xiaowei Jiang, and Yan Solihin. 2010. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In HPCA’10.
[24]
Kun Luo, J. Gummaraju, and M. Franklin. 2001. Balancing throughput and fairness in SMT processors. In ISPASS’01.
[25]
Yong Luo, Olaf M. Lubeck, Harvey Wasserman, Federico Bassetti, and Kirk W. Cameron. 1998. Development and validation of a hierarchical memory model incorporating CPU- and memory-operation overlap model. In Proc. of the 1st Intl. Workshop on Software and Performance.
[26]
Rustam Miftakhutdinov, Eiman Ebrahimi, and Yale N. Patt. 2012. Predicting performance impact of DVFS for realistic memory systems. In MICRO’12.
[27]
Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, Rizos Sakellariou, and Mateo Valero. 2009. FlexDCP: A QoS framework for CMP architectures. SIGOPS Oper. Syst. Rev. 43, 2 (April 2009).
[28]
Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In HiPEAC’08.
[29]
Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In MICRO’06.
[30]
John D. Owens, Peter Mattson, Ujval J. Kapasi, William J. Dally, and Scott Rixner. 2000. Memory access scheduling. In ISCA’00.
[31]
Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (Feb. 2002).
[32]
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO’06.
[33]
Moinuddin K. Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srinivasan, Luis Lastras, and Bulent Abali. 2009a. Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In MICRO'09.
[34]
Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009b. Scalable high performance main memory system using phase-change memory technology. In ISCA’09.
[35]
Moinuddin K. Qureshi, Michele Franceschini, and Luis Alfonso Lastras-Montaño. 2010. Improving read performance of phase change memories via write cancellation and write pausing. In HPCA’10.
[36]
Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and efficient fine-grain cache partitioning. In ISCA’11.
[37]
Sangbeom Kang et al. 2006. A 0.1 μm 1.8V 256Mb 66MHz synchronous burst PRAM. In ISSCC’06.
[38]
Allan Snavely, Dean M. Tullsen, and Geoff Voelker. 2002. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In SIGMETRICS’02.
[39]
Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In ISCA’10.
[40]
Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu. 2013. MISE: Providing performance predictability and improving fairness in shared main memory systems. In HPCA’13.
[41]
Gookwon E. Suh, Larry S. Rudolph, and Srinivas Devadas. 2004. Dynamic partitioning of shared cache memory. J. Supercomput. 28, 1 (April 2004).
[42]
Augusto Vega, Alper Buyuktosunoglu, Heather Hanson, Pradip Bose, and Srinivasan Ramani. 2013. Crank it up or dial it down: Coordinated multiprocessor frequency and folding control. In MICRO’13.
[43]
Ruisheng Wang, Lizhong Chen, and Timothy Pinkston. 2013. An analytical performance model for partitioning off-chip memory bandwidth. In IPDPS’13.
[44]
Weixun Wang, P. Mishra, and S. Ranka. 2011. Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems. In DAC’11.
[45]
Xiaodong Wang and J. F. Martinez. 2015. XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. In HPCA’15.
[46]
Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li. 2014. COLORIS: A dynamic cache partitioning system using page coloring. In PACT’14.
[47]
Chenjie Yu and Peter Petrov. 2010. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In DAC’10.
[48]
Wangyuan Zhang and Tao Li. 2009. Exploring phase change memory and 3D die-stacking for power/thermal friendly, fast and durable memory architectures. In PACT’09.
[49]
Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2000. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO’00.
[50]
Jishen Zhao, Cong Xu, and Yuan Xie. 2011. Bandwidth-aware reconfigurable cache design with hybrid memory technologies. In ICCAD’11.
[51]
Miao Zhou, Santiago Bock, Alexandre Ferreira, Bruce Childers, Rami Melhem, and Daniel Mossé. 2011. Real-time scheduling for phase change main memory systems. In TrustCom’11.
[52]
Miao Zhou, Yu Du, Bruce Childers, Rami Melhem, and Daniel Mossé. 2012. Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems. TACO 8, 4 (Jan. 2012).
[53]
Miao Zhou, Yu Du, Bruce R. Childers, Rami Melhem, and Daniel Mossé. 2013. Writeback-aware bandwidth partitioning for multi-core systems with PCM. In PACT’13.
[54]
Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. A durable and energy efficient main memory using phase change memory technology. In ISCA’09.

Cited By

View all
  • (2019)Enforcing Last-Level Cache Partitioning through Memory Virtual Channels2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00016(97-109)Online publication date: Sep-2019
  • (2018)Cross-layer optimization for many-to-one wireless video streaming systemsMultimedia Tools and Applications10.1007/s11042-018-5698-x77:19(24789-24811)Online publication date: 1-Oct-2018
  • (2017)A Survey of Techniques for Cache Partitioning in Multicore ProcessorsACM Computing Surveys10.1145/306239450:2(1-39)Online publication date: 10-May-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
January 2016
848 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2836331
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2016
Accepted: 01 November 2015
Revised: 01 September 2015
Received: 01 June 2015
Published in TACO Volume 12, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Shared resource
  2. cache partitioning
  3. memory bandwidth partitioning
  4. nonvolatile memory
  5. phase change memory

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • PCM@Pitt group
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)5
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Enforcing Last-Level Cache Partitioning through Memory Virtual Channels2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00016(97-109)Online publication date: Sep-2019
  • (2018)Cross-layer optimization for many-to-one wireless video streaming systemsMultimedia Tools and Applications10.1007/s11042-018-5698-x77:19(24789-24811)Online publication date: 1-Oct-2018
  • (2017)A Survey of Techniques for Cache Partitioning in Multicore ProcessorsACM Computing Surveys10.1145/306239450:2(1-39)Online publication date: 10-May-2017
  • (2016)A First Look at Quality of Mobile Live Streaming ExperienceProceedings of the 2016 Internet Measurement Conference10.1145/2987443.2987472(477-483)Online publication date: 14-Nov-2016
  • (2016)An Adaptive Demand-Based Caching Mechanism for NAND Flash Memory Storage SystemsACM Transactions on Design Automation of Electronic Systems10.1145/294765822:1(1-22)Online publication date: 13-Dec-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media