Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

Published: 01 February 2014 Publication History

Abstract

The main memory system is a shared resource in modern multicore machines that can result in serious interference leading to reduced throughput and unfairness. Many new memory scheduling mechanisms have been proposed to address the interference problem. However, these mechanisms usually employ relative complex scheduling logic and need modifications to Memory Controllers (MCs), which incur expensive hardware design and manufacturing overheads.
This article presents a practical software approach to effectively eliminate the interference without any hardware modifications. The key idea is to modify the OS memory management system and adopt a page-coloring-based Bank-level Partitioning Mechanism (BPM) that allocates dedicated DRAM banks to each core (or thread). By using BPM, memory requests from distinct programs are segregated across multiple memory banks to promote locality/fairness and reduce interference. We further extend BPM to BPM+ by incorporating channel-level partitioning, on which we demonstrate additional gain over BPM in many cases. To achieve benefits in the presence of diverse application memory needs and avoid performance degradation due to resource underutilization, we propose a dynamic mechanism upon BPM/BPM+ that assigns appropriate bank/channel resources based on application memory/bandwidth demands monitored through PMU (performance-monitoring unit) and a low-overhead OS page table scanning process.
We implement BPM/BPM+ in Linux 2.6.32.15 kernel and evaluate the technique on four-core and eight-core real machines by running a large amount of randomly generated multiprogrammed and multithreaded workloads. Experimental results show that BPM/BPM+ can improve the overall system throughput by 4.7%/5.9%, on average, (up to 8.6%/9.5%) and reduce the unfairness by an average of 4.2%/6.1% (up to 15.8%/13.9%).

References

[1]
N. Aggarwal, et al. 2008. Power Efficient DRAM Speculation. In HPCA-14.
[2]
R. Azimi, D. K. Tam, L. Soares, and M. Stumm. 2009. Enhancing Operating System Support for Multicore Processors by Using Hardware Performance Monitoring. ACM SIGOPS Operating Systems Review 43, 2, 56--65.
[3]
Y. Bao, et al. 2008. HMTT: A Platform Independent Full-System Memory Trace Monitoring System. In SIGMETRICS'08.
[4]
S. Beamer, et al. 2010. Re-Architecting DRAM Memory Systems with Monolithically Integrated Silicon Photonics. In ISCA-37.
[5]
C. Bienia, et al. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical Report TR-811-08, Princeton University.
[6]
S. Cho and L. Jin. 2006. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In MICRO-39.
[7]
Z.Cui, et al. 2011. A Fine-grained Component-Level Power Measurement Method. In IGCC'11.
[8]
R. Das, et al. 2013. Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems. In HPCA'13.
[9]
J. Demme, et al. 2011. Rapid Identification of Architectural Bottlenecks via Precise Event Counting. In ISCA'11.
[10]
G. Dhiman, G. Marchetti, and T. Rosing. 2009. vGreen: A System for Energy Efficient Computing in Virtualized Environments. In ISLPED'09.
[11]
E. Ebrahimi, et al. 2010. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. In ASPLOS'10.
[12]
Hewlett-Packard Development Company Perfmon Project. 2005. Retrieved from http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html?jumpid=reg_r1002_usen_c-001_title_r0001.
[13]
I. Hur and C. Lin. 2007. Memory Scheduling for Modern Microprocessors. ACM Transactions on Computer Systems 25, 4, Article 10.
[14]
R. Iyer, et al. 2007. QoS policy and Architecture for Cache/Memory in CMP Platforms. In SIGMETRICS'07.
[15]
M. K. Jeong, et al. 2012. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems. In HPCA-18.
[16]
Y. Kim, et al. 2010. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple MCs. In HPCA-16.
[17]
Y. Kim, M. Papamicheal, and O. Mutlu. 2010. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO-43.
[18]
D. Kaseridis, J. Stuecheli, and L. K. John. 2011. Minimalist Open-Page: A DRAM Page-Mode Scheduling Policy for the Many-Core Era. In MICRO-44.
[19]
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. 2008. Using OS Observations to Improve Performance in Multicore Systems. In Micro-41.
[20]
C. J. Lee, et al. 2009. Improving Memory Bank-level Parallelism in the Presence of Prefetching. In MICRO-42.
[21]
J. Liedtke, H. Haertig, and M. Hohmuth. 1997. OS-Controlled Cache Predictability for Real-Time Systems. In RTAS-3.
[22]
Linux/RK. 2013. Homepage. Retrieved from https://rtml.ece.cmu.edu/redmine/projects/rk.
[23]
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. 2008. Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems. In HPCA-14.
[24]
L. Liu, et al. 2012. A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems. In PACT-21.
[25]
W. Mi, et al. 2010. Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors. In Proc. the 2010 IFIP Int'l Conf. NPC.
[26]
T. Moscibroda and O. Mutlu. 2007. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX Security.
[27]
S. P. Muralidhara, et al. 2011. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning. In Micro-44.
[28]
O. Mutlu and T. Moscibroda. 2008. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA-35.
[29]
O. Mutlu and T. Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO-40.
[30]
C. Natarajan, B. Christenson, and F. Briggs. 2004. A Study of Performance Impact of MC Features in Multi-Processor Environment. In Proceedings of WMPI.
[31]
K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. 2006. Fair Queuing Memory Systems. In MICRO-39.
[32]
H. Park, et al. 2013. Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce Row Buffer Conflicts for Multi-Core, Multi-Bank Systems. In ASPLOS'13.
[33]
M. K. Qureshi and Y. N. Patt. 2006. Utility-based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO-39.
[34]
S. Rixner, et al. 2000. Memory Access Scheduling. In ISCA-27.
[35]
B. Rogers, et al. 2009. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling. In ISCA-42.
[36]
Standard Performance Evaluation Corporation. 2011. SPEC CPU2006. Retrieved from http://www.spec.org/cpu2006.
[37]
K. Sudan, et al. 2010. Micro-Pages: Increasing DRAM Efficiency with Locality-Aware. In ASPLOS'10.
[38]
G. E. Suh, L. Rudolph, and S. Devadas. 2004. Dynamic Partitioning of Shared Cache Memory. Journal of Supercomputing 28, 1, 7--26.
[39]
G. E. Suh, S. Devadas, and L. Rudolph. 2002. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In HPCA-8.
[40]
N. Suzuki, et al. 2013. Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses. In ICESS'13.
[41]
H. S. Stone, J. Turek, and J. L. Wolf. 1992. Optimal Partitioning of Cache Memory. IEEE Transactions on Computers 41, 9, 1054--1068.
[42]
L. Subramanian, et al. 2012. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In HPCA'13.
[43]
A. Udipi, et al. 2010. Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. In ISCA'10.
[44]
G. L. Yuan, et al. 2009. Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures. In MICRO-42.
[45]
X. Zhang, S. Dwarkadas, and K. Shen. 2009. Hardware Execution Throttling for Multi-Core Resource Management. In USENIX ATC'09.
[46]
Z. Zhang, Z. Zhu, and X. Zhang. 2000. A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality. In MICRO-33.
[47]
S. Zhuravlev, S. Blagodurov, and A. Fedorova. 2010. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In ASPLOS'10.

Cited By

View all
  • (2024)REDBJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103135151:COnline publication date: 1-Jun-2024
  • (2023)QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDsACM Transactions on Architecture and Code Optimization10.1145/363295521:1(1-25)Online publication date: 14-Nov-2023
  • (2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
  • Show More Cited By

Recommendations

Reviews

Anshuman Gupta

Dynamic random access memory (DRAM) interference in shared memory systems can, as demonstrated in this paper, lead to degradation of performance. The paper proposes a software-based solution to provide isolation between applications by allocating different banks (bank-level partitioning mechanism (BPM)) and channels (BPM+) to different applications, thereby improving performance. The paper also proposes dynamically allocating the banks to applications based on their memory requirements with the objective of attaining uniform channel utilization. The paper attacks the important problem of interference on shared memory systems, and does an excellent job of quantifying the harmful effects of interference and highlighting the worsening of interference effects in the future. Moreover, the paper clearly quantifies the benefits of resource isolation and dynamic resource allocation to reaffirm these ideas, which have been proposed in the past. However, the papers software-based solution to reduce interference, determine memory requirements, and calculate dynamic allocations lacks details, is poorly evaluated, and might not be sufficiently lightweight to be useful in real systems with rapidly changing application phases or complicated memory behavior. Moreover, there is no analysis of the scalability of the proposed solution to future systems, which seem to be sharing ever-increasing resources between an increasing number of cores. Here are some key thoughts on the paper in more detail: The paper quantifies the impact of interference on fairness and throughput. The uneven and large slowdowns in applications due to bandwidth contention or row-buffer contentions can lead to degradation of overall system performance. The paper demonstrates that resource isolation can reduce interference to reduce fairness and improve throughput. This, however, comes at the cost of fragmentation: allocating banks and channels to an application that has smaller requirements can lead to underutilization. The paper demonstrates that dynamic resource allocation is required because staticfixed allocation is insufficient. The system should be able to change the resource allocation based on changing application behaviors and overall system resource availability. The papers evaluation is not very thorough. While the authors used modern benchmarks, both single threaded and multithreaded, the authors fail to show the results for all benchmark combinations under all scenarios. The results look cherry-picked. Some claims in the paper are not backed up with data. For example, dynamic allocation is evaluated in intervals of ten seconds, but why The paper shows a graph between channel utilization mismatch and performance improvement, and concludes that better balance in channel utilization leads to higher performance. This might be a correlation rather than causality, as a better channel utilization will lead to both balanced channel utilization as well as higher performance. While resource isolation is useful in reducing interference, and dynamic allocation helps to improve resource utilization, it is essential that these two mechanisms are lightweight and accurate in order to account for changing application phases. The papers software approach might be too expensive and slow to adapt to phase changes. The paper uses last level cache (LLC) miss rate to determine the applications memory behavior, which is insufficient since a low miss rate can be attributed to either large working set size or low memory-level parallelism in the application phase. The determination of application memory requirements from the LLC miss rate is not present in the paper. The authors comment that hardware solutions are complicated, but the software solution might be more expensive in terms of execution time and energy. They fail to provide any evaluation of performance and energy overheads of the mechanism. The related work section of the paper fails to look at very recent work in this area (for example, TimeCube [1]). Overall, I would recommend reading this paper to get insights into the worsening ill effects of interference in shared DRAM systems, and the qualitative and quantitative benefits of resource isolation and dynamic allocation. However, I would be cautious about the papers software-based solution to achieve these two properties. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 1
February 2014
373 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2591460
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2014
Accepted: 01 December 2013
Revised: 01 November 2013
Received: 01 June 2013
Published in TACO Volume 11, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Main memory
  2. interference
  3. memory scheduling
  4. multicore

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)138
  • Downloads (Last 6 weeks)26
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)REDBJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103135151:COnline publication date: 1-Jun-2024
  • (2023)QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDsACM Transactions on Architecture and Code Optimization10.1145/363295521:1(1-25)Online publication date: 14-Nov-2023
  • (2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
  • (2021)Monitoring Memory Behaviors and Mitigating NUMA Drawbacks on Tiered NVM SystemsNetwork and Parallel Computing10.1007/978-3-030-79478-1_33(386-391)Online publication date: 23-Jun-2021
  • (2019)Hierarchical Hybrid Memory Management in OS for Tiered Memory SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290817530:10(2223-2236)Online publication date: 1-Oct-2019
  • (2019)Research on Shared Resource Contention of Cloud Data CenterHigh-Performance Computing Applications in Numerical Simulation and Edge Computing10.1007/978-981-32-9987-0_16(186-197)Online publication date: 29-Aug-2019
  • (2018)A performance & power comparison of modern high-speed DRAM architecturesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240315(341-353)Online publication date: 1-Oct-2018
  • (2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
  • (2017)Multiple Physical MappingsProceedings of the 8th Asia-Pacific Workshop on Systems10.1145/3124680.3124742(1-9)Online publication date: 2-Sep-2017
  • (2017)Towards "Full Containerization" in Containerized Network Function VirtualizationACM SIGARCH Computer Architecture News10.1145/3093337.303771345:1(467-481)Online publication date: 4-Apr-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media