research-article

Micro-pages: increasing DRAM efficiency with locality-aware data placement

Authors:

Niladrish Chatterjee,

Rajeev Balasubramonian,

Al DavisAuthors Info & Claims

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Pages 219 - 230

https://doi.org/10.1145/1736020.1736045

Published: 13 March 2010 Publication History

Abstract

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems.

The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).

References

[1]

STREAM -- Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.

[2]

Virtutech Simics Full System Simulator. http://www.virtutech.com.

[3]

Java Server Benchmark, 2005. Available at http://www.spec.org/jbb2005/.

[4]

K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of ISPASS, 2005.

Digital Library

[5]

K. Asanovic and et. al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department, University of California, Berkeley, 2006.

[6]

M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches. In Proceedings of HPCA, 2009.

[7]

D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, D. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3): 63.73, Fall 1991.

[8]

L. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool, 2009.

Digital Library

[9]

C. Benia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical report, Department of Computer Science, Princeton University, 2008.

[10]

B. Bershad, B. Chen, D. Lee, and T. Romer. Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches. In Proceedings of ASPLOS, 1994.

Digital Library

[11]

J. Carter, W. Hsieh, L. Stroller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In Proceedings of HPCA, 1999.

Digital Library

[12]

R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of ASPLOS, 1994.

Digital Library

[13]

M. Chaudhuri. PageNUCA: Selected Policies For Page-Grain Locality Management In Large Shared Chip-Multiprocessor Caches. In Proceedings of HPCA, 2009.

[14]

S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In Proceedings of MICRO, 2006.

Digital Library

[15]

J. Corbalan, X. Martorell, and J. Labarta. Page Migration with Dynamic Space-Sharing Scheduling Policies: The case of SGI 02000. International Journal of Parallel Programming, 32(4), 2004.

Digital Library

[16]

R. Crisp. Direct Rambus Technology: The New Main Memory Standard. In Proceedings of MICRO, 1997.

Digital Library

[17]

V. Cuppu and B. Jacob. Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance. In Proceedings of ISCA, 2001.

Digital Library

[18]

V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of ISCA, 1999.

Digital Library

[19]

V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. Irwin. DRAM Energy Management Using Software and Hardware Directed Power Mode Control. In Proceedings of HPCA, 2001.

Digital Library

[20]

X. Ding, D. S. Nikopoulosi, S. Jiang, and X. Zhang. MESA: Reducing Cache Conflicts by Integrating Static and Run-Time Methods. In Proceedings of ISPASS, 2006.

[21]

X. Fan, H. Zeng, and C. Ellis. Memory Controller Policies for DRAM Power Management. In Proceedings of ISLPED, 2001.

Digital Library

[22]

Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh. Online Superpage Promotion Revisited (Poster Session). SIGMETRICS Perform. Eval. Rev., 2000.

Digital Library

[23]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement And Replication In Distributed Caches. In Proceedings of ISCA, 2009.

Digital Library

[24]

J. L. Henning. SPEC CPU2006 Benchmark Descriptions. In Proceedings of ACM SIGARCH Computer Architecture News, 2005.

Digital Library

[25]

H. Huang, P. Pillai, and K. G. Shin. Design And Implementation Of Power-Aware Virtual Memory. In Proceedings Of The Annual Conference On Usenix Annual Technical Conference, 2003.

Digital Library

[26]

H. Huang, K. Shin, C. Lefurgy, and T. Keller. Improving Energy Efficiency by Making DRAM Less Randomly Accessed. In Proceedings of ISLPED, 2005.

Digital Library

[27]

Intel 845G/845GL/845GV Chipset Datasheet: Intel 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH). Intel Corporation, 2002. http://download.intel.com/design/chipsets/datashts/29074602.pdf.

[28]

ITRS. International Technology Roadmap for Semiconductors, 2007 Edition. http://www.itrs.net/Links/2007ITRS/Home2007.htm.

[29]

B. Jacob, S.W. Ng, and D. T.Wang. Memory Systems -- Cache, DRAM, Disk. Elsevier, 2008.

Digital Library

[30]

JEDEC. JESD79: Double Data Rate (DDR) SDRAM Specification. JEDEC Solid State Technology Association, Virginia, USA, 2003.

[31]

N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of ISCA-17, pages 364.373, May 1990.

Digital Library

[32]

R. E. Kessler and M. D. Hill. Page Placement Algorithms for Large Real-Indexed Caches. ACM Trans. Comput. Syst., 10(4), 1992.

Digital Library

[33]

D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, third edition, 1997.

Digital Library

[34]

R. LaRowe and C. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical report, 1990.

Digital Library

[35]

R. LaRowe and C. Ellis. Page Placement policies for NUMA multiprocessors. J. Parallel Distrib. Comput., 11(2), 1991.

Digital Library

[36]

R. LaRowe, J. Wilkes, and C. Ellis. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of PPOPP, 1991.

Digital Library

[37]

K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. Reinhardt, and T. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of ISCA, 2009.

Digital Library

[38]

K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse--Computing Environments. In Proceedings of ISCA, 2008.

Digital Library

[39]

W. Lin, S. Reinhardt, and D. Burger. Designing a Modern Memory Hierarchy with Hardware Prefetching. In Proceedings of IEEE Transactions on Computers, 2001.

Digital Library

[40]

P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50.58, February 2002.

Digital Library

[41]

Micron DDR2 SDRAM Part MT47H64M8. Micron Technology Inc., 2004.

[42]

R. Min and Y. Hu. Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses. IEEE Trans. Comput., 50(11), 2001.

Digital Library

[43]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of MICRO, 2007.

Digital Library

[44]

O. Mutlu and T. Moscibroda. Stall--Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of MICRO, 2007.

Digital Library

[45]

O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In Proceedings of ISCA, 2008.

Digital Library

[46]

J. Navarro, S. Iyer, P. Druschel, and A. Cox. Practical, Transparent Operating System

[47]

N. Rafique, W. Lim, and M. Thottethodi. Architectural Support for Operating System Driven CMP Cache Management. In Proceedings of PACT, 2006.

Digital Library

[48]

S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proceedings of ISCA, 2000.

Digital Library

[49]

T. Romer, W. Ohlrich, A. Karlin, and B. Bershad. Reducing TLB and Memory Overhead Using Online Superpage Promotion. In Proceedings of ISCA-22, 1995.

Digital Library

[50]

T. Sherwood, B. Calder, and J. Emer. Reducing Cache Misses Using Hardware and Software Page Placement. In Proceedings of SC, 1999.

Digital Library

[51]

A. Snavely, D. Tullsen, and G. Voelker. Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor. In Proceedings of SIGMETRICS, 2002.

Digital Library

[52]

M. Swanson, L. Stoller, and J. Carter. Increasing TLB Reach using Superpages Backed by Shadow Memory. In Proceedings of ISCA, 1998.

Digital Library

[53]

M. Talluri and M. D. Hill. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of ASPLOS-VI, 1994.

Digital Library

[54]

S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. Technical report, HP Laboratories, 2007.

[55]

B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGPLAN Not., 31(9), 1996.

Digital Library

[56]

D. Wallin, H. Zeffer, M. Karlsson, and E. Hagersten. VASA: A Simulator Infrastructure with Adjustable Fidelity. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, 2005.

[57]

D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A Memory-System Simulator. In SIGARCH Computer Architecture News, volume 33, September 2005.

Digital Library

[58]

X. Zhang, S. Dwarkadas, and K. Shen. Hardware Execution Throttling for Multi-core Resource Management. In Proceedings of USENIX, 2009.

Digital Library

[59]

Z. Zhang, Z. Zhu, and X. Zhand. A Permutation-Based Page Interleaving Scheme to Reduce Row--Buffer Conflicts and Exploit Data Locality. In Proceedings of MICRO, 2000.

Digital Library

[60]

H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-Rank: Adaptive DRAM Architecture For Improving Memory Power Efficiency. In Proceedings of MICRO, 2008.

Digital Library

[61]

H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Decoupled DIMM: Building High-Bandwidth Memory System from Low-Speed DRAM Devices. In Proceedings of ISCA, 2009.

Digital Library

[62]

Z. Zhu and Z. Zhang. A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. In Proceedings of HPCA, 2005.

Digital Library

[63]

Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority Scheduling on Multi-channel Memory Systems. In Proceedings of HPCA, 2002

Digital Library

Cited By

Olgun ABostanci FFrancisco de Oliveira Junior GTugrul YBera RYaglikci AHassan HErgin OMutlu O(2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
https://doi.org/10.1145/3673653
Alawneh TSharadqh AJarajreh MAlkasassbeh J(2023)A Hardware-Based Approach to Determine the Frequently Accessed DRAM Pages for Multi-Core Systems2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT58638.2023.10185689(146-153)Online publication date: 22-May-2023
https://doi.org/10.1109/JEEIT58638.2023.10185689
Wei RLi CChen CSun GHe M(2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
https://doi.org/10.3390/electronics10040438
Show More Cited By

Index Terms

Micro-pages: increasing DRAM efficiency with locality-aware data placement
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Micro-pages: increasing DRAM efficiency with locality-aware data placement
ASPLOS '10

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read ...
Micro-pages: increasing DRAM efficiency with locality-aware data placement
ASPLOS '10

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read ...
Health-Binning: Maximizing the Performance and the Endurance of Consumer-Level NAND Flash
SYSTOR '16: Proceedings of the 9th ACM International on Systems and Storage Conference

In recent years, the adoption of NAND flash in enterprise storage systems has been progressing rapidly. Todays all-flash storage arrays exhibit excellent I/O throughput, latency, storage density, and energy efficiency. However, the advancements in NAND ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

March 2010

422 pages

ISBN:9781605588391

DOI:10.1145/1736020

General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA

ACM SIGARCH Computer Architecture News Volume 38, Issue 1
ASPLOS '10
March 2010
399 pages
ISSN:0163-5964
DOI:10.1145/1735970
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 45, Issue 3
ASPLOS '10
March 2010
399 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1735971
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '10

Sponsor:

ASPLOS '10: Architectural Support for Programming Languages and Operating Systems

March 13 - 17, 2010

Pennsylvania, Pittsburgh, USA

Acceptance Rates

ASPLOS XV Paper Acceptance Rate 32 of 181 submissions, 18%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

148
Total Citations
View Citations
1,696
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Olgun ABostanci FFrancisco de Oliveira Junior GTugrul YBera RYaglikci AHassan HErgin OMutlu O(2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
https://doi.org/10.1145/3673653
Alawneh TSharadqh AJarajreh MAlkasassbeh J(2023)A Hardware-Based Approach to Determine the Frequently Accessed DRAM Pages for Multi-Core Systems2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT58638.2023.10185689(146-153)Online publication date: 22-May-2023
https://doi.org/10.1109/JEEIT58638.2023.10185689
Wei RLi CChen CSun GHe M(2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
https://doi.org/10.3390/electronics10040438
Wang YOrosa LPeng XGuo YGhose SPatel MKim JLuna JSadrosadati MGhiasi NMutlu O(2020)FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00036(313-328)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00036
Helm CAkiyama STaura K(2020)Reliable Reverse Engineering of Intel DRAM Addressing Using Performance Counters2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS50786.2020.9285962(1-8)Online publication date: 17-Nov-2020
https://doi.org/10.1109/MASCOTS50786.2020.9285962
Calciu IPuddu IKolli ANowatzyk AGandhi JMutlu OSubrahmanyam P(2019)Project PBerryProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321424(127-135)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321424
Bojnordi MNasrullah F(2019)ReTaggerProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317895(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317895
Liu CKotra JJung MKandemir MDas CBahar IHerlihy MWitchel ELebeck A(2019)SOML ReadProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304035(955-969)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304035
Wang HJog A(2019)Exploiting Latency and Error Tolerance of GPGPU Applications for an Energy-Efficient DRAM2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2019.00046(362-374)Online publication date: Jun-2019
https://doi.org/10.1109/DSN.2019.00046
IMAMURA SYASUI YINOUE KONO TSASAKI HFUJISAWA K(2018)Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded ProgramsIEICE Transactions on Information and Systems10.1587/transinf.2017EDP7296E101.D:9(2247-2257)Online publication date: 1-Sep-2018
https://doi.org/10.1587/transinf.2017EDP7296
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents