Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1736020.1736045acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Micro-pages: increasing DRAM efficiency with locality-aware data placement

Published: 13 March 2010 Publication History

Abstract

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems.
The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).

References

[1]
STREAM -- Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.
[2]
Virtutech Simics Full System Simulator. http://www.virtutech.com.
[3]
Java Server Benchmark, 2005. Available at http://www.spec.org/jbb2005/.
[4]
K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of ISPASS, 2005.
[5]
K. Asanovic and et. al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department, University of California, Berkeley, 2006.
[6]
M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches. In Proceedings of HPCA, 2009.
[7]
D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, D. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3): 63.73, Fall 1991.
[8]
L. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool, 2009.
[9]
C. Benia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical report, Department of Computer Science, Princeton University, 2008.
[10]
B. Bershad, B. Chen, D. Lee, and T. Romer. Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches. In Proceedings of ASPLOS, 1994.
[11]
J. Carter, W. Hsieh, L. Stroller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In Proceedings of HPCA, 1999.
[12]
R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of ASPLOS, 1994.
[13]
M. Chaudhuri. PageNUCA: Selected Policies For Page-Grain Locality Management In Large Shared Chip-Multiprocessor Caches. In Proceedings of HPCA, 2009.
[14]
S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In Proceedings of MICRO, 2006.
[15]
J. Corbalan, X. Martorell, and J. Labarta. Page Migration with Dynamic Space-Sharing Scheduling Policies: The case of SGI 02000. International Journal of Parallel Programming, 32(4), 2004.
[16]
R. Crisp. Direct Rambus Technology: The New Main Memory Standard. In Proceedings of MICRO, 1997.
[17]
V. Cuppu and B. Jacob. Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance. In Proceedings of ISCA, 2001.
[18]
V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of ISCA, 1999.
[19]
V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. Irwin. DRAM Energy Management Using Software and Hardware Directed Power Mode Control. In Proceedings of HPCA, 2001.
[20]
X. Ding, D. S. Nikopoulosi, S. Jiang, and X. Zhang. MESA: Reducing Cache Conflicts by Integrating Static and Run-Time Methods. In Proceedings of ISPASS, 2006.
[21]
X. Fan, H. Zeng, and C. Ellis. Memory Controller Policies for DRAM Power Management. In Proceedings of ISLPED, 2001.
[22]
Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh. Online Superpage Promotion Revisited (Poster Session). SIGMETRICS Perform. Eval. Rev., 2000.
[23]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement And Replication In Distributed Caches. In Proceedings of ISCA, 2009.
[24]
J. L. Henning. SPEC CPU2006 Benchmark Descriptions. In Proceedings of ACM SIGARCH Computer Architecture News, 2005.
[25]
H. Huang, P. Pillai, and K. G. Shin. Design And Implementation Of Power-Aware Virtual Memory. In Proceedings Of The Annual Conference On Usenix Annual Technical Conference, 2003.
[26]
H. Huang, K. Shin, C. Lefurgy, and T. Keller. Improving Energy Efficiency by Making DRAM Less Randomly Accessed. In Proceedings of ISLPED, 2005.
[27]
Intel 845G/845GL/845GV Chipset Datasheet: Intel 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH). Intel Corporation, 2002. http://download.intel.com/design/chipsets/datashts/29074602.pdf.
[28]
ITRS. International Technology Roadmap for Semiconductors, 2007 Edition. http://www.itrs.net/Links/2007ITRS/Home2007.htm.
[29]
B. Jacob, S.W. Ng, and D. T.Wang. Memory Systems -- Cache, DRAM, Disk. Elsevier, 2008.
[30]
JEDEC. JESD79: Double Data Rate (DDR) SDRAM Specification. JEDEC Solid State Technology Association, Virginia, USA, 2003.
[31]
N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of ISCA-17, pages 364.373, May 1990.
[32]
R. E. Kessler and M. D. Hill. Page Placement Algorithms for Large Real-Indexed Caches. ACM Trans. Comput. Syst., 10(4), 1992.
[33]
D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, third edition, 1997.
[34]
R. LaRowe and C. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical report, 1990.
[35]
R. LaRowe and C. Ellis. Page Placement policies for NUMA multiprocessors. J. Parallel Distrib. Comput., 11(2), 1991.
[36]
R. LaRowe, J. Wilkes, and C. Ellis. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of PPOPP, 1991.
[37]
K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. Reinhardt, and T. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of ISCA, 2009.
[38]
K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse--Computing Environments. In Proceedings of ISCA, 2008.
[39]
W. Lin, S. Reinhardt, and D. Burger. Designing a Modern Memory Hierarchy with Hardware Prefetching. In Proceedings of IEEE Transactions on Computers, 2001.
[40]
P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50.58, February 2002.
[41]
Micron DDR2 SDRAM Part MT47H64M8. Micron Technology Inc., 2004.
[42]
R. Min and Y. Hu. Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses. IEEE Trans. Comput., 50(11), 2001.
[43]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of MICRO, 2007.
[44]
O. Mutlu and T. Moscibroda. Stall--Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of MICRO, 2007.
[45]
O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In Proceedings of ISCA, 2008.
[46]
J. Navarro, S. Iyer, P. Druschel, and A. Cox. Practical, Transparent Operating System
[47]
N. Rafique, W. Lim, and M. Thottethodi. Architectural Support for Operating System Driven CMP Cache Management. In Proceedings of PACT, 2006.
[48]
S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proceedings of ISCA, 2000.
[49]
T. Romer, W. Ohlrich, A. Karlin, and B. Bershad. Reducing TLB and Memory Overhead Using Online Superpage Promotion. In Proceedings of ISCA-22, 1995.
[50]
T. Sherwood, B. Calder, and J. Emer. Reducing Cache Misses Using Hardware and Software Page Placement. In Proceedings of SC, 1999.
[51]
A. Snavely, D. Tullsen, and G. Voelker. Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor. In Proceedings of SIGMETRICS, 2002.
[52]
M. Swanson, L. Stoller, and J. Carter. Increasing TLB Reach using Superpages Backed by Shadow Memory. In Proceedings of ISCA, 1998.
[53]
M. Talluri and M. D. Hill. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of ASPLOS-VI, 1994.
[54]
S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. Technical report, HP Laboratories, 2007.
[55]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGPLAN Not., 31(9), 1996.
[56]
D. Wallin, H. Zeffer, M. Karlsson, and E. Hagersten. VASA: A Simulator Infrastructure with Adjustable Fidelity. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, 2005.
[57]
D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A Memory-System Simulator. In SIGARCH Computer Architecture News, volume 33, September 2005.
[58]
X. Zhang, S. Dwarkadas, and K. Shen. Hardware Execution Throttling for Multi-core Resource Management. In Proceedings of USENIX, 2009.
[59]
Z. Zhang, Z. Zhu, and X. Zhand. A Permutation-Based Page Interleaving Scheme to Reduce Row--Buffer Conflicts and Exploit Data Locality. In Proceedings of MICRO, 2000.
[60]
H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-Rank: Adaptive DRAM Architecture For Improving Memory Power Efficiency. In Proceedings of MICRO, 2008.
[61]
H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Decoupled DIMM: Building High-Bandwidth Memory System from Low-Speed DRAM Devices. In Proceedings of ISCA, 2009.
[62]
Z. Zhu and Z. Zhang. A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. In Proceedings of HPCA, 2005.
[63]
Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority Scheduling on Multi-channel Memory Systems. In Proceedings of HPCA, 2002

Cited By

View all
  • (2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
  • (2023)A Hardware-Based Approach to Determine the Frequently Accessed DRAM Pages for Multi-Core Systems2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT58638.2023.10185689(146-153)Online publication date: 22-May-2023
  • (2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
  • Show More Cited By

Index Terms

  1. Micro-pages: increasing DRAM efficiency with locality-aware data placement

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
    March 2010
    422 pages
    ISBN:9781605588391
    DOI:10.1145/1736020
    • General Chair:
    • James C. Hoe,
    • Program Chair:
    • Vikram S. Adve
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
      ASPLOS '10
      March 2010
      399 pages
      ISSN:0163-5964
      DOI:10.1145/1735970
      Issue’s Table of Contents
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 45, Issue 3
      ASPLOS '10
      March 2010
      399 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1735971
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 March 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data placement
    2. dram row-buffer management

    Qualifiers

    • Research-article

    Conference

    ASPLOS '10

    Acceptance Rates

    ASPLOS XV Paper Acceptance Rate 32 of 181 submissions, 18%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)56
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
    • (2023)A Hardware-Based Approach to Determine the Frequently Accessed DRAM Pages for Multi-Core Systems2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT58638.2023.10185689(146-153)Online publication date: 22-May-2023
    • (2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
    • (2020)FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00036(313-328)Online publication date: Oct-2020
    • (2020)Reliable Reverse Engineering of Intel DRAM Addressing Using Performance Counters2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS50786.2020.9285962(1-8)Online publication date: 17-Nov-2020
    • (2019)Project PBerryProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321424(127-135)Online publication date: 13-May-2019
    • (2019)ReTaggerProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317895(1-6)Online publication date: 2-Jun-2019
    • (2019)SOML ReadProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304035(955-969)Online publication date: 4-Apr-2019
    • (2019)Exploiting Latency and Error Tolerance of GPGPU Applications for an Energy-Efficient DRAM2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2019.00046(362-374)Online publication date: Jun-2019
    • (2018)Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded ProgramsIEICE Transactions on Information and Systems10.1587/transinf.2017EDP7296E101.D:9(2247-2257)Online publication date: 1-Sep-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media