Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2749469.2750385acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

Published: 13 June 2015 Publication History

Abstract

Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches.
In this paper, we propose a new PIM architecture that (1) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions.
We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PIM architectures by adapting to data locality of applications.

References

[1]
J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in Proc. ISCA, 2015.
[2]
R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, Jul. 2014.
[3]
C. Balkesen et al., "Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware," in Proc. ICDE, 2013.
[4]
C. Bienia et al., "The PARSEC benchmark suite: Characterization and architectural implications," in Proc. PACT, 2008.
[5]
S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," in Proc. WWW, 1998.
[6]
K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in Proc. DATE, 2012.
[7]
E. Cooper-Balis et al., "Buffer-on-board memory systems," in Proc. ISCA, 2012.
[8]
W. R. Davis et al., "Demystifying 3D ICs: The pros and cons of going vertical," IEEE Des. Test Comput., vol. 22, no. 6, pp. 498--510, Nov./Dec. 2005.
[9]
Z. Fang et al., "Active memory operations," in Proc. ICS, 2007.
[10]
Z. Fang et al., "Active memory controller," J. Supercomput., vol. 62, no. 1, pp. 510--549, Oct. 2012.
[11]
A. Farmahini-Farahani et al., "NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules," in Proc. HPCA, Feb. 2015.
[12]
N. Firasta et al., "Intel® AVX: New frontiers in performance improvements and energy efficiency," Intel Corporation, Tech. Rep., May 2008.
[13]
J. Friedrich et al., "The POWER8™ processor: Designed for big data, analytics, and cloud environments," in Proc. ICICDT, 2014.
[14]
M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Comput., vol. 28, no. 4, pp. 23--31, Apr. 1995.
[15]
Q. Guo et al., "AC-DIMM: Associative computing with STT-MRAM," in Proc. ISCA, 2013.
[16]
M. Hall et al., "Mapping irregular applications to DIVA, a PIM-based data-intensive architecture," in Proc. SC, 1999.
[17]
S. Hong et al., "Green-Marl: A DSL for easy and efficient graph analysis," in Proc. ASPLOS, 2012.
[18]
S. Hong et al., "Efficient parallel graph exploration on multi-core CPU and GPU," in Proc. PACT, 2011.
[19]
S. Hong et al., "Simplifying scalable graph processing with a domain-specific language," in Proc. CGO, 2014, pp. 208--218.
[20]
"Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, Tech. Rep., Jan. 2013.
[21]
"Hybrid memory cube specification 2.0," Hybrid Memory Cube Consortium, Tech. Rep., Nov. 2014.
[22]
"Intel® C102/C104 scalable memory buffer datasheet," Intel, Feb. 2014.
[23]
J. Jeddeloh and B. Keeth, "Hybrid memory cube new DRAM architecture increases density and performance," in Proc. VLSIT, 2012.
[24]
U. Kang et al., "PEGASUS: A peta-scale graph mining system implementation and observations," in Proc. ICDM, 2009.
[25]
Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in Proc. ICCD, 1999.
[26]
D. Keen et al., "Cache coherence in intelligent memory systems," IEEE Trans. Comput., vol. 52, no. 7, pp. 960--966, Jul. 2003.
[27]
G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in Proc. PACT, 2013.
[28]
P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in Proc. ICPP, 1994.
[29]
Laboratory for Web Algorithmics. Available: http://law.di.unimi.it/datasets.php
[30]
S. Li et al., "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proc. MICRO, 2009.
[31]
G. H. Loh, "3D-stacked memory architectures for multi-core processors," in Proc. ISCA, 2008.
[32]
G. H. Loh et al., "A processing-in-memory taxonomy and a case for studying fixed-function PIM," presented at the Workshop on Near-Data Processing, 2013.
[33]
C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.
[34]
A. Lumsdaine et al., "Challenges in parallel graph processing," Parallel Process. Lett., vol. 17, no. 1, pp. 5--20, Mar. 2007.
[35]
K. Mai et al., "Smart memories: A modular reconfigurable architecture," in Proc. ISCA, 2000.
[36]
A. Mislove et al., "Measurement and analysis of online social networks," in Proc. IMC, 2007.
[37]
N. Muralimanohar et al., "CACTI 6.0: A tool to model large caches," HP Laboratories, Tech. Rep. HPL-2009-85, Apr. 2009.
[38]
R. Narayanan et al., "MineBench: A benchmark suite for data mining workloads," in Proc. IISWC, 2006.
[39]
M. Oskin et al., "Active pages: A computation model for intelligent memory," in Proc. ISCA, 1998.
[40]
D. Patterson et al., "A case for intelligent RAM," IEEE Micro, vol. 17, no. 2, pp. 34--44, Mar./Apr. 1997.
[41]
S. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in Proc. ISPASS, 2014.
[42]
S. Rusu et al., "Ivytown: A 22nm 15-core enterprise Xeon processor family," in International Solid-State Circuits Conference Digest of Technical Papers, 2014.
[43]
V. Seshadri et al., "RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization," in Proc. MICRO, 2013.
[44]
Y. Solihin et al., "Automatic code mapping on an intelligent memory architecture," IEEE Trans. Comput., vol. 50, no. 11, pp. 1248--1266, 2001.
[45]
Stanford Large Network Dataset Collection. Available: http://snap.stanford.edu/data/index.html
[46]
T. L. Sterling and H. P. Zima, "Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing," in Proc. SC, 2002.
[47]
Synopsys DesignWare Library -- Datapath and Building Block IP. Available: http://www.synopsys.com/dw/buildingblock.php
[48]
S. Thoziyoor et al., "PIM lite: A multithreaded processor-in-memory prototype," in Proc. GLSVLSI, 2005.
[49]
W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," ACM SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20--24, Mar. 1995.
[50]
D. P. Zhang et al., "TOP-PIM: Throughput-oriented programmable processing in memory," in Proc. HPDC, 2014.
[51]
Q. Zhu et al., "A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing," in Proc. 3DIC, 2013.

Cited By

View all
  • (2024) A 3 PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546698(1-6)Online publication date: 25-Mar-2024
  • (2024)NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM InferencingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651380(722-737)Online publication date: 27-Apr-2024
  • (2024)Bitwise Logic Using Phase Change Memory Devices Based on the Pinatubo Architecture2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00103(583-586)Online publication date: 6-Jan-2024
  • Show More Cited By

Index Terms

  1. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
          June 2015
          768 pages
          ISBN:9781450334020
          DOI:10.1145/2749469
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 13 June 2015

          Permissions

          Request permissions for this article.

          Check for updates

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          ISCA '15
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 543 of 3,203 submissions, 17%

          Upcoming Conference

          ISCA '25

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)353
          • Downloads (Last 6 weeks)34
          Reflects downloads up to 01 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024) A 3 PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546698(1-6)Online publication date: 25-Mar-2024
          • (2024)NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM InferencingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651380(722-737)Online publication date: 27-Apr-2024
          • (2024)Bitwise Logic Using Phase Change Memory Devices Based on the Pinatubo Architecture2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00103(583-586)Online publication date: 6-Jan-2024
          • (2024)Understanding Bulk-Bitwise Processing In-Memory Through Database AnalyticsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2023.331518912:1(7-22)Online publication date: Jan-2024
          • (2024)RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338611743:10(2854-2867)Online publication date: Oct-2024
          • (2024)NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Bank Processing Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00052(628-643)Online publication date: 29-Jun-2024
          • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
          • (2024)Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00034(345-360)Online publication date: 2-Mar-2024
          • (2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
          • (2024)MATSA: An MRAM-Based Energy-Efficient Accelerator for Time Series AnalysisIEEE Access10.1109/ACCESS.2024.337331112(36727-36742)Online publication date: 2024
          • Show More Cited By

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media