research-article

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

Authors:

Kiyoung ChoiAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 336 - 348

https://doi.org/10.1145/2749469.2750385

Published: 13 June 2015 Publication History

Abstract

Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches.

In this paper, we propose a new PIM architecture that (1) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions.

We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PIM architectures by adapting to data locality of applications.

References

[1]

J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in Proc. ISCA, 2015.

Digital Library

[2]

R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, Jul. 2014.

[3]

C. Balkesen et al., "Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware," in Proc. ICDE, 2013.

Digital Library

[4]

C. Bienia et al., "The PARSEC benchmark suite: Characterization and architectural implications," in Proc. PACT, 2008.

Digital Library

[5]

S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," in Proc. WWW, 1998.

Digital Library

[6]

K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in Proc. DATE, 2012.

Digital Library

[7]

E. Cooper-Balis et al., "Buffer-on-board memory systems," in Proc. ISCA, 2012.

Digital Library

[8]

W. R. Davis et al., "Demystifying 3D ICs: The pros and cons of going vertical," IEEE Des. Test Comput., vol. 22, no. 6, pp. 498--510, Nov./Dec. 2005.

Digital Library

[9]

Z. Fang et al., "Active memory operations," in Proc. ICS, 2007.

Digital Library

[10]

Z. Fang et al., "Active memory controller," J. Supercomput., vol. 62, no. 1, pp. 510--549, Oct. 2012.

Digital Library

[11]

A. Farmahini-Farahani et al., "NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules," in Proc. HPCA, Feb. 2015.

[12]

N. Firasta et al., "Intel® AVX: New frontiers in performance improvements and energy efficiency," Intel Corporation, Tech. Rep., May 2008.

[13]

J. Friedrich et al., "The POWER8™ processor: Designed for big data, analytics, and cloud environments," in Proc. ICICDT, 2014.

[14]

M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Comput., vol. 28, no. 4, pp. 23--31, Apr. 1995.

Digital Library

[15]

Q. Guo et al., "AC-DIMM: Associative computing with STT-MRAM," in Proc. ISCA, 2013.

Digital Library

[16]

M. Hall et al., "Mapping irregular applications to DIVA, a PIM-based data-intensive architecture," in Proc. SC, 1999.

Digital Library

[17]

S. Hong et al., "Green-Marl: A DSL for easy and efficient graph analysis," in Proc. ASPLOS, 2012.

Digital Library

[18]

S. Hong et al., "Efficient parallel graph exploration on multi-core CPU and GPU," in Proc. PACT, 2011.

Digital Library

[19]

S. Hong et al., "Simplifying scalable graph processing with a domain-specific language," in Proc. CGO, 2014, pp. 208--218.

Digital Library

[20]

"Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, Tech. Rep., Jan. 2013.

[21]

"Hybrid memory cube specification 2.0," Hybrid Memory Cube Consortium, Tech. Rep., Nov. 2014.

[22]

"Intel® C102/C104 scalable memory buffer datasheet," Intel, Feb. 2014.

[23]

J. Jeddeloh and B. Keeth, "Hybrid memory cube new DRAM architecture increases density and performance," in Proc. VLSIT, 2012.

[24]

U. Kang et al., "PEGASUS: A peta-scale graph mining system implementation and observations," in Proc. ICDM, 2009.

Digital Library

[25]

Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in Proc. ICCD, 1999.

Digital Library

[26]

D. Keen et al., "Cache coherence in intelligent memory systems," IEEE Trans. Comput., vol. 52, no. 7, pp. 960--966, Jul. 2003.

Digital Library

[27]

G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in Proc. PACT, 2013.

Digital Library

[28]

P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in Proc. ICPP, 1994.

Digital Library

[29]

Laboratory for Web Algorithmics. Available: http://law.di.unimi.it/datasets.php

[30]

S. Li et al., "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proc. MICRO, 2009.

Digital Library

[31]

G. H. Loh, "3D-stacked memory architectures for multi-core processors," in Proc. ISCA, 2008.

Digital Library

[32]

G. H. Loh et al., "A processing-in-memory taxonomy and a case for studying fixed-function PIM," presented at the Workshop on Near-Data Processing, 2013.

[33]

C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.

Digital Library

[34]

A. Lumsdaine et al., "Challenges in parallel graph processing," Parallel Process. Lett., vol. 17, no. 1, pp. 5--20, Mar. 2007.

[35]

K. Mai et al., "Smart memories: A modular reconfigurable architecture," in Proc. ISCA, 2000.

Digital Library

[36]

A. Mislove et al., "Measurement and analysis of online social networks," in Proc. IMC, 2007.

Digital Library

[37]

N. Muralimanohar et al., "CACTI 6.0: A tool to model large caches," HP Laboratories, Tech. Rep. HPL-2009-85, Apr. 2009.

[38]

R. Narayanan et al., "MineBench: A benchmark suite for data mining workloads," in Proc. IISWC, 2006.

[39]

M. Oskin et al., "Active pages: A computation model for intelligent memory," in Proc. ISCA, 1998.

Digital Library

[40]

D. Patterson et al., "A case for intelligent RAM," IEEE Micro, vol. 17, no. 2, pp. 34--44, Mar./Apr. 1997.

Digital Library

[41]

S. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in Proc. ISPASS, 2014.

[42]

S. Rusu et al., "Ivytown: A 22nm 15-core enterprise Xeon processor family," in International Solid-State Circuits Conference Digest of Technical Papers, 2014.

[43]

V. Seshadri et al., "RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization," in Proc. MICRO, 2013.

Digital Library

[44]

Y. Solihin et al., "Automatic code mapping on an intelligent memory architecture," IEEE Trans. Comput., vol. 50, no. 11, pp. 1248--1266, 2001.

Digital Library

[45]

Stanford Large Network Dataset Collection. Available: http://snap.stanford.edu/data/index.html

[46]

T. L. Sterling and H. P. Zima, "Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing," in Proc. SC, 2002.

Digital Library

[47]

Synopsys DesignWare Library -- Datapath and Building Block IP. Available: http://www.synopsys.com/dw/buildingblock.php

[48]

S. Thoziyoor et al., "PIM lite: A multithreaded processor-in-memory prototype," in Proc. GLSVLSI, 2005.

Digital Library

[49]

W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," ACM SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20--24, Mar. 1995.

Digital Library

[50]

D. P. Zhang et al., "TOP-PIM: Throughput-oriented programmable processing in memory," in Proc. HPDC, 2014.

Digital Library

[51]

Q. Zhu et al., "A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing," in Proc. 3DIC, 2013.

Cited By

Jiang QTan SChen JAn H(2024) A 3 PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546698(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546698
Heo GLee SCho JChoi HLee SHam HKim GMahajan DPark JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM InferencingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651380(722-737)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651380
Aflalo NYalon EKvatinsky S(2024)Bitwise Logic Using Phase Change Memory Devices Based on the Pinatubo Architecture2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00103(583-586)Online publication date: 6-Jan-2024
https://doi.org/10.1109/VLSID60093.2024.00103
Show More Cited By

Index Terms

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture
1. Hardware

Recommendations

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture
ISCA'15

Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However,...
Accelerating Graph Computations on 3D NoC-Enabled PIM Architectures
Graph application workloads are dominated by random memory accesses with the poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM ...
Dynamic coalescing for 16-bit instructions

In the embedded domain, memory usage and energy consumption are critical constraints.Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

362
Total Citations
View Citations
1,916
Total Downloads

Downloads (Last 12 months)353
Downloads (Last 6 weeks)34

Reflects downloads up to 01 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang QTan SChen JAn H(2024) A 3 PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546698(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546698
Heo GLee SCho JChoi HLee SHam HKim GMahajan DPark JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM InferencingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651380(722-737)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651380
Aflalo NYalon EKvatinsky S(2024)Bitwise Logic Using Phase Change Memory Devices Based on the Pinatubo Architecture2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID60093.2024.00103(583-586)Online publication date: 6-Jan-2024
https://doi.org/10.1109/VLSID60093.2024.00103
Perach BRonen RKimelfeld BKvatinsky S(2024)Understanding Bulk-Bitwise Processing In-Memory Through Database AnalyticsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2023.331518912:1(7-22)Online publication date: Jan-2024
https://doi.org/10.1109/TETC.2023.3315189
Yang WYang YJi SJiang JJing NWang QMao ZSheng W(2024)RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338611743:10(2854-2867)Online publication date: Oct-2024
https://doi.org/10.1109/TCAD.2024.3386117
Tian BLi YJiang LCai SGao M(2024)NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Bank Processing Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00052(628-643)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00052
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Jang HSong JJung JPark JKim YLee J(2024)Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00034(345-360)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00034
Hyun BKim TLee DRhu M(2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00029
Fernandez IGiannoula CManglik AQuislant RGhiasi NGómez-Luna JGutierrez EPlata OMutlu O(2024)MATSA: An MRAM-Based Energy-Efficient Accelerator for Time Series AnalysisIEEE Access10.1109/ACCESS.2024.337331112(36727-36742)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3373311
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents