research-article

Open access

GP-SIMD Processing-in-Memory

Authors:

Ran GinosarAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 53, Pages 1 - 26

https://doi.org/10.1145/2686875

Published: 09 January 2015 Publication History

Abstract

GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

References

[1]

A. Akerib and R. Adar. 1995. Associative approach to real time color, motion and stereo vision. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95). Vol. 5. IEEE.

[2]

A. J. Akerib and S. Ruhman. 1991. Associative array and tree algorithms in stereo vision. In Proceedings of the 8th Israel Conference on Artificial Intelligence, Vision & Pattern Recognition. Elsevier.

[3]

G. Almási et al. 2003. Dissecting Cyclops: A detailed analysis of a multithreaded architecture. ACM SIGARCH Computer Architecture News 31, 1, 26--38.

Digital Library

[4]

AltiVec Engine. 2014. Homepage. Retrieved from http://www.freescale.com/webapp/sps/site/overview.jsp&quest;code=DRPPCALTVC.

[5]

ARM. 2014. NEON™ General-Purpose SIMD Engine. Retrieved from http://www.arm.com/products/processors/technologies/neon.php.

[6]

C. Auth et al. 2012. A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE.

[7]

K. Banerjee et al. 2003. A self-consistent junction temperature estimation methodology for nanometer scale ICs with implications for performance and thermal management. Electron Devices Meeting, 2003. IEDM'03 Technical Digest. IEEE International. IEEE.

[8]

K. E. Batcher. 1974. STARAN parallel processor system hardware. In Proceedings of the National Computer Conference. 405--410.

Digital Library

[9]

N. Binkert et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2, 1--7.

Digital Library

[10]

F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. Journal of Political Economy 81, 637--654.

[11]

S. Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the ACM/IEEE 44th Design Automation Conference (DAC’07). 746--749.

Digital Library

[12]

J. Brockman et al. 2004. A low cost, multithreaded processing-in-memory system. In Proceedings of the 31st International Symposium on Computer Architecture.

Digital Library

[13]

D. T. Burger Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News 25, 3, 13--25.

Digital Library

[14]

A. Cassidy and A. Andreou. 2012. Beyond Amdahl Law - An objective function that links performance gains to delay and energy. IEEE Transactions on Computers 61, 8, 1110--1126.

Digital Library

[15]

E. L. Cloud. 1988. The geometric arithmetic parallel processor. In Proceedings of the 2nd Symposium on the Frontiers of Massively Parallel Computation. IEEE.

[16]

P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes. 2014. An efficient and scalable semiconductor architecture for parallel automata processing. In IEEE Transactions on Parallel and Distributed Systems. 1--1.

[17]

J. Draper et al. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the 16th International Conference on Supercomputing. ACM.

Digital Library

[18]

H. Esmaeilzadeh et al. 2013. Power challenges may end the multicore era. Communications of the ACM 56, 2, 93--102.

Digital Library

[19]

H. Flatt et al. 1989. Performance of parallel processors. Parallel Computing 12, 1, 1--20.

[20]

C. Foster. 1976. Content Addressable Parallel Processors. Van Nostrand Reinhold Company, New York.

Digital Library

[21]

M. Gokhale et al. 1995. Processing in memory: The Terasys massively parallel PIM array. IEEE Computer 23--31.

Digital Library

[22]

M. Gschwind et al. 2006. Synergistic processing in cell's multicore architecture. IEEE Micro 26, 2, 10--24.

Digital Library

[23]

N. Gunther, S. Subramanyam, and S. Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi--cores. Retrieved from http://arxiv.org/abs/1105.4301.

[24]

M. Hall et al. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing.

Digital Library

[25]

N. Hardavellas et al. 2011. Toward dark silicon in servers. IEEE Micro 31, 4, 6--15.

Digital Library

[26]

J. Hennessy and D. A. Patterson. 1996. Computer Architecture: A Quantitative Approach (2nd ed.) Morgan Kaufmann Publishers.

Digital Library

[27]

D. Hentrich et al. 2009. Performance evaluation of SRAM cells in 22nm predictive CMOS technology. In Proceedings of the IEEE International Conference on Electro/Information Technology.

[28]

M. Hill et al. 2008. Amdahl's law in the multicore era. IEEE Computer 41, 7, 33--38.

Digital Library

[29]

S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. ACM SIGARCH Computer Architecture News 37, 3.

Digital Library

[30]

IBM. 2005. PowerPC Vector/SIMD Multimedia Extension. Retrieved from http://math-at-las.sourceforge.net/devel/assembly/vector_simd_pem.ppc.2005AUG23.pdf.

[31]

Intel. 2013. The Intel® Xeon Phi™ Coprocessor. Retrieved from http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html.

[32]

S. W. Keckler et al. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5, 7--17.

Digital Library

[33]

P. Kogge et al. 2000. PIM architectures to support petaflops level computation in the HTMT machine. In Proceedings of the International Workshop on Innovative Architecture for Future Generation Processors and Systems.

Digital Library

[34]

C. E. Kozyrakis et al. 1997. Scalable processors in the billion-transistor era: IRAM. Computer 30, 9, 75--78.

Digital Library

[35]

S. Kumar. 2012. Smart Memory. Retrieved from http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.23.325-1-Kumar-Smart-Memory.pdf.

[36]

G. Lipovski and C. Yu. 1999. The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval. In Proceedings of the IEEE International Workshop on Memory Technology, Design and Testing.

Digital Library

[37]

G. Loh. 2008. The cost of uncore in throughput-oriented many-core processors. In Proceedings of the Workshop on Architectures and Languages for Throughput Applications (ALTA).

[38]

D. Luebke. 2004. General-purpose computation on graphics hardware. In Proceedings of the SIGGRAPH Workshop.

Digital Library

[39]

T. Midwinter, M. Huch, P. A. Ivey, and G. Saucier. 1988. Architectural considerations of a wafer scale processor. IEE Colloquium on VLSI for Parallel Processing 4/1, 4/4, 17.

[40]

A. Morad et al. 2013. Generalized multiAmdahl: Optimization of heterogeneous multi-accelerator SoC. Computer Architecture Letters 13, 1, 37--40.

Digital Library

[41]

A. Morad et al. 2014. Convex optimization of resource allocation in asymmetric and heterogeneous SoC. Power and Timing Modeling, Optimization and Simulation (PATMOS).

[42]

A. Morad et al. 2014. Efficient dense and sparse matrix multiplication on GP-SIMD. Power and Timing Modeling, Optimization and Simulation (PATMOS).

[43]

A. Morad et al. 2014. Optimization of asymmetric and heterogeneous SoC. Under review.

[44]

T. Morad et al. 2006. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters 5, 1, 14--17.

Digital Library

[45]

J. Owens et al. 2008. GPU computing. Proceedings of the IEEE 96, 5, 879--899.

[46]

A. Pedram. 2013. Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics. PhD dissertation, University of Texas. Retrieved from http://repositories.lib.utexas.edu/bitstream/handle/2152/21364/PEDRAM-DISSERTATION-2013.pdf&quest;sequence=1.

[47]

F. Pollack. 1999. New microarchitecture challenges in the coming generations of CMOS process technologies. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society.

Digital Library

[48]

J. Potter et al. 1994. ASC: An associative-computing paradigm. Computer 27, 11, 19--25.

Digital Library

[49]

S. Pugsley et al. 2014. Comparing implementations of near-data computing with in-memory MapReduce workloads. IEEE Micro 34, 4, 44--52.

[50]

G. Qing, X. Guo, R. Patel, E. Ipek, and E. Friedman. 2013. AP-DIMM: Associative computing with STT-MRAM. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY.

Digital Library

[51]

M. Quinn. 1987. Designing Efficient Algorithms for Parallel Computers. McGraw-Hill, 125.

Digital Library

[52]

S. F. Reddaway. 1973. DAP—a distributed array processor. ACM SIGARCH Computer Architecture News 2, 4, 61--65.

Digital Library

[53]

B. Rogers et al. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 371--382.

Digital Library

[54]

R. M. Russell. 1978. The CRAY-1 computer system. Communications of the ACM 21, 1, 63--72.

Digital Library

[55]

G. E. Sayre. 1976. STARAN: An associative approach to multiprocessor architecture. Computer Architecture. Springer, Berlin.

[56]

I. Scherson et al. 1992. Bit-parallel arithmetic in a massively-parallel associative processor. IEEE Transactions on Computers 41, 10.

Digital Library

[57]

J. Sheaffer et al. 2005. Studying thermal management for graphics-processor architectures. ISPASS.

Digital Library

[58]

D. Steinkraus, L. Buck, and P. Simard. 2005. Using GPUs for machine learning algorithms. IEEE ICDAR.

Digital Library

[59]

T. Sterling and H. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing.

Digital Library

[60]

J. Suh et al. 2001. A PIM-based multiprocessor system. In Proceedings of the 15th International Symposium on Parallel and Distributed Processing.

Digital Library

[61]

L. W. Tucker and G. G. Robertson. 1988. Architecture and applications of the connection machine. Computer 21, 8, 26--38.

Digital Library

[62]

V. Volkov and J. W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press.

Digital Library

[63]

D. Wentzlaff et al. 2010. Core Count vs. Cache Size for Manycore Architectures in the Cloud. Technical Report. MIT-CSAIL-TR-2010-008, MIT.

[64]

L. Yavits. 1994. Architecture and Design of Associative Processor for Image Processing and Computer Vision. MSc Thesis, Technion -- Israel Institute of Technology. Retrieved from http://webee.technion.ac.il/&sim;ran/papers/LeonidYavitsMasterThesis1994.pdf.

[65]

L. Yavits et al. 2014a. Computer architecture with associative processor replacing last level cache and SIMD accelerator. IEEE Transactions on Computers.

[66]

L. Yavits et al. 2014b. The effect of communication and synchronization on Amdahl's law in multicore systems. Parallel Computing 40.1, 1--16.

Digital Library

[67]

L. Yavits et al. 2014c. Thermal analysis of 3D associative processor. http://arxiv.org/abs/1307.3853v1

[68]

D. Zhang et al. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM.

Digital Library

[69]

Y. Zhang and J. D. Owens. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE.

Digital Library

Cited By

Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Olgun ALuna JKanellopoulos KSalami BHassan HErgin OMutlu O(2022)PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAMACM Transactions on Architecture and Code Optimization10.1145/356369720:1(1-31)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1145/3563697
Show More Cited By

Index Terms

GP-SIMD Processing-in-Memory
1. Computer systems organization
  1. Architectures
    1. Other architectures
    2. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Resistive GP-SIMD Processing-In-Memory

GP-SIMD, a novel hybrid general-purpose SIMD architecture, addresses the challenge of data synchronization by in-memory computing, through combining data storage and massive parallel processing. In this article, we explore a resistive implementation of ...
Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Through-Silicon Vias (TSVs) and three-dimensional die stacking technologies are enabling a combination of DRAM and CMOS die layer within a single stack, leading to stacked memory. Functionality that was previously associated with the microprocessor, ...
On Endurance of Processing in (Nonvolatile) Memory
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

Processing-in-Memory (PIM) architectures have gained popularity due to their ability to alleviate the memory wall by performing large numbers of operations within the memory itself. On top of this, nonvolatile memory (NVM) technologies offer highly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015

Accepted: 01 October 2014

Revised: 01 October 2014

Received: 01 May 2014

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)
Hasso-Plattner Institute (HPI)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,203
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)26

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Olgun ALuna JKanellopoulos KSalami BHassan HErgin OMutlu O(2022)PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAMACM Transactions on Architecture and Code Optimization10.1145/356369720:1(1-31)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1145/3563697
Gebregiorgis ADu Nguyen HYu JBishnoi RTaouil MCatthoor FHamdioui S(2022)A Survey on Memory-centric Computer ArchitecturesACM Journal on Emerging Technologies in Computing Systems10.1145/354497418:4(1-50)Online publication date: 25-Oct-2022
https://dl.acm.org/doi/10.1145/3544974
Yavits LKaplan RGinosar R(2022)GIRAF: General Purpose In-Storage Resistive Associative FrameworkIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306544833:2(276-287)Online publication date: 1-Feb-2022
https://doi.org/10.1109/TPDS.2021.3065448
Gomez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2022)Machine Learning Training on a Real Processing-in-Memory System2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00064(292-295)Online publication date: Jul-2022
https://doi.org/10.1109/ISVLSI54635.2022.00064
Fernandez IQuislant RGiannoula CAlser MGomez-Luna JGutierrez EPlata OMutlu O(2022)Exploiting Near-Data Processing to Accelerate Time Series Analysis2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00061(279-282)Online publication date: Jul-2022
https://doi.org/10.1109/ISVLSI54635.2022.00061
Boroumand AGhose SOliveira GMutlu O(2022)Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00270(2997-3011)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00270
Bostanci FOlgun AOrosa LYaglikci AKim JHassan HErgin OMutlu O(2022)DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00087(1141-1155)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00087
Gomez-Luna JHajj IFernandez IGiannoula COliveira GMutlu O(2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3174101
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents