research-article

Reconfigurable dataflow graphs for processing-in-memory

Authors:

Charles F. Shelor,

Krishna M. KaviAuthors Info & Claims

ICDCN '19: Proceedings of the 20th International Conference on Distributed Computing and Networking

Pages 110 - 119

https://doi.org/10.1145/3288599.3288605

Published: 04 January 2019 Publication History

Abstract

In order to meet the ever-increasing speed differences between processor clocks and memory access times, there has been an interest in moving computation closer to memory. The near data processing or processing-in-memory is particularly suited for very high bandwidth memories such as the 3D-DRAMs. There are different ideas proposed for PIMs, including simple in-order processors, GPUs, specialized ASICs and reconfigurable designs. In our case, we use Coarse-Grained Reconfigurable Logic to build dataflow graphs for computational kernels as the PIM. We show that our approach can achieve significant speedups and save energy consumed by computations. We evaluated our designs using several processing technologies for building the coarse-gained logic units. The DFPIM concept showed good performance improvement and excellent energy efficiency for the streaming benchmarks that were analyzed. The DFPIM in a 28 nm process with an implementation in each of 16 vaults of a 3D-DRAM logic layer showed an average speed-up of 7.2 over that using 32 cores of an Intel Xeon server system. The server processor required 368 times more energy to execute the benchmarks than the DFPIM implementation.

References

[1]

F. Ahmad, S. Lee, M. Thottethodi, and T. N. Vijaykumar, Puma: Purdue mapreduce benchmarks suite, Tech. report, Purdue University, 2012.

[2]

B. Akin, F. Franchetti, and J. C. Hoe, Data reorganization in memory using 3d-stacked dram, Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, June 2015, pp. 131--143.

Digital Library

[3]

Arvind, Data flow languages and architectures, Proceedings of the 8th Annual Symposium on Computer Architecture (Los Alamitos, CA, USA), ISCA '81, IEEE Computer Society Press, 1981, pp. 1-.

Digital Library

[4]

Arvind and D. E. Culler, Dataflow architectures, Annual Review of Computer Science Vol. 1, 1986 (Joseph F. Traub, Barbara J. Grosz, Butler W. Lampson, and Nils J. Nilsson, eds.), Annual Reviews Inc., Palo Alto, CA, USA, 1986, pp. 225--253.

Digital Library

[5]

Arvind and R. S. Nikhil, Executing a program on the mit tagged-token dataflow architecture, IEEE Transactions on Computers 39 (1990), no. 3, 300--318.

Digital Library

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaff and K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, Oct 2009, pp. 44--54.

Digital Library

[7]

Xilinx Corp, UltraScale Architecture and Product Data Sheet: Overview, Jan 2018.

[8]

Nvidia17 Corporation, Nvidia TESLA V100 GPU Architecture, 2017.

[9]

Elpida Corp, Elpida begins sample shipments of ddr3 sdram (x32) based on tsv stacking technology, 2011.

[10]

A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N.S. Kim, NDA: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules, 2015 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE Conference papers, March 2015, pp. 283--295.

[11]

L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang, Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture, IEEE Micro 37 (2017), no. 4, 40--50.

Digital Library

[12]

M. Gao and C. Kozyrakis, Hrl: Efficient and flexible reconfigurable logic for near-data pro- cessing, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp. 126--137.

[13]

V. Govindaraju, C. H. Ho, and K. Sankaralingam, Dynamically specialized datapaths for energy efficient computing, 2011 IEEE 17th International Symposium on High Performance Computer Architecture, Feb 2011, pp. 503--514.

Digital Library

[14]

V. Govindaraju, Chen-Han Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and Changkyu Kim, Dyser: Unifying functionality and parallelism specialization for energy- efficient computing, Micro, IEEE 32 (2012), no. 5, 38--51.

Digital Library

[15]

S. Gupta, S. Feng, A. Ansari, S. Mahlke, and David August, Bundled execution of recurring traces for energy-efficient general purpose processing, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (New York, NY, USA), MICRO-44, ACM, 2011, pp. 12--23.

Digital Library

[16]

M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown, Mibench: A free, commercially representative embedded benchmark suite, Workload Char- acterization, 2001. WWC-4. 2001 IEEE International Workshop on, Dec 2001, pp. 3--14.

Digital Library

[17]

J. L. Hennessy and D. A. Patterson, Computer architecture, fifth edition: A quan- titative approach, 5th ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011.

Digital Library

[18]

S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, The hibench benchmark suite: Characterization of the mapreduce-based data analysis, Data Engineering Work- shops (ICDEW), 2010 IEEE 26th International Conference on, March 2010, pp. 41--51.

[19]

Hybrid Memory Cube Consortium, Hybrid memory cube specification 2.1, 2014.

[20]

International Technology Roadmap for Semiconductors, Itrs interconnect working group, 2012 update, 2012.

[21]

J. Jeddeloh and B. Keeth, Hybrid memory cube new dram architecture increases density and performance, 2012 Symposium on VLSI Technology (VLSIT), June 2012, pp. 87--88.

[22]

JEDEC Solid State Technology Association, Jesd235a high bandwidth memory (HBM) dram, 2015.

[23]

K. M. Kavi, R. Giorgi, and J. Arul, Scheduled dataflow: execution paradigm, architecture, and performance evaluation, IEEE Transactions on Computers 50 (2001), no. 8, 834--846.

Digital Library

[24]

K. M. Kavi, C. Shelor, and D. Pace, Concurrency, synchronization, and speculation - the dataflow way, Advances in Computers 96 (2015), 47--104.

[25]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, Gpus and the future of parallel computing, IEEE Micro 31 (2011), no. 5, 7--17.

Digital Library

[26]

C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller, Energy management for commercial servers, Computer 36 (2003), no. 12, 39--48.

Digital Library

[27]

Arm Limited, Arm Physical IP, 2017.

[28]

LLVM Project, The llvm compiler infrastructure, 2018.

[29]

LLVM Project, Writing an LLVM Pass, 2018.

[30]

G.H. Loh, 3d-stacked memory architectures for multi-core processors, Computer Architec- ture, 2008. 35th International Symposium on, June 2008, pp. 453--464.

Digital Library

[31]

Micron Technology, 16gb: x16 twindie single rank ddr4 sdram datasheet, nov 2014.

[32]

Micron Technology, Hmc high-performance memory brochure, jun 2016.

[33]

C. Nicol, A Dataflow Processing Chip for Training Deep Neural Networks, Hot Chips: A Symposium on High Performance Chips, August 2017.

[34]

D. A. Patterson, Latency lags bandwith, Commun. ACM 47 (2004), no. 10, 71--75. {90} J. Thomas Pawlowski, Hybrid memory cube (HMC), 2011.

Digital Library

[35]

J. Thomas Pawlowski, Hybrid memory cube (HMC), 2011.

[36]

M. Scrbak, M. Islam, K. M. Kavi, M. Ignatowski, and N. Jayasena, Processing-in-Memory: Exploring the Design Space, Architecture of Computing Systems ARCS 2015 (Lus Miguel Pinho Pinho, Wolfgang Karl, Albert Cohen, and Uwe Brinkschulte, eds.), Lecture Notes in Computer Science, vol. 9017, Springer International Publishing, 2015, pp. 43--54.

[37]

SK hynix, DRAM HBM products, 2018.

[38]

SPEC Benchmarks, https://www.spec.org/benchmarks.html.

[39]

P. Fay, T. Willhalm, R. Dementiev, Intel Performance Counter Monitor - A Better Way to Measure CPU Utilization, January 2017.

[40]

D. Voitsechov and Y. Etsion, Single-Graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs, Proceeding of the 41st Annual International Symposium on Com- puter Architecture (Piscataway, NJ, USA), ISCA '14, IEEE Press, 2014, pp. 205--216.

Digital Library

[41]

A. White, Exascale Challenges: Applications, Technologies, and Co-design, From Petascale to Exascale: R&D Challenges for HPC Simulation Environments, 03 2011.

[42]

W. A. Wulf and S. A. McKee, Hitting the Memory Wall: Implications of the Obvious, SIGARCH Comput. Archit. News 23 (1995), no. 1, 20--24.

Digital Library

[43]

D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, TOP-PIM: Throughput-oriented Programmable Processing in Memory, Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA), HPDC '14, ACM, 2014, pp. 85--98

Digital Library

Cited By

Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013

Index Terms

Reconfigurable dataflow graphs for processing-in-memory
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures

Recommendations

GP-SIMD Processing-in-Memory

GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with ...
Design space exploration for PIM architectures in 3D-stacked memories
CF '18: Proceedings of the 15th ACM International Conference on Computing Frontiers

Scaling existing architectures to large-scale data-intensive applications is limited by energy and performance losses caused by off-chip memory communication and data movements in the cache hierarchy. Processing-in-Memory (PIM) has been recently ...
Exploring Processing In-Memory for Different Technologies
GLSVLSI '19: Proceedings of the 2019 Great Lakes Symposium on VLSI

The recent emergence of IoT has led to a substantial increase in the amount of data processed. Today, a large number of applications are data intensive, involving massive data transfers between processing core and memory. These transfers act as a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICDCN '19: Proceedings of the 20th International Conference on Distributed Computing and Networking

January 2019

535 pages

ISBN:9781450360944

DOI:10.1145/3288599

General Chairs:
R. C. Hansdah
IISc Bangalore, India Paolo Bellavista, University of Bologna, Italy
,
Dilip Krishnaswamy
Reliance Jio Infocomm, India Sriram V. Pemmaraju, University of Iowa
,
Nitin Vaidya
Georgetown University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems
Indian Institute of Science: Indian Institute of Science

In-Cooperation

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICDCN '19

Sponsor:

SIGOPS
Indian Institute of Science

ICDCN '19: International Conference on Distributed Computing and Networking

January 4 - 7, 2019

Bangalore, India

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
177
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)3

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents