Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3288599.3288605acmconferencesArticle/Chapter ViewAbstractPublication PagesicdcnConference Proceedingsconference-collections
research-article

Reconfigurable dataflow graphs for processing-in-memory

Published: 04 January 2019 Publication History

Abstract

In order to meet the ever-increasing speed differences between processor clocks and memory access times, there has been an interest in moving computation closer to memory. The near data processing or processing-in-memory is particularly suited for very high bandwidth memories such as the 3D-DRAMs. There are different ideas proposed for PIMs, including simple in-order processors, GPUs, specialized ASICs and reconfigurable designs. In our case, we use Coarse-Grained Reconfigurable Logic to build dataflow graphs for computational kernels as the PIM. We show that our approach can achieve significant speedups and save energy consumed by computations. We evaluated our designs using several processing technologies for building the coarse-gained logic units. The DFPIM concept showed good performance improvement and excellent energy efficiency for the streaming benchmarks that were analyzed. The DFPIM in a 28 nm process with an implementation in each of 16 vaults of a 3D-DRAM logic layer showed an average speed-up of 7.2 over that using 32 cores of an Intel Xeon server system. The server processor required 368 times more energy to execute the benchmarks than the DFPIM implementation.

References

[1]
F. Ahmad, S. Lee, M. Thottethodi, and T. N. Vijaykumar, Puma: Purdue mapreduce benchmarks suite, Tech. report, Purdue University, 2012.
[2]
B. Akin, F. Franchetti, and J. C. Hoe, Data reorganization in memory using 3d-stacked dram, Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, June 2015, pp. 131--143.
[3]
Arvind, Data flow languages and architectures, Proceedings of the 8th Annual Symposium on Computer Architecture (Los Alamitos, CA, USA), ISCA '81, IEEE Computer Society Press, 1981, pp. 1-.
[4]
Arvind and D. E. Culler, Dataflow architectures, Annual Review of Computer Science Vol. 1, 1986 (Joseph F. Traub, Barbara J. Grosz, Butler W. Lampson, and Nils J. Nilsson, eds.), Annual Reviews Inc., Palo Alto, CA, USA, 1986, pp. 225--253.
[5]
Arvind and R. S. Nikhil, Executing a program on the mit tagged-token dataflow architecture, IEEE Transactions on Computers 39 (1990), no. 3, 300--318.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaff and K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, Oct 2009, pp. 44--54.
[7]
Xilinx Corp, UltraScale Architecture and Product Data Sheet: Overview, Jan 2018.
[8]
Nvidia17 Corporation, Nvidia TESLA V100 GPU Architecture, 2017.
[9]
Elpida Corp, Elpida begins sample shipments of ddr3 sdram (x32) based on tsv stacking technology, 2011.
[10]
A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N.S. Kim, NDA: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules, 2015 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE Conference papers, March 2015, pp. 283--295.
[11]
L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang, Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture, IEEE Micro 37 (2017), no. 4, 40--50.
[12]
M. Gao and C. Kozyrakis, Hrl: Efficient and flexible reconfigurable logic for near-data pro- cessing, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp. 126--137.
[13]
V. Govindaraju, C. H. Ho, and K. Sankaralingam, Dynamically specialized datapaths for energy efficient computing, 2011 IEEE 17th International Symposium on High Performance Computer Architecture, Feb 2011, pp. 503--514.
[14]
V. Govindaraju, Chen-Han Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and Changkyu Kim, Dyser: Unifying functionality and parallelism specialization for energy- efficient computing, Micro, IEEE 32 (2012), no. 5, 38--51.
[15]
S. Gupta, S. Feng, A. Ansari, S. Mahlke, and David August, Bundled execution of recurring traces for energy-efficient general purpose processing, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (New York, NY, USA), MICRO-44, ACM, 2011, pp. 12--23.
[16]
M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown, Mibench: A free, commercially representative embedded benchmark suite, Workload Char- acterization, 2001. WWC-4. 2001 IEEE International Workshop on, Dec 2001, pp. 3--14.
[17]
J. L. Hennessy and D. A. Patterson, Computer architecture, fifth edition: A quan- titative approach, 5th ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011.
[18]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, The hibench benchmark suite: Characterization of the mapreduce-based data analysis, Data Engineering Work- shops (ICDEW), 2010 IEEE 26th International Conference on, March 2010, pp. 41--51.
[19]
Hybrid Memory Cube Consortium, Hybrid memory cube specification 2.1, 2014.
[20]
International Technology Roadmap for Semiconductors, Itrs interconnect working group, 2012 update, 2012.
[21]
J. Jeddeloh and B. Keeth, Hybrid memory cube new dram architecture increases density and performance, 2012 Symposium on VLSI Technology (VLSIT), June 2012, pp. 87--88.
[22]
JEDEC Solid State Technology Association, Jesd235a high bandwidth memory (HBM) dram, 2015.
[23]
K. M. Kavi, R. Giorgi, and J. Arul, Scheduled dataflow: execution paradigm, architecture, and performance evaluation, IEEE Transactions on Computers 50 (2001), no. 8, 834--846.
[24]
K. M. Kavi, C. Shelor, and D. Pace, Concurrency, synchronization, and speculation - the dataflow way, Advances in Computers 96 (2015), 47--104.
[25]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, Gpus and the future of parallel computing, IEEE Micro 31 (2011), no. 5, 7--17.
[26]
C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller, Energy management for commercial servers, Computer 36 (2003), no. 12, 39--48.
[27]
Arm Limited, Arm Physical IP, 2017.
[28]
LLVM Project, The llvm compiler infrastructure, 2018.
[29]
LLVM Project, Writing an LLVM Pass, 2018.
[30]
G.H. Loh, 3d-stacked memory architectures for multi-core processors, Computer Architec- ture, 2008. 35th International Symposium on, June 2008, pp. 453--464.
[31]
Micron Technology, 16gb: x16 twindie single rank ddr4 sdram datasheet, nov 2014.
[32]
Micron Technology, Hmc high-performance memory brochure, jun 2016.
[33]
C. Nicol, A Dataflow Processing Chip for Training Deep Neural Networks, Hot Chips: A Symposium on High Performance Chips, August 2017.
[34]
D. A. Patterson, Latency lags bandwith, Commun. ACM 47 (2004), no. 10, 71--75. {90} J. Thomas Pawlowski, Hybrid memory cube (HMC), 2011.
[35]
J. Thomas Pawlowski, Hybrid memory cube (HMC), 2011.
[36]
M. Scrbak, M. Islam, K. M. Kavi, M. Ignatowski, and N. Jayasena, Processing-in-Memory: Exploring the Design Space, Architecture of Computing Systems ARCS 2015 (Lus Miguel Pinho Pinho, Wolfgang Karl, Albert Cohen, and Uwe Brinkschulte, eds.), Lecture Notes in Computer Science, vol. 9017, Springer International Publishing, 2015, pp. 43--54.
[37]
SK hynix, DRAM HBM products, 2018.
[38]
SPEC Benchmarks, https://www.spec.org/benchmarks.html.
[39]
P. Fay, T. Willhalm, R. Dementiev, Intel Performance Counter Monitor - A Better Way to Measure CPU Utilization, January 2017.
[40]
D. Voitsechov and Y. Etsion, Single-Graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs, Proceeding of the 41st Annual International Symposium on Com- puter Architecture (Piscataway, NJ, USA), ISCA '14, IEEE Press, 2014, pp. 205--216.
[41]
A. White, Exascale Challenges: Applications, Technologies, and Co-design, From Petascale to Exascale: R&D Challenges for HPC Simulation Environments, 03 2011.
[42]
W. A. Wulf and S. A. McKee, Hitting the Memory Wall: Implications of the Obvious, SIGARCH Comput. Archit. News 23 (1995), no. 1, 20--24.
[43]
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, TOP-PIM: Throughput-oriented Programmable Processing in Memory, Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA), HPDC '14, ACM, 2014, pp. 85--98

Cited By

View all
  • (2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023

Index Terms

  1. Reconfigurable dataflow graphs for processing-in-memory

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICDCN '19: Proceedings of the 20th International Conference on Distributed Computing and Networking
    January 2019
    535 pages
    ISBN:9781450360944
    DOI:10.1145/3288599
    • General Chairs:
    • R. C. Hansdah,
    • Dilip Krishnaswamy,
    • Nitin Vaidya
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 January 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D-stacked memories
    2. coarse grained reconfigurable logic
    3. dataflow architectures
    4. processing in memory

    Qualifiers

    • Research-article

    Conference

    ICDCN '19
    Sponsor:
    • SIGOPS
    • Indian Institute of Science

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 29 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media