research-article

IMC-Sort: In-Memory Parallel Sorting Architecture using Hybrid Memory Cube

Authors:

Nagadastagiri Challapalle,

Akshay Krishna Ramanathan,

Vijaykrishnan NarayananAuthors Info & Claims

GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI

Pages 45 - 50

https://doi.org/10.1145/3386263.3407581

Published: 07 September 2020 Publication History

Abstract

Processing-in-memory (PIM) architectures have gained significant importance as an alternative paradigm to the von-Neumann architectures to alleviate the memory wall and technology scaling problems. PIM architectures have achieved significant latency and energy consumption improvements for various emerging and widely used workloads such as deep neural networks, graph analytics, databases and computational genomics. In this work, we propose a PIM based accelerator architecture (IMC-Sort) for the sort algorithm. Sort is one of the fundamental and widely used algorithm in various applications such as databases, networking, and data analytics. IMC-Sort architecture augments the hybrid memory cube memory system by incorporating custom sorting network at each of the HMC vault's logic layer. IMC-Sort uses optimized folded Bitonic sort and merge network to sort input sequences of arbitrary length at each vault and optimized address mapping mechanism to distribute the input data across HMC vaults. Merging of the sorted results across individual vaults is also performed using the vault's sorting network by communicating with other vaults through the HMC's crossbar network. Overall, IMC-Sort achieves 16.8x, 1.1x speedup and 375.5x, 13.6x savings in energy consumption compared to the widely used CPU implementation and state of the art near memory custom sort accelerator respectively.

Supplementary Material

MP4 File (3386263.3407581.mp4)

Presentation video

Download
50.14 MB

References

[1]

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 14--26, 2016.

[2]

P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 27--39, 2016.

[3]

S. Gudaparthi, S. Narayanan, R. Balasubramonian, E. Giacomin, H. Kambalasubramanyam, and P.-E. Gaillardon, "Wire-aware architecture and dataflow for cnn accelerators," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, p. 1--13, 2019.

[4]

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, "A scalable processing-in-memory accelerator for parallel graph processing," in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 105--117, June 2015.

[5]

G. Li, G. Dai, S. Li, Y. Wang, and Y. Xie, "GraphIA: An In-situ Accelerator for Large-scale Graph Processing," in Proceedings of the International Symposium on Memory Systems, pp. 79--84, 2018.

[6]

H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, "Map-reduce-merge: Simplified relational data processing on large clusters," in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, p. 1029--1040, 2007.

[7]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, "Hive - a petabyte scale data warehouse using hadoop," in 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 996--1005, 2010.

[8]

A. Farmahini-Farahani, H. J. Duwe III, M. J. Schulte, and K. Compton, "Modular design of high-throughput, low-latency sorting units," IEEE Trans. Comput., vol. 62, p. 1389--1402, July 2013.

Digital Library

[9]

S. H. Pugsley, A. Deb, R. Balasubramonian, and F. Li, "Fixed-function hardware sorting accelerators for near data mapreduce execution," in 2015 33rd IEEE International Conference on Computer Design (ICCD), pp. 439--442, 2015.

[10]

N. Samardzic, W. Qiao, V. Aggarwal, M. F. Chang, and J. Cong, "Bonsai: High- Performance Adaptive Merge Tree Sorting," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020.

[11]

S. Zhou, C. Chelmis, and V. K. Prasanna, "High-throughput and energy-efficient graph processing on fpga," in 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 103--110, 2016.

[12]

A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, "A hybrid design for high performance large-scale sorting on fpga," in 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1--6, 2015.

[13]

K. E. Batcher, "Sorting networks and their applications," in Proceedings of the April 30--May 2, 1968, Spring Joint Computer Conference, p. 307--314, 1968.

[14]

J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," 2012 Symposium on VLSI Technology (VLSIT), pp. 87--88, 2012.

[15]

B. Akin, F. Franchetti, and J. C. Hoe, "Data reorganization in memory using 3d-stacked dram," in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA '15, (New York, NY, USA), p. 131--143, Association for Computing Machinery, 2015.

[16]

S. Jiang, P. Pan, Y. Ou, and C. Batten, "Pymtl3: A python framework for opensource hardware modeling, generation, simulation, and verification," IEEE Micro, vol. 40, no. 4, pp. 58--66, 2020.

Digital Library

[17]

H. Chen, S. Madaminov, M. Ferdman, and P. Milder, "Fpga-accelerated sample sort for large data sets," in The 2020 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, p. 222--232, 2020.

[18]

J. D. Leidel and Y. Chen, "Hmc-sim-2.0: A simulation platform for exploring custom memory cube operations," in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 621--630, 2016.

Cited By

Lanius CGemmeke T(2024)Fully Digital, Standard-Cell-Based Multifunction Compute-in-Memory Arrays for Genome SequencingIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.330826232:1(30-41)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVLSI.2023.3308262
Oh HPark JLee S(2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
https://doi.org/10.1109/TCSII.2024.3377255
Esmaili-Dokht PGuiot MRadojković PMartorell XAyguadé ELabarta JAdlard JAmato PSforzin M(2024)On Key–Value Sort With Active Compute MemoryIEEE Transactions on Computers10.1109/TC.2024.337177373:5(1341-1356)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3371773
Show More Cited By

Index Terms

IMC-Sort: In-Memory Parallel Sorting Architecture using Hybrid Memory Cube
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware

Recommendations

Memory Coalescing for Hybrid Memory Cube
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can ...
Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Through-Silicon Vias (TSVs) and three-dimensional die stacking technologies are enabling a combination of DRAM and CMOS die layer within a single stack, leading to stacked memory. Functionality that was previously associated with the microprocessor, ...
Enabling Hybrid PCM Memory System with Inherent Memory Management
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Replacing the traditional volatile main memory, e.g., DRAM, with a non-volatile phase change memory (PCM) has become a possible solution to reduce the energy consumption of computing systems. To further reduce the bit cost of PCM, the development trend ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI

September 2020

597 pages

ISBN:9781450379441

DOI:10.1145/3386263

General Chairs:
Tinoosh Mohsenin
University of Maryland, Baltimore County, USA
,
Weisheng Zhao
Beihang University, China
,
Program Chairs:
Yiran Chen
Duke University, USA
,
Onur Mutlu
ETH Zurich, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Semiconductor Research Corporation

Conference

GLSVLSI '20

GLSVLSI '20: Great Lakes Symposium on VLSI 2020

September 7 - 9, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
361
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)14

Reflects downloads up to 01 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lanius CGemmeke T(2024)Fully Digital, Standard-Cell-Based Multifunction Compute-in-Memory Arrays for Genome SequencingIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.330826232:1(30-41)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVLSI.2023.3308262
Oh HPark JLee S(2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
https://doi.org/10.1109/TCSII.2024.3377255
Esmaili-Dokht PGuiot MRadojković PMartorell XAyguadé ELabarta JAdlard JAmato PSforzin M(2024)On Key–Value Sort With Active Compute MemoryIEEE Transactions on Computers10.1109/TC.2024.337177373:5(1341-1356)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3371773
Jangra PDuhan M(2024)In-memory computing: characteristics, spintronics, and neural network applications insightsMultiscale and Multidisciplinary Modeling, Experiments and Design10.1007/s41939-024-00517-0Online publication date: 9-Jul-2024
https://doi.org/10.1007/s41939-024-00517-0
Zokaee FChen FSun GJiang L(2023)Sky-Sorter: A Processing-in-Memory Architecture for Large-Scale SortingIEEE Transactions on Computers10.1109/TC.2022.316943472:2(480-493)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TC.2022.3169434
Alam MNajafi MTaherinejad N(2022)Sorting in Memristive MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/351718118:4(1-21)Online publication date: 13-Oct-2022
https://dl.acm.org/doi/10.1145/3517181
Lanius CGemmeke T(2022)Multi-Function CIM Array for Genome Alignment Applications built with Fully Digital Flow2022 IEEE Nordic Circuits and Systems Conference (NorCAS)10.1109/NorCAS57515.2022.9934470(1-7)Online publication date: 25-Oct-2022
https://doi.org/10.1109/NorCAS57515.2022.9934470
Lenjani MAhmed ASkadron K(2022)Pulley: An Algorithm/Hardware Co-Optimization for In-Memory SortingIEEE Computer Architecture Letters10.1109/LCA.2022.320825521:2(109-112)Online publication date: 1-Jul-2022
https://doi.org/10.1109/LCA.2022.3208255
Li HJin HZheng LHuang YLiao X(2022)ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memoryFrontiers of Computer Science10.1007/s11704-022-1322-317:2Online publication date: 8-Aug-2022
https://doi.org/10.1007/s11704-022-1322-3

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents