Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Automatic Sublining for Efficient Sparse Memory Accesses

Published: 10 May 2021 Publication History

Abstract

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues.
Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal.
We propose the Instruction Spatial Locality Estimator (ISLE), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.

References

[1]
Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard, Ibrahim Hur, MidhunChandra Kodiyath, Samkit Jain, Daniel S. Klowden, Marek M. Landowski, Laurent Montigny, Ankit More, Przemyslaw Ossowski, Robert Pawlowski, Nick Pepperling, Fabrizio Petrini, Mariusz Sikora, Balasubramanian Seshasayee, Shaden Smith, Sebastian Szkoda, Sanjaya Tayal2020. PIUMA: Programmable Integrated Unified Memory Architecture. arxiv:cs.AR/2010.06277
[2]
Advanced Micro Devices, Inc.2013. High Bandwidth Memory (HBM) DRAM.
[3]
Sam Ainsworth and Timothy M. Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. ACM SIGPLAN Notices 53, 2 (2018), 578--592.
[4]
Berkin Akin, Chiachen Chou, Jongsoo Park, Christopher J. Hughes, and Rajat Agarwal. 2018. Dynamic fine-grained sparse memory accesses. In International Symposium on Memory Systems (MEMSYS ’18).
[5]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. arxiv:cs.DC/1508.03619
[6]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[7]
Sanguhn Cha, O. Seongil, Hyunsung Shin, Sangjoon Hwang, Kwangil Park, Seong Jin Jang, Joo Sun Choi, Gyo Young Jin, Young Hoon Son, Hyunyoon Cho, Jung Ho Ahn, and Nam Sung Kim. 2017. Defect analysis and cost-effective resilience architecture for future DRAM devices. In International Symposium on High Performance Computer Architecture (HPCA ’17).
[8]
Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In SIAM International Conference on Data Mining. 442--446.
[9]
Hybrid Memory Cube Consortium. 2015. Hybrid Memory Cube Specification 2.1.
[10]
Cray. 2014. Urika GD. Retrieved from https://www.cray.com/sites/default/files/resources/Urika-GD-TechSpecs.pdf.
[11]
Timothy Dysart, Peter Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay Brockman, Kenneth Jacobsen, Yujen Juan, Shannon Kuntz, Richard Lethin, Janice McMahon, Chandra Pawar, Martin Perrigo, Sarah Rucker, John Ruttenberg, Max Ruttenberg, and Steve Stein. 2016. Highly scalable near memory processing with migrating threads on the Emu system architecture. In Workshop on Irregular Applications: Architectures and Algorithms.
[12]
Stijn Eyerman, Wim Heirman, Kristof Du Bois, Joshua B. Fryman, and Ibrahim Hur. 2018. Many-core graph workload analysis. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13]
Stijn Eyerman, Wim Heirman, Kristof Du Bois, Ibrahim Hur, and Joshua B. Fryman. 2020. Indirect memory fetcher. US patent 10,684,858.
[14]
Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019. Understanding the interactions of workloads and DRAM types: A comprehensive experimental study. arxiv:cs.AR/1902.07609
[15]
John R. Gilbert, Steve Reinhardt, and Viral B. Shah. 2008. A unified framework for numerical and combinatorial computing. Computing in Science & Engineering 10, 2 (2008), 20--25.
[16]
Seong-Lyong Gong, Minsoo Rhu, Jungrae Kim, Jinsuk Chung, and Mattan Erez. 2015. CLEAN-ECC: High reliability ECC for adaptive granularity memory system. In IEEE/ACM International Symposium on Microarchitecture (MICRO ’15).
[17]
Wim Heirman, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, and Ibrahim Hur. 2018. System, apparatus and method for dynamic automatic sub-cacheline granularity memory access control. US patent application 16/203,891.
[18]
Wim Heirman, Kristof Du Bois, Yves Vandriessche, Stijn Eyerman, and Ibrahim Hur. 2018. Near-side prefetch throttling: Adaptive prefetching for high-performance many-core processors. In International Conference on Parallel Architectures and Compilation Techniques (PACT ’18).
[19]
SK Hynix. 2018. SK Hynix Inc. Announces 1Ynm 16Gb DDR5 DRAM.
[20]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In International Symposium on Computer Architecture (ISCA ’10).
[21]
Hai Jin, Pengcheng Yao, Xiaofei Liao, Long Zheng, and Xianliang Li. 2017. Towards Dataflow-Based Graph Accelerator. In International Conference on Distributed Computing Systems (ICDCS ’17).
[22]
Andrew Kopser and Dennis Vollrath. 2011. Overview of the next generation Cray XMT. In Cray User Group Proceedings.
[23]
Snehasish Kumar, Hongzhou Zhao, Arrvindh Shriraman, Eric Matthews, Sandhya Dwarkadas, and Lesley Shannon. 2012. Amoeba-Cache: Adaptive blocks for eliminating waste in the memory hierarchy. In IEEE/ACM International Symposium on Microarchitecture (MICRO ’12).
[24]
Sanghyuk Kwon, Young Hoon Son, and Jung Ho Ahn. 2014. Understanding DDR4 in pursuit of In-DRAM ECC. In International SoC Design Conference.
[25]
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H. Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Transactions on Computers12 (2001), 1352--1361.
[26]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved from http://snap.stanford.edu/data.
[27]
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In International Conference on Supercomputing (ICS ’15).
[28]
Sheng Li, Doe Hyun Yoon, Ke Chen, Jishen Zhao, Jung Ho Ahn, Jay B. Brockman, Yuan Xie, and Norman P. Jouppi. 2012. MAGE: Adaptive granularity and ECC for resilient and power efficient memory systems. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
[29]
Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5--20.
[30]
M. Usman Nisar, Arash Fard, and John A. Miller. 2013. Techniques for graph analytics on big data. In IEEE International Congress on Big Data.
[31]
Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In International Symposium on Computer Architecture (ISCA ’16).
[32]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In International Symposium on High Performance Computer Architecture (HPCA ’18).
[33]
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In International Symposium on Computer Architecture (ISCA ’07).
[34]
Moinuddin K. Qureshi, M. Aater Suleman, and Yale N. Patt. 2007. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In International Symposium on High Performance Computer Architecture (HPCA ’07).
[35]
Jeffrey B. Rothman and Alan Jay Smith. 2000. Sector cache design and performance. In International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.
[36]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing (KNL): Second-generation Intel Xeon Phi product. IEEE/ACM International Symposium on Microarchitecture (MICRO ’16).
[37]
William S. Song, Vitaliy Gleyzer, Alexei Lomakin, and Jeremy Kepner. 2016. Novel graph processor architecture, prototype system, and results. In IEEE High Performance Extreme Computing Conference (HPEC ’16).
[38]
Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In IEEE/ACM International Symposium on Microarchitecture (MICRO ’20).
[39]
Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU cache bypassing. In Workshop on General Purpose Processing Using GPUs (GPGPU ’15).
[40]
Doe Hyun Yoon, Min Kyu Jeong, and Mattan Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In International Symposium on Computer Architecture (ISCA ’11).
[41]
Doe Hyun Yoon, Min Kyu Jeong, Michael Sullivan, and Mattan Erez. 2012. The dynamic granularity memory system. In International Symposium on Computer Architecture (ISCA ’12).
[42]
Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In IEEE/ACM International Symposium on Microarchitecture (MICRO ’15).
[43]
Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices 53, 2 (2018), 593--607.

Cited By

View all
  • (2024)The First Direct Mesh-to-Mesh Photonic FabricIEEE Micro10.1109/MM.2024.338782844:3(25-32)Online publication date: 1-May-2024
  • (2023)The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics ProcessorIEEE Micro10.1109/MM.2023.329584843:5(78-87)Online publication date: 1-Sep-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 3
September 2021
370 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3460978
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2021
Accepted: 01 February 2021
Revised: 01 January 2021
Received: 01 October 2020
Published in TACO Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graph analytics
  2. sparse computation

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)255
  • Downloads (Last 6 weeks)28
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The First Direct Mesh-to-Mesh Photonic FabricIEEE Micro10.1109/MM.2024.338782844:3(25-32)Online publication date: 1-May-2024
  • (2023)The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics ProcessorIEEE Micro10.1109/MM.2023.329584843:5(78-87)Online publication date: 1-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media