Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3470496.3527400acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

Tiny but mighty: designing and realizing scalable latency tolerance for manycore SoCs

Published: 11 June 2022 Publication History

Abstract

Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy.
Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verification.
We present the first system implementation of latency tolerance hardware that provides significant speedups without requiring any memory hierarchy or processor tile modifications. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform long-latency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP).
In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efficient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dual-core FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over software-based prefetching and decoupling, respectively. Compared to state-of-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques.

References

[1]
Sam Ainsworth and Timothy M. Jones. 2016. Graph Prefetching Using Data Structure Knowledge. In Proceedings of the 2016 International Conference on Supercomputing (Istanbul, Turkey) (ICS '16). Association for Computing Machinery, NewYork, NY, USA, Article 39, 11 pages.
[2]
Sam Ainsworth and Timothy M Jones. 2017. Software prefetching for indirect memory accesses. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 305--317.
[3]
Sam Ainsworth and Timothy M Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. ACM SIGPLAN Notices 53, 2 (2018), 578--592.
[4]
A. Amid, D. Biancolin, A. Gonzalez, D. Grubb, S. Karandikar, H. Liew, A. Magyar, H. Mao, A. Ou, N. Pemberton, P. Rigge, C. Schmidt, J. Wright, J. Zhao, Y. S. Shao, K. Asanović, and B. Nikolić. 2020. Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs. IEEE Micro 40, 4 (2020), 10--21.
[5]
Krste Asanovic, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Biancolin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraelevitz, et al. 2016. The Rocket chip generator. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17 (2016).
[6]
Jonathan Balkind, Katie Lim, Fei Gao, Jinzheng Tu, David Wentzlaff, Michael Schaffner, Florian Zaruba, and Luca Benini. 2019. OpenPiton+Ariane: The First Open-Source, SMP Linux-booting RISC-V System Scaling From One to Many Cores. In Third Workshop on Computer Architecture Research with RISC-V, CARRV, Vol. 19.
[7]
Jonathan Balkind, Katie Lim, Michael Schaffner, Fei Gao, Grigory Chirkov, Ang Li, Alexey Lavrov, Tri M. Nguyen, Yaosheng Fu, Florian Zaruba, Kunal Gulati, Luca Benini, and David Wentzlaff. 2020. BYOC: A "Bring Your Own Core" Framework for Heterogeneous-ISA Research. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 699--714.
[8]
Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Alexey Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang, Matthew Matl, and David Wentzlaff. 2016. OpenPiton: An Open Source Manycore Research Framework. In ASPLOS. ACM, 217--232.
[9]
A. Basak, S. Li, X. Hu, S. M. Oh, X. Xie, L. Zhao, X. Jiang, and Y. Xie. 2019. Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 373--386.
[10]
Nathan Beckmann. 2021. The Case for a Programmable Memory Hierarchy. https://www.sigarch.org/the-case-for-a-programmable-memory-hierarchy/.
[11]
Peter L Bird, Alasdair Rawsthorne, and Nigel P Topham. 1993. The effectiveness of decoupling. In Proceedings of the 7th international conference on Supercomputing. 47--56.
[12]
Cadence Design Systems. 2015. JasperGold Apps User's Guide.
[13]
Luca P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proceedings of the 53rd Design Automation Conference (DAC). 17:1--17:6.
[14]
Jamison D Collins, Hong Wang, Dean M Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings 28th Annual International Symposium on Computer Architecture. IEEE, 14--25.
[15]
Neal Clayton Crago and Sanjay Jeram Patel. 2011. OUTRIDER: Efficient memory latency tolerance with decoupled strands. In Proceedings of the 38th annual international symposium on Computer architecture. 117--128.
[16]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science and Engineering 5, 1 (Jan. 1998), 10 pages.
[17]
Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawai, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rajesh K. Gupta, Zhiru Zhang, Ronald Dreslinski, Christopher Batten, and Michael Bedford Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro 38, 2 (2018), 30--41.
[18]
Timothy A Davis. 2015. SuiteSparse: A suite of sparse matrix software. URL http://faculty.cse.tamu.edu/davis/suitesparse.html (2015).
[19]
Esperanto Technologies. 2021. Esperanto's ET-Minion on-chip RISC-V cores. https://www.esperanto.ai/technology/.
[20]
Harry D. Foster. 2015. Trends in Functional Verification: A 2014 Industry Study. In Proceedings of the 52nd Annual Design Automation Conference (San Francisco, California) (DAC '15). Association for Computing Machinery, New York, NY, USA, Article 48, 6 pages.
[21]
James R Goodman, Jian-tu Hsieh, Koujuch Liou, Andrew R Pleszkun, PB Schechter, and Honesty C Young. 1985. PIPE: a VLSI decoupled architecture. ACM SIGARCH Computer Architecture News 13, 3 (1985), 20--27.
[22]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled Supply-compute Communication Management for Heterogeneous Architectures. In MICRO. ACM.
[23]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2019. Efficient Data Supply for Parallel Heterogeneous Architectures. ACM TACO 16, 2, Article 9 (2019), 23 pages.
[24]
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 118--130.
[25]
IEEE. 2013. Standard for SystemVerilog-Unified Hardware Design, Specification, and Verification Language. (2013), 1--1315.
[26]
Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2011. Access map pattern matching for high performance data cache prefetch. Journal of Instruction-Level Parallelism 13, 2011 (2011), 1--24.
[27]
Lizy Kurian John, Vinod Reddy, Paul T Hulina, and Lee D Coraor. 1995. Program balance and its impact on high performance RISC architectures. In Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture. IEEE, 370--379.
[28]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amaras- inghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1--29.
[29]
Manjunath Kudlur and Scott Mahlke. 2008. Orchestrating the execution of stream programs on multicore platforms. ACM SIGPLAN Notices 43, 6 (2008), 114--124.
[30]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO. IEEE Press.
[31]
Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When Prefetching Works, When It Doesn't, and Why. ACM Trans. Archit. Code Optim. 9, 1, Article 2 (March 2012), 29 pages.
[32]
Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. 2010. Kronecker Graphs: An Approach to Modeling Networks. Journal of Machine Learning Reseach (JMLR) 11 (March 2010), 985--1042.
[33]
Sean Lie. 2021. Multi-Million Core, Multi-Wafer AI Cluster. In 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Computer Society, 1--41.
[34]
Katie Lim, Jonathan Balkind, and David Wentzlaff. 2019. JuxtaPiton: Enabling Heterogeneous-ISA Research with RISC-V and SPARC FPGA Soft-Cores. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Seaside, CA, USA) (FPGA '19). Association for Computing Machinery, New York, NY, USA, 184.
[35]
Chi-Keung Luk. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings 28th Annual International Symposium on Computer Architecture. IEEE, 40--51.
[36]
William Mangione-Smith, Santosh G Abraham, and Edward S Davidson. 1990. The effects of memory latency and fine-grain parallelism on astronautics ZS-1 performance. In Twenty-Third Annual Hawaii International Conference on System Sciences, Vol. 1. IEEE, 288--296.
[37]
Aninda Manocha, Tyler Sorensen, Esin Tureci, Opeoluwa Matthews, Juan L Aragón, and Margaret Martonosi. 2021. GraphAttack: Optimizing Data Supply for Graph Applications on In-Order Multicore Architectures. ACM Transactions on Architecture and Code Optimization (TACO) 18, 4 (2021), 1--26.
[38]
Opeoluwa Matthews, Aninda Manocha, Davide Giri, Marcelo Orenes-Vera, Esin Tureci, Tyler Sorensen, Tae Jun Ham, Juan L Aragón, Luca P Carloni, and Margaret Martonosi. 2020. MosaicSim: A Lightweight, Modular Simulator for Heterogeneous Systems. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 136--148.
[39]
Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad Hammoud. 2019. Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.
[40]
K. J. Nesbit and J. E. Smith. 2004. Data Cache Prefetching Using a Global History Buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA'04). 96--96.
[41]
Quan M Nguyen and Daniel Sanchez. 2020. Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 596--608.
[42]
Marcelo Orenes-Vera, Aninda Manocha, David Wentzlaff, and Margaret Martonosi. 2021. AutoSVA: Democratizing Formal Verification of RTL Module Interactions. In Proceedings of the 58th Design Automation Conference (DAC).
[43]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, and J. Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures. In ISCA.
[44]
J-M Parcerisa and Antonio Gonzalez. 1999. The synergy of multithreading and access/execute decoupling. In Proceedings Fifth International Symposium on High-Performance Computer Architecture. IEEE, 59--63.
[45]
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[46]
RISC-V Foundation. 2019. Riscv-tests. https://github.com/riscv/riscv-tests.
[47]
Karl Rupp. 2018. 42 Years of Microprocessor Trend Data. https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/.
[48]
James Smith. 1982. Decoupled Access/Execute Computer Architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (Austin, Texas, USA) (ISCA). 8 pages. http://dl.acm.org/citation.cfm?id=800048.801719
[49]
James E Smith. 1982. Decoupled access/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Press.
[50]
Tyler Sorensen, Aninda Manocha, Esin Tureci, Marcelo Orenes-Vera, Juan L Aragón, and Margaret Martonosi. 2020. A simulator and compiler framework for agile hardware-software co-design evaluation and exploration. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.
[51]
Tyler Sorensen, Aninda Manocha, Esin Tureci, Marcelo Orenes-Vera, Juan L. Aragón, and Margaret Martonosi. 2020. A Simulator and Compiler Framework for Agile Hardware-Software Co-design Evaluation and Exploration. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1--9.
[52]
V. Srinivasan, R. B. R. Chowdhury, and E. Rotenberg. 2020. Slipstream Processors Revisited: Exploiting Branch Sets. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 105--117.
[53]
Xian-He Sun and Yong Chen. 2010. Reevaluating Amdahl's law in the multicore era. Journal of Parallel and distributed Computing 70, 2 (2010), 183--188.
[54]
Karthik Sundaramoorthy, Zach Purser, and Eric Rotenberg. 2000. Slipstream processors: Improving both performance and fault tolerance. ACM SIGPLAN Notices 35, 11 (2000), 257--268.
[55]
Michael Sung, Ronny Krashinsky, and Krste Asanović. 2001. Multithreading decoupled architectures for complexity-effective general purpose computing. ACM SIGARCH Computer Architecture News 29, 5 (2001), 56--61.
[56]
Nishil Talati, Kyle May, Armand Behroozi, Yichen Yang, Kuba Kaszyk, Christos Vasiladiotis, Tarunesh Verma, Lu Li, Brandon Nguyen, Jiawen Sun, et al. 2021. Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 654--667.
[57]
Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, et al. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE micro 22, 2 (2002), 25--35.
[58]
Kim-Anh Tran, Trevor E Carlson, Konstantinos Koukos, Magnus Själander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: look-ahead compile-time scheduling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 171--184.
[59]
Kim-Anh Tran, Alexandra Jimborean, Trevor E Carlson, Konstantinos Koukos, Magnus Själander, and Stefanos Kaxiras. 2018. SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 328--343.
[60]
P. N. Whatmough, M. Donato, G. G. Ko, S. K. Lee, D. Brooks, and G. Wei. 2020. CHIPKIT: An Agile, Reusable Open-Source Framework for Rapid Test Chip Development. IEEE Micro 40, 4 (2020), 32--40.
[61]
Wm A Wulf. 1992. Evaluation of the WM Architecture. In Proceedings of the 19th annual international symposium on Computer architecture. 382--390.
[62]
Xiangyao Yu, Christopher J Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture. 178--190.
[63]
Florian Zaruba and Luca Benini. 2018. Ariane: An Open-Source 64-bit RISC-V Application Class Processor and latest Improvements. Technical talk at the RISC-V Workshop https://www.youtube.com/watch?v=8HpvRNh0ux4.
[64]
F. Zaruba and L. Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 11 (Nov 2019), 2629--2640. https://github.com/openhwgroup/cva6.
[65]
Florian Zaruba, Fabian Schuiki, and Luca Benini. 2020. Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing. IEEE Micro 41, 2 (2020), 36--42.
[66]
Weifeng Zhang, Dean M Tullsen, and Brad Calder. 2007. Accelerating and adapting precomputation threads for efficient prefetching. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 85--95.
[67]
Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. Graphit: A high-performance graph dsl. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1--30.

Cited By

View all
  • (2024)A Novel Approach to Managing System-on-Chip Sub-Blocks Using a 16-Bit Real-Time Operating SystemElectronics10.3390/electronics1310197813:10(1978)Online publication date: 18-May-2024
  • (2024)ARQUIN: Architectures for Multinode Superconducting Quantum ComputersACM Transactions on Quantum Computing10.1145/36741515:3(1-59)Online publication date: 19-Sep-2024
  • (2024)Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory AccessACM Transactions on Architecture and Code Optimization10.1145/366347921:3(1-28)Online publication date: 9-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
June 2022
1097 pages
ISBN:9781450386104
DOI:10.1145/3470496
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

  • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Author Tags

  1. decoupling
  2. latency tolerance
  3. memory
  4. modular RTL

Qualifiers

  • Research-article

Funding Sources

  • Defense Advanced Research Projects Agency (DARPA)
  • National Science Foundation (NSF)
  • Air Force Research Laboratory (AFRL)

Conference

ISCA '22
Sponsor:

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)619
  • Downloads (Last 6 weeks)49
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Novel Approach to Managing System-on-Chip Sub-Blocks Using a 16-Bit Real-Time Operating SystemElectronics10.3390/electronics1310197813:10(1978)Online publication date: 18-May-2024
  • (2024)ARQUIN: Architectures for Multinode Superconducting Quantum ComputersACM Transactions on Quantum Computing10.1145/36741515:3(1-59)Online publication date: 19-Sep-2024
  • (2024)Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory AccessACM Transactions on Architecture and Code Optimization10.1145/366347921:3(1-28)Online publication date: 9-May-2024
  • (2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
  • (2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
  • (2024)HotTiles: Accelerating SpMM with Heterogeneous Accelerator Architectures2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00081(1012-1028)Online publication date: 2-Mar-2024
  • (2023)Graph3PO: A Temporal Graph Data Processing Method for Latency QoS Guarantee in Object Cloud Storage SystemProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607075(1-16)Online publication date: 12-Nov-2023
  • (2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
  • (2022)An Architecture Interface and Offload Model for Low-Overhead, Near-Data, Distributed AcceleratorsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00083(1160-1177)Online publication date: 1-Oct-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media