research-article

Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign

Authors:

Newsha Ardalani,

Karthikeyan Sankaralingam,

Jian WengAuthors Info & Claims

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Article No.: 36, Pages 1 - 15

https://doi.org/10.1145/3243176.3243212

Published: 01 November 2018 Publication History

Abstract

Recent programmable accelerators are faster and more energy efficient than general purpose processors, but expose complex hardware/software abstractions for compilers. A key problem is instruction scheduling, which requires sophisticated algorithms for mapping instructions to distributed processing elements, routing of operand dependences, and timing the arrival of operands to enable high throughput.

The complex dependences between mapping, communication and timing make prior scheduling techniques insufficient. Optimization-based approaches are too slow, and heuristic-based approaches cannot achieve high quality. Our first insight is that the algorithm can be solved in a series of phases with overlapping responsibilities to reduce complexity. Second, it is possible to combine optimization-based and stochastic-heuristic based search strategies, to exploit the best features of both. This leads to the two primary techniques we explore, phase overlapping and hybridization.

In this work we explore the codesign of scheduling algorithms with a challenging-to-schedule programmable accelerator. We show we can improve its area by 35% by trimming its scheduling-friendly structures, using a scheduling algorithm that is 5× faster than the state-of-the-art optimization-based scheduler, with up to 2× better throughput.

References

[1]

{n. d.}. GAMS, http://www.gams.com/. ({n. d.}).

[2]

{n. d.}. IBM ILOG CPLEX, https://www.ibm.com/us-en/marketplace/ibm-ilog-cplex. ({n. d.}).

[3]

Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. 1995. Software Pipelining. ACM Comput. Surv. 27, 3 (Sept. 1995), 367--432.

Digital Library

[4]

S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. 2002. A Theoretical and Practical Approach to Instruction Scheduling on Spatial Architectures. Technical Report. MIT.

[5]

S. Amellal and B. Kaminska. 1994. Functional synthesis of digital systems with TASS. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 13, 5 (may 1994), 537 --552.

Digital Library

[6]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing Hardware in a Scala Embedded Language. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, NY, USA, 1216--1225.

Digital Library

[7]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News (2011).

Digital Library

[8]

HJ Caulfield, WT Rhodes, MJ Foster, and Sam Horvitz. 1981. Optical implementation of systolic array processing. Optics Communications 40, 2 (1981), 86--90.

[9]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 269--284.

Digital Library

[10]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 367--379.

Digital Library

[11]

Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization. In MICRO.

Digital Library

[12]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable Accelerator-rich Microprocessor Enhanced for Adaptivity and Longevity. In ISLPED.

Digital Library

[13]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012. CHARM: A Composable Heterogeneous Accelerator-rich Microprocessor. In ISPLED.

Digital Library

[14]

Jason Cong, Karthik Gururaj, Guoling Han, and Wei Jiang. 2009. Synthesis algorithm for application-specific homogeneous processor networks. IEEE Trans. Very Large Scale Integr. Syst. 17, 9 (Sept. 2009).

Digital Library

[15]

J. Cong, H. Huang, and M. A. Ghodrat. 2016. A scalable communication-aware compilation flow for programmable accelerators. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). 503--510.

[16]

Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A fully pipelined and dynamically composable architecture of CGRA. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 9--16.

Digital Library

[17]

Katherine E. Coons, Xia Chen, Doug Burger, Kathryn S. McKinley, and Sundeep K. Kushwaha. 2006. A spatial path scheduling algorithm for EDGE architectures. SIGARCH Comput. Archit. News 34, 5 (Oct. 2006), 129--140.

Digital Library

[18]

John R. Ellis. 1985. Bulldog: a compiler for vliw architectures. Ph.D. Dissertation.

Digital Library

[19]

Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. International Journal of Parallel Programming 21 (1992), 313--347. Issue 5.

Digital Library

[20]

Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (Sept. 2012), 38--51.

Digital Library

[21]

Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankar-alingam. 2013. Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG. In PACT.

Digital Library

[22]

Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO.

Digital Library

[23]

Zhining Huang, Sharad Malik, Nahri Moreano, and Guido Araujo. 2004. The design of dynamically reconfigurable datapath coprocessors. ACM Trans. Embed. Comput. Syst. 3, 2 (May 2004), 361--384.

Digital Library

[24]

Rajeev Joshi, Greg Nelson, and Keith Randall. 2002. Denali: a goal-directed superoptimizer. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation (PLDI '02). 304--314.

Digital Library

[25]

Krishnan Kailas, Ashok Agrawala, and Kemal Ebcioglu. 2001. CARS: A New Code Generation Framework for Clustered ILP Processors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01). 133--. http://dl.acm.org/citation.cfm?id=580550.876436

Digital Library

[26]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Re-configurable Interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 461--475.

Digital Library

[27]

Monica Sin-Ling Lam. 1987. A Systolic Array Optimizing Compiler. Ph.D. Dissertation. Pittsburgh, PA, USA. AAI8814722.

[28]

Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. 1998. Space-time Scheduling of Instruction-level Parallelism on a Raw Machine. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). ACM, New York, NY, USA, 46--57.

Digital Library

[29]

D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 14--26.

[30]

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In International Conference on Field Programmable Logic and Applications. Springer, 61--70.

[31]

B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. IEE Proceedings - Computers and Digital Techniques 150, 5 (Sept 2003), 255--61--.

[32]

Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, Mark Oskin, and Susan J. Eggers. 2006. Instruction scheduling for a tiled dataflow architecture. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS XII). 141--150.

Digital Library

[33]

Ramadass Nagarajan, Sundeep K. Kushwaha, Doug Burger, Kathryn S. McKinley, Calvin Lin, and Stephen W. Keckler. 2004. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). 74--84.

Digital Library

[34]

Chris Nicol. 2017. A Coarse Grain Reconfigurable Array (CGRA) for Statically Scheduled Data Flow Computing. Wave Computing White Paper (2017).

[35]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 416--429.

Digital Library

[36]

Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright. 2016. Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 27--39.

[37]

Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 495--506.

Digital Library

[38]

Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages.

Digital Library

[39]

Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages.

Digital Library

[40]

Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture (MICRO 31). 308--315. http://dl.acm.org/citation.cfm?id=290940.291004

Digital Library

[41]

Jens Palsberg and MpSOC Mayur Naik. 2004. ILP-based Resource-aware Compilation. (2004).

[42]

Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 142--153.

Digital Library

[43]

Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. 2008. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). 166--176.

Digital Library

[44]

Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. 2017. Scale-out Acceleration for Machine Learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 367--381.

Digital Library

[45]

Yongjun Park, Hyunchul Park, and Scott Mahlke. 2009. CGRA Express: Accelerating Execution Using Dynamic Operation Fusion. In Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '09). ACM, New York, NY, USA, 271--280.

Digital Library

[46]

Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 84--95.

Digital Library

[47]

Phitchaya Mangpo Phothilimthana, Tikhon Jelvis, Rohin Shah, Nishant Totla, Sarah Chasins, and Rastislav Bodik. 2014. Chlorophyll: Synthesisaided Compiler for Low-power Spatial Architectures. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 396--407.

Digital Library

[48]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 389--402.

Digital Library

[49]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.

[50]

Roddy Urquhart and Will Moore and Andrew McCabe. 1987. Systolic Arrays. Institute of Physics Publishing.

[51]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan Desikan, Saurabh Drolia, M.S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. In MICRO.

Digital Library

[52]

Nadathur Satish, Kaushik Ravindran, and Kurt Keutzer. 2007. A decomposition-based constraint optimization approach for statically scheduling task graphs with communication delays to multiprocessors. In DATE '07.

Digital Library

[53]

Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 291--.

Digital Library

[54]

Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro 22, 2 (March 2002), 25--35.

Digital Library

[55]

J. J. Tithi, N. C. Crago, and J. S. Emer. 2014. Exploiting spatial architectures for edit distance algorithms. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 23--34.

[56]

Dani Voitsechov and Yoav Etsion. 2014. Single-graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 205--216.

Digital Library

[57]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM, New York, NY, USA, Article 29, 6 pages.

Digital Library

Cited By

Wei ZYu JRen JDuan WWu D(2024)Progressive Simulated Annealing Algorithm for the Pipeline Allocation Problem of Protocol Independent Switch Architecture Chips2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS63458.2024.10704552(243-249)Online publication date: 16-Aug-2024
https://doi.org/10.1109/DOCS63458.2024.10704552
Sriram KPothukuchi RGerasimiuk MUgur MYe OManohar RKhandelwal ABhattacharjee ASolihin YHeinrich M(2023)SCALO: An Accelerator-Rich Distributed System for Scalable Brain-Computer InterfacingProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589107(1-20)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589107
Wang DLou JJin NMascarenhas EMahapatra RKinzer SGhodrati SYazdanbakhsh AEsmaeilzadeh HKim NSolihin YHeinrich M(2023)MESA: Microarchitecture Extensions for Spatial Architecture GenerationProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589084(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589084
Show More Cited By

Recommendations

Effective instruction scheduling with limited registers
Lazy instruction scheduling: keeping performance, reducing power
ISLPED '08: Proceedings of the 2008 international symposium on Low Power Electronics & Design

An important approach to reduce power dissipation is reducing the number of instructions executed by the processor. To achieve this goal, this paper introduces a novel instruction scheduling algorithm that executes an instruction only when its result is ...
Compiler optimization on VLIW instruction scheduling for low power

In this article, we investigate compiler transformation techniques regarding the problem of scheduling VLIW instructions aimed at reducing power consumption of VLIW architectures in the instruction bus. The problem can be categorized into two types: ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

November 2018

494 pages

ISBN:9781450359863

DOI:10.1145/3243176

General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

PACT '18

Sponsor:

SIGARCH

PACT '18: International conference on Parallel Architectures and Compilation Techniques

November 1 - 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
481
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)6

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei ZYu JRen JDuan WWu D(2024)Progressive Simulated Annealing Algorithm for the Pipeline Allocation Problem of Protocol Independent Switch Architecture Chips2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS63458.2024.10704552(243-249)Online publication date: 16-Aug-2024
https://doi.org/10.1109/DOCS63458.2024.10704552
Sriram KPothukuchi RGerasimiuk MUgur MYe OManohar RKhandelwal ABhattacharjee ASolihin YHeinrich M(2023)SCALO: An Accelerator-Rich Distributed System for Scalable Brain-Computer InterfacingProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589107(1-20)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589107
Wang DLou JJin NMascarenhas EMahapatra RKinzer SGhodrati SYazdanbakhsh AEsmaeilzadeh HKim NSolihin YHeinrich M(2023)MESA: Microarchitecture Extensions for Spatial Architecture GenerationProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589084(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589084
Kong XHuang YZhu JMan XLiu YFeng CGou PTang MWei SLiu LSolihin YHeinrich M(2023)MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree SearchProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589081(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589081
Maas MBeaugnon UChauhan AIlbeyi BAamodt TJerger NSwift M(2023)TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning AcceleratorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567961(123-137)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3567955.3567961
Song LGuo LBasalama SChi YLucas RCong JIenne PZhang Z(2023)Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient SolverProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573182(247-258)Online publication date: 12-Feb-2023
https://dl.acm.org/doi/10.1145/3543622.3573182
Liu TLi WFan Z(2023)DFGC: DFG-aware NoC Control based on Time Stamp Prediction for Dataflow Architecture2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00071(432-439)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00071
Shah NMeert WVerhelst MShah NMeert WVerhelst M(2023)DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial DatapathEfficient Execution of Irregular Dataflow Graphs10.1007/978-3-031-33136-7_5(89-123)Online publication date: 26-Apr-2023
https://doi.org/10.1007/978-3-031-33136-7_5
Dadu VNowatzki TFalsafi BFerdman MLu SWenisch T(2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507706
Sankaralingam KNowatzki TGangadhar VShah PDavies MGalliher WGuo ZKhare JVijay DPalamuttam PPunde MTan AThiruvengadam VWang RXu SSalapura VZahran MChong FTang L(2022)The Mozart reuse exposed dataflow processor for AI and beyondProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533040(978-992)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3533040
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents