Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3410463.3414627acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration

Published: 30 September 2020 Publication History

Abstract

With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications that meet power and performance targets, while remaining flexible and programmable for end users. This is particularly true for domains that have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machine learning and graph analytics. To overcome this, we present a flexible accelerator called Transmuter, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-Specific Integrated Circuits (ASICs). Transmuter adapts to changing kernel characteristics, such as data reuse and control divergence, through the ability to reconfigure the on-chip memory type, resource sharing and dataflow at run-time within a short latency. This is facilitated by a fabric of light-weight cores connected to a network of reconfigurable caches and crossbars. Transmuter addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications. Finally, in order to support programmability and ease-of-adoption, we prototype a software stack composed of low-level runtime routines, and a high-level language library called TransPy, that cater to expert programmers and end-users, respectively.
Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×) over a high-end CPU and GPU, respectively, across a diverse set of kernels predominant in graph analytics, scientific computing and machine learning. Transmuter achieves energy-efficiency gains averaging 3.4× and 2.0× over prior FPGA and CGRA implementations of the same kernels, while remaining on average within 9.3× of state-of-the-art ASICs.

References

[1]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.
[2]
Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge. 2013. Scaling Towards Kilo-core Processors with Asymmetric High-Radix Topologies. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, 496--507.
[3]
Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid Methods in Image Processing. RCA Engineer, Vol. 29, 6 (1984), 33--41.
[4]
Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and Muhammad Shafique. 2019. X-CGRA: An energy-efficient approximate coarse-grained reconfigurable architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).
[5]
Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. 2012. Tensor decompositions for learning latent variable models. CoRR, Vol. abs/1210.7559 (2012). arxiv: 1210.7559
[6]
Tuba Ayhan, Wim Dehaene, and Marian Verhelst. 2014. A 128textasciitilde 2048/1536 point FFT hardware implementation with output pruning. In 2014 22nd European Signal Processing Conference (EUSIPCO). IEEE, 266--270.
[7]
David F Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA programming for the masses. Commun. ACM, Vol. 56, 4 (2013), 56--63.
[8]
Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002. IEEE, 73--78.
[9]
Nathan Bell, Steven Dalton, and Luke N Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, Vol. 34, 4 (2012), C123--C152.
[10]
Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News, Vol. 39, 2 (2011), 1--7.
[11]
Nathan L Binkert, Ronald G Dreslinski, Lisa R Hsu, Kevin T Lim, Ali G Saidi, and Steven K Reinhardt. 2006. The M5 simulator: Modeling networked systems. Ieee micro, Vol. 26, 4 (2006), 52--60.
[12]
Ian Buck. 2010. The evolution of GPUs for general purpose computing. In Proceedings of the GPU Technology Conference 2010. 11.
[13]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 141--151.
[14]
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-Shot Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 221--230.
[15]
Benton Highsmith Calhoun, Joseph F Ryan, Sudhanshu Khanna, Mateja Putic, and John Lach. 2010. Flexible circuits and architectures for ultralow power. Proc. IEEE, Vol. 98, 2 (2010), 267--282.
[16]
Web Chang. 2001. Embedded configurable logic ASIC. US Patent 6,260,087.
[17]
Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, Vol. 52, 1 (2016), 127--138.
[18]
Marco Cuturi. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13). 2292--2300.
[19]
Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). ACM, 924--939.
[20]
Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawai, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rakesh K Gupta, Zhiru Zhang, Ronald G Dreslinski, Christopher Batten, and Michael B Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro, Vol. 38, 2 (2018), 30--41.
[21]
Chris HQ Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D Simon. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings 2001 IEEE International Conference on Data Mining. IEEE, 107--114.
[22]
Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1320--1329.
[23]
Richard Dorrance, Fengbo Ren, and Dejan Marković. 2014. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs. In Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays. ACM, 161--170.
[24]
Iain S Duff, Michael A Heroux, and Roldan Pozo. 2002. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum. ACM Transactions on Mathematical Software (TOMS), Vol. 28, 2 (2002), 239--267.
[25]
E Swartzlander Earl Jr. 2006. Systolic FFT Processors: Past, Present and Future. In IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06). IEEE, 153--158.
[26]
Nasim Farahini, Shuo Li, Muhammad Adeel Tajammul, Muhammad Ali Shami, Guo Chen, Ahmed Hemani, and Wei Ye. 2013. 39.9 GOPS/Watt Multi-Mode CGRA Accelerator for a Multi-Standard Basestation. In 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013). IEEE, 1448--1451.
[27]
Kayvon Fatahalian, Jeremy Sugerman, and Pat Hanrahan. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. 133--137.
[28]
Siying Feng, Subhankar Pal, Yichen Yang, and Ronald G Dreslinski. 2019. Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 202--211.
[29]
Jivří Filipovivc, Matúvs Madzin, Jan Fousek, and Ludvěk Matyska. 2015. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, Vol. 71, 10 (2015), 3934--3957.
[30]
Florian Fricke, André Werner, Keyvan Shahin, and Michael Hübner. 2018. CGRA tool flow for fast run-time reconfiguration. In International Symposium on Applied Reconfigurable Computing. Springer, 661--672.
[31]
Yusuke Fujii, Takuya Azumi, Nobuhiko Nishio, Shinpei Kato, and Masato Edahiro. 2013. Data transfer matters for GPU computing. In 2013 International Conference on Parallel and Distributed Systems. IEEE, 275--282.
[32]
Noriyuki Fujimoto. 2008. Dense matrix-vector multiplication on the CUDA architecture. Parallel Processing Letters, Vol. 18, 04 (2008), 511--530.
[33]
Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). Ieee, 126--137.
[34]
Heiner Giefers, Raphael Polig, and Christoph Hagleitner. 2016a. Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT. Journal of Signal Processing Systems, Vol. 85, 3 (Dec. 2016), 307--323.
[35]
Heiner Giefers, Peter Staar, Costas Bekas, and Christoph Hagleitner. 2016b. Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 46--56.
[36]
Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matthew Moe, and R Reed Taylor. 2000. PipeRench: A reconfigurable architecture and compiler. Computer, Vol. 33, 4 (2000), 70--77.
[37]
Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 503--514.
[38]
Azzam Haidar, Mark Gates, Stan Tomov, and Jack Dongarra. 2013. Toward a scalable multi-gpu eigensolver via compute-intensive kernels and efficient communication. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 223--232.
[39]
Tom R Halfhill. 2006. Ambric's new parallel processor. Microprocessor Report, Vol. 20, 10 (2006), 19--26.
[40]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 620--629.
[41]
Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Yuhan Chen, Ronald Dreslinski, and Trevor Mudge. 2020. Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices. In Proceedings of the 34th ACM International Conference on Supercomputing (ICS '20). ACM, 12.
[42]
Mark Horowitz. 2014. 1.1 computing's energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 10--14.
[43]
Randall A Hughes and John D Shott. 1986. The future of automation for high-volume wafer fabrication and ASIC manufacturing. Proc. IEEE, Vol. 74, 12 (1986), 1775--1793.
[44]
Engin Ipek, Meyrem Kirman, Nevin Kirman, and Jose F. Martinez. 2007. Core Fusion: Accommodating Software Diversity in Chip Multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (San Diego, California, USA) (ISCA '07). ACM, 186--197.
[45]
Satoshi Itoh, Pablo Ordejón, and Richard M Martin. 1995. Order-N tight-binding molecular dynamics on parallel computers. Computer physics communications, Vol. 88, 2--3 (1995), 173--185.
[46]
Preston A. Jackson, Cy P. Chan, Jonathan E. Scalera, Charles M. Rader, and M. Michael Vai. 2004. A systolic FFT architecture for real time FPGA systems,? High Performance Embedded Computing Conference (HPEC04. In In High Performance Embedded Computing Conference (HPEC04.
[47]
Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. 2017. pybind 11--Seamless operability between C 11 and Python. https://github.com/pybind/pybind11
[48]
Supreet Jeloka, Reetuparna Das, Ronald G Dreslinski, Trevor Mudge, and David Blaauw. 2014. Hi-Rise: a high-radix switch for 3D integration with single-cycle arbitration. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 471--483.
[49]
Kurtis T. Johnson, Ali R Hurson, and Behrooz Shirazi. 1993. General-purpose systolic arrays. Computer, Vol. 26, 11 (1993), 20--31.
[50]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, 1--12.
[51]
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C. arXiv preprint arXiv:1804.00344 (2018).
[52]
Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017. 1--6.
[53]
John Kelm, Daniel Johnson, Matthew Johnson, Neal Crago, William Tuohy, Aqeel Mahesri, Steven Lumetta, Matthew Frank, and Sanjay Patel. 2009. Rigel: An Architecture and Scalable Programming Interface for a 1000-Core Accelerator. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 140--151.
[54]
Jeremy Kepner, Peter Aaltonen, David A. Bader, Aydin Bulucc, Franz Franchetti, John R. Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, Carl Yang, John D. Owens, Marcin Zalewski, Timothy G. Mattson, and José E. Moreira. 2016. Mathematical foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016. IEEE, 1--9.
[55]
Khubaib, M. Aater Suleman, Milad Hashemi, Chris Wilkerson, and Yale N. Patt. 2012. MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP. 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (2012), 305--316.
[56]
Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. 2008. Polymorphic On-Chip Networks. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA '08). IEEE Computer Society, 101--112.
[57]
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A Language and Compiler for Application Accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, 296--311.
[58]
Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA '15). ACM, 707--719.
[59]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[60]
Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE computer, Vol. 15, 1 (1982), 37--46.
[61]
Ian Kuon and Jonathan Rose. 2007. Measuring the gap between FPGAs and ASICs. IEEE Transactions on computer-aided design of integrated circuits and systems, Vol. 26, 2 (2007), 203--215.
[62]
Ian Kuon, Russell Tessier, and Jonathan Rose. 2008. FPGA architecture: Survey and challenges. Now Publishers Inc.
[63]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). 957--966.
[64]
Georgi Kuzmanov and Mottaqiallah Taouil. 2009. Reconfigurable sparse/dense matrix-vector multiplier. In 2009 International Conference on Field-Programmable Technology. IEEE, 483--488.
[65]
Benjamin C Lee, Richard W Vuduc, James W Demmel, and Katherine A Yelick. 2004. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In International Conference on Parallel Processing, 2004. ICPP 2004. IEEE, 169--176.
[66]
Chang-Chi Lee, CP Hung, Calvin Cheung, Ping-Feng Yang, Chin-Li Kao, Dao-Long Chen, Meng-Kai Shih, Chien-Lin Chang Chien, Yu-Hsiang Hsiao, Li-Chieh Chen, Michael Su, Michael Alfano, Joe Siegel, Julius Din, and Bryan Black. 2016. An overview of the development of a GPU with integrated HBM on silicon interposer. In 2016 IEEE 66th Electronic Components and Technology Conference (ECTC). IEEE, 1439--1444.
[67]
Chang-Hwan Lee. 2015. A gradient approach for value weighted classification learning in naive Bayes. Knowledge-Based Systems, Vol. 85 (2015), 71--79.
[68]
Dongwook Lee, Manhwee Jo, Kyuseung Han, and Kiyoung Choi. 2009. FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability. In 2009 International Conference on Field-Programmable Technology. IEEE, 376--379.
[69]
Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2012. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 377--386.
[70]
Cao Liang and Xinming Huang. 2008. SmartCell: A power-efficient reconfigurable architecture for data streaming applications. In 2008 IEEE Workshop on Signal Processing Systems. IEEE, 257--262.
[71]
Leibo Liu, Dong Wang, Min Zhu, Yansheng Wang, Shouyi Yin, Peng Cao, Jun Yang, and Shaojun Wei. 2015. An energy-efficient coarse-grained reconfigurable processing unit for multiple-standard video decoding. IEEE Transactions on Multimedia, Vol. 17, 10 (2015), 1706--1720.
[72]
Leibo Liu, Jianfeng Zhu, Zhaoshi Li, Yanan Lu, Yangdong Deng, Jie Han, Shouyi Yin, and Shaojun Wei. 2019. A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications. ACM Computing Surveys (CSUR), Vol. 52, 6 (2019), 1--39.
[73]
Beth Logan. 2000. Mel Frequency Cepstral Coefficients for Music Modeling. In ISMIR, Vol. 270. 1--11.
[74]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.
[75]
Ikuo Magaki, Moein Khazraee, Luis Vega Gutierrez, and Michael Bedford Taylor. 2016. Asic clouds: Specializing the datacenter. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 178--190.
[76]
Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. 2000. Smart Memories: A Modular Reconfigurable Architecture. In Proceedings of the 27th Annual International Symposium on Computer Architecture (Vancouver, British Columbia, Canada) (ISCA '00). ACM, New York, NY, USA, 161--171.
[77]
Tim Mattson, David A. Bader, Jonathan W. Berry, Aydin Bulucc, Jack J. Dongarra, Christos Faloutsos, John Feo, John R. Gilbert, Joseph Gonzalez, Bruce Hendrickson, Jeremy Kepner, Charles E. Leiserson, Andrew Lumsdaine, David A. Padua, Stephen Poole, Steven P. Reinhardt, Mike Stonebraker, Steve Wallach, and Andrew Yoo. 2013. Standards for graph algorithm primitives. In IEEE High Performance Extreme Computing Conference, HPEC 2013, Waltham, MA, USA, September 10--12, 2013 . IEEE, 1--2.
[78]
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240 (2018).
[79]
Badri Narayan Mohapatra and Rashmita Kumari Mohapatra. 2017. FFT and sparse FFT techniques and applications. In 2017 Fourteenth International Conference on Wireless and Optical Communications Networks (WOCN). IEEE, 1--5.
[80]
Frank Mueller. 1993. Pthreads library interface. Florida State University (1993).
[81]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories, Vol. 27 (2009), 28.
[82]
Chris Nicol. 2017. A coarse grain reconfigurable array (cgra) for statically scheduled data flow computing. Wave Computing White Paper (2017).
[83]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). ACM, 416--429.
[84]
Molly A O'Neil and Martin Burtscher. 2014. Microarchitectural performance characterization of irregular GPU kernels. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 130--139.
[85]
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, Vol. 2, 11 (2015), 1--4.
[86]
Mike O'Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017 Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 41--54.
[87]
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product based Sparse Matrix Multiplication Accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724--736.
[88]
Subhankar Pal, Dong-Hyeon Park, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Bedford Taylor, Trevor N. Mudge, David T. Blaauw, Hun-Seok Kim, and Ronald G. Dreslinski. 2019. A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm. In 2019 Symposium on VLSI Circuits, Kyoto, Japan, June 9-14, 2019. IEEE, 150.
[89]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 304--315.
[90]
Dong-Hyeon Park, Subhankar Pal, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Bedford Taylor, Trevor N. Mudge, David T. Blaauw, Hun-Seok Kim, and Ronald G. Dreslinski. 2020. A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator. Journal of Solid-State Circuits, Vol. 55, 4 (2020), 933--944.
[91]
Ardavan Pedram, Andreas Gerstlauer, and Robert A Van De Geijn. 2011. A high-performance, low-power linear algebra core. In ASAP 2011--22nd IEEE International Conference on Application-specific Systems, Architectures and Processors. IEEE, 35--42.
[92]
Ardavan Pedram, John D. McCalpin, and Andreas Gerstlauer. 2014. A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores. Journal of Signal Processing Systems, Vol. 77, 1 (01 Oct 2014), 169--190.
[93]
Gerald Penn. 2006. Efficient transitive closure of sparse matrices over closed semirings. Theoretical Computer Science, Vol. 354, 1 (2006), 72--81.
[94]
Kara KW Poon, Steven JE Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol. 10, 2 (2005), 279--302.
[95]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 389--402.
[96]
Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, Eric Peterson, Aaron Smith, Jason Thong, Phillip Yi Xiao, Doug Burger, Jim Larus, Gopi Prashanth Gopal, and Simon Pope. 2014. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA). IEEE, 13--24.
[97]
Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. 2018. Improving GANs using optimal transport. arXiv preprint arXiv:1803.05573 (2018).
[98]
Fabian Schuiki, Michael Schaffner, and Luca Benini. 2019. Ntx: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm fd-soi. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 662--667.
[99]
Korey Sewell, Ronald G. Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Ross Pinckney, Geoffrey Blake, Michael Cieslak, Reetuparna Das, Thomas F. Wenisch, Dennis Sylvester, David T. Blaauw, and Trevor N. Mudge. 2012. Swizzle-Switch Networks for Many-Core Systems. IEEE J. Emerg. Sel. Topics Circuits Syst., Vol. 2, 2 (2012), 278--294.
[100]
Muhammad Shafique and Siddharth Garg. 2016. Computing in the Dark Silicon Era: Current Trends and Research Challenges. IEEE Design & Test, Vol. 34, 2 (2016), 8--23.
[101]
Anuraag Soorishetty, Jian Zhou, Subhankar Pal, David Blaauw, H Kim, Trevor Mudge, Ronald Dreslinski, and Chaitali Chakrabarti. 2020. Accelerating Linear Algebra Kernels on a Massively Parallel Reconfigurable Architecture. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1558--1562.
[102]
Samuel Steffl and Sherief Reda. 2017. LACore: A Supercomputing-Like Linear Algebra Accelerator for SoC-Based Designs. In 2017 IEEE International Conference on Computer Design (ICCD). IEEE, 137--144.
[103]
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: a Functional Data-Parallel IR for High-Performance GPU Code Generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 74--85.
[104]
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, Vol. 12, 3 (2010), 66--73.
[105]
Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. 2018. Stitch: Fusible heterogeneous accelerators enmeshed with many-core architecture for wearables. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 575--587.
[106]
Masakazu Tanomoto, Shinya Takamaeda-Yamazaki, Jun Yao, and Yasuhiko Nakashima. 2015. A cgra-based approach for accelerating convolutional neural networks. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 73--80.
[107]
Michael Bedford Taylor, Jason Sungtae Kim, Jason E. Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul R. Johnson, Jae W. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matthew I. Frank, Saman P. Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, Vol. 22, 2 (2002), 25--35.
[108]
Vaishali Tehre, Pankaj Agrawal, and RV Kshrisagar. [n.d.]. Implementation of Fast Fourier Transform Accelerator on Coarse Grain Reconfigurable Architecture. ( [n.,d.]).
[109]
Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. 2017. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. ACM SIGARCH Computer Architecture News, Vol. 45, 2 (2017), 13--26.
[110]
Manish Verma, Lars Wehmeyer, Peter Marwedel, and Peter Marwedel. 2004. Cache-Aware Scratchpad Allocation Algorithm. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 2 (DATE '04). IEEE Computer Society, 21264--.
[111]
Kizheppatt Vipin and Suhaib A Fahmy. 2018. FPGA dynamic and partial reconfiguration: A survey of architectures, methods, and applications. ACM Computing Surveys (CSUR), Vol. 51, 4 (2018), 1--39.
[112]
Donglin Wang, Xueliang Du, Leizu Yin, Chen Lin, Hong Ma, Weili Ren, Huijuan Wang, Xingang Wang, Shaolin Xie, Lei Wang, Zijun Liu, Tao Wang, Zhonghua Pu, Guangxin Ding, Mengchen Zhu, Lipeng Yang, Ruoshan Guo, Zhiwei Zhang, Xiao Lin, Jie Hao, Yongyong Yang, Wenqin Sun, Fabiao Zhou, NuoZhou Xiao, Qian Cui, and Xiaoqin Wang. 2016. MaPU: A novel mathematical computing architecture. 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2016), 457--468.
[113]
Jagath Weerasinghe, Francois Abel, Christoph Hagleitner, and Andreas Herkersdorf. 2015. Enabling FPGAs in Hyperscale Data Centers. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom). IEEE, 1078--1086.
[114]
Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. 2016. Coarse Grained Reconfigurable Architectures in the Past 25 years: Overview and Classification. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE, 235--244.
[115]
Xilinx [n.d.]. Partial Reconfiguration User Guide UG702 (v13.3). Xilinx. https: //www.xilinx.com/support/documentation/sw_manuals/xilinx13_3/ug702.pdf
[116]
Xilinx [n.d.]. Partial Reconfiguration User Guide UG909 (v2018.1). Xilinx. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_ 1/ug909-vivado-partial-reconfiguration.pdf
[117]
Ichitaro Yamazaki and Xiaoye S Li. 2010. On techniques to improve robustness and scalability of a parallel hybrid linear solver. In International Conference on High Performance Computing for Computational Science. Springer, 421--434.
[118]
Fanghua Ye, Chuan Chen, and Zibin Zheng. 2018. Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1393--1402.
[119]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.
[120]
Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Lawrence T. Pileggi, and Franz Franchetti. 2013. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. 2013 IEEE High Performance Extreme Computing Conference (HPEC) (2013), 1--6.

Cited By

View all
  • (2023)SPADE: A Flexible and Scalable Accelerator for SpMM and SDDMMProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589054(1-15)Online publication date: 17-Jun-2023
  • (2022)Squaring the circleProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569665(148-159)Online publication date: 8-Oct-2022
  • (2022)OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning AcceleratorsACM Transactions on Embedded Computing Systems10.1145/353090921:6(1-29)Online publication date: 30-Nov-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
September 2020
505 pages
ISBN:9781450380751
DOI:10.1145/3410463
  • General Chair:
  • Vivek Sarkar,
  • Program Chair:
  • Hyesoon Kim
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataflow reconfiguration
  2. general-purpose acceleration
  3. hardware acceleration
  4. memory reconfiguration
  5. reconfigurable architectures

Qualifiers

  • Research-article

Funding Sources

Conference

PACT '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)749
  • Downloads (Last 6 weeks)30
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)SPADE: A Flexible and Scalable Accelerator for SpMM and SDDMMProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589054(1-15)Online publication date: 17-Jun-2023
  • (2022)Squaring the circleProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569665(148-159)Online publication date: 8-Oct-2022
  • (2022)OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning AcceleratorsACM Transactions on Embedded Computing Systems10.1145/353090921:6(1-29)Online publication date: 30-Nov-2022
  • (2022)MeNDAProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527432(245-258)Online publication date: 18-Jun-2022
  • (2022)Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems2022 IEEE 40th VLSI Test Symposium (VTS)10.1109/VTS52500.2021.9794253(1-14)Online publication date: 25-Apr-2022
  • (2022)Blocks: Challenging SIMDs and VLIWs With a Reconfigurable ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.312054141:9(2915-2928)Online publication date: Sep-2022
  • (2022)An Adjustable Farthest Point Sampling Method for Approximately-sorted Point Cloud Data2022 IEEE Workshop on Signal Processing Systems (SiPS)10.1109/SiPS55645.2022.9919246(1-6)Online publication date: 2-Nov-2022
  • (2022)Versa: A 36-Core Systolic Multiprocessor With Dynamically Reconfigurable Interconnect and MemoryIEEE Journal of Solid-State Circuits10.1109/JSSC.2022.314024157:4(986-998)Online publication date: Apr-2022
  • (2022)Improving Energy Efficiency of Convolutional Neural Networks on Multi-core Architectures through Run-time Reconfiguration2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937275(375-379)Online publication date: 28-May-2022
  • (2021)SparseAdapt: Runtime Control for Sparse Linear Algebra on a Reconfigurable AcceleratorMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480134(1005-1021)Online publication date: 18-Oct-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media