research-article

Public Access

LogCA: A High-Level Performance Model for Hardware Accelerators

Authors:

Muhammad Shoaib Bin Altaf,

David A. WoodAuthors Info & Claims

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Pages 375 - 388

https://doi.org/10.1145/3079856.3080216

Published: 24 June 2017 Publication History

Abstract

With the end of Dennard scaling, architects have increasingly turned to special-purpose hardware accelerators to improve the performance and energy efficiency for some applications. Unfortunately, accelerators don't always live up to their expectations and may under-perform in some situations. Understanding the factors which effect the performance of an accelerator is crucial for both architects and programmers early in the design stage. Detailed models can be highly accurate, but often require low-level details which are not available until late in the design cycle. In contrast, simple analytical models can provide useful insights by abstracting away low-level system details.

In this paper, we propose LogCA---a high-level performance model for hardware accelerators. LogCA helps both programmers and architects identify performance bounds and design bottlenecks early in the design cycle, and provide insight into which optimizations may alleviate these bottlenecks. We validate our model across a variety of kernels, ranging from sub-linear to super-linear complexities on both on-chip and off-chip accelerators. We also describe the utility of LogCA using two retrospective case studies. First, we discuss the evolution of interface design in SUN/Oracle's encryption accelerators. Second, we discuss the evolution of memory interface design in three different GPU architectures. In both cases, we show that the adopted design optimizations for these machines are similar to LogCA's suggested optimizations. We argue that architects and programmers can use insights from these retrospective studies for improving future designs.

References

[1]

Advanced Micro Devices 2016. APP SDK - A Complete Development Platform. Advanced Micro Devices. http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.

[2]

Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring) (1967), 483.

Digital Library

[3]

Dan Anderson. 2012. How to tell if SPARC T4 crypto is being used? https://blogs.oracle.com/DanX/entry/how_to_tell_if_sparc.

[4]

Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A View of the Parallel Computing Landscape. Commun. ACM 52, 10 (oct 2009), 56--67.

Digital Library

[5]

Nathan Beckmann and Daniel Sanchez. 2016. Cache Calculus: Modeling Caches through Differential Equations. Computer Architecture Letters PP, 99 (2016), 1. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7366753$\delimiter"026E30F$npapers3://publication/doi/10.1109/LCA.2015.2512873

[6]

C. Cascaval, S. Chatterjee, H. Franke, K. J. Gildea, and P. Pattnaik. 2010. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development 54 (2010), 5:1--5:10.

Digital Library

[7]

Eric S. Chung, Peter a. Milder, James C. Hoe, and Ken Mai. 2010. Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (dec 2010), 225--236.

Digital Library

[8]

Jason Cong, Zhenman Fang, Michael Gill, and Glenn Reinman. 2015. PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration. In 2015 IEEE/ACM International Conference on Computer-Aided Design. Austin, TX.

Digital Library

[9]

D Culler, R Karp, D Patterson, and A Sahay. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1--12. http://dl.acm.org/citation.cfm?id=155333

Digital Library

[10]

Bruno da Silva, An Braeken, Erik H D'Hollander, and Abdellah Touhafi. 2013. Performance Modeling for FPGAs: Extending the Roofline Model with High-level Synthesis Tools. Int. J. Reconfig. Comput. 2013 (jan 2013), 7:7--7:7.

Digital Library

[11]

Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. 2011. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. In 2011 Symposium on Application Accelerators in High-Performance Computing. Ieee, 141--149.

Digital Library

[12]

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11. ACM Press, New York, New York, USA, 365.

Digital Library

[13]

H. Franke, J. Xenidis, C. Basso, B. M. Bass, S. S. Woodward, J. D. Brown, and C. L. Johnson. 2010. Introduction to the wire-speed processor and architecture. IBM Journal of Research and Development 54 (2010), 3:1--3:11.

Digital Library

[14]

Venkatraman Govindaraju, Chen Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings - International Symposium on High-Performance Computer Architecture. 503--514.

Digital Library

[15]

Shay Gueron. 2012. Intel Advanced Encryption Standard (AES) Instructions Set. Technical Report. Intel Corporation. https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf

[16]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10 (2010), 37.

Digital Library

[17]

Mark Hempstead, Gu-Yeon Wei, and David Brooks. 2009. Navigo: An early-stage model to study power-constrained architectures and specialization. In ISCA Workshop on Modeling, Benchmarking, and Simulations (MoBS).

[18]

John L Hennessy and David A Patterson. 2006. Computer Architecture, Fourth Edition: A Quantitative Approach. 704 pages.

Digital Library

[19]

Mark D Hill and Michael R Marty. 2008. Amdahl's Law in the Multicore Era. Computer 41, 7 (jul 2008), 33--38.

Digital Library

[20]

Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture, Vol. 37. 152--163.

Digital Library

[21]

Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture, Vol. 38. 280--289.

Digital Library

[22]

Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, Shengen Yan, and Yan Li. 2012. GPURoofline: A Model for Guiding Performance Optimizations on GPUs. In Proceedings of the 18th International Conference on Parallel Processing (Euro-Par'12). Springer-Verlag, Berlin, Heidelberg, 920--932.

Digital Library

[23]

Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the Walkers: Accelerating Index Traversals for In-memory Databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 468--479.

Digital Library

[24]

Karthik Kumar, Jibang Liu, Yung Hsiang Lu, and Bharat Bhargava. 2013. A survey of computation offloading for mobile systems. Mobile Networks and Applications 18 (2013), 129--140.

Digital Library

[25]

Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman, and Vijayalakshmi Srinivasan. 2015. DASX: Hardware Accelerator for Software Data Structures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 361--372.

Digital Library

[26]

Maysam Lavasani, Hari Angepat, and Derek Chiou. 2014. An FPGA-based InLine Accelerator for Memcached. IEEE Comput. Archit. Lett. 13, 2 (jul 2014), 57--60.

Digital Library

[27]

John D C Little and Stephen C Graves. 2008. Little's law. In Building intuition. Springer, 81--100.

[28]

U Lopez-Novoa, A Mendiburu, and J Miguel-Alonso. 2015. A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing. Parallel and Distributed Systems, IEEE Transactions on 26, 1 (jan 2015), 272--281.

[29]

M. R. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. 2013. Modeling and predicting performance of high performance computing applications on hardware accelerators. International Journal of High Performance Computing Applications 27 (2013), 89--108.

Digital Library

[30]

National Institute of Standards and Technology 2001. Advanced Encryption Standard (AES). National Institute of Standards and Technology.

[31]

National Institute of Standards and Technology 2008. Secure Hash Standard. National Institute of Standards and Technology. http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf.

[32]

S Nilakantan, S Battle, and M Hempstead. 2013. Metrics for Early-Stage Modeling of Many-Accelerator Architectures. Computer Architecture Letters 12, 1 (jan 2013), 25--28.

Digital Library

[33]

Cedric Nugteren and Henk Corporaal. 2012. The Boat Hull Model: Enabling Performance Prediction for Parallel Computing Prior to Code Development Categories and Subject Descriptors. In Proceedings of the 9th Conference on Computing Frontiers. ACM, 203--212.

Digital Library

[34]

OpenSSL Software Foundation 2015. OpenSSL, Cryptography and SSL/TLS Toolkit. OpenSSL Software Foundation. https://openssl.org.

[35]

Sanjay Patel. 2009. Sun's Next-Generation Multithreaded Processor: Rainbow Falls. In 21st Hot Chip Symposium. http://www.hotchips.org/wp-content/uploads/hc

[36]

Sanjay Patel and Wen-mei W. Hwu. 2008. Accelerator Architectures. IEEE Micro 28, 4 (jul 2008), 4--12.

Digital Library

[37]

Stephen Phillips. 2014. M7: Next Generation SPARC. In 26th Hot Chip Symposium.

[38]

Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips.

[39]

Yoshiei Sato, Ryuichi Nagaoka, Akihiro Musa, Ryusuke Egawa, Hiroyuki Takizawa, Koki Okabe, and Hiroaki Kobayashi. 2009. Performance tuning and analysis of future vector processors based on the roofline model. Proceedings of the 10th MEDEA workshop on MEmory performance DEaling with Applications, systems and architecture - MEDEA '09 (2009), 7.

Digital Library

[40]

M Shah, J Barren, J Brooks, R Golla, G Grohoski, N Gura, R Hetherington, P Jordan, M Luttrell, C Olson, B Sana, D Sheahan, L Spracklen, and A Wynn. 2007. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In Solid-State Circuits Conference, 2007. ASSCC '07. IEEE Asian. 22--25.

[41]

Manish Shah, Robert Golla, Gregory Grohoski, Paul Jordan, Jama Barreh, Jeffrey Brooks, Mark Greenberg, Gideon Levinsky, Mark Luttrell, Christopher Olson, Zeid Samoail, Matt Smittle, and Thomas Ziaja. 2012. Sparc T4: A dynamically threaded server-on-a-chip. IEEE Micro 32 (2012), 8--19.

Digital Library

[42]

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In International Symposium on Computer Architecture (ISCA).

Digital Library

[43]

Soekris Engineering 2016. vpn 1401, for Std. PCI-sockets. Soekris Engineering. http://soekris.com/products/vpn-1401.html.

[44]

Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W. Cameron. 2013. A simplified and accurate model of power-performance efficiency on emergent GPU architectures. In Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. 673--686.

Digital Library

[45]

Jeff Stuecheli. 2013. POWER8. In 25th Hot Chip Symposium.

[46]

Ning Sun and Chi-Chang Lin. 2007. Using the Cryptographic Accelerators in the UltraSPARC T1 and T2 processors. Technical Report. http://www.oracle.com/technetwork/server-storage/solaris/documentation/819-5782-150147.pdf

[47]

S. Tabik, G. Ortega, and E. M. Garzón. 2014. Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. The Journal of Supercomputing 70, 2 (nov 2014), 577--587.

Digital Library

[48]

Y C Tay. 2013. Analytical Performance Modeling for Computer Systems (2nd ed.). Morgan & Claypool Publishers.

Digital Library

[49]

MB Taylor. 2012. Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE. 1131--1136.

Digital Library

[50]

G Venkatesh, J Sampson, N Goulding, S Garcia, V Bryksin, J Lugo-Martinez, S Swanson, and M B Taylor. 2010. Conservation cores: Reducing the energy of mature computations. In s and Operating Systems - ASPLOS. 205--218.

Digital Library

[51]

Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCores: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11. 163.

Digital Library

[52]

Guibin Wang, Yisong Lin, and Wei Yi. 2010. Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Cyber, Physical and Social Computing (CPSCom). 344--350.

Digital Library

[53]

Eric W. Weisstein. 2015. Newton's Method. From MathWorld -- A Wolfram Web Resource. http://mathworld.wolfram.com/NewtonsMethod.html.

[54]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52 (2009), 65--76.

Digital Library

[55]

Lisa Wu, Andrea Lottarini, Timothy K Paine, Martha A Kim, and Kenneth A Ross. 2014. Q100: The Architecture and Design of a Database Processing Unit. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 255--268.

Digital Library

[56]

Moein Pahlavan Yali. 2014. FPGA-Roofline: An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems. Ph.D. Dissertation. Virginia Polytechnic Institute and State University.

[57]

Yao Zhang and John D. Owens. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings - International Symposium on High-Performance Computer Architecture. 382--393.

Digital Library

Cited By

Spencer ZRogers SSlycord JTabkhi H(2024)Expanding hardware accelerator system design space exploration with gem5-SALAMv2Journal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103211154:COnline publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103211
Guo ZLin JBai YKim DSwift MAkella ALiu M(2023)LogNIC: A High-Level Performance Model for SmartNICsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614291(916-929)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614291
Martínez PWoodruff JArmengol-Estapé JBernabé GGarcía JO’Boyle MVerbrugge CLhoták OShen X(2023)Matching Linear Algebra and Tensor Code to Specialized Hardware AcceleratorsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580262(85-97)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580262
Show More Cited By

Index Terms

LogCA: A High-Level Performance Model for Hardware Accelerators

Recommendations

LogCA: A High-Level Performance Model for Hardware Accelerators
ISCA'17

With the end of Dennard scaling, architects have increasingly turned to special-purpose hardware accelerators to improve the performance and energy efficiency for some applications. Unfortunately, accelerators don't always live up to their expectations ...
Performance and toolchain of a combined GPU/FPGA desktop (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance ...
Examining recent many-core architectures and programming models using SHOC
PMBS '15: Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems

The Scalable HeterOgeneous Computing (SHOC) benchmark suite was released in 2010 as a tool to evaluate the stability and performance of emerging heterogeneous architectures and to compare different programming models for compute devices used in those ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

June 2017

736 pages

ISBN:9781450348928

DOI:10.1145/3079856

ACM SIGARCH Computer Architecture News Volume 45, Issue 2
ISCA'17
May 2017
715 pages
ISSN:0163-5964
DOI:10.1145/3140659
Editor:
Babak Falsafi
Interim
Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE: IEEE Computer Society Technical Committee on Design Automation
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ISCA '17

Sponsor:

IEEE
SIGARCH

ISCA '17: The 44th Annual International Symposium on Computer Architecture

June 24 - 28, 2017

ON, Toronto, Canada

Acceptance Rates

ISCA '17 Paper Acceptance Rate 54 of 322 submissions, 17%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
1,677
Total Downloads

Downloads (Last 12 months)329
Downloads (Last 6 weeks)32

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Spencer ZRogers SSlycord JTabkhi H(2024)Expanding hardware accelerator system design space exploration with gem5-SALAMv2Journal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103211154:COnline publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103211
Guo ZLin JBai YKim DSwift MAkella ALiu M(2023)LogNIC: A High-Level Performance Model for SmartNICsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614291(916-929)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614291
Martínez PWoodruff JArmengol-Estapé JBernabé GGarcía JO’Boyle MVerbrugge CLhoták OShen X(2023)Matching Linear Algebra and Tensor Code to Specialized Hardware AcceleratorsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580262(85-97)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580262
Koraei MCebrian JJahre M(2023)Near-optimal multi-accelerator architectures for predictive maintenance at the edgeFuture Generation Computer Systems10.1016/j.future.2022.10.030140:C(331-343)Online publication date: 8-Feb-2023
https://dl.acm.org/doi/10.1016/j.future.2022.10.030
Lenke OPetri RWild THerkersdorf A(2021)PEPERONI: Pre-Estimating the Performance of Near-Memory IntegrationProceedings of the International Symposium on Memory Systems10.1145/3488423.3519329(1-6)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519329
Pourhabibi ASutherland MDaglis AFalsafi B(2021)Cerebros: Evading the RPC Tax in DatacentersMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480055(407-420)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480055
Pismenny BEran HYehezkel ALiss LMorrison ATsafrir DSherwood TBerger EKozyrakis C(2021)Autonomous NIC offloadsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446732(18-35)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446732
Azad ZSen RPark KJoshi A(2021)Hardware Acceleration for DBMS Machine Learning Scoring: Is It Worth the Overheads?2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00047(243-253)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00047
Rogers SSlycord JBaharani MTabkhi H(2020)gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00047(471-482)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00047
Rheindt SFried ALenke ONolte LSabirov TTwardzik TWild THerkersdorf A(2020)X-CEL: A Method to Estimate Near-Memory Acceleration Potential in Tile-Based MPSoCsArchitecture of Computing Systems – ARCS 202010.1007/978-3-030-52794-5_9(109-123)Online publication date: 9-Jul-2020
https://doi.org/10.1007/978-3-030-52794-5_9
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents