research-article

Fusion: design tradeoffs in coherent cache hierarchies for accelerators

Authors:

Snehasish Kumar,

Arrvindh Shriraman,

Naveen VedulaAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 733 - 745

https://doi.org/10.1145/2749469.2750421

Published: 13 June 2015 Publication History

Abstract

Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading at the granularity of functions and loops. The sequential program now needs to migrate across the chip utilizing the appropriate accelerator for each program region. As the execution migrates, it has become increasingly challenging to retain the temporal and spatial locality of the original program as well as manage the data sharing.

We show that with the increasing energy cost of wires and caches relative to compute operations, it is imperative to optimize data movement to retain the energy benefits of accelerators. We develop FUSION, a lightweight coherent cache hierarchy for accelerators and study the tradeoffs compared to a scratchpad based architecture. We find that coherency, both between the accelerators and with the CPU, can help minimize data movement and save energy. FUSION leverages temporal coherence [32] to optimize data movement within the accelerator tile. The accelerator tile includes small per-accelerator L0 caches to minimize hit energy and a per-tile shared cache to improve localized-sharing between accelerators and minimize data exchanges with the host LLC. We find that overall FUSION improves performance by 4.3× compared to an oracle DMA that pushes data into the scratchpad. In workloads with inter-accelerator sharing we save up to 10x the dynamic energy of the cache hierarchy by minimizing the host-accelerator data ping-ponging.

References

[1]

Macsim: Simulator for heterogeneous architecture - https://code.google.com/p/macsim/.

[2]

J. Balfour. EFFICIENT EMBEDDED COMPUTING.

[3]

A. Basu, M. D. Hill, and M. M. Swift. Reducing memory reference energy with opportunistic virtual caching. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 297--308, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[4]

B. Blaner, B. Abali, B. Bass, S. Chari, R. Kalla, S. Kunkel, K. Lauricella, R. Leavens, J. Reilly, and P. Sandon. Ibm power7+ processor on-chip accelerators for cryptography and active memory expansion. IBM Journal of Research and Development, 57(6):3:1--3:16, Nov 2013.

Digital Library

[5]

J. Brown, S. Woodward, B. Bass, and C. Johnson. IBM Power Edge of Network Processor: A Wire-Speed System on a Chip. IEEE Micro, 31(2):76--85, 2011.

Digital Library

[6]

S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala. An FPGA memcached appliance, 2013.

Digital Library

[7]

B. Dally. Power, programmability, and granularity: The challenges of exascale computing. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 878--878. IEEE, 2011.

Digital Library

[8]

A. Farmahini-Farahani, N. S. Kim, and K. Morrow. Energy-efficient reconfigurable cache architectures for accelerator-enabled embedded systems. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pages 211--220. IEEE, 2014.

[9]

J. Goodman. Source Snooping Cache Coherence Protocols. Science, 2009.

[10]

Goodridge. The effect and technique of system coherence in arm multicore technology. 2008.

[11]

V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 32(5):0038--51, 2012.

Digital Library

[12]

Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures.

[13]

R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In PROC of the 37th ISCA, 2010.

Digital Library

[14]

J. Hestness, S. W. Keckler, and D. A. Wood. A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior,. In IEEE International Symposium on Workload Characterization, 2014.

[15]

R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for High Bandwidth Network I/O. In PROC of the 32nd ISCA, 2005.

Digital Library

[16]

Intel. Xeon chip with integrated fpga. 2014.

[17]

S. Kaxiras and G. Keramidas. Sarc coherence: Scaling directory cache coherence in performance and power. Micro, IEEE, 30(5):54--65, Sept 2010.

Digital Library

[18]

S. Kaxiras and A. Ros. A new perspective for efficient virtual-cache coherence. In PROC of the 40th ISCA, pages 1--12, Apr. 2013.

Digital Library

[19]

S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan. DASX: Hardware accelerator for software data structures. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015, june 2015.

Digital Library

[20]

K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch. Thin servers with smart pipes: designing SoC accelerators for memcached. In PROC of the 40th ISCA, 2013.

Digital Library

[21]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, Nov. 2005.

Digital Library

[22]

S. L. Min and J. L. Baer. Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps. IEEE Trans. Parallel Distrib. Syst., 3(1), 1992.

Digital Library

[23]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In PROC of the 40th MICRO, 2007.

Digital Library

[24]

E. Peter Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. In DARPA IPTO, September 2008.

[25]

A. Putnam, S. Eggers, D. Bennett, E. Dellinger, J. Mason, H. Styles, P. Sundararajan, and R. Wittig. Performance and power of cache-based reconfigurable computing. In PROC of the 36th ISCA, 2009.

Digital Library

[26]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz. Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing. PROC of the 40th ISCA, pages 1--12, Apr. 2013.

Digital Library

[27]

B. Reagen, R. Adolf, S. Y. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IEEE International Symposium on Workload Characterization (IISWC), 2014.

[28]

G. Research. Arm cto warns of dark silicon. 2010.

[29]

S. Ricketts. Efficient Cache-Coherent Migration for Heterogeneous Coprocessors in Dark Silicon Limited Technology.

[30]

J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson, and M. B. Taylor. Efficient complex operators for irregular codes. In PROC of the 17th HPCA, 2011.

Digital Library

[31]

K. S. Shim et al. Library Cache Coherence. Csail technical report mit-csail-tr-2011-027, May 2011.

[32]

I. Singh, A. Shriraman, W. W. Fung, M. O'Connor, and T. M. Aamodt. Cache coherence for gpu architectures. In HPCA, pages 578--590, 2013.

Digital Library

[33]

S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi. Spatiotemporal memory streaming. In PROC of the 36th ISCA, 2009.

Digital Library

[34]

D. Stasiak, R. Chaudhry, D. Cox, S. Posluszny, J. Warnock, S. Weitzel, D. Wendel, and M. Wang. Cell processor low-power design methodology. In Micro, ieee, 2005.

Digital Library

[35]

S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, S. Belongie, and M. B. Taylor. SD-VBS: The San Diego Vision Benchmark Suite. IEEE, Oct. 2009.

Digital Library

[36]

G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation cores: reducing the energy of mature computations. In ACM SIGARCH Computer Architecture News, volume 38, pages 205--218. ACM, 2010.

Digital Library

[37]

G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. K. Venkata, M. B. Taylor, and S. Swanson. Qscores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 163--174. ACM, 2011.

Digital Library

[38]

M. Vuletic, P. Ienne, C. Claus, and W. Stechele. Multithreaded virtual-memory-enabled reconfigurable hardware accelerators. In Field Programmable Technology, 2006. FPT 2006. IEEE International Conference on, pages 197--204. IEEE, 2006.

[39]

T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal Streaming of Shared Memory. In PROC of the 32nd ISCA, 2005.

Digital Library

[40]

L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross. Navigating big data with high-throughput, energy-efficient data partitioning. In PROC of the 40th ISCA, 2013.

Digital Library

[41]

Q. Zheng, N. Goulding-Hotta, S. Ricketts, S. Swanson, M. B. Taylor, and J. Sampson. Exploring energy scalability in coprocessor-dominated architectures for dark silicon. ACM Trans. Embed. Comput. Syst., 13(4s):130:1--130:24, Apr. 2014.

Digital Library

Cited By

Gupta SDwarkadas S(2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00084
Li ANing AWentzlaff D(2023)Duet: Creating Harmony between Processors and Embedded FPGAs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070989(745-758)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070989
López-Paradís GVenu BArmejach AMoretó M(2023)Characterization of a Coherent Hardware Accelerator Framework for SoCsEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-46077-7_7(91-106)Online publication date: 7-Nov-2023
https://doi.org/10.1007/978-3-031-46077-7_7
Show More Cited By

Index Terms

Fusion: design tradeoffs in coherent cache hierarchies for accelerators
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Fusion-Cache: A Refactored Content-Aware Host-Side SSD Cache
ICA3PP 2015: Proceedings, Part II, of the 15th International Conference on Algorithms and Architectures for Parallel Processing - Volume 9529

For the merits of high I/O performance and low energy consumption, SSDs have been widely deployed as the host-side cache devices for backend storage to improve the hosted virtual machines' I/O performance. But inï źtoday's host-side SSD cache, the cache ...
Fusion: design tradeoffs in coherent cache hierarchies for accelerators
ISCA'15

Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading ...
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97: Proceedings of the international Conference on Parallel Processing

The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
729
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gupta SDwarkadas S(2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00084
Li ANing AWentzlaff D(2023)Duet: Creating Harmony between Processors and Embedded FPGAs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070989(745-758)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070989
López-Paradís GVenu BArmejach AMoretó M(2023)Characterization of a Coherent Hardware Accelerator Framework for SoCsEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-46077-7_7(91-106)Online publication date: 7-Nov-2023
https://doi.org/10.1007/978-3-031-46077-7_7
Naghibijouybari HKoruyeh EAbu-Ghazaleh N(2022)Microarchitectural Attacks in Heterogeneous Systems: A SurveyACM Computing Surveys10.1145/354410255:7(1-40)Online publication date: 15-Dec-2022
https://dl.acm.org/doi/10.1145/3544102
Zuckerman JGiri DKwon JMantovani PCarloni L(2021)Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480065(350-365)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480065
Xi SYao YBhardwaj KWhatmough PWei GBrooks D(2020)SMAUGACM Transactions on Architecture and Code Optimization10.1145/342466917:4(1-26)Online publication date: 10-Nov-2020
https://dl.acm.org/doi/10.1145/3424669
Boroumand AGhose SPatel MHassan HLucia BAusavarungnirun RHsieh KHajinazar NMalladi KZheng HMutlu OManne SHunter HAltman E(2019)CoNDAProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322266(629-642)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322266
Giri DMantovani PCarloni LShibuya T(2019)Runtime reconfigurable memory hierarchy in embedded scalable platformsProceedings of the 24th Asia and South Pacific Design Automation Conference10.1145/3287624.3288755(719-726)Online publication date: 21-Jan-2019
https://dl.acm.org/doi/10.1145/3287624.3288755
Goyat SKant SDhariwal N(2019)Dynamic Heterogeneous scheduling of GPU-CPU in Distributed Environment2019 International Conference on Smart Systems and Inventive Technology (ICSSIT)10.1109/ICSSIT46314.2019.8987886(329-336)Online publication date: Nov-2019
https://doi.org/10.1109/ICSSIT46314.2019.8987886
Fang ZJavadi FCong JReinman G(2019)Understanding Performance Gains of Accelerator-Rich Architectures2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2019.00013(239-246)Online publication date: Jul-2019
https://doi.org/10.1109/ASAP.2019.00013
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents