research-article

Public Access

Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration

Authors:

Emilio G. Cota,

Paolo Mantovani,

Luca P. CarloniAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 27, Pages 1 - 12

https://doi.org/10.1145/2925426.2926258

Published: 01 June 2016 Publication History

Abstract

We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage.

We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.

References

[1]

D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 248--259, 1999.

Digital Library

[2]

K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent, and A. Tumeo. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, December 2013.

[3]

B. M. Beckmann, M. R. Marty, and D. A. Wood. ASR: Adaptive selective replication for CMP caches. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 443--454, 2006.

Digital Library

[4]

S. Borkar and A. A. Chien. The future of microprocessors. Communications of the ACM, 54(5):67--77, May 2011.

Digital Library

[5]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 269--284, 2014.

Digital Library

[6]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 55--66, 2003.

Digital Library

[7]

J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Charm: a composable heterogeneous accelerator-rich microprocessor. In Proc. of the Intl. Symp. on Low Power Electronics and Design (ISLPED), pages 379--384, 2012.

Digital Library

[8]

J. Cong, M. A. Ghodrat, M. Gill, C. Liu, and G. Reinman. BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs. In Proc. of the Intl. Symp. on Low Power Electronics and Design (ISLPED), pages 225--230, 2012.

Digital Library

[9]

E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. An analysis of accelerator coupling in heterogeneous architectures. In Proc. of the Design Automation Conference (DAC), pages 202:1--6, 2015.

Digital Library

[10]

E. G. Cota, P. Mantovani, M. Petracca, M. R. Casu, and L. P. Carloni. Accelerator memory reuse in the dark silicon era. IEEE Comput. Archit. Lett., 13(1):9--12, Jan. 2014.

Digital Library

[11]

Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, and R. Bianchini. MemScale: Active low-power modes for main memory. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 225--238, 2011.

Digital Library

[12]

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 365--376, 2011.

Digital Library

[13]

C. F. Fajardo, Z. Fang, R. Iyer, G. F. Garcia, S. E. Lee, and L. Zhao. Buffer-integrated-cache: a cost-effective sram architecture for handheld and embedded platforms. In Proc. of the Design Automation Conference (DAC), pages 966--971, 2011.

Digital Library

[14]

R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 37--47, June 2010.

Digital Library

[15]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 184--195, June 2009.

Digital Library

[16]

C. D. Kersey, A. Rodrigues, and S. Yalamanchili. A universal parallel front-end for execution driven microarchitecture simulation. In Proc. of the Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO), pages 25--32, 2012.

Digital Library

[17]

C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211--222, 2002.

Digital Library

[18]

R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. Stash: Have your scratchpad and cache it too. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 707--719, 2015.

Digital Library

[19]

A. Krishna, T. Heil, N. Lindberg, F. Toussi, and S. VanderWiel. Hardware acceleration in the IBM PowerEN processor: Architecture and performance. In Proc. of the Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 389--400, 2012.

Digital Library

[20]

N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar. Haswell: A family of IA 22nm processors. In ISSCC Digest of Technical Papers, pages 112--113, 2014.

[21]

G. Kurian, S. Devadas, and O. Khan. Locality-aware data replication in the last-level cache. In Proc. of the Intl. Symp. on High-Performance Computer Architecture (HPCA), pages 1--12, 2014.

[22]

O. Lempel. 2nd generation intel core processor family: Intel core i7, i5 and i3. In Hot Chips, 2011.

[23]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 469--480, 2009.

Digital Library

[24]

M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store: A shared memory framework for accelerator-based systems. ACM Trans. Archit. Code Optim., 8(4):48:1--48:22, Jan. 2012.

Digital Library

[25]

K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart memories: A modular reconfigurable architecture. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 161--171, 2000.

Digital Library

[26]

M. M. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Communications of the ACM, 55(7):78--89, 2012.

Digital Library

[27]

P. Michaud. Demystifying multicore throughput metrics. Computer Architecture Letters, 12(2):63--66, July 2013.

Digital Library

[28]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 3--14, 2007.

Digital Library

[29]

T. Nowatzki, V. Gangadhar, K. Sankaralingam, and G. Wright. Pushing the limits of accelerator efficiency while retaining programmability. In Proc. of the Intl. Symp. on High-Performance Computer Architecture (HPCA), Mar. 2016.

[30]

C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. System-level memory optimization for high-level synthesis of component-based SoCs. In Proc. of the Intl. Symp. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 1--10, Oct. 2014.

Digital Library

[31]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz. Convolution engine: Balancing efficiency & flexibility in specialized computing. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 24--35, 2013.

Digital Library

[32]

B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In Proc. of the IEEE Intl. Symp. on Workload Characterization (IISWC), pages 110--119, 2014.

[33]

R. Sampson and T. F. Wenisch. Zcache skew-ered. In Proc. of the 9th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2011.

[34]

D. Sanchez and C. Kozyrakis. The zcache: Decoupling ways and associativity. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 187--198, 2010.

Digital Library

[35]

A. Seznec. Bank-interleaved cache or memory indexing does not require euclidean division. In Proc. of the 11th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2015.

[36]

C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. of the Intl. Symp. on Networks on Chip (NoCS), pages 201--210, 2012.

Digital Library

[37]

G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation cores: reducing the energy of mature computations. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 205--218, 2010.

Digital Library

[38]

L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: The architecture and design of a database processing unit. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 255--268, 2014.

Digital Library

Cited By

Orenes-Vera MTureci EWentzlaff DMartonosi M(2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071089
Asri MMalhotra DWang JBiros GJohn LGerstlauer A(2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
https://doi.org/10.1109/TPDS.2021.3056045
Mantovani PCota EPilato CDi Guglielmo GCarloni L(2016)Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chipProceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.1145/2968455.2968509(1-10)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1145/2968455.2968509
Show More Cited By

Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Exploiting Replicated Cache Blocks to Reduce L2 Cache Leakage in CMPs

Modern chip multiprocessors (CMPs) employ large L2 caches to reduce the performance gap between processors and off-chip memory. However, as the size of an L2 cache increases, its leakage power consumption also becomes a major contributor to the total ...
An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale,...
Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,570
Total Downloads

Downloads (Last 12 months)311
Downloads (Last 6 weeks)34

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Orenes-Vera MTureci EWentzlaff DMartonosi M(2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071089
Asri MMalhotra DWang JBiros GJohn LGerstlauer A(2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
https://doi.org/10.1109/TPDS.2021.3056045
Mantovani PCota EPilato CDi Guglielmo GCarloni L(2016)Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chipProceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.1145/2968455.2968509(1-10)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1145/2968455.2968509
Carloni L(2016)Invited - The case for embedded scalable platformsProceedings of the 53rd Annual Design Automation Conference10.1145/2897937.2905018(1-6)Online publication date: 5-Jun-2016
https://dl.acm.org/doi/10.1145/2897937.2905018

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents