Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2925426.2926258acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration

Published: 01 June 2016 Publication History

Abstract

We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage.
We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.

References

[1]
D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 248--259, 1999.
[2]
K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent, and A. Tumeo. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, December 2013.
[3]
B. M. Beckmann, M. R. Marty, and D. A. Wood. ASR: Adaptive selective replication for CMP caches. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 443--454, 2006.
[4]
S. Borkar and A. A. Chien. The future of microprocessors. Communications of the ACM, 54(5):67--77, May 2011.
[5]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 269--284, 2014.
[6]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 55--66, 2003.
[7]
J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Charm: a composable heterogeneous accelerator-rich microprocessor. In Proc. of the Intl. Symp. on Low Power Electronics and Design (ISLPED), pages 379--384, 2012.
[8]
J. Cong, M. A. Ghodrat, M. Gill, C. Liu, and G. Reinman. BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs. In Proc. of the Intl. Symp. on Low Power Electronics and Design (ISLPED), pages 225--230, 2012.
[9]
E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. An analysis of accelerator coupling in heterogeneous architectures. In Proc. of the Design Automation Conference (DAC), pages 202:1--6, 2015.
[10]
E. G. Cota, P. Mantovani, M. Petracca, M. R. Casu, and L. P. Carloni. Accelerator memory reuse in the dark silicon era. IEEE Comput. Archit. Lett., 13(1):9--12, Jan. 2014.
[11]
Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, and R. Bianchini. MemScale: Active low-power modes for main memory. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 225--238, 2011.
[12]
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 365--376, 2011.
[13]
C. F. Fajardo, Z. Fang, R. Iyer, G. F. Garcia, S. E. Lee, and L. Zhao. Buffer-integrated-cache: a cost-effective sram architecture for handheld and embedded platforms. In Proc. of the Design Automation Conference (DAC), pages 966--971, 2011.
[14]
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 37--47, June 2010.
[15]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 184--195, June 2009.
[16]
C. D. Kersey, A. Rodrigues, and S. Yalamanchili. A universal parallel front-end for execution driven microarchitecture simulation. In Proc. of the Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO), pages 25--32, 2012.
[17]
C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 211--222, 2002.
[18]
R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. Stash: Have your scratchpad and cache it too. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 707--719, 2015.
[19]
A. Krishna, T. Heil, N. Lindberg, F. Toussi, and S. VanderWiel. Hardware acceleration in the IBM PowerEN processor: Architecture and performance. In Proc. of the Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 389--400, 2012.
[20]
N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar. Haswell: A family of IA 22nm processors. In ISSCC Digest of Technical Papers, pages 112--113, 2014.
[21]
G. Kurian, S. Devadas, and O. Khan. Locality-aware data replication in the last-level cache. In Proc. of the Intl. Symp. on High-Performance Computer Architecture (HPCA), pages 1--12, 2014.
[22]
O. Lempel. 2nd generation intel core processor family: Intel core i7, i5 and i3. In Hot Chips, 2011.
[23]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 469--480, 2009.
[24]
M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store: A shared memory framework for accelerator-based systems. ACM Trans. Archit. Code Optim., 8(4):48:1--48:22, Jan. 2012.
[25]
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart memories: A modular reconfigurable architecture. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 161--171, 2000.
[26]
M. M. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Communications of the ACM, 55(7):78--89, 2012.
[27]
P. Michaud. Demystifying multicore throughput metrics. Computer Architecture Letters, 12(2):63--66, July 2013.
[28]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 3--14, 2007.
[29]
T. Nowatzki, V. Gangadhar, K. Sankaralingam, and G. Wright. Pushing the limits of accelerator efficiency while retaining programmability. In Proc. of the Intl. Symp. on High-Performance Computer Architecture (HPCA), Mar. 2016.
[30]
C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. System-level memory optimization for high-level synthesis of component-based SoCs. In Proc. of the Intl. Symp. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 1--10, Oct. 2014.
[31]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz. Convolution engine: Balancing efficiency & flexibility in specialized computing. In Proc. of the Intl. Symp. on Computer Architecture (ISCA), pages 24--35, 2013.
[32]
B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In Proc. of the IEEE Intl. Symp. on Workload Characterization (IISWC), pages 110--119, 2014.
[33]
R. Sampson and T. F. Wenisch. Zcache skew-ered. In Proc. of the 9th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2011.
[34]
D. Sanchez and C. Kozyrakis. The zcache: Decoupling ways and associativity. In Proc. of the Intl. Symp. on Microarchitecture (MICRO), pages 187--198, 2010.
[35]
A. Seznec. Bank-interleaved cache or memory indexing does not require euclidean division. In Proc. of the 11th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2015.
[36]
C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. of the Intl. Symp. on Networks on Chip (NoCS), pages 201--210, 2012.
[37]
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation cores: reducing the energy of mature computations. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 205--218, 2010.
[38]
L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: The architecture and design of a database processing unit. In Proc. of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 255--268, 2014.

Cited By

View all
  • (2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
  • (2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
  • (2016)Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chipProceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.1145/2968455.2968509(1-10)Online publication date: 1-Oct-2016
  • Show More Cited By
  1. Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Accelerator memory
    2. non-uniform cache architectures
    3. opportunity cost
    4. private local memory

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)311
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071089(718-730)Online publication date: Feb-2023
    • (2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
    • (2016)Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chipProceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.1145/2968455.2968509(1-10)Online publication date: 1-Oct-2016
    • (2016)Invited - The case for embedded scalable platformsProceedings of the 53rd Annual Design Automation Conference10.1145/2897937.2905018(1-6)Online publication date: 5-Jun-2016

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media