Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1454115.1454156acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Hybrid access-specific software cache techniques for the cell BE architecture

Published: 25 October 2008 Publication History

Abstract

Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

References

[1]
A. E. Eichenberger et al., "Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture," IBM Sytems Journal, Vol. 45, No. 1, 2006.
[2]
M. Kistler et al., "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro, Vol. 26, Issue 3, 2006.
[3]
D. Pham et al., "The Design and Implementation of a First-Generation CELL Processor," in the Proceedings of the IEEE International Solid-State Circuits Conference, 2005.
[4]
M. Gschwind et al., "A Novel SIMD Architecture for the CELL Heterogeneous Chip-Multiprocessor," In Hot Chips 17, 2005.
[5]
T. Chen et al., "Optimizing the use of static buffers for DMA on a Cell chip," in the Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2006.
[6]
A. E. Eichenberger et al., "Optimizing Compiler for a Cell Processor," in the proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2005.
[7]
D. Bailey et al. "The NAS parallel benchmarks," Technical Report TR RNR-91-002, NASA Ames, 1991.
[8]
E. Witchel et al. "Direct Addressed Caches for Reduced Power Consumption," in the Proceedings of the Annual International Symposium on Microarchitecture, 2001.
[9]
C. A. Moritz et al., "Hot Pages: Software Caching for Raw Microprocessors," MIT-LCS Technical Memo LCS-TM-599, 1999.
[10]
J. B. Fryman et al., "SoftCache: A Technique for Power and Area Reduction in Embedded Systems," CERCS; GIT-CERCS-03-06
[11]
J. E. Miller and A. Agarwal, "Software-based Instruction Caching for Embedded Processors," in the Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems, 2006.
[12]
C. A. Moritz et al., "FlexCache: A framework for flexible compiler generated data caching," in the Proceedings of the 2nd Workshop on Intelligent Memory Systems, 2000.
[13]
S. Udayakumaran et al., "Dynamic Allocation for Scratch-Pad Memory Using Compile-Time Decisions," ACM Transactions on Embedded Computing Systems, Vol. 5, No. 2, 2006.
[14]
B. Sinharoy et al., "POWER 5 system micro-architecture," IBM Journal of Research and Development, Vol. 49, No. 4/5, 2005.
[15]
J. Hoeflinger and B. de Supinski, "The OpenMP Memory Model," in the Proceedings of the First International Workshop on OpenMP, 2005.
[16]
P. Altevogt et al., "IBM BladeCenter QS21 Hardware Performance," IBM Technical White Paper WP101245, 2008.
[17]
T. Chen et al., "Orchestrating Data Transfer for the Cell B.E. processor," in the Proceedings of the Annual International Conference on Supercomputing, 2008.
[18]
T. Chen et al., "Prefetching Irregular References for Software Cache on Cell, Proceedings of the sixth Annual International Symposium on Code Generation and Optimization.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques
October 2008
328 pages
ISBN:9781605582825
DOI:10.1145/1454115
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OpenMP
  2. compiler optimizations
  3. local memories
  4. memory classification
  5. software cache

Qualifiers

  • Research-article

Conference

PACT '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A new software cache structure on Sunway TaihuLightThe Journal of Supercomputing10.1007/s11227-021-04056-078:4(4779-4798)Online publication date: 1-Mar-2022
  • (2018)Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core acceleratorsJournal of Real-Time Image Processing10.1007/s11554-015-0544-015:1(73-92)Online publication date: 1-Jun-2018
  • (2018)Data-Driven Thread Execution on Heterogeneous ProcessorsInternational Journal of Parallel Programming10.1007/s10766-016-0486-646:2(198-224)Online publication date: 1-Apr-2018
  • (2016)Partitioning and Data Mapping in Reconfigurable Cache and Scratchpad Memory--Based ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/293468022:1(1-25)Online publication date: 2-Sep-2016
  • (2015)Coherence protocol for transparent management of scratchpad memories in shared memory manycore architecturesACM SIGARCH Computer Architecture News10.1145/2872887.275041143:3S(720-732)Online publication date: 13-Jun-2015
  • (2015)Coherence protocol for transparent management of scratchpad memories in shared memory manycore architecturesProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750411(720-732)Online publication date: 13-Jun-2015
  • (2015)Hardware–Software Coherence Protocol for the Coexistence of Caches and Local MemoriesIEEE Transactions on Computers10.1109/TC.2013.19464:1(152-165)Online publication date: Jan-2015
  • (2015)Caching Puts and Gets in a PGAS Language RuntimeProceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models10.1109/PGAS.2015.10(13-24)Online publication date: 16-Sep-2015
  • (2014)Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2014.6844487(231-242)Online publication date: Mar-2014
  • (2014)Optimizing memory bandwidth in OpenVX graph execution on embedded many-core acceleratorsProceedings of the 2014 Conference on Design and Architectures for Signal and Image Processing10.1109/DASIP.2014.7115617(1-8)Online publication date: Oct-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media