Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1375527.1375570acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Orchestrating data transfer for the cell/B.E. processor

Published: 07 June 2008 Publication History

Abstract

In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur.
In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further.

References

[1]
Official OpenMP specifications.]]
[2]
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In Proc. of the tenth international symposium on Hardware/software codesign (CODES ''02), pages 73--78, New York, NY, USA, 2002. ACM.]]
[3]
T. Chen, Z. Sura, K. O''Brien, and K. O''Brien. Optimizing the use of static buffers for DMA on a CELL chip. In Proc. of the International Workshop on Languages and Compilers for Parallel Computing (LCPC''06), pages 314--329. Springer Berlin, 2006.]]
[4]
W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Proc. of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT''05), pages 267--278, Washington, DC, USA, 2005. IEEE Computer Society.]]
[5]
J. Cocke. Global common subexpression elimination. SIGPLAN Not., 5(7):20--24, 1970.]]
[6]
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fine-- and coarse--grain approaches for software distributed shared memory. In Proc. of the 5th International Symposium on High Performance Computer Architecture (HPCA''99), page 260, Washington, DC, USA, 1999. IEEE Computer Society.]]
[7]
A. E. Eichenberger, J. K. O''Brien, K. M. O''Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM System. Journal, 45(1):59--84, 2006.]]
[8]
K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. Proc. of the 2006 Conference on Supercomputing (SC''06), 0:4, 2006.]]
[9]
P. Feautrier. Data°ow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, 1991.]]
[10]
B. Flachs, S. Asano, S. H. Dhong, H. P. Hofstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H.-J. Oh, S. M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, N. Yano, D. A. Brokenshire, M. Peyravian, V. To, and E. Iwata. The microarchitecture of the synergistic processor for a Cell processor. IEEE Journal of Solid-State Circuits, 41(1):63--70, 2006.]]
[11]
C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In Proc. of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT''05), pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society.]]
[12]
K. Ishizaki, H. Komatsu, and T. Nakatani. A loop transformation algorithm for communication overlapping. Int. J. Parallel Program., 28(2):135--154, 2000.]]
[13]
H. Jin, M. Frumkin, and J. Yan. The OpenMP implementation of NAS parallel benchmarks and its performance.]]
[14]
K. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: High performance all-software distributed shared memory. Technical report, Cambridge, MA, USA, 1995.]]
[15]
U. J. Kapasi, P. Mattson, W. J. Dally, J. D. Owens, and B. Towles. Stream scheduling. In Proc. of the 3rd Workshop on Media and Streaming Processors, pages 101--106, 2001.]]
[16]
M. Kistler, M. Perrone, and F. Petrini. Cell multiprocessor communication network: Built for speed. IEEE Micro, 26(3):10--23, 2006.]]
[17]
T. J. Knight, J. Y. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. Compilation for explicitly managed memory hierarchies. In Proc. of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP''07), pages 226--236, New York, NY, USA, 2007. ACM.]]
[18]
C. A. Moritz, M. Frank, and S. P. Amarasinghe. Flexcache: A framework for °exible compiler generated data caching. In Revised Papers from the Second International Workshop on Intelligent Memory Systems (IMS''00), pages 135--146, London, UK, 2001. Springer-Verlag.]]
[19]
J. K. O''Brien, K. M. O''Brien, Z. Sura, T. Chen, and T. Zhang. Support OpenMP on Cell. In Proc. of the International Workshop on OpenMP, 2007.]]
[20]
P. R. Panda, A. Nicolau, and N. Dutt. Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration. Kluwer Academic Publishers, Norwell, MA, USA, 1998.]]
[21]
D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first-generation Cell processor. In Proc. of IEEE International Solid-State Circuits Conference, pages 184--592 Vol. 1, 2005.]]
[22]
Z. Radovic and E. Hagersten. Removing the overhead from software-based shared memory. In Proc. of the 2001 ACM/IEEE conference on Supercomputing (SC''01), pages 56--56, New York, NY, USA, 2001. ACM.]]
[23]
G. Rivera and C.-W. Tseng. Tiling optimizations for 3D scientific computations. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (SC''00), page 32, Washington, DC, USA, 2000. IEEE Computer Society.]]
[24]
I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-grain access control for distributed shared memory. In Proc. of the sixth international conference on Architectural support for programming languages and operating systems (ASPLOS-VI), pages 297--306, New York, NY, USA, 1994. ACM.]]
[25]
O. S. Unsal, R. Ashok, I. Koren, C. M. Krishna, and C. A. Moritz. Cool-cache for hot multimedia. In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture (MICRO34), pages 274--283, Washington, DC, USA, 2001. IEEE Computer Society.]]
[26]
M. N. Wegman and F. K. Zadeck. Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst., 13(2):181--210, 1991.]]

Cited By

View all

Index Terms

  1. Orchestrating data transfer for the cell/B.E. processor

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
    June 2008
    390 pages
    ISBN:9781605581583
    DOI:10.1145/1375527
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. memory coherence
    2. multi-core system
    3. software controlled cache

    Qualifiers

    • Research-article

    Conference

    ICS08
    Sponsor:
    ICS08: International Conference on Supercomputing
    June 7 - 12, 2008
    Island of Kos, Greece

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Data-Driven Thread Execution on Heterogeneous ProcessorsInternational Journal of Parallel Programming10.1007/s10766-016-0486-646:2(198-224)Online publication date: 1-Apr-2018
    • (2017)Memory ArchitecturesHandbook of Hardware/Software Codesign10.1007/978-94-017-7358-4_14-1(1-31)Online publication date: 8-Apr-2017
    • (2017)Memory ArchitecturesHandbook of Hardware/Software Codesign10.1007/978-94-017-7267-9_14(411-441)Online publication date: 27-Sep-2017
    • (2016)Partitioning and Data Mapping in Reconfigurable Cache and Scratchpad Memory--Based ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/293468022:1(1-25)Online publication date: 2-Sep-2016
    • (2013)SPM-SieveProceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.5555/2555729.2555750(1-10)Online publication date: 29-Sep-2013
    • (2013)SPM-Sieve: A framework for assisting data partitioning in scratch pad memory based systems2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662527(1-10)Online publication date: Sep-2013
    • (2012)Integrating software caches with scratch pad memoryProceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems10.1145/2380403.2380440(201-210)Online publication date: 7-Oct-2012
    • (2011)DDM-VMcProceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers10.1145/1944862.1944869(25-34)Online publication date: 24-Jan-2011
    • (2011)Automatic Loop Tiling for Direct Memory AccessProceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2011.53(479-489)Online publication date: 16-May-2011
    • (2010)A study of a software cache implementation of the OpenMP memory model for multicore and manycore architecturesProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885310(341-352)Online publication date: 31-Aug-2010
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media