research-article

COMIC: a coherent shared memory interface for cell be

Authors:

SangYong HanAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 303 - 314

https://doi.org/10.1145/1454115.1454157

Published: 25 October 2008 Publication History

Abstract

The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).

References

[1]

Jairo Balart, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, Zehra Sura, Tong Chen, Tao Zhang, Kevin O'brien, and Kathryn O'Brien. A novel asynchronous software cache implementation for the cell/be processor. In LCPC '07: Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing, October 2007.

[2]

Brian N. Bershad and Matthew J. Zekauskas. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical Report CMU-CS-91-170, School of Computer Science, Carnegie Mellon University, September 1991.

[3]

Angelos Bilas, Cheng Liao, and Jaswinder Pal Singh. Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems. In ISCA '99: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 282--293, May 1999.

Digital Library

[4]

OpenMP Architecture Review Board. OpenMP. http://www.openmp.org.

[5]

OpenMP Architecture Review Board. OpenMP Application Program Interface. OpenMP Architecture Review Board, version 2.5 edition, May 2005.

[6]

John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of munin. In SOSP '91: Proceedings of the thirteenth ACM Symposium on Operating Systems Principles, pages 152--164, October 1991.

Digital Library

[7]

Tong Chen, Zehra Sura, Kathryn M. O'Brien, and John K. O'Brien. Optimizing the use of static buffers for dma on a cell chip. In LCPC '06: Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing, pages 314--329, November 2006. Also in Lecture Notes in Computer Science 4382, Springer 2007.

Digital Library

[8]

Tong Chen, Tao Zhang, Zehra Sura, Kathryn O'Brien, Kevin O'Brien, and Marc Gonzalez Tallada. Prefetching irregular references for software cache on cell. In CGO '08: Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008.

Digital Library

[9]

Standard Performance Evaluation Corporation. SPEC 2000. http://www.spec.org/benchmarks.html.

[10]

David E. Culler and Jaswinder Pal Singh. Parallel Computer Architecture. Morgan Kaufmann, 1999.

[11]

IBM DevloperWorks. Cell broadband engine resouce center. http://www.ibm.com/developerworks/power/cell/downloads.html.

[12]

NASA Advanced Supercomputing Division. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.

[13]

Susan J. Eggers and Tor E. Jeremiassen. Eliminating False Sharing. In ICPP '91: Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 377--381, August 1991.

[14]

Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. Optimizing compiler for the cell processor. In PACT '05: Proceedings of the 4th International Conference on Parallel Architectures and Compilation Techniques, pages 161--172, September 2005.

Digital Library

[15]

B. Flachs et. al. A Streaming Processing Unit for a CELL Processor. IEEE International Solid-State Circuits Conference (ISSCC), February 2005.

[16]

Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 Supercomputing Conference, November 2006.

Digital Library

[17]

Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In ISCA '90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26, May 1990.

Digital Library

[18]

Michael Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd Conference on Computing Frontiers, pages 1--8, May 2006.

Digital Library

[19]

Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. Synergistic processing in cell's multicore architecture. IEEE Micro, 26(02):10--24, March/April 2006.

Digital Library

[20]

John L. Hennessy and David A. Patterson. Computer Architecture. Morgan Kaufmann, fourth edition, 2006.

[21]

Parry Husbands, Costin Iancu, and Katherine Yelick. A performance analysis of the berkeley upc compiler. In ICS '03: Proceedings of the 17th Annual International Conference on Supercomputing, pages 63--73, June 2003.

Digital Library

[22]

IBM. Software Development Kit for Multicore Acceleration version 3.0, Programmer's Guide. IBM, 2007. http://www.ibm.com/developerworks/power/cell/.

[23]

IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, October 2007. http://www.ibm.com/developerworks/power/cell/.

[24]

Tor E. Jeremiassen and Susan J. Eggers. Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 179--188, New York, NY, USA, July 1995. ACM.

Digital Library

[25]

Peter Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy release consistency for software distributed shared memory. In ISCA'92: Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 13--21, May 1992.

Digital Library

[26]

Peter J. Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 1994 USENIX Technical Conference, pages 115--132, January 1994.

Digital Library

[27]

M. Kistler, M. Perrone, and F. Petrini. CELL Multiprocessor Communication Network: Built for Speed. IEEE Micro, 26(3), May/June 2006.

Digital Library

[28]

Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Transactions on Computers, 28(9):690--691, September 1979.

Digital Library

[29]

Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. In PODC '86: Proceedings of the fifth Annual ACM Symposium on Principles of Distributed Computing, pages 229--239, August 1986.

Digital Library

[30]

Jason E. Miller and Anant Agarwal. Software-based instruction caching for embedded processors. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 293--302, October 2006.

Digital Library

[31]

M. Morita, T. Machino, M. Guo, and G. Wang. Design and implementation of stream processing system and library for CELL broadband engine processors. In Proceedings of the 2007 Parallel and Distributed Computing and Systems Conference, November 2007.

Digital Library

[32]

Kevin O'Brien, Kathryn O'Brien, Zehra Sura, Tong Chen, and Tao Zhang. Supporting openmp on cell. In IWOMP '07: Proceedings of the International Workshop on OpenMP, June 2007.

Digital Library

[33]

Kevin O'Brien, Kathryn M. O'Brien, Zehra Sura, Tong Chen, and Tao Zhang. Supporting openmp on cell. International Journal of Parallel Programming, 36(3):289--311, 2008.

Digital Library

[34]

Parallel and High Performance Applicational Software Exchange Editorial Committee. Omni OpenMP compiler project. http://phase.hpcc.jp/omni.

[35]

Rodric Rabbah. Beyond gaming: Programming the PLAYSTATION3 Cell architecture for cost-effective parallel processing. In Proceedings of the 5th International Conference on Hardware/Software Codesign and System Synthesis, 2007.

Digital Library

[36]

Daniel J. Scales, Kourosh Gharachorloo, and Anshu Aggarwal. Fine-grain software distributed shared memory on smp clusters. In HPCA '98: Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, pages 125--136, January 1998.

Digital Library

[37]

Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory. In ASPLOS-VII: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 174--185, October 1996.

Digital Library

[38]

Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain access control for distributed shared memory. In ASPLOS-VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, pages 297--306, October 1994.

Digital Library

[39]

Robert Stets, Sandhya Dwarkadas, Nikolaos Hardavellas, Galen Hunt, Leonidas Kontothanassis, Srinivasan Parthasarathy, and Michael Scott. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In SOSP '97: Proceedings of the sixteenth ACM Symposium on Operating Systems Principles, pages 170--183, October 1997.

Digital Library

[40]

HPC Challenge Team. HPC challenge benchmark. http://icl.cs.utk.edu/hpcc/.

[41]

Matthew J. Zekauskas, Wayne A. Sawdon, and Brian N. Bershad. Software write detection for distributed shared memory. In OSDI '94: Proceedings of the First Symposium on Operating Systems Design and Implementation, pages 87--100, November 1994.

Digital Library

[42]

Yuanyuan Zhou, Liviu Iftode, and Kai Li. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In OSDI '96: Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, pages 75--88, October 1996.

Digital Library

Cited By

Tagliavini GHaugou GMarongiu ABenini L(2018)Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core acceleratorsJournal of Real-Time Image Processing10.1007/s11554-015-0544-015:1(73-92)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11554-015-0544-0
Chakraborty PPanda PSen S(2016)Partitioning and Data Mapping in Reconfigurable Cache and Scratchpad Memory--Based ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/293468022:1(1-25)Online publication date: 2-Sep-2016
https://dl.acm.org/doi/10.1145/2934680
Cai JShrivastava A(2016)Software Coherence Management on Non-coherent Cache Multi-coresProceedings of the 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID)10.1109/VLSID.2016.70(397-402)Online publication date: 4-Jan-2016
https://dl.acm.org/doi/10.1109/VLSID.2016.70
Show More Cited By

Index Terms

COMIC: a coherent shared memory interface for cell be
1. Hardware
  1. Communication hardware, interfaces and storage
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Virtual memory

Recommendations

A software-SVM-based transactional memory for multicore accelerator architectures with local memory
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

We propose a software transactional memory (STM) for heterogeneous multicores with small local memory. The heterogeneous multicore architecture consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The ...
Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying ...
STAC-A2 on intel architecture: from scalar code to heterogeneous application
WHPCF '14: Proceedings of the 7th Workshop on High Performance Computational Finance

STAC-A2^™ is compute and memory intensive industry benchmark in the field of market risk analysis. The benchmark specifications were created by the Securities Technology Analysis Center (aka STAC®) and are based on inputs collected from the leading ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
715
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tagliavini GHaugou GMarongiu ABenini L(2018)Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core acceleratorsJournal of Real-Time Image Processing10.1007/s11554-015-0544-015:1(73-92)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11554-015-0544-0
Chakraborty PPanda PSen S(2016)Partitioning and Data Mapping in Reconfigurable Cache and Scratchpad Memory--Based ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/293468022:1(1-25)Online publication date: 2-Sep-2016
https://dl.acm.org/doi/10.1145/2934680
Cai JShrivastava A(2016)Software Coherence Management on Non-coherent Cache Multi-coresProceedings of the 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID)10.1109/VLSID.2016.70(397-402)Online publication date: 4-Jan-2016
https://dl.acm.org/doi/10.1109/VLSID.2016.70
Dehyadegari MMarongiu AKakoee MMohammadi SYazdani NBenini L(2015)Architecture Support for Tightly-Coupled Multi-Core Clusters with Shared-Memory HW AcceleratorsIEEE Transactions on Computers10.1109/TC.2014.236052264:8(2132-2144)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1109/TC.2014.2360522
Lim JKim H(2014)Design Space Exploration of Memory Model for Heterogeneous ComputingProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.9(160-167)Online publication date: 22-Oct-2014
https://dl.acm.org/doi/10.1109/SBAC-PAD.2014.9
Tagliavini GHaugou GBenini L(2014)Optimizing memory bandwidth in OpenVX graph execution on embedded many-core acceleratorsProceedings of the 2014 Conference on Design and Architectures for Signal and Image Processing10.1109/DASIP.2014.7115617(1-8)Online publication date: Oct-2014
https://doi.org/10.1109/DASIP.2014.7115617
Papagiannis ANikolopoulos D(2014)Hybrid address spacesJournal of Systems and Software10.1016/j.jss.2014.06.05897:C(47-64)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1016/j.jss.2014.06.058
Pinto CBenini L(2014)A Novel Object-Oriented Software Cache for Scratchpad-Based Multi-Core ClustersJournal of Signal Processing Systems10.1007/s11265-014-0881-477:1-2(77-93)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1007/s11265-014-0881-4
Pinto CBenini L(2013)A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clustersProceedings of the 2013 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2013.6567591(281-288)Online publication date: 5-Jun-2013
https://dl.acm.org/doi/10.1109/ASAP.2013.6567591
Azevedo AJuurlink B(2012)A Multidimensional Software Cache for Scratchpad-Based SystemsInnovations in Embedded and Real-Time Systems Engineering for Communication10.4018/978-1-4666-0912-9.ch004(59-78)Online publication date: 2012
https://doi.org/10.4018/978-1-4666-0912-9.ch004
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents