Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Towards an Effective Unified Programming Model for Many-Cores

2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Towards an Effective Unified Programming Model for Many-Cores Ana Lucia Varbanescu∗† , Pieter Hijma∗ , Rob van Nieuwpoort∗‡ and Henri Bal∗ ∗ Computing Systems Group, Vrije Universiteit Amsterdam, The Netherlands † Parallel and Distributed Systems Group, Delft University of Technology, The Netherlands ‡ ASTRON, Dwingeloo, The Netherlands e-mail: {analucia,rob,hphijma,bal}@cs.vu.nl Abstract—Building an effective programming model for manycore processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to effectively cope with the increased number of large-scale workloads that require parallelization, a portable, application-centric programming model is desirable. Such a model enables programmers to focus first on extracting and exploiting parallelism from their applications, as opposed to generating parallelism for specific hardware, and only second on platform-specific implementation and optimizations. In this paper, we first present a survey of programming models designed for programming three families of many-cores: general purpose many-cores (GPMCs), graphics processing units (GPUs), and the Cell/B.E.. We analyze the usability of these models, their ability to improve platform programmability, and the specific features that contribute to this improvement. Next, we also discuss two types of generic models: parallelismcentric and application-centric. We also analyze their features and impact on platform programmability. Based on this analysis, we recommend two application-centric models (OmpSs and OpenCL) as promising candidates for a unified programming model for many-cores and we discuss potential enhancements for them. I. I NTRODUCTION Modern multi-core processors and many-core accelerators (simply, many-cores), such as graphics processing units (GPUs) and the Cell processor, offer unprecedented performance levels by exploiting hardware parallelism on a large scale. Inevitably, they are seen as a solution to the performance problems that arise in many applications. This assumption only holds if applications respond with comparable parallelism, a non-trivial task for most types of workloads. Programming many-cores is difficult, as it is a problem with multiple constraints: we want applications to deliver great performance, to be easy to program, and to be portable between architectures. The perceived levels of these parameters - performance, productivity, and portability - combine into a measure of platform programmability, which is a good indicator for the success of an architecture. As matters are only getting worse with the increased variety of many-cores (see Section II for a comprehensive overview), both vendors and academia provide platform-specific programming models and tools, aiming to improve platform programmability (i.e., to make these architectures easier to program). This paper presents a survey of programming models used for general purpose and high-performance computing on many-core architectures. To evaluate these models, we focus on the model features used to improve platform programmability (Section III). Our discussion has two parts. First, we present a set of representative hardware-centric models (Section IV), which aim to make many of the complicated low-level architectural features of many-cores transparent to the users. These models tackle mostly application development (aiming to improve usability by offering an easy-to-use development environment), and do not provide enough support for parallel application design. Further, we also analyze several representative generic programming models - i.e., models that can be successfully used for more many-core families (Section V). We discuss both parallelism-centric and application-centric models, and we show that these models offer enough support for application development, and provide additional support in expressing and exploiting parallelism at design time. Each one of the twenty models we survey, with its qualities and drawbacks, has a positive impact on platform programmability. Still, most of these models are scarcely ever used. We believe this poor adoption of high-level programming models is due, to a large extent, to the multitude of models available. It is difficult for potential users to understand the limitations of each model, the differences between models, or the impact a model has on a specific workload. Therefore, programmers use the native models which, although cumbersome and lowlevel, guarantee full flexibility. To counter this approach, we propose to choose the most promising models and identify how can they be transformed into a generic application-centric model for many-core applications. Based on features, usersbase, current and projected development status, we choose two good candidates: OpenCL[1] and OmpSs[2], [3]. We compare these models in more detail (Section VI), and we make a couple of recommendations on what they can improve in terms if programmability impact and generality. We conclude (Section VII) that using these “candidate” models pays off in (investments in) their faster paced improvement, and limits the use of the lower-level native and/or hardware-centric models. TABLE I K EY COMPUTING AND MEMORY PROPERTIES OF MANY- CORE PROCESSORS . Platform Intel Nehalem EX AMD Magny Cours IBM Power 7 Sun Niagara II GF1000 HD5870 Cell/B.E. II. A Cores / threads 8 / 16 12 / 12 8 / 64 8 / 64 (16x32)xSIMT=10K-100K (20x16)xSIMT= 10K-100K 9 / 10 Vectors 4-wide 4-wide 4-wide no no 4-wide 4-wide ALUs 64 48 256 64 512 1600 36 Scheduling OS OS OS OS HW HW App COMPARATIVE ANALYSIS OF MANY- CORES In this section we discuss three classes of many-core devices used for high performance computing: multi-core processors1 , graphical processing units (GPUs), and the Cell Broadband Engine (Cell/B.E.). General purpose multi-cores (GPMCs) are replacing, since 2006, traditional single-core CPUs in both personal computers and servers. GPMCs are homogeneous platforms with complex cores, based on traditional processor architectures. They are typically shared-memory architectures, with multiple levels of caches. We emphasize the diversity of the spectrum of GPMCs by discussing multi-cores from different vendors: The Intel Nehalem EX (Xeon 7500 series), the AMD Magny Cours, the IBM POWER7, and the Sun Niagara 2 (UltraSPARC T2+). Many-core programming models should be retargetable to all these architectures. Since the first HPC applications ran on a GPU tens of times faster than on the CPUs of the time (2007), GPUs are constantly increasing their computation abilities. Nowadays, stateof-the-art GPU architectures target HPC markets directly, by adding computation-only features to their graphics pipelines. GPUsare shared memory machines, with a complex memory hierarchy, combining different types of caches and scratchpads at each level. GPUs are used as accelerators, which requires very low flexibility in the hardware; in turn, this allows for architectures that provide high memory throughput and computation capabilities. The PCI express bus is used to connect a GPU to a host system. We discuss GPUs from NVIDIA (the GF100 or Fermi architecture), and AMD/ATI (the Radeon HD 5870 or Cypress). Finally, we discuss the Cell/B.E., a 9-core heterogeneous processor (1 PPE and 8 SPEs) with a very basic programming model, in which a lot of architecture-related optimizations must be done by the programmer. The eight SPEs are dedicated to high-speed processing, have their own local scratchpad memories, and access the main memory by explicit DMAs. The main system memory is directly accessible to the PPE (which also has its private L1 and an L2 shared with the SPEs). Cell can be used both as a stand-alond processor and as an accelerator2 . 1 We generically call all these platforms ”many-cores” due to their relatively large numbers of hardware threads. However, we preserve the name ”multicores” as traditional for general-purpose many-cores. 2 Cell/B.E. is the main processor in PlayStation 3; RoadRunner (http://www. lanl.gov/roadrunner/) uses Cell as accelerators next to AMD Opteron’s as main processors Par levels 2 2 2 1 3 4 5 Space(s) shared shared shared shared shared;device;(host) shared;device;(host) SPE(local store);PPE Access R/W R/W R/W R/W R/W;(transparent L1-2) R/W;(transparent L1-2) DMA;R/W Cache transparent L1-3 transparent L1-3 transparent L1-3 transparent L1-2 app-controlled shared store app-controlled shared store app-controlled local store Table I presents some key computing and memory properties of many-core platforms. Note the increase in the number of parallelism levels: programming models can handle these explicitly or implicitly, trading performance for programmability (see Section IV). Further, note that many-cores have increasingly complex memory and caching hierarchies. This happens because they have to compensate for the inherent decrease in memory bandwidth per core with the increase in the number of cores and ALUs. One of the key differences between multi-core CPUs on the one hand, and the GPUs and the Cell/B.E. on the other, is that the memory hierarchy is more exposed, and often explicitly handled in the latter. This has a clear impact on the programming effort that is needed to achieve good performance. III. E VALUATING P ROGRAMMING M ODELS We start from the assumption that programming models are built to improve platform programmability. Therefore, this section defines platform programmability and its components, and presents a list of features used to evaluate many-core programming models in terms of usability and programmability impact. A. Programmability Due to the multiple levels of parallelism many-core platform offer, their peak performance is only achievable if applications are able to extract and express enough layers of parallelism, at par with those offered by the hardware platform. Platform programmability is a measure of how easy it is for (generic) applications to express enough parallelism to match the hardware offer. Typically, the native programming model of a platform exposes its “bare” programmability, as it provides users with the means to express parallelism in a platform-specific form, and it has minor limitations on achievable performance. Higher level programming models aim to improve programmability, by (1) offering users easier abstractions for designing and building parallelism, and (2) building better back-end components (i.e., compilers and runtime systems) to minimize the impact on performance. We judge the impact a programming model can have on platform programmability as a combination of its productivity, portability, and performance. Productivity is a measure of the development effort (typically, the time spent by the user when designing and developing the application). A model’s portability indicates the potential re-usability of the solutions built using it. A model’s performance indicates the achievable performance of a solution; from the model’s perspective, the performance potential is usually measured as efficiency (i.e., how much of the platform’s peak performance is achievable when using the chosen model). Note that productivity, portability and performance are strongly interconnected. For example, obtaining maximum performance might increase development time, thus decreasing productivity; similarly, a highly portable solution can use no hardware specific optimizations, thus limiting achievable performance. Therefore, a programming model has a positive impact on platform programmability if it can increase productivity and portability without (negatively) affecting the achievable performance. B. Features The features of a programming model are essential for its adoption: while the availability of features is not sufficient to guarantee success, the lack thereof is definitely a show stopper. There are three categories of features we use to evaluate the programming models: usability, design support, and implementation support. We further explain the features we consider representative for each category, and show how do they influence platform programmability. Usability: We include here a set of practical features that programming models offer; these features are linked to the ease of use of a programming model, ultimately aiming to increase programmers’ productivity. Class: We separate models in three classes: parallelismcentric, hardware-centric, or application-centric. Parallelismcentric usually provide model-specific constructs to parallelize an application. To parallelize applications, they have to be altered in such a way that they use the parallel constructs. Hardware-centric models focus on providing a simplified interface for exploiting platform-specific parallelism. Finally, application-centric models help programmers to design an inherently parallel program. These models typically provide abstractions for parallelism. Initial problem specification: Programming models require different ways of exposing the problem to be solved. We distinguish here models that start from the sequential code (typically enhanced with parallelism by the programmer), models that start from an algorithm and apply model-specific parallelization (this usually requires finding a new, parallel algorithm), and models that start from application specification. Note that for hierarchical models (e.g., models that use a host-accelerator(s) structure), it is typical that the problem specification differs between layers. Actions: The actions to be performed to transform the problem specification into a parallel solution, as well as the way they are done (by the user or (semi-)automated) are essential in increasing productivity. Typical actions are specific parallelization, where the user parallelizes the given algorithm to fit the target model, loop-level parallelization, usually done by the compiler, kernel isolation and fine-grain parallelization, where the users need to isolate the highly parallel regions in the code and exploit loop-like parallelism within the model space, and data clustering, where users specify collections of data to be processed in parallel, and a compiler or runtime system uses these elements as concurrency units. Impact: Problem specification is important for productivity because using the right initial description helps with correct solution design and minimizes the time spent in design, thus making the process more efficient. For example, if sequential code is not available for a given application, choosing a model that requires the sequential code algorithm to design the parallel version is counter-productive. Furthermore, models that require a detailed application specification and complex actions to be performed by the user have good performance potential, but their impact on productivity is negative; by contrast, models that rely on automated transformations of applications specification typically show both improved productivity and performance limitations. The class of a model are only indirectly linked to usability: users are responsible for choosing a model that suites their knowledge level (models based on known programming languages, or use familiar programming constructs, as well as models that use libraries of pre-optimized components are proven to be more productive than models that use new abstractions and languages). Design support: There are four features we evaluate to determine if programming models offer support for parallel application design: Algorithm view: The algorithm view [4] of a programming model can be fragmented or global. The parallelism constructs of a fragmented-view model are usually explicit, and interleaved with the processing constructs. In this case, the processing appears as fragmented, like in the classical example of MPI [5]. In contrast, a global-view model typically uses implicit communication and synchronization constructs, resulting in little interference with the processing, thus preserving the global view of the algorithm. A typical example is High Performance Fortran [6]. Parallelism: A model can support multiple types of parallelism. At a lower level, models can offer support for SIMD (single instruction, multiple data - typically known as vectorization) or SIMT (single instruction, multiple thread - also known as lock-step execution), targeting fine-grain parallelism. At a higher level, models can offer SPMD or MPMD [7] (single process/multiple process, multiple data) targeting coarse-grain parallelism. Finally, at the highest level, models can offer one or several patterns for both SPMD and MPMD, such as divide-and-conquer, map-reduce, pipelining, or streaming. Concurrency units and granularity: A programming model can define its own concurrency units and to provide mechanisms to control their granularity. Concurrency units can range from data items (in flat, data-parallelism oriented models) to functions and/or processes; models can also have hierarchies of different concurrency units. Furthermore, models that may alter the granularity of their concurrency units (more or less dynamic) are told to have “granularity control”. Available models range from those which do not have abstractions for granularity control, other than changing the program (the majority), to programming models that offer automatic and/or dynamic granularity control. Data layout: A model can provide ways to explicitly specify a data layout and/or a data distribution among the concurrency units. This improves the control users have to limit unneeded communication and influence future mapping and scheduling decisions. Impact: The overall design support of a model has an important impact on productivity, portability, and performance. The algorithm view influences portability and productivity: global-view algorithms are easier to reason about and simpler to port on different platforms than fragmented-view models; in turn, fragmented models might simplify debugging and implementation. The supported types of parallelism is directly linked to productivity: a good match of the application parallelism with the programming model parallelism leads to a programmability boost, while a mismatch typically requires a lot of empirical changes on the algorithm, decreasing productivity. The control over the granularity of the concurrency units can contribute to performance, but also to portability. A model that allows explicit granularity definition without changing the program, may contribute to performance. In addition, applications can be ported to an architecture which needs more fine-grained or coarse-grained parallelism. Finally, data distribution contributes to productivity and portability, as a model with explicit data distribution is easier to reason about, debug, and tune for different platforms. Implementation support: We discuss here four features that offer implementation-level support and impact overall platform programmability. Mapping and Scheduling: By mapping and scheduling, we refer to the way the concurrency units are “placed” on the platform resources and executed to improve concurrency. Models typically choose one of the following solutions: (1) require users to make an explicit mapping, (2) determine the mappings automatically or even dynamically (using their own runtime system), or (3) rely on either the Operating System (OS) or the hardware schedulers for a “default” mapping. Data transfers: Due to their complex memory hierarchies, many-cores need data transfers between memory levels. Therefore, applications need to program transfers between concurrency units (and, eventually, concurrency layers). Programming models can choose to require transfers to be made explicitly or deal with them implicitly (i.e., transparent). As memory hierarchies vary a lot between platforms, it is likely that a hybrid approach - where some transfers are explicit, while the rest are taken care of by either the hardware (shared memories) or a runtime system - will prevail. Communication and synchronization: An important part of application development deals with the communication and synchronization between concurrency units. The traditional alternatives are implicit (i.e., transparent to the user) and explicit. Implicit communication and synchronization is essentially solved by the model, thus avoiding typical parallelism problems (like deadlocks or race conditions). Models that choose for an explicit approach require users to program the communication and/or synchronization explicitly. Hybrid approaches are possible, and fairly quite common. Optimizations: Programming models can simplify certain types of optimizations. If such optimizations can be performed automatically (without users tweaking the code), their positive influence on performance translates into a positive impact on programmability. However, optimizations are typically lowlevel and platform-specific (see memory coalescing for GPUs and SIMD extensions for the Cell/BE or the GPMCs), requiring user’s intervention and diminishing solution portability and productivity. Among the models that require users’ intervention for optimization, we those models that encourage the users to freely apply low-level optimizations (by simply altering the code) and those which limit or even obstruct this action mainly because such interventions on code lower the ability of the model’s analyzers to parse and extract other parameters and/or parallelization opportunities. Impact: Programming models that offer enhanced implementation support for parallel applications on many-cores have to cover these features. Furthermore, the way these features are reflected by models impacts programmability. For example, using explicit mapping and scheduling increases solution complexity, lowering productivity; the implicit alternative typically affects performance. The way data transfers between concurrency units (and, eventually, concurrency layers) are done impacts both performance and productivity. Requiring data transfers to be made explicitly affects portability and increases the complexity of the solution, while making them implicit without performance penalties requires the programming model to know the concurrency units mapping. IV. H ARDWARE - CENTRIC P ROGRAMMING M ODELS FOR M ANY- CORES Hardware-centric programming models aim to replace the native platform programming (typically supported by a lowlevel model) with higher-level, user-friendly solutions. In this section, we present a set of selected models designed to address the challenges posed by the three families of manycores we target. A. GPMC Programming Models The native parallelism model of GPMCs is symmetrical multithreading, as we deal with homogeneous architectures. GPMCs target coarse grain MPMD or SPMD workloads. Programmers cannot control scheduling and mapping through the programming model - these are typically done by the operating system. Memory consistency and contention are additional problems: consistency may impact solution correctness, while memory contention often limits performance. The hardware features with the most influence on platform performance are the multiple hardware threads, the caches, the memory hierarchy, as well as OS-based scheduler. GPMCs are best suitable for coarse-grain parallel workloads, i.e., applications consisting on multiple complex, yet independent tasks, or massive data-parallel processing. Intel Threading Building Blocks: Intel Threading Building Blocks (TBB [8]) is a C++ library with a strong focus on data parallelism. It centers around the concept of tasks instead of threads where tasks are performed on data in parallel. The library provides scalability by means of recursive subdivision of data and tasks that perform work-stealing. The library has three types of building blocks. It contains built-in constructs like parallel_reduce that can be composed recursively, container classes that are thread-safe, and locking classes (although the usage of these is discouraged). TBB can be used to parallelize parts of an existing C++ program. It provides parallelism at a level of abstraction that is above threads, with support for concurrent container classes and reduction constructs. However, it is still a flexible framework where programmers can also use lower level constructs. With the predefined constructs, TBB offers a global view, and ensures that algorithms remain general. TBB is rather flexible and offers both task and data parallelism, and good control over task creation, and granularity. TBB does not offer mechanisms to specify data layout, task mapping or data transfers, but it is possible to control task scheduling. It is also possible to perform communication and synchronization by hand but it is not recommended. The model allows other optimizations at a later stage. Ct: C for throughput (Ct [9]) extents C++ with support for nested data parallelism. The programming model centers around a special throughput vector (TVEC) which allows more irregular data parallelism than flat data parallel models. The model targets GPMCs, and aims to be scalable when the number of cores on a chip increases. A TVEC represents a single assignment vector that can be constructed to hold values of various scalar types and perhaps in the future also structures. A TVEC can also contain nested TVECs, achieving nested data parallelism. TVECs are constructed in a separate memory space and garbage collected by the Ct Memory Manager. The model is limited to nested data parallelism, but it offers a global-view of computation. Therefore, the algorithms that conform to this model are general enough to be portable to other families of architectures. The model does not have the concept of tasks, but it does allow to specify the data layout with help of the TVEC construct. Ct is high-level which means that it does not offer control over mapping scheduling, data transfer, communication and synchronization. It also obstructs other optimizations. Ct has been discontinued as it has been re-used and rebranded in the ArBB model. Intel Array Building Blocks: Array Building Blocks (ArBB) is a continuation of C for throughput (Ct [9]) with some features from RapidMind[10]. It extends C++ with a library, JIT compiler, and ArBB specific constructs such as special for and while statements. The programming model centers around special data containers with support for regular data (dense or sparse) and irregular data. Through the use of these data types, ArBB supports data and nested data parallelism. Operations, such as reductions on these data types are logically performed in a separate memory space and garbage collected to obtain a deterministic programming model. ArBB targets GPMCs using vector instructions, and aims to be scalable when the number of cores on a chip increases or vector instructions become wider. The model is well suited to parallelize data intensive parts with numerical computations in an existing C++ program. There are special copy in and copy out instructions to logically define the data in the ArBB memory space. This does not mean that the data is actually copied, but from the programming model point of view this results in a fragmented view of algorithms. The users have no control of task granularity as the dynamic compilation phase of ArBB operations takes care of this. Users have control over data layout by specifying their data structures in a different way. Summary: All three models preserve the native parallelism models of GPMCs - ArBB and Ct focus on SPMD, as they focus on data-parallel workloads, while TBB allows for more generality by supporting both SPMD and MPMD, and focusing on task-parallel workloads. All models simplify both the data distribution and the communication/synchronization between threads (in the limits of the types of workloads they consider). None of the model has a clear definition of a concurrency unit and granularity - arbitrarily sizes data elements (ArBB) or functions (TBB) are used instead. None of the models deals with the memory hierarchy or cache optimizations - these are left to the user and/or the compiler. Mapping is offloaded to a run-time system (which maps the model’s concurrency units on threads), while scheduling is offloaded to the OS (which maps virtual threads on hardware cores/threads). Overall, TBB offers more flexibility in expressing a parallel solution, but it has no support for parallel solution design. B. GPGPU Programming Models Programming GPUs combines coarse-level parallelism (offloading) and fine-grained parallelism (massive multithreading): the host CPU offloads the data-parallel kernels as large collections of threads on the GPU. The native parallelism models for the GPUs self are SIMD/SIMT3 , with medium and low granularity. Note that due to the time-consuming offloading of the kernels from the host (CPU) to the GPU, too low-granularity kernels are only suitable for these architectures in large numbers. For GPUs, the architectural features with the highest impact on programmability are: the very large number of threads and the dynamically partitioned register counts (i.e., registers are dynamically partitioned between running threads, resulting is a trade-off between using more registers per thread, or more threads with less registers per thread), the hardwarebased mapping and scheduling, the complex memory hierarchy and its parameters (sizes, latencies, and bandwidths), and the different memory spaces (host and device). 3 SIMT stands for “Single Instruction Multiple Threads”; it can be seen as a finer-grained version of SIMD. GPUs are typically used for highly data-parallel workloads, where hundreds to thousands of threads can compute concurrently. NVIDIA CUDA: NVIDIA’s native programming model is called CUDA [11]. Based on C, CUDA uses language extensions for separating device (i.e., GPU) from host code and data, as well as for launching CUDA kernels. An advantage of NVIDIA hardware and CUDA is that the application does not have to do vectorization, since all cores have their own address generation units. All data parallelism is expressed by using threads. The programmer has to explicitly group threads in thread blocks. All threads in a block run on the same streaming multiprocessor. Thread blocks are in turn grouped in a grid. While considered a fairly simple programming model, CUDA is still a low-level tool, and requires a lot of programmer’s insight and experience to claim impressive performance results. With CUDA, one essentially explicitly subdivides the work over the streaming multiprocessors, and has to define correct and suitable grid configurations. In addition, the programmer has to consider many details such as memory coalescing, the texture cache, etc. CUDA is global algorithm view model, where kernels need to be separated (using special constructs) from the host code and explicitly launched. The model uses kernels as the main concurrency unit for the overall application, and data elements for the SIMT/SIMD parallelization of the kernels themselves (finer granularity). The application data layout is also specified in two layers: the data structures used by the kernels are simply moved when and where they are needed; for the kernels themselves, the data layout results from the access patterns of the threads. Mapping and scheduling are performed by the hardware, and the data transfers from host to device are explicit. Low-level optimizations are left to the user. Stanford Brook / AMD Brook+: In terms of workloads, ATI GPUs are targeting similar applications as NVIDIA’s processors: highly data-parallel applications, with medium and low granularity. Therefore, choosing between the two becomes a matter of performance and ease of programming. For high-level programming, ATI adopted Brook, which was originally developed at Stanford [12]. ATI’s extended version is called Brook+ [13]. In contrast to CUDA, Brook+ offers a programming model that is based on streaming. Therefore, a paradigm shift is needed to port CPU applications to Brook+, making this a more difficult task than porting applications to CUDA. With Brook+, the programmer has to do the vectorization, unlike with NVIDIA GPUs. Brook+ provides a feature called swizzling, which is used to select parts of vector registers in arithmetic operations, improving readability. In our experience, the high-level Brook+ model does not achieve acceptable performance. The low-level CAL model that AMD also provides does, but it is difficult to use. Recently, AMD adopted OpenCL as a high-level programming model for their GPUs, but also for the CPUs, and therefore Brook is most likely discontinued. Brook is a fragmented view model, which uses explicit data transfers between host and device and implicit data layouts and transfers on the device itself. The concurrency units are kernels (coarse granularity) and stream elements (fine granularity). These are both controllable through the code, using language constructs. Mapping and scheduling are implicit. The model allows low-level optimizations. PGI Fortran & C Accelerator Programming Model: Using PGI’s Accelerator compilers[14], programmers can accelerate applications by adding OpenMP-like compiler directives to existing high-level Fortran and C programs. In this respect, PGI’s programming model is similar to PathScale’s. Compute intensive kernels are offloaded to the GPU, using two levels of explicit parallelism. There is an outer forall loop, and an inner synchronous SIMD loop level. Based on sequential code reuse, PGI Accelerator is a global view model which uses pragmas to separate the potential kernels (its main concurrency units). Data layouts and transfers are both implicit, as kernels are automatically offloaded and parallelized. As the model uses CUDA as back-end, most of the implementation features - hardware-based mapping and synchronization, SIMT-based granularity, etc. - are inherited from CUDA. Still, the model does not allow for hand-tuning or architecture-specific optimizations on potential kernel code, as these interfere with the ability of the compiler to automatically parallelize the kernel loops. Pathscale ENZO: The PathScale compiler company4 has recently released a GPU software suite called ENZO. Although the programming model is device-independent, it initially targets NVIDIA GPUs and GPMCs, as well as hybrid combinations between these two. ENZO comes with it’s own hardware device drivers, which focus on computing and do not support graphics. This way, PathScale expects to achieve better performance than CUDA. With ENZO, programmers annotate their code with directives to indicate which code regions should be parallelized on the GPU. The C with annotations approach is similar to the OpenMP’s and PGI’s Accelerator model, preserving their advantages (e.g., portability, relatively high-level, and starting from sequential code), as well as their drawbacks (e.g., important architecture-specific optimizations cannot be expressed). ENZO is a global view model, using pragmas for kernels’ granularity control and mapping/scheduling. It relies on hardware-based mapping and scheduling at kernel level. Data transfers are automatically generated and performed, and there are no special constructs for data layouts. To compensate for the lack of low-level optimizations, the model uses preoptimized libraries, code generators and auto-tuning to improve kernel performance. Summary: All four GPU programming models are very close to the architecture; they all approach application parallelization by identifying and offloading the potential kernels to be accelerated. The way the offload is performed differs: CUDA and Brook require users to code this offload explicitly, while PGI and ENZO work by isolating kernels (with user4 See http://www.pathscale.com. inserted pragmas) from available sequential code. Further, the models tackle the massive parallelism in the kernels differently. PGI and ENZO use a compiler to extract and exploit the fine-grain concurrency units; CUDA requires the programmers to do so manually, while Brook uses streams to help the user identify the fine-grain concurrency. None of the models allows for specifying granularity requirements these are guessed and tuned by the user. Data distribution is simplified, but not optimized - i.e., the models do not tune the virtual organization of threads (blocks and grids) to match the application requirements). Mapping and scheduling are left to the hardware (impossible otherwise). All models but CUDA (the native model) simplify the offloading procedures, minimizing the problem of different memory spaces. The effective use of the accelerator memory hierarchy remains the duty of the programmer. C. Cell/B.E. Programming Models Cell/B.E. programming is based on a simple multi-threading model: the PPE spawns threads that execute asynchronously on SPEs, until interaction and/or synchronization is required. The communication between the SPEs and the PPE is bidirectional and on-demand. All data distribution, task scheduling, communication, and processing optimizations are to be performed “manually” by the user/application. As a result, programming the Cell/B.E., and especially optimizing the code for Cell/B.E., are notoriously difficult. The hardware features with the highest impact on programmability are: the heterogeneous cores, the inter-core communication and synchronization, the manual work- and data-distribution, the task scheduling (including thread management), the SIMD intrinsics, and the DMA operations. Given that the architecture supports MPMD, SPMD, and SIMD parallelism, it is suitable for both coarse- and fine-grain parallel applications, and especially for workloads that exhibit nested parallelism (i.e., combinations of MPMD/SPMD and SIMD). IBM Cell SDK: The SDK is the native Cell/B.E. programming model, developed to allow full flexibility for any workload. The model offers all the needed constructs to define the PPE and the SPE codes, and their interaction. With the SDK, programmers design and develop the main control flow on the PPE (using simple C or C++ code), and the computation kernels for the SPEs (using C and special intrinsics for optimized processing). Both SPMD and MPMD execution is supported for the SPE kernels. The SPE kernels are managed and coordinated by the PPE. Data layout is implicit - the SPEs and the PPE collaborate in data transfers. There are no special constructs for data layouts, as most of these are settled through programmed DMA accesses to the main memory. To summarize, the SDK is a fragmented view model, with kernels as main concurrency units. Granularity is derived from code (no special constructs are provided), and concurrency is achieved by running different threads on the SPEs. Data transfers are all explicit, and programmed by the user using DMAs; so are synchronization and communication, which use special channels, but still require code to manage the protocol. Mapping and scheduling are performed by the user via the PPE code. Low-level optimizations (mostly vectorization and SIMD operations) are also explicitly performed by the user. ALF: ALF (Accelerated Library Framework [15], [16]) provides a set of functions/APIs for the development of data parallel applications following an SPMD model on a (multilevel) hierarchical host-accelerators system (PPE-SPEs and/or host-Cell’s). In ALF, the host runs the control task and the accelerators run the compute tasks. The same program runs on all accelerators at one level. ALF provides data transfer management, task management, double buffering support, and data partitioning. Further, the model uses three types of code: (1) accelerator optimized code (compute tasks optimized for a specific accelerator), (2) accelerated libraries (kernels and their interfaces, including the data management and buffering), and (3) applications (user-defined aggregation of compute tasks). Overall, ALF is an elegant high-level model which offers high productivity and, provided with the right libraries, also very good performance. Feature-wise, ALF is also a fragmented view model, with coarse, task-level granularity. The model is hierarchical, but focuses on SPMD parallelism. Kernels are defined by the programmer as concurrency units. Kernel mapping and scheduling are solved transparently by the runtime system; the same holds for communication and synchronizations. Data layout can be pre-set for the SPMD tasks, and data-transfers, communication and synchronization are implicit. Mapping and scheduling are performed by the runtime system. Optimizations are typically performed by the user in the kernel code, but the model can use imported kernel from pre-optimized libraries. Cell SuperScalar: Cell SuperScalar 5 (CellSS) is a pragmatic model, suitable for quick porting of existing sequential applications to the Cell/B.E. [17]. CellSS uses a compiler and a runtime system. The compiler separates an annotated sequential application in two: a PPE part (the main application thread), and the SPEs part (a collection of functions to be offloaded as SPE tasks). The runtime system maintains a dynamic data dependency graph with all active tasks. When the PPE reaches a task invocation, it requests the CellSS runtime to add a new task to the execution list. When a task is ready for execution (i.e., all its data dependencies are satisfied and there is an SPE available), the DMA transfers are transparently started (and optimized), and the task itself is started on the available SPE. Various scheduling optimizations are performed to limit communication overhead. Additionally, CellSS provides execution tracing, a mechanism included in the runtime system to allow performance debugging by inspecting the collected traces. When starting from suitable sequential code, CellSS is a very productive global-view model for first-order implementations of Cell applications. As mapping and scheduling are dynamically optimized, the performance depends on the 5 The model is based on the principles of Grid SuperScalar, a successful grid programming model. kernels’ performance. As kernels are generated from sequential user-written functions, low-level optimizations need to be performed by hand, and some manual (re)sizing might be needed to avoid task imbalance. Summary: The three models presented here are quite different: the SDK is the native model, which provides the means to program the platform, but offloads all tasks to the user; ALF includes some rudimentary design elements (builds a hierarchy of tasks), and from there it derives additional optimization for mapping and scheduling; CellSS enables quick parallelization for the Cell, guided entire by the user. Except for the SDK, the presented models simplify (1) mapping and scheduling at the MPMD/SPMD level on the SPEs (based on a run-time system), (2) data distribution, including the complete DMA transfers, with reasonable optimizations (prefetching, double buffering), and (3) inter-core communication. Lowlevel optimizations are left to the user (with potential help from the compiler). The concurrency units and their granularity are arbitrarily chosen by the user, but none of these three models helps in properly estimating them. V. G ENERIC P ROGRAMMING M ODELS Our analysis on hardware-centric models shows that their design support is rather rudimentary. Rather, these models focus almost entirely on simplifying the user experience by tuning application parallelization to match a chosen platform. Given the size of the parallelization problem the software community is facing - i.e., all applications have to be parallelized, sooner or later, to make effective use of many-cores - using hardware-centric models becomes a non-scalable solution: Alternatively, one can use generic programming models (i.e., models which run on more than one family of many-core processors), focusing first on the application parallelism, and only second on mapping the parallel solution on one platform or another. We split this “generic” models in two categories: parallelism-centric models and application-centric models. We briefly discuss each category. A. Parallelism-centric models Parallelism-centric models are built to allow users to express typical parallelism constructs in a simple and effective way, and at various levels of abstraction. The higher the level of abstraction is, the less (explicit) parallelism constructs are available, but also the less the flexibility and expressibility of the model are. Parallel-centric models are typically used to express complex parallel algorithms (i.e., the design of the parallel solution for the application and its implementation are decoupled). Threads with shared memory: Threading libraries such as POSIX threads or the Java Thread class extend the sequential imperative programming model in a natural way to obtain parallelism. Functions are spawned as new threads that globally share data. Threads provide mechanisms on the lowest level of abstraction of parallel programming and are very flexible. It is natural to spawn threads with different functions to obtain task parallelism, but threads can also be spawned in a loop, with the same function operating on different data, which results in data parallelism. There is extensive synchronization support, such as joins, barriers and condition variables. Threads offer a fragmented-view as programmers need to divide data among threads and join for the results. Algorithms expressed in this model are not portable to other architectures and there is no concept of tasks that can be sized or resized other than functions. Users have no control over mapping and scheduling and the model has no specific means to change the data layout. Threads do allow other low-level optimizations. The flexibility of threads provides programmers lots of control over task creation and synchronization. Threads are well suited for coarse-grained tasks that need much synchronization. Because threads are so low-level, programmers have many opportunities to optimize on for example synchronization. MPI: The Message Passing Interface (MPI)[5] targets both distributed memory systems and shared memory machines, but is normally used for distributed memory. An application consists of multiple processes that communicate with messages. MPI gives much control due to the strong separation of communication and computation and is not suitable for fine-grained parallelism. MPI is the typical example of a fragmented-view programming model. The user is responsible for all communication and synchronization. Explicitly parallel algorithms are needed, which results in algorithms being not easily portable to another family of platforms, or even other platforms of the same family. There is no way to size MPI tasks other than changing the program. Data transfer and layout is explicit but only on a high granularity. Communication and synchronization is explicit by means of messages and the user has no control over how processes are scheduled. However, it allows other optimizations at a later stage. For example MPI can be mixed with OpenMP [18]. MPI is well suited for applications where the input and communication is static, for example a regular data structure that can be divided in regular coarse-grained blocks that each are computed on different processors. Cilk: The language Cilk allows programmers to write parallel divide-and-conquer programs. It extends C and C++ with keywords such as spawn and sync. A spawn in front of a function call creates a non-blocking function call that is executed in its own thread and may spawn other (possibly recursive) functions. Multiple consecutive spawn function calls create parallelism in the program. The keyword sync blocks the calling thread until the results of the spawned function calls are available. Cilk offers a global view of the algorithm. Syntactically, a sequential divide and conquer algorithm is very similar to a parallel divide and conquer algorithm. The language is limited to divide-and-conquer parallelism. The model gives no control over tasks. Programmers often control the granularity of the recursive task by manually choosing between a sequential version or parallel version based on the size of the data that needs to be processed. Data is communicated to threads by using parameters of spawn function calls. The model offers several ways to obtain more control over synchronization. For example, the abort keyword aborts other spawned threads. A typical use-case is a parallel search where one thread finds an item and aborts the others. Another example is the use of inlets that guarantee that results of spawned threads are treated atomically with respect to the other spawned threads. The Cilk system dynamically schedules threads and dynamically maps them to hardware. B. Application-centric models Application-centric models tackle application parallelization from design to implementation. Some of them also include several generic optimizations. These models have less explicit parallelism constructs. Their goal is to help users to find an effective, (partially) platform-agnostic parallel solutions for their applications, and implement them using a limited set of concurrency, granularity, and parallelism constructs. CHARM++ and the Offload API: CHARM++ is an existing parallel programming model adapted to run on acceleratorbased systems[19], [20]. A CHARM++ program consists of a number of chares (i.e., the equivalents of tasks) distributed across the processors in the parallel machine. These chares can be dynamically created and destroyed at run-time, and can communicate with each other using messages. For accelerators, Charm++ uses an Offload API: a chare can offload work requests, which are computation-intensive kernels to be accelerated by the accelerators (e.g., the SPEs on the Cell/B.E.); on the host side (e.g., the PPE on the Cell/B.E.), the Offload API manages each work request, coordinating data movement, execution, and completion notifications. Charm++ is a fragmented view model, with coarse parallelism expressed by chares. The model supports both SPMD and MPMD. The granularity is controlled at design time, by defining the chares; data distribution is implicitly defined by the data usage of these chares and data transfers are automated. The dynamic mapping and scheduling, together with the highly optimized data transfers contribute to a high performance potential. Sequoia: Sequoia requires the programmer to reason about a parallel application focusing on memory locality [21]. A Sequoia application is a tree-like hierarchy of parametrized tasks; running tasks concurrently leads to parallelism. A tree has two different types of tasks: inner-nodes, which spawn children threads, and leaf-nodes, which run the computation itself. The task hierarchy has to be mapped on the memory hierarchy of the target machine (not only Cell/B.E.) by the programmer. Tasks run in isolation, using only their local memory, while data movement is exclusively done by passing arguments (no shared variables). One task can have multiple implementation versions, which can be used interchangeably; each implementation uses both application and platform parameters, whose values shall be fixed during the mapping phase. For the Cell/B.E., all inner nodes run on the PPE, while the leaf nodes are executed by the SPEs. The PPE runs the main application thread and handles the SPE thread scheduling. Each SPE uses a single thread for the entire lifespan of the application, and it continuously waits, idle, for the PPE to asynchronously request the execution of a computation task. As a generic model, Sequoia uses a fragmented algorithm view. Based on coarse-level granularity SPMD parallelism, the model requires applications to be designed using a divide-andconquer approach. The granularity can be controlled at both compile time and runtime. Data transfers are implicit. The special feature of the model is its user-defined mapping and scheduling (by user-file), which also results in implicit, yet automated data layouts. Optimizations are allowed as different versions of the same kernel can be user interchangeably during the lifespan of the application. Still, Sequoia has limited productivity, as the model is difficult to use for non divide-and-conquer applications. The application and machine decompositions are independently reusable. The manual application-to-machine mapping offers a flexible environment for tuning and testing application performance. Pattern-based models: OPL: Pattern-based models allow users to focus entirely on application analysis. An application is built as a composition of nested patterns. Once all these patterns are implemented for a platform, their composition, also a pattern, leads to a complete application on that platform. Pattern languages use different classes of patterns, applied at different stages of application design. First, the high-level application structure is described in terms of structural patterns (a graph of tasks) and computational patterns (the computation of each task). Next, the algorithm strategies patterns are employed: they identify and exploit application concurrency as exposed by the structural patterns. The way the program and data are organized is specified by implementation strategies patterns, which are ways of implementing each algorithmic pattern. Finally, the low-level parallelism support, matching both the application and the target architecture, is included in the so-called parallel execution patterns. A pattern-based language is a promising abstract concept, being elegant, generic, systematic, and allow for feedback loops and incremental re-design. Implementing such a language, however, requires platform-specific pattern implementations, an effort that depends on the number of commonlyused patterns. Furthermore, the first and last categories of patterns - the structural and the parallel execution, are not trivial to apply to a new application. Structural mistakes can significantly affect performance, but the simplicity of the model and, eventually, the limited number of structural patterns should allow for extensive auto-tuning. One example of a pattern-based language is OPL (Our Pattern Language), developed at Berkeley6 ). Details on the five categories of patterns the language offer can be found 6 http://parlab.eecs.berkeley.edu/wiki/patterns/patterns in [22]; however, there is no implementation of OPL so far, so any technical details are missing. Therefore, we consider that the the practical side of this solution is yet to be proven. OmpSs: OmpSs[2], [3] addresses the programmability of heterogeneous architectures by allowing the user to exploit task-level parallelism 7 using a similar approach to that of OpenMP. Based on pragmas, the model uses a source-tosource translator to separate the code in dedicated programs for the different components of the heterogeneous system; furthermore, a the runtime system schedules tasks to execution, preserving and optimizing the dependencies among tasks. The system is based on incremental parallelization of a single-source code, allowing step-by-step restructuring and optimization, and a separation of the implementation from the platform specific details (which are, of course, encapsulated in the runtime system). OmpSs is also portable, as the same code (typically, a sequential C or FORTRAN application with pragmas) runs on any machine where the backend is ported. Programmers may choose to apply platform specific optimizations (i.e., design and implement platform specific versions of the tasks), but they may also choose to ignore them, preserving portability at the expense of performance. OmpSs is a global view model with coarse granularity, MPMD (and SPMD) parallelism, implicit data transfers, pragmas for data distribution, and runtime-based dynamic mapping and scheduling. Platform-specific can be applied “outside the model” - i.e., the model allows multiple kernel implementations to be plugged in. OpenCL: OpenCL was proposed as a in 2008 (by the KHRONOS group) as a solution to the platform diversity problem. OpenCL proposes a common hardware model for all multi-core platforms. The user programs this “virtual” platform, and the resulting source code is portable on any OpenCL compliant platform8 . The OpenCL platform model consists of a host connected to one or more OpenCL compute devices. A compute device is divided into multiple compute units (CUs); CUs are divided into multiple processing elements (PEs); PEs perform the computations. Each PE can either behave as a SIMD or as a SPMD unit: a kernel is executed concurrently on multiple processing elements, each with its own data and a shared program counter or each with its own data but its own program counter. SIMD execution implies that all PEs execute a strictly identical set of instructions, which cannot be always true for SPMD due to possible branching in a kernel. Further, the OpenCL platform has a multi-level shared memory model, featuring four distinct memory spaces: private, local, constant and global. Private memory can only be used by a single compute unit (like registers in a single compute unit). Local memory can be used by the work-items in a work-group 7 Instances of the generic StarSs programming model include GRIDSs (for the Grid), CellSs (for the Cell B.E.), and SMPSs (for multi-core processors), GPUSs (for GPUs); Ss stands for Superscalar. 8 Currently, ATI’s and NVIDIA’s GPUs, AMD’s and Intel’s multi-cores, ARM’s embedded processors, and the Cell/B.E have hardware drivers and compiler back-ends for OpenCL. (similar to on-chip shared memory). Constant memory can be used to store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host allocates and initialize the memory objects from constant memory. This is similar to the constant caches that are available on AMD GPUs. Finally, global memory is memory that can be used by all the compute units on the device. This is similar to the off-chip GPU memory that is available on AMD GPUs. Note that the mapping on real hardware depends on real memory subsystem, and it might require different memory spaces are to be collapsed together. An OpenCL program has two parts: the compute kernels that will be executed on one or more OpenCL devices, and a host program that defines the context for the kernels, initiates and manages their execution. The “main” OpenCL application runs on the host, and submits commands from the host to be executed on the processing elements within a device. Kernels can run either in-order or out-of-order. Events allow checking the status of outstanding kernel execution requests and other runtime requests. The execution domain of a kernel is defined by an N-dimensional computation domain, which signals how large of a problem the user needs to solve. Each element in the execution domain is a work-item and OpenCL provides the ability to group together work-items into work-groups for synchronization and communication purposes. VI. T HE U NIFIED M ODEL : O PEN CL OR O MP S S ? So far, we have discussed twenty programming models for many-core processors - and these are only the representative ones. Each one of these models has its qualities and drawbacks, and has a positive impact on platform programmability. Still, many of them are used only sporadically. We believe this poor adoption of high-level programming models is due to a large extent to the multitude of models available. It is difficult for potential users to understand limitations of each model, the differences between models, or the impact a model has on a specific workload. Therefore, programmers choose the native models which, although cumbersome and low-level, offer full flexibility. We believe that the space of programming models needs to be pruned to only a few items: generic models and applicationspecific models. In this context, we extract two candidates for a unified many-core programming model - OpenCL and OmpSs - and compare them from this perspective. While we are not able to extract a clear winner, we can make several recommendations for each of the models to improve the potential impact on platform programmability. A. Programmability Impact We briefly analyze the impact of both OpenCL and OmpSs on platform programmability by looking at all its three components, as described in Section III-A. 1) Productivity: OpenCL supports application-centric programming, covering a good mix of both design and implementation features. Application parallelization starts from a specification and/or an algorithm. It is a kernel-based global view model: it preserves the algorithm design and offloads designated kernels to be accelerated. Kernel mapping and scheduling are not controllable, but they can be influenced by using asynchronous queues. Fine-grain data parallelism is well supported. The model has no definition of a task (and, implicitly, no control over sizing and composition), but kernels can be used to implement some degree of task parallelism, provided that the hardware platform supports it. There are virtually no limitations in expressing parallelism in OpenCL. However, the way this parallelism is translated from the virtual OpenCL platform to the real hardware platform is hidden behind vendor-specific drivers, and can be counterintuitive. In contrast, OmpSs requires sequential code to produce a parallel application by offloading the kernels (specified by the user) and parallelizing them accordingly. OmpSs is also a global-view model. Despite its less developed parallel design, OmpSs allows for quick parallelization of available applications by user-placed pragmas. This approach leads to excellent productivity. Kernel-level optimizations are very similar to the ones in OpenCL (in fact, one of the back-ends of OmpSs is OpenCL). There can be limitations induced at design-level by a poorly written sequential application. In this case, a fall-back to a parallel solution designed from scratch is recommended (for which an implementation in OpenCL is an alternative). Finally, OmpSs is slightly friendlier for coding: the application is single-source (separated by the compiler in device code and kernels), and it is less verbose than the original OpenCL (the context setting code is generated). The most difficult problem remains the parallelization from scratch. For OpenCL, applications need to fit the architectural model of the common middleware. These are workloads with large collections of fine-grain work items, very limited synchronization, and no need for user-controlled mapping and scheduling; the applications can only be driven/managed from the host, with no device-initiated communication. Essentially, applications with coarse tasks and a lot of interdependencies are unsuitable for OpenCL. OmpSs requires a good sequential implementation (or at least its control-flow skeleton together with hardware-specific kernels) to generate a good parallelization. This is a much quicker design, but the performance behavior will vary a lot on real hardware platforms. Overall, OmpSs is more productive than OpenCL when sequential code is available. On the other hand, already parallelized codes (such as, for example, CUDA applications) are typically much easier to translate to OpenCL than (directly) to OmpSs. 2) Portability: The strongest point of OpenCL is its portability by construction. This is a result of using the common platform model as a virtual middleware. Further, this separates the design and implementation concerns: OpenCL’s back-end targets one machine type only, and it is the responsibility of the hardware vendors to provide the OpenCL drivers; programmers are only concerned with designing a parallel application for a given platform model. Note that the portability of OpenCL (and its relative success in both the academia and the industry) is a good incentive for all processor vendors to produce high-performing OpenCL drivers for their platforms. OmpSs is portable at source-level: each platform for which OmpSs is available has its own compiler and run-time system, fully optimized for the specific platform. While this is, in a sense, similar to OpenCL’s portability provided by the drivers, the incentive and the driving force behind the model are significantly lower. Thus, we expect the number of devices supported by OpenCL and not by OmpSs to increase rapidly. Overall, the two models are equally portable. However, the performance penalties that are inferred by this portability might differ significantly. 3) Performance: Finally, when it comes to the performance impact, OpenCL is expected to enable GPUs to achieve comparable results to their “native” models - there are no fundamental reasons for OpenCL to behave worse: parallel design is similar, low-level optimizations can be enabled, and CUDA-like execution can be emulated perfectly. Therefore, bad-performing drivers and/or compilers, as well as bad programming, could be the only causes for lower OpenCL performance. For the GPMCs and the Cell, the performance depends on (1) how well the application maps to the OpenCL platform, and (2) how well the hardware maps to the OpenCL platform. Overall, the few performance case-studies available so far real applications, are inconclusive [23], [24], [25], [26]: performance variations range between 10% and 90%. Our own preliminary evaluations show that, for fair comparisons, the gap is not larger than 10%. For OmpSs, performance penalties come from two sources: the poor parallelization of the application (due to bad use of pragmas or unsuitable sequential code) and non-optimized kernels. The first one can be attributed to programmers, and it is impossible to eradicate. The second one is addressed by the model itself, which allows multiple implementations of the same kernel to be included in the program, and the system can choose, at run-time, which implementation is the most suitable one for the platform in use. However, there are also performance improvements that are specific to OmpSs: its run-time system provides dynamic mapping and scheduling, taking into account data locality and minimizing data transfers. Therefore, it is most likely that GPU-like applications will perform about the same in both models, while for the GPMCs and the Cell, the OmpSs will perform better. Performance-wise, the systems are comparable; a more precise ranking can only be made on an application basis. B. Recommendations Based on the detailed analysis provided above, we conclude that both models have provide a mix of features and good impact on platform programmability to become generic models for many-core processors. Before that, more case-studies have to be developed, to assess if some of the theoretical assumptions we have made here hold in practice for real workloads. Still, we recommend two improvements for each model. For OpenCL: (1) provide basic communication options to allow the computing elements to communicate (at least to the host) and (2) allow some degree of control for mapping work items on the platform, enhancing support for the task parallelism. And for OmpSs: (1) find a higher level application specification to be used for cases when sequential code fails, and (2) enable an auto-tuning like technique to allow the automated selection (or even the tuning/generation) of the best available kernel implementation for a given platform. VII. C ONCLUSIONS AND R ECOMMENDATIONS Multi-core processors are here to stay. With that, application parallelism has become mandatory. The software community is (suddenly) faced with a large problem: virtually every application will have to run on a parallel machine, rather sooner than later. And with multi-core complexity steadily increasing, addressing applications and machines as one pair at a time will be counterproductive. Using hardware-centric programming models can speedup the implementation process per platform, but the lack of portability will eventually lead to lower performance and/or lower productivity. Therefore, a more effective way to address the mass-parallelization problem is to focus on applicationcentric programming: first design a parallel solution for the problem, and then implement it on one/multiple platform(s). However, this is easier said than done: we analyzed twenty many-core programming models, and showed that most of them lack either the design or the implementation component. Our analysis leads to three important conclusions. First, the perceived difficulty of many-core programming generates a lot of hardware-centric models, platform-specific and nonportable. Second, application-centric models can be used to improve platform programmability, as they offer increased productivity and portability with minor performance penalties. Third, although there is no unified model for efficient programming of many-cores, OpenCL and OmpSs are promising candidates for achieving such a consensus. For the near-future, we have three suggestions. For hardware vendors: support OpenCL by developing drivers for your platforms, rather than building yet another hardwarecentric programming model. For (third-party) programming model designers: support application-centric models by using OmpSs or OpenCL as a compilation target for your higherlevel models. For programmers: use OpenCL and OmpSs for prototyping and development - a quicker adoption will speedup their development, increasing the chances for a unified model for programming many-cores to emerge. R EFERENCES [1] OpenCL committee, “OpenCL 1.1 standard,” http://www.khronos.org/ opencl/, October 2010. [2] R. Ferrer, J. Planas, P. Bellens, A. Duran, M. Gonzalez, X. Martorell, R. M. Badia, E. Ayguade, and J. Labarta, “Optimizing the exploitation of multicore processors and gpus with OpenMP and OpenCL,” in LCPC2010, October 2010. [3] E. Ayguade, R. M. Badia, P. Bellens, D. Cabrera, A. Duran, M. Gonzalez, F. Igual, D. Jimenez-Gonzalez, J. Labarta, L. Martinell, X. Martorell, R. Mayo, J. M. Perez, J. Planas, and E. S. Quintana-Orti, “Extending OpenMP to survive the heterogeneous multi-core era,” International Journal of Parallel Programming, vol. 38, no. 5–6, pp. 440–459, june 2010. [4] B. Chamberlain, D. Callahan, and H. Zima, “Parallel programmability and the chapel language,” Int. J. High Perform. Comput. Appl., vol. 21, no. 3, pp. 291–312, 2007. [5] M. P. I. Forum, “MPI: A Message-Passing Interface standard,” 1994. [6] K. Kennedy, C. Koelbel, and H. Zima, “The rise and fall of high performance fortran: an historical object lesson,” in HOPL III: Proceedings of the third ACM SIGPLAN conference on History of programming languages. New York, NY, USA: ACM, 2007, pp. 7–1–7–22. [7] M. J. Flynn, “Some computer organizations and their effectiveness,” Computers, IEEE Transactions on, vol. C-21, no. 9, pp. 948 –960, 1972. [8] J. Reinders, Intel Threading building blocks. O’Reilly & Associates, Inc. Sebastopol, CA, USA, 2007. [9] A. Ghuloum, T. Smith, G. Wu, X. Zhou, J. Fang, P. Guo, B. So, M. Rajagopalan, Y. Chen, and B. Chen, “Future-proof data parallel algorithms and software on intel multi-core architecture,” Intel Technology Journal, vol. 11, no. 4, pp. 333–348, 2007. [10] M. McCool, “Developing for GPUs, Cell, and multi-core CPUs using a unified programming model,” http://www.linux-mag.com/id/6374, July 2008. [11] CUDA Programming Guide, nVidia, 2007. [12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream Computing on Graphics Hardware,” in ACM Transactions on Graphics, Proceedings of SIGGRAPH 2004, Los Angeles, California, August 2004, pp. 777–786. [13] Advanced Micro Devices Corporation (AMD), AMD Stream Computing User Guide, august 2008, revision 1.1. [14] PGI Fortran & C Accelerator Programming Model white paper, version 1.2 ed., The Portland Group, March 2010, http://www.pgroup.com/lit/ whitepapers/pgi accel prog model 1.2.pdf. [15] Cell/B.E. Programming Tutorial, 2nd ed., IBM, December 2006. [16] A. Buttari, P. Luszczek, J. Kurzak, J. Dongarra, and G. Bosilca, “A rough guide to scientific computing on the playstation 3,” ICL, University of Tennessee Knoxville, Tech. Rep. UT-CS-07-595, May 2007. [17] P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta, “CellSS: A programming model for the Cell BE architecture,” in SC’06. IEEE Computer Society Press, November 2006. [18] E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang, “The design of openmp tasks,” IEEE Transactions on Parallel and Distributed Systems, vol. 20, pp. 404–418, 2009. [19] D. Kunzman, “CHARM++ on the Cell Processor,” Master’s thesis, Dept. of Computer Science, University of Illinois, 2006. [20] D. Kunzman and L. Kalé, “Towards a framework for abstracting accelerators in parallel applications: experience with cell,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. New York, NY, USA: ACM, 2009, pp. 1–12. [21] K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, “Sequoia: Programming the memory hierarchy,” in SC’06. ACM Press, November 2006. [22] K. Keutzer and T. Mattson, “Opl: Our pattern language,” http://parlab.eecs.berkeley.edu/wiki/ media/patterns/ October 2009. [Onopl-new with appendix-20091014.pdf, line]. Available: http://parlab.eecs.berkeley.edu/wiki/ media/patterns/ opl-new with appendix-20091014.pdf [23] S. M. Cho, D. W. Im, O. Y. Jang, H. J. Song, B. D. Paulovicks, V. Sheinin, and H. Yeo, “OpenCL and parallel primitives for digital TV applications,” IBM Journal of Research and Development, vol. 54, no. 5, pp. 1–14, September 2010. [24] P. Du, P. Luszczek, and J. Dongarra, “Opencl evaluation for numerical linear algebra library development,” in SAAHPC ’10, June 2010. [25] J. D. Sean Rul, Hans Vandierendonck and K. D. Bosschere, “An experimental study on performance portability of OpenCL kernels,” in SAAHPC ’10, June 2010. [26] K. Komatsu, K. Sato, Y. Arai, K. Koyama, H. Takizawa, and H. Kobayashi, “Evaluating performance and portability of OpenCL programs,” in VecPar’10, 2010.