Keywords

1 Introduction

Graph analytics systems must handle very large data-sets with billions of nodes and trillions of edges [16]. Graphs of this size are too big to fit into the memory of a single machine, so one approach is to use distributed-memory clusters consisting of multicore processors. Writing efficient distributed-memory programs can be difficult, so a number of frameworks and libraries such as Pregel [18], PowerGraph [12], and Gemini [33], have been developed to ease the burden of writing graph analytics applications for such machines. New trends in processor architecture have made this programming problem much more difficult. To reduce energy consumption, computer manufacturers are turning to heterogeneous processor architectures in which each machine has a multicore processor and GPUs or FPGAs. To exploit such platforms, we must tackle the twin challenges of processor heterogeneity and distributed-memory computing. Frameworks like Lux [15] and Gluon [10] permit graph analytics applications writers to use distributed GPUs, but they require writing platform-specific programs that are not portable.

Ideally, we would have a compiler that takes single-source, high-level specifications of graph analytics algorithms and automatically translates them into distributed, heterogeneous implementations while optimizing them for diverse processor architectures. This paper describes such a compiler, called Abelian. Application programs are generalized vertex programs written in the Galois programming model, which provides programming patterns and data structures to support graph applications [20]. Section 2 describes this programming model in more detail. The Abelian compiler, described in Sect. 3, targets the Gluon runtime [10], which implements bulk-synchronous execution. Unlike other systems in this space, this runtime supports a number of graph partitioning policies including edge-cuts and vertex-cuts, and the programmer can choose any of these policies. The compiler exploits domain-knowledge to generate distributed code, inserting optimized communication code. Back-end compilers generate optimized code for NUMA multi-cores and GPUs from the output of Abelian.

The experimental results presented in Sect. 4 show that the communication optimizations in Abelian reduce communication volume by 23\(\times \), enabling Abelian-generated implementations to match the performance of hand-tuned distributed-CPU and distributed-GPU programs on the same platform. In addition, the distributed-CPU implementations produced by Abelian yield a geometric mean speedup of 2.4\(\times \) over those in the stand-alone distributed-CPU system Gemini [33] on the same hardware. This shows that the flexibility of Abelian in compiling a high-level, shared-memory, single address space specification for heterogeneous and distributed-memory architectures does not come at the cost of performance, even when compared to integrated, homogeneous systems.

2 Programming Model

The Abelian compiler supports a generalized vertex programming model [12, 18, 33] that is a restriction of the Galois programming model [20, 24]. Nodes and edges of the graph have labels that are updated iteratively by the program until some quiescence condition is reached. Updating of labels is performed by applying operators to active nodes in the graph; this is called an activity. A push-style operator uses the label of the active node to conditionally update the labels of immediate neighbors of that node while a pull-style operator reads the labels of the immediate neighbors and conditionally updates the label of the active node.

Abelian supports more general operators than other systems in this space. In particular, an operator is allowed to update the labels of both the active node and its immediate neighbors, which is useful for applications like matrix completion using stochastic gradient descent. In addition, Abelian does not require updates to node labels to be reduction operations. For example, k-core decomposition evaluated in Sect. 4 uses subtraction on node labels.

In addition to the operator, the programmer must specify how active nodes are found in the graph [19]. The simplest approach is to execute the program in rounds and apply the operator to every node in each round. The order in which nodes are visited is unspecified, and the implementation is free to choose whatever order is convenient. These topology-driven algorithms [24] terminate when a global quiescence condition is reached. The Bellman-Ford algorithm for single-source shortest-path (sssp) is an example.

An alternative strategy is to track active nodes in the graph and apply the operator only to those nodes, which potentially creates new active nodes. These data-driven algorithms [24] terminate when there are no more active nodes in the graph. As before, the order in which active nodes are to be processed is left unspecified, and the implementation is free to choose whatever order is convenient. Chaotic relaxation sssp uses this style of execution. Tracking of active nodes can be implemented by maintaining a work-list of active nodes. Alternatively, this can be implemented by marking active nodes in the graph and making sweeps over the graph, applying the operator only to marked nodes; we call this approach filtering. Fine-grain synchronization in marking and unmarking nodes can be avoided by using Jacobi-style iteration with two flags, say current and next, on each node; in a round, active nodes whose current flag is set are processed, and if a node becomes active in that round, its next flag is set using an ordinary write operation. The roles of these flags are exchanged at the end of each round. In our programming model, data-driven algorithms are written using work-lists, but the compiler transforms the code to use a filtering implementation. The correctness of this transformation is ensured by the fact that active nodes can be processed in any order.

Implementation: This programming model is implemented in C++ using the Galois library [20]. Figure 1 shows a program for push-style data-driven algorithm of pagerank. A work-list is used to track active nodes. The Galois::for_each in line 30 populates the work-list initially with all nodes in the graph and then iterates over it until the work-list is empty. The operator computes the update to the pagerank of the active node, and it pushes this update to all neighbors of the active node. If the residual at a neighbor exceeds some user-specified threshold, that neighbor becomes active and is pushed to the work-list.

The semantics of the Galois::for_each iterator permit work-list elements to be processed in any order. In a parallel implementation of the iterator, each operator application must appear to have been executed atomically. To ensure this, the application programmer must use data structures provided in the Galois library which include graphs, work-lists, and accumulators. This permits the runtime to manage updates to distributed data structures on heterogeneous devices and allows the compiler to treat data structures as objects with known semantics, which enables program optimization and generation of parallel code from implicitly parallel programs as described in Sect. 3.

Restrictions on Operators: Like in other programming models for graph analytics [12, 15, 26, 33] and compilers for data-parallel languages [3, 27, 30], operators cannot perform I/O operations. They also cannot perform explicit dynamic memory allocation since some devices (like GPUs) have limited support for this in their runtimes. The library data structures can perform dynamic storage allocation, but this is done transparently to the programmer.

Fig. 1.
figure 1

Pagerank source program

Fig. 2.
figure 2

Compiler-generated synchronization structures for field contrib in pagerank

Fig. 3.
figure 3

Compiler-generated pagerank program

3 Abelian Compiler

Figure 4 is an overview of how input programs are compiled for execution on distributed, heterogeneous architectures. The Abelian compiler (implemented as a source-to-source translation tool based on Clang’s libTooling) analyzes the patterns of data accesses in operators, restructures programs for execution on distributed-memory architectures, and inserts code for optimized communication. The output of the Abelian compiler is a bulk-synchronous parallel C++ program with calls to the Gluon [10] communication runtime (Fig. 3). Gluon transparently handles the graph partitioning while loading the input graph. The generated code is independent of the partitioning policy, but the partitioning policy determines which portions of this code are executed. This permits Gluon’s optimization that exploits structural invariants in partitioning without recompiling the program. The Abelian compiler also generates IrGL [22] intermediate representation kernels corresponding to each Galois::do_all call in the C++ program and inserts code in the C++ program to switch between calling the Galois::do_all and the corresponding IrGL kernel depending on the configuration chosen for the host (these are not shown in Fig. 3 for brevity). The C++ program and the IrGL intermediate code are then compiled using device-specific compilers. The output executable is parameterized by the graph input, the partitioning policy, and the number of hosts and their configuration (CPU or GPU). The user can thus experiment with a variety of partitioning strategies and heterogeneous devices with a single command-line switch.

Fig. 4.
figure 4

System overview

3.1 Graph-Data Access Analysis

The access analysis pass analyzes the fields accessed in an operator. The results of this analysis are used to insert required communication code. Field accesses are classified as follows:

  • Reduction: The field is read and updated using a reduction operation inside an edge iterator within the operator (e.g., addition to residual in line 22 in Fig. 1). This is a common and important pattern in graph analytics applications.

  • Read: The field is read, and it is not part of a reduction (e.g., read from nout in line 17 in Fig. 1).

  • Write: The field is written, and it is not part of a reduction (e.g., write to rank in line 16, Fig. 1).

In addition, it is useful to abstract the context in which a field access is made.

  • At source: The field is accessed at the source node of an edge.

  • At destination: The field is accessed at the destination node of an edge.

  • At any: The field is accessed at a node independent of any edge or at both endpoints of an edge.

3.2 Restructuring Computation

The goal of computation restructuring is to bridge the semantic gap between the programming model, which has a single address space, and the execution model, which is distributed-memory and bulk-synchronous parallel. The semantics of Galois iterators permit iterations to be executed in parallel as long as each iteration appears to execute atomically. This fine-grain, iteration-level parallelism must be converted to round-based, bulk-synchronous parallelism by the Abelian compiler. This includes eliminating global variables (similar to closure conversion in functional languages) by adding them as members of the structure. This also requires two key transformations.

Splitting Operators. When active nodes are processed in parallel on a shared-memory machine, fine-grain synchronization may be needed for correct execution. This problem appears in a different guise on distributed-memory machines: if the two active nodes are on different hosts, proxies will be created on both hosts for the common neighbor, and it is necessary to reconcile the values pushed to these proxies so that the semantics of the program are respected. The bulk-synchronous execution model does not permit fine-grain synchronization, so these kinds of problems must be solved, in general, by breaking up the operator into phases if necessary and introducing sync calls between phases. There are a number of cases to consider depending on the type of field access as determined by the graph-data access analysis. We describe this for one such case.

In the PageRank source code in Fig. 1, the residual field is read (line 14) to update the rank field (line 16) and written (line 14 using exchange(0)) at the source, but it is also reduced (line 22) at the destination. Since different hosts could update the residual, the hosts reading it should have the reduced value. To handle this, the compiler splits any operator that has such a dependence into multiple operators (a form of loop fission): one with only Read and Write accesses to the field and another with only Reduction accesses, as shown in the PageRank and PageRank_splitOp operators (lines 12–41) respectively in Fig. 3. This may involve introducing new fields to store the intermediate values (e.g., contrib). The compiler also transforms some non-reduction read-after-write operations (e.g., subtraction) to equivalent reduction operations (e.g., addition) in a similar way. After this transformation, sync calls are introduced between the parallel phases, as described in Sect. 3.3.

Eliminating Work-Lists. The Abelian compiler eliminates work-lists by using filtering, as explained in Sect. 2: in a given round, all nodes in the graph are visited and the operator is applied to nodes whose current flag is set. This flag is reset, and if a node becomes active in that round, its next flag is set; the roles of the flags are exchanged at the end of each round.

In some algorithms, the predicate used in the source program to push an active node to the work-list can be used during filtering to check if the node is active. Extracting this predicate involves a form of loop fission, and it avoids introducing flags and synchronizing their accesses. For example, in Fig. 1, the code in lines 23–24 adds active nodes to the work-list. In the generated code, this is eliminated, and a new operator is created to conditionally activate nodes as shown in line 18 in Fig. 3. Another operator is created to execute all nodes that would have been on the initial work-list (line 42). Abelian can also directly take filter-based implementation of data-driven algorithms as an input, in which case this transformation is not required. Termination is detected using a distributed accumulator (lines 19 and 63) provided by Gluon.

3.3 Inserting Communication

The final pass of the Abelian compiler inserts code for communication and synchronization. A simple approach is the following: in each round, every mirror sends its value to its master where these values are combined, and the result is broadcast back to all the mirrors. This is essentially the gather-apply-scatter model used by most systems in this space, and it can be implemented by inserting a Gluon [10] sync call after each operator for every field that might be updated by that operator. Compilers for heterogeneous systems, such as Falcon [30], Dandelion [27], LiquidMetal [3], and DMLL [6], take a similar approach since their granularity of synchronization is an object or field. This coarse-grained approach can be seen as a more elaborate version of the write-broadcast cache coherence protocol used in systems with hardware cache-coherence. Abelian implements a different, fine-grained communication protocol to reduce the communication volume: a host sends the value of a field to other hosts only if that field has been updated in the previous rounds and if this value will be read in the current round. Static analysis is not adequate to determine these properties, so instrumentation code is inserted to track this dynamically. The actual communication is performed by the Gluon runtime, and it is invoked by inserting sync calls into the code.

Fine-Grained Communication. In graph analytics applications, each round typically updates the field of only a small subset of graph nodes. A device-local, field-specific bit-vector is used to track updates to nodes’ fields that participate in communication. The analysis pass determines points in the operator where these fields might be updated, and the compiler inserts instrumentation code at those points to also update the node’s bit in the bit-vector for that field (lines 23, 29, 38 in Fig. 3). The Gluon sync interface permits this bit-vector to be passed to the runtime system, which uses it to avoid sending node values that have not been updated in the current round.

On-Demand Communication. Using the bit-vector ensures only updated values are communicated, but it does not permit Gluon’s communication optimization that exploits structural invariants in partitioning policies [10]. To do so, the domain-specific knowledge of abstract write and read locations for the last reduction access(es) and next read access of the field must be specified, respectively. If it is unspecified or imprecise, Gluon may conservatively perform some redundant synchronization. The Abelian compiler can only precisely identify the abstract locations of fields accessed within an operator and cannot be precise about the future accesses. Therefore, after an operator, it inserts code that sets or invalidates the sync-state invalidation flags for fields that could be written in the operator using its write location (lines 49, 50, 62 in Fig. 3). Before an operator, it inserts the synchronization structures, as shown in Fig. 2 (equivalent GPU functions generated for a vector of nodes are omitted for brevity), and the communication code for fields that could be read in the operator (lines 46, 52–59 in Fig. 3). The code checks the field-specific sync-state flags and calls the Gluon sync routine with the precise write and read locations if the flag is invalidated.

3.4 Device-Specific Compilers

The Abelian compiler outputs C++ code that can be compiled using existing compilers like g++ to execute on shared-memory NUMA multicores using the Galois runtime [20]. A naive translation of this C++ code to CUDA or OpenCL is not likely to yield high-performance code because it will not exploit SIMD execution. We instead use the IrGL [22] compiler, which produces highly optimized CUDA and OpenCL code from an intermediate representation that is intended for graph applications. This compiler exploits nested parallelism, which is important when processing scale-free graphs. To interface with the IrGL compiler, the Abelian compiler generates IrGL intermediate code, translating data layout of fields from arrays of structures to structures of arrays.

4 Experimental Results

To evaluate the performance of programs generated by the Abelian compiler, we studied a number of graph analytical applications: betweenness centrality (bc), breadth-first search (bfs), connected components (cc), k-core decomposition (kcore), pagerank (pr), single-source shortest path (sssp), and matrix completion using stochastic gradient descent (sgd). We specify the programs in Galois C++: pull-style topology-driven algorithm for pr, push-and-pull-style topology-driven algorithm for sgd, and push-style work-list-driven algorithms for the rest. The Abelian compiler analyzes the program, restructures the operators, and synthesizes precise communication. Unless otherwise noted, all optimizations are applied in our evaluation, including eliminating work-lists. The programs work with different partitioning policies. In our evaluation, we choose incoming edge-cut for pr, cartesian vertex-cut for sgd, and outgoing edge-cut for all other benchmarks. We have empirically found these policies to work well in practice; an exhaustive search to find the best policy is outside the scope of this work.

Table 1. Inputs and their key properties

Table 1 shows the input graphs we used along with their properties. All the CPU experiments were done on the Texas Advanced Computing Center’s [2] Stampede [28] KNL Cluster. For GPU experiments, the Bridges [21] supercomputer at the Pittsburgh Supercomputing Center [1, 29] was used. Table 2 shows the configuration of these clusters used in our experiments. In all our experiments, we choose the max-degree node as the source for bc, bfs, and sssp. For kcore, we solve for \(k=100\). We present the mean execution time of 3 runs, excluding graph partitioning time. We run pr and sgd for 100 and 50 iterations, respectively; all other algorithms are run until convergence.

 

Table 2. Cluster configurations
Table 3. Bridges: execution time (in seconds) on 16 GPUs for rmat28

4.1 Comparison with the State-of-the-Art

We compare the with handwritten D-Galois programs for CPU-only systems [10] and handwritten D-IrGL programs for GPU-only systems [10]. D-Galois and D-IrGL programs have explicit synchronization specified by the programmer; in contrast, synchronization in programs produced by the Abelian compiler is introduced automatically by the compiler. However, all these programs use Gluon [10], a communication substrate that optimizes communication at runtime by exploiting structural and temporal invariants in partitioning (Gluon uses LCI [9] for message transport between hosts). In addition, D-Galois and Abelian use the same Galois [20] computation operators on the CPU while D-IrGL and Abelian use the same IrGL [22] computation kernels on the GPU. Therefore, differences in performance between Abelian-generated code and D-Galois/D-IrGL code arise mainly from differences in how synchronization code is inserted by the Abelian compiler.

Table 4. Stampede: execution time (in seconds) (H: hosts)

We also compare Abelian-generated programs with distributed-CPU programs written in the Gemini framework [33] (Gemini does not have kcore and sgd; bc in Gemini uses bfs while that in Abelian uses sssp, so it is omitted). Gemini has explicit communication messages in the programming model, and it provides a third-party baseline for our study.

Tables 3 and 4 show the distributed-GPU and distributed-CPU results. Abelian programs match the performance of D-Galois and D-IrGL programs; the difference is not more than \(12\%\). Gemini is \(15\%\) faster than Abelian for pr with kron30 on 8 hosts. In all other cases, Abelian matches or outperforms Gemini. The geometric mean speedup of Abelian over Gemini on 32 KNL hosts is \(2.4\times \). These results show that Abelian is able to compile a high-level, shared-memory, single address space specification into efficient implementations that either match or beat the state-of-the-art graph analytics platform. Although the Abelian compiler produces code for heterogeneous devices, we do not report numbers for distributed CPU+GPU execution because the 4 GPUs on a node on Bridges outperform the CPU by a significant margin.

4.2 Impact of Communication Optimizations

We analyze the performance impact of the communication optimizations in Abelian (Sect. 3.3) by comparing three levels of communication optimization.

  1. 1.

    Unoptimized (UO): the Gluon sync call is inserted for a field after an operator if it could be updated in that operator. The bit-vector as well as the abstract write and read locations are left unspecified, so all elements in the field are synchronized. Existing compilers for heterogeneous systems like Falcon [30], Dandelion [27], and Liquid Metal [3] do similar field-specific, coarse-grained synchronization.

  2. 2.

    Fine-grained communication optimization (FG): the compiler instruments the code to use a bit-vector that dynamically tracks updates to fields. The Gluon sync call used is the same as in UO, but it only synchronizes the elements in the field that have been updated using the bit-vector. This is similar to existing graph analytical frameworks [8, 12, 33] that synchronize only the updated elements.

  3. 3.

    Fine-grained and on-demand communication optimization (FO): this (default of Abelian compiler) uses on-demand communication along with fine-grained optimization. It instruments invalidation flags to track fields that have been updated and inserts Gluon sync calls before an operator for fields that could be read in the operator, thereby precisely identifying both the abstract write and read locations. This enables Gluon’s communication optimization that exploits structural invariants in partitioning policies.

We compare these three communication optimization variants with hand-tuned (HT) programs written in D-Galois and D-IrGL on distributed CPUs and distributed GPUs respectively. In these programs, the programmer (with global control-flow knowledge) specified the precise communication using Gluon sync calls.

Fig. 5.
figure 5

32 KNL hosts on Stampede: clueweb12 (left) and kron30 (right). Different variants are: UnOpt (UO), Fine-Grained opt (FG), Fine-grained+On-demand opt (FO), Hand-Tuned (HT)

Fig. 6.
figure 6

16 GPU devices on Bridges: rmat28

Fig. 7.
figure 7

32 KNL hosts on Stampede: partitionings for bc on clueweb12

Figures 5 and 6 present the comparison results on 32 KNL hosts of Stampede and 16 GPU devices of Bridges respectively. Each bar in the figures shows the execution time (maximum across hosts). We measure the maximum computation time across hosts in each round and take their sum, which is the total computation time (top). The rest of the execution time is non-overlapped communication time (bottom). We also measure the total communication volume across all rounds, shown in text on the bars.

The trends are clear in the figure. Each optimization reduces communication volume and time, improving execution time further. FG significantly reduces communication volume and time over UO, with the exception of pr. FG performs atomic updates to the bit-vector, which could be overhead when the updates are dense, like in pr. FO optimizes the communication volume and time further to match the performance of HT. FO reduces communication volume by 23\(\times \) over UO, yielding a geometric mean execution time speedup of 3.4\(\times \). Fine-grained and on-demand communication optimizations (FO) are thus essential to match the performance of HT on both CPUs and GPUs.

Abelian compiler-generated programs can support different partitioning policies, and we study whether they can fully exploit Gluon’s partition-aware optimizations like HT. Figure 7 presents the comparison results for bc on clueweb12 using different partitioning policies namely, cartesian vertex cut [5] (cvc), hybrid vertex-cut [8] (hvc), and outgoing edge cut (ec). This shows that FO matches the performance of HT, although FG does not. This shows that the compiler can capture sufficient domain-specific knowledge to aid the Gluon runtime in performing partition-aware optimizations.

5 Related Work

Distributed Graph Processing Systems: Many frameworks [8, 10, 12, 15, 18, 31, 33] exist which provide a runtime to simplify writing distributed graph analytics algorithms. Like Abelian, these systems use a vertex programming model and bulk-synchronous parallel (BSP) execution. Abelian is the first compiler that synthesizes the required communication. Our evaluation shows that the programs generated by the Abelian compiler that use the Gluon [10] runtime match hand-tuned programs in the Gluon system and outperform those in the Gemini [33] system.

Single-Host Heterogeneous Graph Processing Systems: There are several frameworks for graph processing on a single GPU [22], multiple GPUs [4, 23, 32] and multiple GPUs with a CPU [11]. All of these are restricted to a single physical node that connects all devices unlike our system, and consequently, they cannot handle graphs as large as the ones our system can. Abelian leverages the throughput optimizations in the IrGL [22] compiler that are essential for performance on power-law graphs. Unlike IrGL, which compiles an intermediate-level program representation to CUDA, the Abelian compiler not only generates this from a high-level C++ program but also synthesizes synchronization code to execute the compiled code on multiple devices in multiple hosts.

Compilers for Distributed or Heterogeneous Architectures: Liquid Metal [3] compiles the Lime language to heterogeneous CPUs, GPUs, and FPGAs. Dandelion [27] compiles high-level LINQ programs to distributed heterogeneous systems. Green-Marl [14] is a DSL that is compiled to Pregel. Brown et al. [6] compile a data-parallel intermediate language DMLL to multicores, clusters, and GPUs. Upadhyay et al. [30] compile a domain-specific language, Falcon, to Giraph code for CPU clusters and MPI+OpenCL code for GPU clusters, but it does not do GPU-specific computation restructurings like nested parallelism which Abelian compiler does using IrGL. In all these compilers, the granularity of communication is an object or field, whereas Abelian identifies fine-grained elements of a label-array and communicates them precisely using the Gluon runtime. Moreover, none of the existing compilers use domain-specific analysis and computation restructurings for graph analytical applications like Abelian.

6 Conclusions

Abelian is the first graph analytics compiler that can produce high-performance, distributed, heterogeneous implementations from high-level, shared-memory, single address space specification of graph algorithms. It splits operators and eliminates work-lists to make the programs bulk-synchronous. The fine-grained, on-demand communication optimizations in Abelian yield a speedup of 3.4\(\times \) over field-specific, coarse-grained communication code generated by existing compilers. This enables the generated implementations to match the performance of hand-tuned implementations for distributed CPUs and distributed GPUs in the state-of-the-art Gluon system using the same computation engines on the same hardware. The distributed-CPU implementations produced by Abelian also yield a geometric mean speedup of 2.4\(\times \) over programs in the distributed CPU-only system Gemini on the same hardware. This shows that the Abelian compiler can manage heterogeneity and distributed-memory successfully while generating high-performance code, even in comparison to homogeneous systems.