research-article

Open access

GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems

Authors:

Junkaixuan Li,

Yi KangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 3

Article No.: 48, Pages 1 - 25

https://doi.org/10.1145/3661998

Published: 14 September 2024 Publication History

PDF eReader

Abstract

With the explosive growth of graph data, distributed graph processing has become popular, and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed.

In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. To improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality.

Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup of 1.52×, with the maximum up to 2×. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.

1 Introduction

Graphs are important and powerful data structures in big-data applications. With the rapid development of the Internet, various types of graph data are generated every moment. Real-world graphs are usually large and growing rapidly: They now often contain hundreds of billions of nodes and trillions of edges [7, 11, 46]. For example, Twitter’s monthly active user count increased from 482 million to 2.5 billion within 10 years; the knowledge graph data stored by Google includes more than 57 billion data objects, each with complex semantic links; and the number of web pages in the World Wide Web has grown from 40 billion to 55 billion within five years [3].

With the rapid growth of graph data, the computational and storage capabilities of a single computer fall short of meeting the escalating demand. Consequently, numerous distributed graph processing systems have been proposed, including Pregel [36], GraphLab [35], Powergraph [20], among others. Additionally, many graph hardware accelerators leverage distributed frameworks, such as Tesseract [1], GraphP [53], GraphH [12], and so on. Graph partitioning is foundation in distributed graph processing, which partitions a graph into multiple parts. The execution time of a distributed graph processing system predominantly comprises two components: the computing time of the system and the communication time across nodes. The quality of partitioning significantly impacts both aspects. First, the computing time is contingent on the slowest node; thus, a more balanced partitioning results in a shorter computing time. Second, communication time across nodes is influenced by the number of vertices across partitions, as vertices must transmit update values to remote neighbors. Therefore, minimizing the number of vertices across partitions leads to a reduction in communication time.

However, achieving balanced graph partitioning poses an NP-hard problem [2, 6]. Graph partitioning is primarily categorized into two types: vertex partitioning and edge partitioning. Recent works [20, 43] have demonstrated theoretically and experimentally that edge partitioning has better graph processing performance in natural power-law graphs. Despite considerable efforts to address edge partitioning problems, the majority of existing methods concentrate on static graphs, encompassing offline (in-memory) and online (stream) approaches. [3]. Offline methods load the entire graph into memory and allocate edges based on global graph information. Online methods sequentially load edge streams and directly allocate them to partitions. However, graphs in real world are dynamic. Reference [31] indicates that the density of a real-world graph gradually increases and the diameter gradually decreases with dynamic changes. That is, based on the initial partition, dynamic changes result in diminished spatial locality, subsequently increasing communication time across nodes. As shown in Figure 1, dynamic changes diminish the spatial locality of a graph, necessitating repartitioning. However, employing static partition algorithms to partition dynamic graphs from scratch incurs a high cost. Leveraging existing partitioning information becomes important to minimize dynamic graph partitioning costs. Figure 2 provides a comparison of two methods on a dynamic graph dataset Twitter [8]: (1) dynamic partitioning the graph by our method and (2) repartitioning the entire graph by static partition algorithm HDRF [42]. In Figure 2(a), as the graph’s scale gradually doubles, our method effectively maintains partition quality close to HDRF. In Figure 2(b), when the graph scale doubles, our method completes the repartitioning within 110 seconds, whereas HDRF requires 860 seconds.

Fig. 1.

Fig. 2.

Several prior studies have concentrated on edge repartition, exemplified by GR-DEP [33], SAGM-ER [34], and GraphSteel [29]. GraphSteel [29] only addresses the issue of load imbalance, while both GR-DEP [33] and SAGM-ER [34] require costly searching to determine which edges are worth of migration, which is suboptimal, as dynamic partitioning demands minimal partition time overhead. Notably, these studies uniformly assume equal communication delays between nodes, neglecting the variability in communication delay based on node distance in distributed systems. The impact of distance on communication delay is a crucial consideration in dynamic partitioning. DREDGE [39] is a vertex repartition method optimized for many-core systems. However, edge partitioning has better graph processing performance in natural power-law graphs. Monitoring past vertex update messages, DREDGE may lead to suboptimal performance for graph algorithms with random memory access patterns between iterations (e.g., SSSP and Random Walk). Furthermore, the vertex pressure in DREDGE is linked to the system size, yielding better dynamic partitioning quality with larger systems and vice versa. Hence, DREDGE’s performance varies across different graph algorithms and system sizes.

To mitigate the shortcomings identified in the aforementioned references, we introduce a stream-based edge repartition method. This method considers both the overhead and the quality of repartition, exhibiting stable performance across various graph algorithms and system sizes. Termed GraphSER, our method is specifically tailored for many-core systems, surpassing the performance of state-of-the-art references. Our main contributions are summarized as follows:

We propose a novel stream-based edge repartition method, which is based on the structure of dynamic graph itself, rather than relying on monitoring past memory access patterns, as many graph algorithms have different memory access patterns between different iterations.

In GraphSER, each computing node sequentially traverses its local edge list in parallel, eliminating the need to search for edges worthy of migration. Additionally, GraphSER is designed to be distance-aware in many-core systems, taking into account the communication delay between different nodes.

For the first time, we introduce the concept of replica degree and investigate their influence along with vertex degree on stream edge repartition. Subsequently, we present optimization methods based on the replica degree and vertex degree. Notably, these optimization methods remain unaffected by specific hardware architectures, ensuring consistent and stable dynamic partitioning improvements across various system sizes.

We propose the concept of replica-hop and discuss how replica-hop measures the spatial locality of graph partitions on many-core systems. Furthermore, we illustrate the influence of replica-hop on the overall performance of many-core systems.

We evaluate GraphSER using four static graph datasets and four dynamic graph datasets. The evaluation results demonstrate the effectiveness of GraphSER in several aspects.

The rest of this article is organized as follows: Section 2 provides an introduction to the related work. Section 3 defines the dynamic edge partitioning problem, explores concepts such as replica degree and replica-hop, and introduces our proposed method, GraphSER. The experimental evaluation is presented in Section 4, and Section 5 concludes with a summary of the article.

2 Related Work

In this section, we commence by presenting an overview of related work in traditional distributed graph processing, streaming/dynamic graph processing, static graph partitioning, and dynamic graph partitioning. Subsequently, we focus on how our work differs from them.

2.1 Traditional Distributed Graph Processing

Traditional distributed graph processing frameworks are primarily categorized into two types: vertex-centric and edge-centric. Vertex-centric frameworks partition and compute graphs based on vertices, while in edge-centric frameworks, edges are the fundamental unit for computation and partitioning. In edge-centric frameworks, each edge of graph is assigned to only one partition, but each vertex may exist in multiple partitions. Vertices that exist in multiple partitions are called replica.

Google introduced the first vertex-centric distributed graph processing framework, known as Pregel [36]. This framework iteratively executes user-defined programs on the vertices of the graph, adopting a local, vertex-oriented computational perspective, enhancing programming intuition and simplicity. However, for power-law graphs, due to some high-degree vertices, vertex-centric frameworks exhibit greater imbalance in partitioning and poorer memory access locality compared to edge-centric frameworks. X-Stream [45] first proposed edge-centric system based on the insight that storage media, including disks, SSDs, and DRAMs, perform significantly better in sequential data access. In edge-centric frameworks, edges are the fundamental unit for computation and partitioning, making it easier to achieve balanced partitioning compared to vertex-centric frameworks.

Recently, an increasing number of hardware graph accelerators utilize distributed graph processing frameworks. For instance, Tesseract [1] employs a vertex-centric framework, and GraphH [12] utilizes an edge-centric framework. Both Tesseract and GraphH employ HMC (Hybrid Memory Cube) to implement many-core graph processing systems. The computing capacity, memory bandwidth, and storage capacity of the HMC system exhibit linear expansion with system scaling, and its high memory bandwidth effectively eliminates local memory access bottlenecks. However, both vertex-centric and edge-centric frameworks necessitate cross-node communication. GraphP [53] highlights a critical bottleneck in this type of distributed system—the network latency resulting from communication between different nodes. This issue arises because, in many-core systems, the internal memory bandwidth of nodes can scale linearly as the system scales up, but the network bandwidth between different nodes cannot expand due to the number of physical pins. As the system scales, the average distance between two nodes also increases, causing data transportation latency between nodes to become a performance bottleneck. DREDGE [39] introduced a method to enhance the spatial locality of graph partitions through vertex repartition, reducing cross-node memory accesses and mitigating this bottleneck. However, DREDGE has two drawbacks: First, it optimizes partitioning by monitoring previous memory access patterns, which proves challenging for algorithms with random memory access patterns; second, the vertex pressure in DREDGE is linked to the system size, resulting in performance fluctuations for smaller systems. Our work employs edge repartition to enhance the spatial locality of graph partitions based on the inherent structure of dynamic graphs.

2.2 Streaming/Dynamic Graph Processing

In this study, consistent with Reference [5], we adopt the terminology employed therein. Specifically, we employ “dynamic” to characterize modifications made to the graph dataset, reserving the term “streaming” to denote the method of accessing or updating the graph.

Dynamic graph research can be categorized into various classes [5] and different classes of research focus on different graph problem. First, dynamic graphs involve the frequent updating of datasets, encompassing the insertion and deletion of edges and vertices. Traditional CSR format updates are notably inefficient in this context. Stinger [14] and its variants DISTINGER [17] and cuSTINGER [21] are very well-known works in addressing this issue. Stinger adapts and extends the CSR format to support quick insertions and deletions. DISTINGER [17] extends Stinger for distributed system and cuSTINGER [21] extends Stinger for CUDA GPUs.

As discussed in Section 1, the quality of graph partitions tends to degrade following updates to graph datasets. Our research, centered around dynamic graph partitioning, specifically addresses this issue, and this class of work will be detailed in Section 2.4.

Research in addition to dynamic graph data structures and dynamic graph partitioning includes a focus on incremental computation. Typically, graph updates are small relative to the overall graph size and impact only a limited subset of vertices [16, 19]. Thus, leveraging incremental computation based on previous results can significantly enhance the efficiency of dynamic graph computations. Kineograph [10] and GraphIn [48] are well-known works that employ incremental computations. However, they may generate incorrect results when edge deletes. To address this, KickStarter [51] and GraphBolt [37] are specifically designed to trace dependencies, ensuring the generation of correct results. For better performance, JetStream [44] proposes the first graph accelerator that supports incremental computation.

2.3 Static Graph Partitioning

Static graph partitioning can be categorized into vertex partitioning and edge partitioning, as illustrated in Figure 3. Vertex partitioning, also known as edge-cut, involves edges that span nodes. Conversely, edge partitioning, or vertex-cut, involves vertices that span nodes. The term “replica” is used when a vertex is cut, resulting in two or more identical vertices. On the basis of partition balance, the primary goal of vertex partitioning is to reduce the number of edges across nodes, while the primary goal of edge partitioning is to reduce the number of vertices across nodes.

Fig. 3.

Static graph partitioning also can be categorized into offline (in-memory) [3] and online (stream) [42] methods. The subsequent discussion primarily focuses on edge partitioning due to its superior graph processing performance for power-law graphs. Offline methods load the entire graph into main memory and allocate edges based on global graph information. While this approach achieves higher partitioning quality, it incurs significant time overhead and necessitates loading the entire graph into main memory, rendering it impractical for large-scale graph partitioning. Online methods sequentially load edge streams and allocate edges based on partition rules. These methods exhibit lower time overhead, do not require loading the entire graph into main memory, and can be executed in a streaming form, loading one edge at a time, making them suitable for large-scale partitioning. However, the partitioning quality of online methods is not as good as offline methods. Compared to our proposed stream edge repartition method, online methods for static graphs have the following shortcomings: First, online methods can only partition newly added edges into existing partitions. If the graph data has newly deleted edges, then the online method cannot handle them; second, even if all dynamic changes are newly added edges and online methods can handle them. However, as graph data and graph partitions change, the degree of vertices and the replica degree of vertices will change. In this case, there is still room for further optimization of partition quality. However, online methods do not have ability to repartition, thus there is no opportunity to improve the quality of partitioning.

2.4 Dynamic Graph Partitioning

The real-world graphs are dynamically changing, and the partitioning quality usually decreases with the dynamic changes of graphs [31]. However, the cost of static graph partitioning from beginning is too high, therefore dynamic graph partitioning is proposed to reuse the initial partition information and reduce the cost of partitioning. Dynamic graph partitioning can also be divided into two main categories: vertex repartition and edge repartition. This work mainly focuses on edge repartition, as edge partitioning performs better on power-law graphs [33].

GraphSteel [29] is an edge repartition algorithm, but it is only used to balance the running time imbalance of each partition during the graph calculation. It transports edges from slower partitions to faster partitions, thereby improving the computation and memory efficiency. GrapH [38] proposed repartition algorithm H-adapt to improve partitioning quality. In a heterogeneous environment, GrapH checks from high traffic vertices to determine if their adjacent edges can reduce communication costs by migrating between two selected parts. However, GrapH is used to solve the load imbalance issue caused by the graph calculation. It cannot solve the problem of repartition for dynamic graph structure change. GR-DEP [33] and SAGM-ER [34] proposed methods using edge group, treating multiple connected edges as a group. They improve partitioning quality by migrating certain edge groups between two selected parts. References [33] and [34] proposed edge group search algorithms to determine which edge group are worth migrating.

Previous works mentioned above require costly searching to determine which edges are worth repartitioning, but dynamic graph partitioning needs low partitioning overhead, as partitioning is carried out on-the-fly. In the following sections, we propose stream-based edge repartition method, which takes into account both overhead and quality of repartition, while at the same time the distance of data transfer is considered.

3 Methodology

In this section, we first define the dynamic edge partitioning problem, then propose the concept of replica degree and discuss the impact of replica degree and degree on edge repartition. Finally, we propose distance-aware stream edge repartition method for many-core systems and discuss the overview of our proposed system.

3.1 Problem Definition

Let \(G = (V, E)\) be an undirected graph, V be the vertex set of the graph, E be the edge set of the graph, \(|V|\) represent the number of vertices in the graph, \(|E|\) represent the number of edges in the graph. And for subsets \(E_s\subseteq E\), \(V(E_s)\) represents the vertices contained in the edge set \(V(E_s)\). In vertex partitioning, V is divided into k partitions, while in edge partitioning, E is divided into k partitions. The edge partitions are defined as \(P = \lbrace P_1, P_2, . . ., P_k\rbrace\), where \(E = \bigcup _{i=1}^{k} P_i\) and \(\bigcap _{i=1}^{k} P_i = \emptyset\). For edge partitioning, each edge can only be allocated to one partition, but each vertex can be copied to multiple partitions. Let \(R(V_i)\) be the set of partitions where \(V_i\) is allocated, representing partitions that contain \(V_i\). \(|R(V_i)|\) represents the number of partitions where \(V_i\) is allocated, i.e., how many partitions contain \(V_i\), and \(|R(V_i)|\) is at least 1. If \(|R(V_i)| \gt 1\), then it means that at least two partitions contain \(V_i\), and \(V_i\) can be regarded as a boundary vertex. All replicas of a vertex are the same, without any primary or secondary differences.

In edge-centric distributed frameworks, the update value of a replicated vertex needs to be synchronized across nodes. Therefore, the fewer vertex replications, the lower the synchronization cost and the better the graph processing performance. Edge partitioning measures the number of vertex replicas using the replication factor RF [42], where \(RF(k)\) represents the replication factor for k-parts partitioning. The definition of \(RF(k)\) is as follows:

\begin{equation} RF(k) = \frac{ \sum _{v\in V}\mid R(v)\mid }{\mid v\mid } . \end{equation}

(1)

The replication factor mainly affects the number of cross-node communication in distributed graph computing, while the local computing time of nodes is mainly affected by load balancing. The load balance of edge partitioning is measured by Equation (2), where \(\varepsilon\) is the imbalance coefficient satisfying \(1 \le \varepsilon \le 2\) [33, 34].

\begin{equation} (2 - \varepsilon)\frac{\mid E\mid }{k} \le \mid Pi\mid \le \varepsilon \frac{\mid E\mid }{k} , 1 \le i \le k \end{equation}

(2)

In Equation (2), \(\varepsilon = 1\) represents the situation where partitions are strictly balanced, and the number of edges in each partition is \(|E| / k\); \(\varepsilon = 2\) represents the most imbalanced situation in partitioning, with a maximum of \(2|E| / k\) edges and a minimum of 0 edges per partition; if \(\varepsilon = 1.5\), then the maximum number of edges per partition is \(1.5|E| / k\), and the minimum is \(0.5|E| / k\).

3.2 Replica Degree

The time required for edge-centric graph processing mainly consists of two parts: the cross-node communication time for vertex replications and the computing time at each node. The ultimate goal of dynamic edge partitioning is to minimize the total graph processing time as much as possible. First, the cross-node communication time depends on the number of vertex replicas, smaller replication factor generates better spatial locality of partitions and shorter cross-node communication time. Second, the computing time of a multi-node system depends on the slowest node. If the imbalance coefficient \(\varepsilon\) is close to 1, then it means that the workload balance of nodes is better, and the computing time of the system would be shorter. In fact, the key idea of a dynamic edge partitioning is to migrate the edges from one partition to other partitions to adjust to less replication factor and more balanced workload after dynamic changes to graph. Therefore, a smaller imbalance coefficient \(\varepsilon\) leads to fewer edges being migrated to other partitions, and that result in less edges involved in repartition. According to Equation (2), each partition contains an upper limit \(\varepsilon |E| / k\) and a lower limit \((2 - \varepsilon) |E| / k\) on the number of edges. Therefore, each partition can migrate up to \(2(\varepsilon - 2) |E| / k\) edges towards other partitions. When the imbalance coefficient \(\varepsilon\) is 1, the migrated edges in repartition are 0. According to the above analysis, to minimize execution time of graph processing, it would be better to simultaneously minimize the imbalance coefficient and replication factor. However, these two parameters are tightly coupled in terms of their impact to the system performance, so dynamic edge partitioning needs to consider both parameters to find optimized point in graph processing performance.

To illustrate the impact of the imbalance coefficient and replication factor, an example is given in Section 4.2.1. We evaluate the impact of the imbalance coefficient \(\varepsilon\) on the performance of dynamic partitioning. The experimental results are shown in Figure 7, which is consistent with our analysis. If imbalance coefficient \(\varepsilon\) is smaller, then dynamic partitioning optimization space is smaller and the cross-node communication time of the system would be longer. However, when the imbalance coefficient \(\varepsilon\) is 1.05, the total execution time is smallest, because the local computing time of nodes is also part of total execution time. The experimental results indicate that dynamic edge partitioning needs to balance the imbalance coefficient and replication factor to achieve maximum performance improvement. Because the reduction of these two parameters is mutually constrained, the optimized performance of the system in parameter space \((\varepsilon , RF)\) can be reached by migrating the fewest edges to eliminate the largest number of vertex replications.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

If we want to reduce replication factor, then we need to eliminate vertex replicas in one partition and not add new vertex replicas in another partition. This requires that the destination partition where edges is migrated to also contains this vertex and that the source partition no longer contains this vertex after the edge is migrated away. As shown in Figure 4, if we want to eliminate vertex replicas in \(P_1\) partition, then vertices a, c, and d meet the requirements, because they also have vertex replicas in \(P_2\) partition. After migrating and eliminating replicas in \(P_1\) partition, no new replicas will be added in \(P_2\) partition. To measure the efficiency of eliminating vertex replicas, we propose replica degree, which is defined as follows: We use \(RD_i(v)\) to represent the replica degree of vertex v in partition i, representing the total number of edges connected by vertex v in partition i. Because in edge partitioning, an edge can only be allocated to one partition, the set of connected edges of vertex v in all partitions is the set of connected edges of v in the original graph, i.e., \(\sum _{i=1}^{k}RD_i(v) = D(v)\), where \(D(v)\) represents the degree of v. Replica degree is a very intuitive measure of the efficiency of eliminating vertex replicas, because vertex replica v connects \(RD_i(v)\) edges in partition i, which means migrating \(RD_i(v)\) edges to another partition containing v can eliminate vertex replica. The efficiency of vertex replica elimination is \(1 / RD_i(v)\), which is inversely proportional to the replica degree, meaning that vertices with lower replica degree are more worth eliminating through edge repartition. As shown in Figure 4, vertex a need to migrate two edges to eliminate the replica in \(P_1\) partition, while vertices c and d only need to migrate one edge to eliminate the replica in \(P_1\) partition. The replica of vertices c and d is more worth eliminating through edge repartition.

Although replica degree can intuitively measure the efficiency of vertex replica elimination, relying solely on replica degree for dynamic partitioning is not enough. Edge repartition eliminates vertex replicas by migrating edges, since an edge connects two vertices, and determining whether to migrate an edge based solely on the replica degree of one vertex may actually result in an increase in replication factor. However, it is difficult to determine whether to migrate an edge by checking the replica degree of two vertices of the edge at the same time. In the following section, we propose using the degree of vertices to give consideration to both two vertices of an edge, thereby further improving the quality of dynamic partitioning.

3.3 Vertex Degree-based Stream Edge Repartition

To determine more clearly whether an edge is worth migrating, we classify the two vertices connected by an edge as follows: Assume that an edge is migrated from the source partition \(P_s\) to the destination partition \(P_d\). For the source partition \(P_s\), there are two types of migrated vertices v. Before the migration, \(P_s\) already have the vertex v. After the migration there are two types of vertex v, one is that \(P_s\) still has vertex v after the migration, the other is that after the migration vertex v is eliminated. For the destination partition \(P_d\), there are also two types of migrated vertices. After the migration, \(P_d\) would have the vertex v. Before the migration there are two types of migrated vertices v, one is that \(P_d\) do not have vertex v before the migration, the other is that \(P_d\) already has vertex v before the migration. So, based on the combination of two types of source partition and two types of destination partition, a vertex can be divided into four categories, as shown in Table 1. In these four categories, (1) increasing the number of vertex replicas, (2) and (3) having no effect on the number of vertex replicas, only (4) reducing the number of vertex replicas.

Table 1.

	\(v \in P_d\) before migration	\(v \notin P_d\) before migration
\(v \in P_s\) after migration	2) \(\|R(v)\|\) remain unchanged	1) \(\|R(v)\| + 1\)
\(v \notin P_s\) after migration	4) \(\|R(v)\| - 1\)	3) \(\|R(v)\|\) remain unchanged

Table 1. Four Types of Vertices

The vertices we actively migrate based on replica degree must be type (4), which ensures that the migration can reduce the number of vertex replicas. However, an edge has two vertices, and during the migration of the edge, one vertex passively follows the actively migrating vertex. To maximize the benefits of dynamic partitioning, it is necessary to reduce the probability of passively following vertex being a (1) type vertex and try to make it a (2) and 3) and 4) type vertex as much as possible. If it is a 4) type vertex, the efficiency of eliminating vertex replicas will be maximized. It is observed that the degree of a vertex is closely related to its probability of being a type (1) vertex. Because type (1) vertices require the destination partition to have no vertex replicas before migrating, if a vertex is in more partitions, then it is less likely for it to be a type (1) vertex. Since vertices of high degree tend to be cut to multiple partitions in an edge partitioning scheme due to their more connected edges, their probability of being a type (1) vertex is low. Conversely, vertices of low degree have a higher probability of becoming a type (1) vertex.

Eliminating vertex replicas essentially is to concentrate the edges of a vertex into one partition as much as possible. If all edges of v are in the same partition, then \(|R(v)|\) has the lowest value 1. High-degree vertices are more likely to drive (1) type vertices to follow the migration during the concentration process. Although high-degree vertices eliminate their own vertex replicas, the migration of (1) type vertices increase vertex replicas and actually reduces the partition quality. This is because high-degree vertices tend to connect with more vertices, and during the process of eliminating high-degree vertex replicas, more low-degree vertices will follow the migration. As mentioned earlier, the probability of low-degree vertices becoming type 1) vertices are higher. As shown in Figure 5, during the process of concentrating high-degree vertices, the number of vertex replicas in the entire system actually increases. So, low-degree vertices are more suitable to be treated as active migrating vertices, while high-degree vertices are more suitable as vertices following the migrating vertices. Therefore, whether an edge is worth of migrating can be judged based on degree of vertices from the edge.

Taking into account considerations in above, our degree-based stream edge repartition method can be illustrated using Figure 6. After the graph data undergoes dynamic changes, each computing node sequentially traverses its local edge list in parallel, then selecting a vertex v with lower degree of the edge. Based on whether or not v is a (4) type vertex and its replica degree, if its efficiency in eliminating vertex replicas is higher than a certain threshold \(T_r\), it is determined that the edge is worth migrating.

3.4 Optimization of Distance-aware

The previous edge repartition works GR-DEP [33], SAGM-ER [34], and GrapH [38] used replication factors to measure the communication cost during the synchronization phase. However, for many-core systems, replication factors cannot reflect the exact communication cost across nodes. This is because the replication factor can only measure the number of messages across nodes, not the transmission distance of messages across nodes. In this work, we optimize the stream edge repartition algorithm for many-core systems and propose a new measurement called replica-hop to measure the communication cost of many-core systems.

Hop is a metric in network-on-chip [13], which refers to the number of routers that a message travels from the source node to the destination node, or the number of links it traverses. Hop is a simple and useful metric for measuring network latency. The proposed replica-hop combines the original concept of hop and replication factor. It is defined as follows: A vertex v has replicas in both \(P_i\) and \(P_j\) partitions, and messages from \(P_i\) to \(P_j\) partitions require Hv hops to transmit. Therefore, replica-hop \(= \sum _{v\in V} Hv\), we can use replica-hop to measure the spatial locality of a graph stored in a many-core system. If the replica-hop is smaller, then it means that the graph has better spatial locality, and the communication overhead would be smaller during the synchronization phase.

Our work and DREDGE both deal with architecture based on a mesh-based many-core system. DREDGE [39] pointed out that a reduction in the replication factor only reduces the total number of messages, but the average transmission distance of messages may increase. Therefore, for many-core systems, simply reducing vertex replicas may actually lead to increased communication overhead. Vertex replicas with larger hops have more communication overhead, therefore, priority should be given to migrating edges to eliminate these vertex replicas, which can maximize the efficiency of the system. Therefore, in combination with the degree-based stream edge repartition algorithm in Section 3.3, the pseudocode of distance-aware stream edge repartition is shown in Algorithm 1. This algorithm runs in parallel on each computing node in the system, and the running process is consistent with Figure 6.

Among Algorithm 1, we calculated the \(node\_id\) of the node with the highest replica degree for each vertex v during initialization, because migrating the vertex to the node with the highest replica degree has the highest probability of eliminating vertex replicas. The function \(distance (a, b)\) calculates the communication distance (hops) between nodes a and b in a many-core system. Because vertex replicas with larger hops and lower replica degree have higher priority for elimination, we use \(distance(L, Max\_ID (v)) / RD_L (v)\) to measure whether the edge is worth migrating. If it is greater than the threshold \(T_R\), then the edge is repartitioned.

3.5 System Overview

Although the system we simulated is a processing-in-memory (PIM) system, our proposed GraphSER is a general method that can apply to systems requiring graph partitioning (such as general PIM system, distributed system and multi-GPU system). This is because GraphSER focuses on optimizing the quality of graph partitions, which is separated from other stages of graph processing. The simulated PIM system is from References [18, 24]. And the details of our system are shown in Table 2.

Table 2.

Component	Configuration
Memory	48 Gb total capacity, 384 DRAM banks 128 Mb capacity and 4.25 GB/s bandwidth per DRAM bank
Core	800 MHz RISC-V in-order core The number of DRAM bank and RISC-V in-order core are the same
Network	Mesh topology Each link provides 40 GB/s bandwidth [1] One hop requires 50 clock cycles between adjacent nodes

Table 2. System Configuration

In our system, each PE core is restricted to access its own local memory only, and each PE communicates with others through messages. Message passing avoids coherence protocol and atomic operations among PE cores. Therefore, our system can scale up without considering coherence management. And we do not use special atomic operations like locks or barriers. We use one PE core as a centralized control point to collect the synchronization status and issue synchronization commands.

We use unsorted edge list as our data format. That is because: (1) most of open-source graph datasets use edge list to store edges, (2) graph compression such as CSR is very inefficient to update [5], (3) our method traverses edges in a streaming form, edge list naturally suits our method. When graph updates (insertion, deletion of edges and vertices), our update operations on edge lists are different, but the performance of repartitioning after updates remains unchanged (this is because our method only traverses the edge list, regardless of the content stored in the edge list).

4 Evaluation

In this section, we first evaluate the impact of various parameters on our system. Then, we evaluate the impact of replica degree, degree, and replica-hop on our GraphSER partitioning. Finally, we evaluate the scalability of GraphSER and compare GraphSER with the state-of-the-art works using an in-house simulator.

4.1 Experimental Setup

We evaluate some of the metrics of the GraphSER algorithm on the Intel (R) Xeon (R) Gold 6254 CPU @ 3.10 GHz, such as replication factor, replication hop, memory overhead, and so on. We evaluate the overall performance of GraphSER on the many-core system using an in-house simulator, which combines ZSsim [47], Ramulator [28, 49], and Booksim [23] so a many-core system can be simulated.

Ramulator is a cycle-accurate DRAM simulator that we use to simulate the main memory of the many-core system. Ramulator is based on a universal template for modeling DRAM systems, so it can provide out-of-the-box support for various DRAM standards such as DDR3/4, LPDDR3/4, GDDR5, HBM, HMC, and so on. Ramulator uses the trace generated by ZSsim [47] to simulate memory access in the many-core system, including local and cross-node access. Booksim is a cycle accurate network-on-chip simulator designed to achieve simulation flexibility and precise modeling of network components. We use it to simulate the network of the many-core system. The topology of the network is mesh, using the X-Y routing algorithm.

We use a total of eight datasets: four static datasets and four dynamic datasets, as shown in Table 3. We use four static datasets, the same as DREDGE [39]: LiveJournal [4], Friendster [52], Amazon [30], and Roadnet [32]. We choose these four static datasets for the convenience of comparison with DREDGE [39] in Section 4.4. And we choose four dynamic datasets for comparison with the state-of-the-art works in Section 4.5: Twitter [8], Facebook [50], Flickr [9], Orkut [40].

Table 3.

Name	Type	Number of Vertices	Number of Edges	Average Degree
LiveJournal	Static	4.8M	69M	14.38
Friendster	Static	65.6M	1.8B	27.44
Amazon	Static	335k	925k	2.76
Roadnet	Static	2M	2.8M	1.4
Twitter	Dynamic	54M	2B	37.04
Facebook	Dynamic	60.3k	1.55M	25.64
Flickr	Dynamic	2.57M	33.14M	12.89
Orkut	Dynamic	3.07M	223.5M	72.8

Table 3. Graph Datasets

Four dynamic datasets can be directly used in our evaluation, but we need to convert four static datasets into dynamic datasets for evaluation. We refer to the method of SAGM-ER [34] and divided each graph into an initial graph and a dynamic variation graph, with the selection of data being random. Metis [25, 26] and its parallel version ParMetis [27] are very well-known static graph partition works, and they can achieve high quality in vertex partitioning. But in edge partitioning, the partitioning quality of Metis is not as good as HDRF [42]. Specifically, we use HDRF [42] for initial partitioning, with a total of 70% of the dataset divided and the remaining 30% being used as dynamic changes, including the insertion and deletion of edges and vertices. According to Section 4.2.1, we set the imbalance coefficient of both the initial partitioning algorithm and subsequent dynamic partitioning algorithm to \(\varepsilon = 1.05\). We run the repartition algorithm after dynamic changes in the dataset and evaluate the various metrics of the algorithm.

4.2 Parameter Analysis

In this section, we evaluate the impact of various parameters on our system, such as unbalance coefficient \(\varepsilon\), threshold \(T_R\), graph diameter, and so on.

4.2.1 Analysis of Unbalance Coefficient \(\varepsilon\).

The time required for graph processing mainly consists of two parts: the cross-node communication time and the local computing time. As discussed in Section 3.2, unbalance coefficient \(\varepsilon\) has an impact on both parts. And in this subsection, we mainly evaluate the impact of unbalance coefficient \(\varepsilon\) on the total execution time.

According to Section 4.4, our system performs best when the system size is 20 × 20, so the system size we simulate is 20 × 20. We use eight datasets in Table 3, and we use PageRank as algorithm in this experiment. And we use Twitter [8] as an example to show the total execution time breakdown in Figure 7. We divide the total execution time into four categories: communication time, computation time, synchronization time, and repartitioning time. The repartitioning time specifically refers to the time it takes for the system to run our repartition algorithm, including the time it takes to calculate the intermediate value (e.g., RF (v)).

As the unbalance coefficient \(\varepsilon\) increases, the repartitioning optimization space also increases and the cross-node communication time of the system is shorter. And the repartitioning time is longer for better repartitioning quality. The local computation time mainly depends on load balancing, so as the unbalance coefficient \(\varepsilon\) increases, the computation time also increases. The synchronization time is independent of the unbalance coefficient \(\varepsilon\), so it does not have a significant change with the unbalance coefficient \(\varepsilon\). The optimal unbalance coefficients of different datasets are shown in Table 4. Combining Figure 7, we can see that when the unbalance coefficients are 1.04–1.1, the total execution time is the shortest, and the unbalance coefficients find the optimized point between repartitioning quality and load balancing.

Table 4.

Datasets	LiveJournal	Friendster	Amazon	Roadnet	Twitter	Facebook	Flickr	Orkut
The Optimal \(\varepsilon\)	1.05	1.07	1.08	1.06	1.05	1.04	1.07	1.05

Table 4. The Optimal Unbalance Coefficient \(\varepsilon\) of Different Datasets

4.2.2 Analysis of Threshold \(T_R\).

We described \(T_R\) in Section 3.4, which is the minimum threshold for triggering edge migration. In this subsection, we conduct a detailed evaluation of the impact of threshold \(T_R\) on system performance. The configuration of this experiment is the same as Section 4.2.1: We use Twitter [8] as dataset and PageRank as algorithm. The system size we simulate is 20 × 20, and we divide the total execution time into four categories: communication time, computation time, synchronization time, and repartitioning time.

As the threshold \(T_R\) decreases, the number of edges that meet the migration condition increases. If the threshold \(T_R\) is 0, then it means that all edges meet the migration condition. The system will migrate all the edges traversed until the load balancing condition is met (i.e., random repartition). If the threshold \(T_R\) is a sufficiently large number, then it means that there are no edges that meet the migration condition (i.e., no repartitions). The evaluation results are shown in Figure 8, which is consistent with our analysis. The computation time and synchronization time are independent of the threshold \(T_R\), so they do not have a significant change with the threshold \(T_R\). As threshold \(T_R\) increases, the number of edges that meet the migration condition decreases. Thus, the system needs to traverse more edges to complete the repartition and the repartition time becomes longer. When the threshold \(T_R\) is small, it is difficult to migrate edges that can truly optimize partitioning quality because a large number of edges meet the migration condition. When the threshold \(T_R\) is large, only a small number of edges meet the transfer condition, thus the repartitioning optimization space is also small. So, when threshold \(T_R\) is very small or very large, the communication time will become longer. We can see that when the threshold \(T_R\) is 8, the total execution time is the shortest, and the threshold \(T_R\) finds the optimized point between repartitioning quality and repartitioning time.

Fig. 8.

4.2.3 Analysis of Dataset Parameters.

Graph datasets have many properties, such as average degree, maximum degree, and so on. In this subsection, we select the three most important properties of dynamic graph datasets, which we consider to be: graph diameter, graph change rate, and graph skewness.

The diameter of a graph refers to the maximum distance between any two vertices. The smaller the graph diameter, the tighter the connection between graph partitions, and the more communication between partitions. As mentioned in Section 1, after the update of graph datasets, the quality of graph partitions will deteriorate. So, the higher the rate of graph changes, the faster the quality of dynamic graph partitioning deteriorates. One of the landmark properties of real-world graphs is skewed power-law degree distribution [15]: Most vertices have few neighbors while a few have many neighbors. Under a power-law distribution, the probability of a vertex is given by:

\begin{equation} P(d) \propto d^{-\alpha }, \end{equation}

(3)

where d is the degree of the vertex, and \(\alpha\) is a positive constant that represents the “skewness” of the degree distribution. As \(\alpha\) increases, the number of high vertices in the graph decreases, the density of the graph decreases, and the skewness of the graph increases. And as the skewness of the graph increases, load balancing of graph partitioning becomes more difficult to achieve. Most real-world graphs typically have a skewness constant around \(\alpha \approx 2\) [15].

In this experiment, we analyze three properties of the graph, so we use Graph500 [41] to generate the graph datasets for easy control of variables. The system size we simulate is 20 × 20, and we set the unoptimized solution that not dynamically partition as the baseline. The evaluation results are shown in Figure 9. As the graph diameter decreases, the connection between graph partitions becomes tighter, and our GraphSER can achieve better performance improvement by reducing communication between partitions. Figure 9(b) shows the performance change of GraphSER with the update rate of the dynamic graph. We use the number of edges/vertices updated between two repartitioning to represent the update rate of the dynamic graph. As the graph update rate increases, the quality of dynamic graph partitioning deteriorates faster, and our GraphSER can achieve better performance improvement by optimizing partitions quality. As skewness constant \(\alpha\) increases, the load balancing of graph partitioning becomes more difficult to achieve, and our proposed edge repartitioning is better than vertex repartition in this aspect. Thus, as \(\alpha\) increases, GraphSER performs better.

Fig. 9.

4.2.4 Analysis of Memory Overhead.

In this subsection, we evaluate the memory overhead of GraphSER. We use three datasets: Twitter [8], Facebook [50], and Flickr [9]. The graph algorithm we evaluate is PageRank. We choose Metis [25] and HDRF [42] as baseline. And we evaluate the memory overhead of GraphSER and baselines on Intel (R) Xeon (R) Gold 6254 CPU @ 3.10 GHz.

As detailed in Section 3.4, only the edges currently traversed need to be stored in the memory, which significantly decreases the memory overhead of GraphSER. The evaluation results are shown in Figure 10. The memory consumption of GraphSER is significantly lower than that of Metis [25] and HDRF [42]. The memory consumption of Metis [25] and HDRF [42] increase linearly from \(t_{0}\) to \(t_{5}\), while the memory overhead of GraphSER increases smoothly. This is because GraphSER only stores the currently traversed edges in memory.

Fig. 10.

4.3 Self-analysis

In Section 3, we proposed replica degree, replica-hop and degree-based stream edge repartition. And in this section, we evaluate the impact of replica degree, degree, and replica-hop on our GraphSER partitioning.

4.3.1 Impact of Replica Degree.

The partitioning quality and load balancing of edge partitioning are mutually constrained, and the more unbalanced the load, the easier it is to partition the quality higher. However, the processing time of distributed graphs is related to both, so only by considering both can the performance of the entire system be maximized. We proposed replica degree in Section 3.2 to measure the efficiency of eliminating vertex replicas by migrating edges to balance the replication factor and imbalance coefficient. In this experiment, we evaluate the impact of replica degree on the edge repartition algorithm.

We use the LiveJournal dataset in the experiment and divide the graph into 64, 144, 256, 400, and 576 partitions, respectively. We set the unoptimized solution that not dynamically partition as the baseline. And a comparison is made between the GraphSER algorithm and the GraphSER algorithm, which do not migrate edges based on replica degree. As mentioned in Section 4.1, we set the imbalance coefficient to \(\varepsilon = 1.05\), which means that a partition can only move up to 10% of its edges to optimize partition quality.

We evaluate the impact of replica degree on edge repartition algorithms using replication factors and replica-hop. The replication factor is used to measure the quality of graph partitioning in general distributed systems, representing the number of cross-node communication in the system. Replica-hop is used to measure the spatial locality of data in the many-core system, representing the cross-node communication overhead in the system. We measure the impact of replica degree on the performance of two systems using two metrics, as shown in Figure 11. Compared to the GraphSER without migrating edges based on replica degree, GraphSER shows a maximum decrease of 18% in replication factor at 144 partitions, a minimum decrease of 15% at 576 partitions, and an average decrease of 16%. At 256 partitions, the replica-hop decreased by a maximum of 18%, and at 400 partitions; it decreased by a minimum of 12%, with an average decrease of 15%. The average decrease in replication factor and replica-hop is similar, indicating that migrating edges based on replica degree has similar performance improvements for both systems, and replica degree is universal in improving edge repartition algorithms. And as the scale of the mesh architecture system changes, the gain of replica degree on the system is stable.

Fig. 11.

4.3.2 Impact of Degree.

The edge repartition algorithm optimizes the partition quality by migrating the edges of partitions. The replica degree proposed in Section 3.2 can improve the efficiency of eliminating vertex replicas by migrating edges. However, an edge contains two vertices, and migrating the edge only through the replica degree of one vertex may actually reduce the partition quality. Section 3.3 proposes to solve this problem through the degree of vertices, which means that vertices with higher degrees move with vertices with lower degrees, and an edge determines whether it is worth migrating based on the replica degree of vertices with lower degrees. In this experiment, we evaluate the impact of degree on edge repartition algorithms.

In this experiment, we set the unoptimized solution that not dynamically partition as the baseline and compare the GraphSER algorithm with the GraphSER algorithm where the edge randomly follows a vertex migrating. The other settings in this experiment are similar to those in Section 4.3, using the LiveJournal dataset, dividing the graph into 64, 144, 256, 400, and 576 partitions, and setting the imbalance coefficient to \(\varepsilon = 1.05\). This experiment also uses replication factors and replica-hop to evaluate the impact of degree on edge repartition algorithms. The experimental results are shown in Figure 12. Compared to GraphSER with edges randomly following a vertex migrating, the replication factor of GraphSER decreased by up to 26% at 64 partitions, and by minimum 5% at 576 partitions, with an average decrease of 15.5%. At 144 partitions, replica-hop decreased by a maximum of 23%, and at 576 partitions, it decreased by a minimum of 6%, with an average decrease of 14%. The average decrease in replication factor and replica-hop is similar, indicating that migrating edges according to lower degree vertices has similar performance improvements for both systems, and degree is universal for improving edge repartition algorithms.

Fig. 12.

4.3.3 Spatial Locality.

For a many-core system, the communication cost is not only determined by the number of cross-node communications, but also by the average distance (hops) of cross-node communication. However, the replication factor can only measure the number of communications across nodes. Therefore, in Section 3.4, we propose new metric replica-hop and distance-aware stream edge repartition algorithm for many-core systems. In this experiment, we mainly evaluate the measurement of communication overhead on mesh topology by replica-hop and the performance improvement of the distance-aware algorithm on mesh topology.

In this experiment, we use an in-house simulator to simulate the many-core system, with the topology of NOC being mesh and using the X-Y routing algorithm. We run the GraphSER algorithm aware and not aware of distance separately and set the unoptimized solution without dynamic partition as the baseline. We use the LiveJournal dataset and PageRank algorithm to evaluate system sizes of 8 × 8, 12 × 12, 16 × 16, 20 × 20, and 24 × 24. This experiment uses replication factor and replica-hop to evaluate partition quality and uses in-house simulator to simulate the total execution time of the system, as shown in Figure 13 and Table 5. The GraphSER aware of distance shows an average reduction of 29% in replication factor, 79.6% in replica-hop, and 55% in total execution time compared to the unoptimized solution without dynamic partitioning. The GraphSER algorithm, which is not distance-aware, shows an average reduction of 29.6% in replication factor, 31% in replica-hop, and 21% in total execution time compared to the unoptimized solution without dynamic partitioning. Based on the above data, we can draw two observations: First, replica-hop can effectively measure the communication overhead in many-core systems, as the GraphSER aware of distance has similar replication factors compared to the GraphSER not aware of distance, while replica-hop and total execution time have significantly decreased; second, the GraphSER aware of distance greatly improves the data spatial locality in the many-core system, greatly reducing total execution time.

Table 5.

System Size	No Repartition	No Distance-aware	GraphSER
8 × 8	1×	0.79×	0.57×
12 × 12	1×	0.85×	0.49×
16 × 16	1×	0.80×	0.46×
20 × 20	1×	0.73×	0.38×
24 × 24	1×	0.76×	0.39×

Table 5. Total Execution Time in Many-core Systems

Fig. 13.

4.4 Scalability

In this experiment, we use an in-house simulator to evaluate the scalability of GraphSER. And we compare the scalability of GraphSER with DREDGE [39] and SAGM-ER [34]. We use four datasets, LiveJournal, Friendster, Amazon, and Roadnet, the same as DREDGE [39]. For the convenience of comparison with DREDGE, our experimental results do not use the total execution time, but instead use the number of graph algorithm iterations completed within 100M clock cycle. Graph algorithms we evaluate are PageRank and SSSP. And we use an unoptimized solution without dynamic partitioning as the baseline.

We first evaluate the scalability of repartition time of GraphSER and SAGM-ER algorithms on Intel (R) Xeon (R) Gold 6254 CPU @ 3.10 GHz, as shown in Table 6. We do not compare with DREDGE, because its solution relies on hardware implementation and there is no specific software algorithm implementation. At a system size of 4 × 4, GraphSER shows the minimum improvement compared to SAGM-ER, with a 67% reduction in repartition time. At a system size of 24 × 24, GraphSER shows the maximum improvement compared to SAGM-ER, with a 94.8% reduction in repartition time. This is because, as the system size increases, the amount of data in each partition will decrease, and the time for GraphSER to sequentially traverse the edges of local partition will also decrease. The experimental results illustrate two points: First, the cost of GraphSER edge repartition is significantly lower than the previous edge repartition algorithm SAGM-ER. Second, GraphSER has good scalability for system scale, and as the system size increases, the repartition cost will decrease.

Table 6.

System Size	SAGM-ER (ms)	GraphSER (ms)
4 × 4	987.63	322.59
8 × 8	1,001.27	163.75
12 × 12	1,222.43	100.98
16 × 16	1,173.55	69.63
20 × 20	1,031.31	57.09
24 × 24	1,020.98	53.07

Table 6. Repartition Time of GraphSER and SAGM-ER

Then, we evaluate the scalability of overall performance of GraphSER, DREDGE, and SAGM-ER. The evaluation results are shown in Table 7. The system sizes we evaluate are 4 × 4, 8 × 8, 12 × 12, 16 × 16, 20 × 20, and 24 × 24. When the dataset is Friendster and the system size is 20 × 20, the maximum GraphSER performance is 2× that of SAGM-ER. When the dataset is Roadnet and the system size is 4 × 4, the minimum GraphSER performance is 0.9× that of SAGM-ER, and the average GraphSER performance is 1.52× that of SAGM-ER. When the dataset is LiveJournal and the system size is 8 × 8, the maximum GraphSER performance is 2.17× that of DREDGE. When the dataset is Roadnet and the system size is 24 × 24, the minimum GraphSER performance is 1.04× that of DREDGE, and the average GraphSER performance is 1.40× that of DREDGE. From the experimental data, we can draw some observations:

Table 7.

Algorithm	System Size	LiveJournal			Friendster			Amazon			Roadnet
Algorithm	System Size	[34]	[39]	Ours	[34]	[39]	Ours	[34]	[39]	Ours	[34]	[39]	Ours
PageRank	4 × 4	1.33×	N.A.	1.77×	1.41×	N.A.	1.92×	1.15×	N.A.	1.45×	1.16×	N.A.	1.04×
	8 × 8	1.41×	0.81×	1.76×	1.33×	1.12×	2.01×	1.17×	1.28×	1.78×	1.02×	0.96×	1.21×
	12 × 12	1.24×	N.A.	2.05×	1.49×	N.A.	2.16×	1.26×	N.A.	1.81×	1.13×	N.A.	1.22×
	16 × 16	1.35×	1.44×	2.16×	1.36×	1.55×	2.63×	1.25×	1.68×	1.84×	1.12×	1.21×	1.41×
	20 × 20	1.36×	N.A.	2.59×	1.39×	N.A.	2.78×	1.24×	N.A.	2.26×	1.07×	N.A.	1.46×
	24 × 24	1.29×	2.05×	2.54×	1.47×	2.20×	2.80×	1.22×	1.90×	2.29×	1.09×	1.38×	1.43×
SSSP	4 × 4	1.24×	N.A.	1.68×	1.33×	N.A.	1.77×	1.12×	N.A.	1.34×	1.04×	N.A.	1.09×
	8 × 8	1.39×	N.A.	1.64×	1.42×	N.A.	2.07×	1.18×	N.A.	1.76×	0.99×	N.A.	1.19×
	12 × 12	1.22×	N.A.	1.93×	1.47×	N.A.	2.14×	1.20×	N.A.	1.69×	1.02×	N.A.	1.11×
	16 × 16	1.36×	N.A.	2.00×	1.24×	N.A.	2.68×	1.14×	N.A.	1.93×	1.10×	N.A.	1.33×
	20 × 20	1.30×	N.A.	2.80×	1.43×	N.A.	2.64×	1.22×	N.A.	2.49×	1.11×	N.A.	1.57×
	24 × 24	1.35×	N.A.	2.77×	1.48×	N.A.	3.00×	1.28×	N.A.	2.36×	1.08×	N.A.	1.55×

Table 7. Overall Performance in Many-core Systems

In most cases, when the system size is less than 20 × 20, GraphSER’s performance on the many-core system increases with the increase of the system size. And when the system size reaches 20 × 20, the overall performance of GraphSER reaches saturation zone. This is because, as the system size increases, the average communication distance becomes longer, and the spatial locality optimization space becomes larger, which is similar to DREDGE. However, the performance improvement does not always increase with the system size. This is because, as the system size increases, it is more difficult for GraphSER to optimize adjacent vertices to adjacent computing nodes (because graph partitions become more dispersed).

When the network size is relatively small, GraphSER performance improves more significantly compared to DREDGE, because DREDGE relies on vertex pressure to optimize partitioning. When the mesh size decreases, the average vertex pressure also decreases, and the optimization space also decreases. The optimization methods in GraphSER that rely on replica degree and degree are independent of network size, resulting in more stable performance improvement.

Compared to the SAGM-ER edge repartition method, GraphSER has an average improvement of 52%, mainly because GraphSER has better scalability in terms of repartition time and overall performance than SAGM-ER.

In the experiment in Table 7, 30% of all datasets are used as the dynamically changing part. Although datasets such as LiveJournal, Friendster, and Amazon vary significantly in size, GraphSER’s performance on these datasets shows no notable difference. We can see from Figure 14 that GraphSER’s performance is not related to the size of the input graph.

In Table 7, the performance of SAGM-ER goes up and down in different experiments. This is mainly because, in different experiments, SAGM-ER’s algorithm starts searching using data from the dynamically changing part of the graph and the data are randomly selected from the 30% dynamic part of the datasets, so the performance of SAGM-ER varies in different experiments.

Fig. 14.

Finally, we evaluate the scalability of different graph algorithms of GraphSER. We run the Random Walk (5%) algorithm on the LiveJournal dataset to evaluate the performance of GraphSER. The proportion of Random Walk (5%) accessing edges is 5%, while PageRank is 100%. The experimental results are shown in Table 8. Combined with Table 7, it can be seen that GraphSER and SAGM-ER have relatively stable performance for different algorithms, as they are directly optimized based on the structure of dynamic graphs. The repartition results of DREDGE depend on the program’s memory access patterns. DREDGE monitors past vertex update messages to repartition. Therefore, when the algorithms are different, the performance of DREDGE will change, such as for Random Walk (5%). Due to the significant randomness of memory access pattern between different iterations, DREDGE performance is only 48% of the baseline.

Table 8.

System Size	No Repartition	SAGM-ER	GraphSER
4 × 4	1×	1.18×	1.81 ×
8 × 8	1×	1.34×	1.77 ×
12 × 12	1×	1.28×	2.06 ×
16 × 16	1×	1.24×	2.26 ×
20 × 20	1×	1.16×	2.29×
24 × 24	1×	1.31×	2.05×

Table 8. Overall Performance in Random Walk (5%)

4.5 Comparison with the State-of-the-art

In this section, we compare GraphSER with several the state-of-the-art works: PowerGraph [20], GraphSteel [29], GrapH [38], Lux [22], DREDGE [39], and SAGM-ER [34].

PowerGraph is the first static graph computing framework to propose edge partitioning. And PowerGraph is very well-known in edge partitioning. We choose PowerGraph as baseline.

GraphSteel is an edge repartition algorithm. It transports edges from slower partitions to faster partitions, thereby improving the computation and memory efficiency.

GrapH proposed repartition algorithm H-adapt to improve partitioning quality. In a heterogeneous environment, GrapH checks from high traffic vertices to determine if their adjacent edges can reduce communication costs by migrating between two selected parts.

Lux proposed a novel dynamic partitioning strategy that achieves good load balancing across GPUs. And to the best of our knowledge, Lux is the state-of-the-art edge repartitioning work in GPUs.

DREDGE proposed a novel hardware solution to improve partitioning quality. And to the best of our knowledge, DREDGE is the state-of-the-art dynamic repartitioning hardware work.

SAGM-ER proposed edge group and improve partitioning quality by migrating certain edge groups between two selected parts. And to the best of our knowledge, SAGM-ER is the state-of-the-art edge repartitioning work.

In this experiment, we use four dynamic datasets: Twitter [8], Facebook [50], Flickr [9], Orkut [40]. Graph algorithms we evaluate are PageRank, SSSP, connected component (CC), and BFS. According to Section 4.4, our system performs best when the system size is 20 × 20, so the system size we simulate is 20 × 20. The experimental results are shown in Figure 14. GraphSteel can balance the work load between different nodes, but it does not optimize communication between nodes, and cross-node communication occupies most of the time in graph computation. GrapH cuts the vertices with low traffic to decrease the communication costs. But the traffic of vertices varies with different iterations, and GrapH needs to predict vertex traffic. With different iterations, its predictions are not always accurate. Lux monitors runtime performance and balances workload among nodes, which is similar to GraphSteel. And, like GraphSteel, Lux does not optimize communication between nodes. DREDGE repartitions graph by monitoring past vertex update messages. However, many graph algorithms have random memory access patterns between different iterations such as SSSP, CC, and BFS. As mentioned in Section 4.4, GraphSER has better scalability in terms of repartition time and overall performance than SAGM-ER.

5 Conclusion

The main bottleneck of graph processing lies in irregular memory accesses. Some many-core systems utilize the high local bandwidth of HMC systems to alleviate this bottleneck. However, as highlighted by GraphP, the cross-node bandwidth emerges as a critical bottleneck. DREDGE was the first to introduce a method for enhancing the spatial locality of graph partitions through vertex repartition, thereby reducing cross-node memory accesses and mitigating this bottleneck. Nevertheless, the performance of DREDGE exhibits variability across different graph algorithms and system sizes.

In this work, we propose GraphSER, a stream-based edge repartition method designed to overcome the limitations of prior research. We optimize GraphSER to be aware of distance in many-core systems and achieve performance better than state-of-the-art references. Our evaluation demonstrates that, when compared to SAGM-ER, GraphSER exhibits an average speedup of 1.52×, with a maximum of 2×. Additionally, in comparison to DREDGE, GraphSER exhibits an average performance improvement of 40%, reaching a maximum of 117%.

References

[1]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 105–117.

Abstract

1 Introduction

2 Related Work

2.1 Traditional Distributed Graph Processing

2.2 Streaming/Dynamic Graph Processing

2.3 Static Graph Partitioning

2.4 Dynamic Graph Partitioning

3 Methodology

3.1 Problem Definition

3.2 Replica Degree

3.3 Vertex Degree-based Stream Edge Repartition

3.4 Optimization of Distance-aware

3.5 System Overview

4 Evaluation

4.1 Experimental Setup

4.2 Parameter Analysis

4.2.1 Analysis of Unbalance Coefficient \(\varepsilon\).

4.2.2 Analysis of Threshold \(T_R\).

4.2.3 Analysis of Dataset Parameters.

4.2.4 Analysis of Memory Overhead.

4.3 Self-analysis

4.3.1 Impact of Replica Degree.

4.3.2 Impact of Degree.

4.3.3 Spatial Locality.

4.4 Scalability

4.5 Comparison with the State-of-the-art

5 Conclusion

References

Index Terms

Recommendations

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Vectorizing Unstructured Mesh Computations for Many-core Architectures

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations