Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems

Published: 14 September 2024 Publication History

Abstract

With the explosive growth of graph data, distributed graph processing has become popular, and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed.
In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. To improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality.
Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup of 1.52×, with the maximum up to 2×. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.

1 Introduction

Graphs are important and powerful data structures in big-data applications. With the rapid development of the Internet, various types of graph data are generated every moment. Real-world graphs are usually large and growing rapidly: They now often contain hundreds of billions of nodes and trillions of edges [7, 11, 46]. For example, Twitter’s monthly active user count increased from 482 million to 2.5 billion within 10 years; the knowledge graph data stored by Google includes more than 57 billion data objects, each with complex semantic links; and the number of web pages in the World Wide Web has grown from 40 billion to 55 billion within five years [3].
With the rapid growth of graph data, the computational and storage capabilities of a single computer fall short of meeting the escalating demand. Consequently, numerous distributed graph processing systems have been proposed, including Pregel [36], GraphLab [35], Powergraph [20], among others. Additionally, many graph hardware accelerators leverage distributed frameworks, such as Tesseract [1], GraphP [53], GraphH [12], and so on. Graph partitioning is foundation in distributed graph processing, which partitions a graph into multiple parts. The execution time of a distributed graph processing system predominantly comprises two components: the computing time of the system and the communication time across nodes. The quality of partitioning significantly impacts both aspects. First, the computing time is contingent on the slowest node; thus, a more balanced partitioning results in a shorter computing time. Second, communication time across nodes is influenced by the number of vertices across partitions, as vertices must transmit update values to remote neighbors. Therefore, minimizing the number of vertices across partitions leads to a reduction in communication time.
However, achieving balanced graph partitioning poses an NP-hard problem [2, 6]. Graph partitioning is primarily categorized into two types: vertex partitioning and edge partitioning. Recent works [20, 43] have demonstrated theoretically and experimentally that edge partitioning has better graph processing performance in natural power-law graphs. Despite considerable efforts to address edge partitioning problems, the majority of existing methods concentrate on static graphs, encompassing offline (in-memory) and online (stream) approaches. [3]. Offline methods load the entire graph into memory and allocate edges based on global graph information. Online methods sequentially load edge streams and directly allocate them to partitions. However, graphs in real world are dynamic. Reference [31] indicates that the density of a real-world graph gradually increases and the diameter gradually decreases with dynamic changes. That is, based on the initial partition, dynamic changes result in diminished spatial locality, subsequently increasing communication time across nodes. As shown in Figure 1, dynamic changes diminish the spatial locality of a graph, necessitating repartitioning. However, employing static partition algorithms to partition dynamic graphs from scratch incurs a high cost. Leveraging existing partitioning information becomes important to minimize dynamic graph partitioning costs. Figure 2 provides a comparison of two methods on a dynamic graph dataset Twitter [8]: (1) dynamic partitioning the graph by our method and (2) repartitioning the entire graph by static partition algorithm HDRF [42]. In Figure 2(a), as the graph’s scale gradually doubles, our method effectively maintains partition quality close to HDRF. In Figure 2(b), when the graph scale doubles, our method completes the repartitioning within 110 seconds, whereas HDRF requires 860 seconds.
Fig. 1.
Fig. 1. Example of dynamic graph repartitioning. (a) illustrates the initial partitioning. In (b), the dynamic graph undergoes modifications by deleting edge (b, c) and (e, f), adding edges (b, f) and (c, e). And the partitioning quality decreases. In (c), after repartitioning, the partitioning quality increases.
Fig. 2.
Fig. 2. The comparison of static and dynamic methods on dynamic graph. (a) The average replicas per vertex at different graph sizes. (b) The partitioning time of dynamic and static methods.
Several prior studies have concentrated on edge repartition, exemplified by GR-DEP [33], SAGM-ER [34], and GraphSteel [29]. GraphSteel [29] only addresses the issue of load imbalance, while both GR-DEP [33] and SAGM-ER [34] require costly searching to determine which edges are worth of migration, which is suboptimal, as dynamic partitioning demands minimal partition time overhead. Notably, these studies uniformly assume equal communication delays between nodes, neglecting the variability in communication delay based on node distance in distributed systems. The impact of distance on communication delay is a crucial consideration in dynamic partitioning. DREDGE [39] is a vertex repartition method optimized for many-core systems. However, edge partitioning has better graph processing performance in natural power-law graphs. Monitoring past vertex update messages, DREDGE may lead to suboptimal performance for graph algorithms with random memory access patterns between iterations (e.g., SSSP and Random Walk). Furthermore, the vertex pressure in DREDGE is linked to the system size, yielding better dynamic partitioning quality with larger systems and vice versa. Hence, DREDGE’s performance varies across different graph algorithms and system sizes.
To mitigate the shortcomings identified in the aforementioned references, we introduce a stream-based edge repartition method. This method considers both the overhead and the quality of repartition, exhibiting stable performance across various graph algorithms and system sizes. Termed GraphSER, our method is specifically tailored for many-core systems, surpassing the performance of state-of-the-art references. Our main contributions are summarized as follows:
We propose a novel stream-based edge repartition method, which is based on the structure of dynamic graph itself, rather than relying on monitoring past memory access patterns, as many graph algorithms have different memory access patterns between different iterations.
In GraphSER, each computing node sequentially traverses its local edge list in parallel, eliminating the need to search for edges worthy of migration. Additionally, GraphSER is designed to be distance-aware in many-core systems, taking into account the communication delay between different nodes.
For the first time, we introduce the concept of replica degree and investigate their influence along with vertex degree on stream edge repartition. Subsequently, we present optimization methods based on the replica degree and vertex degree. Notably, these optimization methods remain unaffected by specific hardware architectures, ensuring consistent and stable dynamic partitioning improvements across various system sizes.
We propose the concept of replica-hop and discuss how replica-hop measures the spatial locality of graph partitions on many-core systems. Furthermore, we illustrate the influence of replica-hop on the overall performance of many-core systems.
We evaluate GraphSER using four static graph datasets and four dynamic graph datasets. The evaluation results demonstrate the effectiveness of GraphSER in several aspects.
The rest of this article is organized as follows: Section 2 provides an introduction to the related work. Section 3 defines the dynamic edge partitioning problem, explores concepts such as replica degree and replica-hop, and introduces our proposed method, GraphSER. The experimental evaluation is presented in Section 4, and Section 5 concludes with a summary of the article.

2 Related Work

In this section, we commence by presenting an overview of related work in traditional distributed graph processing, streaming/dynamic graph processing, static graph partitioning, and dynamic graph partitioning. Subsequently, we focus on how our work differs from them.

2.1 Traditional Distributed Graph Processing

Traditional distributed graph processing frameworks are primarily categorized into two types: vertex-centric and edge-centric. Vertex-centric frameworks partition and compute graphs based on vertices, while in edge-centric frameworks, edges are the fundamental unit for computation and partitioning. In edge-centric frameworks, each edge of graph is assigned to only one partition, but each vertex may exist in multiple partitions. Vertices that exist in multiple partitions are called replica.
Google introduced the first vertex-centric distributed graph processing framework, known as Pregel [36]. This framework iteratively executes user-defined programs on the vertices of the graph, adopting a local, vertex-oriented computational perspective, enhancing programming intuition and simplicity. However, for power-law graphs, due to some high-degree vertices, vertex-centric frameworks exhibit greater imbalance in partitioning and poorer memory access locality compared to edge-centric frameworks. X-Stream [45] first proposed edge-centric system based on the insight that storage media, including disks, SSDs, and DRAMs, perform significantly better in sequential data access. In edge-centric frameworks, edges are the fundamental unit for computation and partitioning, making it easier to achieve balanced partitioning compared to vertex-centric frameworks.
Recently, an increasing number of hardware graph accelerators utilize distributed graph processing frameworks. For instance, Tesseract [1] employs a vertex-centric framework, and GraphH [12] utilizes an edge-centric framework. Both Tesseract and GraphH employ HMC (Hybrid Memory Cube) to implement many-core graph processing systems. The computing capacity, memory bandwidth, and storage capacity of the HMC system exhibit linear expansion with system scaling, and its high memory bandwidth effectively eliminates local memory access bottlenecks. However, both vertex-centric and edge-centric frameworks necessitate cross-node communication. GraphP [53] highlights a critical bottleneck in this type of distributed system—the network latency resulting from communication between different nodes. This issue arises because, in many-core systems, the internal memory bandwidth of nodes can scale linearly as the system scales up, but the network bandwidth between different nodes cannot expand due to the number of physical pins. As the system scales, the average distance between two nodes also increases, causing data transportation latency between nodes to become a performance bottleneck. DREDGE [39] introduced a method to enhance the spatial locality of graph partitions through vertex repartition, reducing cross-node memory accesses and mitigating this bottleneck. However, DREDGE has two drawbacks: First, it optimizes partitioning by monitoring previous memory access patterns, which proves challenging for algorithms with random memory access patterns; second, the vertex pressure in DREDGE is linked to the system size, resulting in performance fluctuations for smaller systems. Our work employs edge repartition to enhance the spatial locality of graph partitions based on the inherent structure of dynamic graphs.

2.2 Streaming/Dynamic Graph Processing

In this study, consistent with Reference [5], we adopt the terminology employed therein. Specifically, we employ “dynamic” to characterize modifications made to the graph dataset, reserving the term “streaming” to denote the method of accessing or updating the graph.
Dynamic graph research can be categorized into various classes [5] and different classes of research focus on different graph problem. First, dynamic graphs involve the frequent updating of datasets, encompassing the insertion and deletion of edges and vertices. Traditional CSR format updates are notably inefficient in this context. Stinger [14] and its variants DISTINGER [17] and cuSTINGER [21] are very well-known works in addressing this issue. Stinger adapts and extends the CSR format to support quick insertions and deletions. DISTINGER [17] extends Stinger for distributed system and cuSTINGER [21] extends Stinger for CUDA GPUs.
As discussed in Section 1, the quality of graph partitions tends to degrade following updates to graph datasets. Our research, centered around dynamic graph partitioning, specifically addresses this issue, and this class of work will be detailed in Section 2.4.
Research in addition to dynamic graph data structures and dynamic graph partitioning includes a focus on incremental computation. Typically, graph updates are small relative to the overall graph size and impact only a limited subset of vertices [16, 19]. Thus, leveraging incremental computation based on previous results can significantly enhance the efficiency of dynamic graph computations. Kineograph [10] and GraphIn [48] are well-known works that employ incremental computations. However, they may generate incorrect results when edge deletes. To address this, KickStarter [51] and GraphBolt [37] are specifically designed to trace dependencies, ensuring the generation of correct results. For better performance, JetStream [44] proposes the first graph accelerator that supports incremental computation.

2.3 Static Graph Partitioning

Static graph partitioning can be categorized into vertex partitioning and edge partitioning, as illustrated in Figure 3. Vertex partitioning, also known as edge-cut, involves edges that span nodes. Conversely, edge partitioning, or vertex-cut, involves vertices that span nodes. The term “replica” is used when a vertex is cut, resulting in two or more identical vertices. On the basis of partition balance, the primary goal of vertex partitioning is to reduce the number of edges across nodes, while the primary goal of edge partitioning is to reduce the number of vertices across nodes.
Fig. 3.
Fig. 3. Example of vertex partitioning and edge partitioning. (a) is an example of input graph. In the above input graph, four edges (b, c) (a, c) (c, d), and (c, e) are cut, resulting in partitioning (b). And in the input graph below, vertex c is cut into three replicas, resulting in partitioning (c).
Static graph partitioning also can be categorized into offline (in-memory) [3] and online (stream) [42] methods. The subsequent discussion primarily focuses on edge partitioning due to its superior graph processing performance for power-law graphs. Offline methods load the entire graph into main memory and allocate edges based on global graph information. While this approach achieves higher partitioning quality, it incurs significant time overhead and necessitates loading the entire graph into main memory, rendering it impractical for large-scale graph partitioning. Online methods sequentially load edge streams and allocate edges based on partition rules. These methods exhibit lower time overhead, do not require loading the entire graph into main memory, and can be executed in a streaming form, loading one edge at a time, making them suitable for large-scale partitioning. However, the partitioning quality of online methods is not as good as offline methods. Compared to our proposed stream edge repartition method, online methods for static graphs have the following shortcomings: First, online methods can only partition newly added edges into existing partitions. If the graph data has newly deleted edges, then the online method cannot handle them; second, even if all dynamic changes are newly added edges and online methods can handle them. However, as graph data and graph partitions change, the degree of vertices and the replica degree of vertices will change. In this case, there is still room for further optimization of partition quality. However, online methods do not have ability to repartition, thus there is no opportunity to improve the quality of partitioning.

2.4 Dynamic Graph Partitioning

The real-world graphs are dynamically changing, and the partitioning quality usually decreases with the dynamic changes of graphs [31]. However, the cost of static graph partitioning from beginning is too high, therefore dynamic graph partitioning is proposed to reuse the initial partition information and reduce the cost of partitioning. Dynamic graph partitioning can also be divided into two main categories: vertex repartition and edge repartition. This work mainly focuses on edge repartition, as edge partitioning performs better on power-law graphs [33].
GraphSteel [29] is an edge repartition algorithm, but it is only used to balance the running time imbalance of each partition during the graph calculation. It transports edges from slower partitions to faster partitions, thereby improving the computation and memory efficiency. GrapH [38] proposed repartition algorithm H-adapt to improve partitioning quality. In a heterogeneous environment, GrapH checks from high traffic vertices to determine if their adjacent edges can reduce communication costs by migrating between two selected parts. However, GrapH is used to solve the load imbalance issue caused by the graph calculation. It cannot solve the problem of repartition for dynamic graph structure change. GR-DEP [33] and SAGM-ER [34] proposed methods using edge group, treating multiple connected edges as a group. They improve partitioning quality by migrating certain edge groups between two selected parts. References [33] and [34] proposed edge group search algorithms to determine which edge group are worth migrating.
Previous works mentioned above require costly searching to determine which edges are worth repartitioning, but dynamic graph partitioning needs low partitioning overhead, as partitioning is carried out on-the-fly. In the following sections, we propose stream-based edge repartition method, which takes into account both overhead and quality of repartition, while at the same time the distance of data transfer is considered.

3 Methodology

In this section, we first define the dynamic edge partitioning problem, then propose the concept of replica degree and discuss the impact of replica degree and degree on edge repartition. Finally, we propose distance-aware stream edge repartition method for many-core systems and discuss the overview of our proposed system.

3.1 Problem Definition

Let \(G = (V, E)\) be an undirected graph, V be the vertex set of the graph, E be the edge set of the graph, \(|V|\) represent the number of vertices in the graph, \(|E|\) represent the number of edges in the graph. And for subsets \(E_s\subseteq E\), \(V(E_s)\) represents the vertices contained in the edge set \(V(E_s)\). In vertex partitioning, V is divided into k partitions, while in edge partitioning, E is divided into k partitions. The edge partitions are defined as \(P = \lbrace P_1, P_2, . . ., P_k\rbrace\), where \(E = \bigcup _{i=1}^{k} P_i\) and \(\bigcap _{i=1}^{k} P_i = \emptyset\). For edge partitioning, each edge can only be allocated to one partition, but each vertex can be copied to multiple partitions. Let \(R(V_i)\) be the set of partitions where \(V_i\) is allocated, representing partitions that contain \(V_i\). \(|R(V_i)|\) represents the number of partitions where \(V_i\) is allocated, i.e., how many partitions contain \(V_i\), and \(|R(V_i)|\) is at least 1. If \(|R(V_i)| \gt 1\), then it means that at least two partitions contain \(V_i\), and \(V_i\) can be regarded as a boundary vertex. All replicas of a vertex are the same, without any primary or secondary differences.
In edge-centric distributed frameworks, the update value of a replicated vertex needs to be synchronized across nodes. Therefore, the fewer vertex replications, the lower the synchronization cost and the better the graph processing performance. Edge partitioning measures the number of vertex replicas using the replication factor RF [42], where \(RF(k)\) represents the replication factor for k-parts partitioning. The definition of \(RF(k)\) is as follows:
\begin{equation} RF(k) = \frac{ \sum _{v\in V}\mid R(v)\mid }{\mid v\mid } . \end{equation}
(1)
The replication factor mainly affects the number of cross-node communication in distributed graph computing, while the local computing time of nodes is mainly affected by load balancing. The load balance of edge partitioning is measured by Equation (2), where \(\varepsilon\) is the imbalance coefficient satisfying \(1 \le \varepsilon \le 2\) [33, 34].
\begin{equation} (2 - \varepsilon)\frac{\mid E\mid }{k} \le \mid Pi\mid \le \varepsilon \frac{\mid E\mid }{k} , 1 \le i \le k \end{equation}
(2)
In Equation (2), \(\varepsilon = 1\) represents the situation where partitions are strictly balanced, and the number of edges in each partition is \(|E| / k\); \(\varepsilon = 2\) represents the most imbalanced situation in partitioning, with a maximum of \(2|E| / k\) edges and a minimum of 0 edges per partition; if \(\varepsilon = 1.5\), then the maximum number of edges per partition is \(1.5|E| / k\), and the minimum is \(0.5|E| / k\).

3.2 Replica Degree

The time required for edge-centric graph processing mainly consists of two parts: the cross-node communication time for vertex replications and the computing time at each node. The ultimate goal of dynamic edge partitioning is to minimize the total graph processing time as much as possible. First, the cross-node communication time depends on the number of vertex replicas, smaller replication factor generates better spatial locality of partitions and shorter cross-node communication time. Second, the computing time of a multi-node system depends on the slowest node. If the imbalance coefficient \(\varepsilon\) is close to 1, then it means that the workload balance of nodes is better, and the computing time of the system would be shorter. In fact, the key idea of a dynamic edge partitioning is to migrate the edges from one partition to other partitions to adjust to less replication factor and more balanced workload after dynamic changes to graph. Therefore, a smaller imbalance coefficient \(\varepsilon\) leads to fewer edges being migrated to other partitions, and that result in less edges involved in repartition. According to Equation (2), each partition contains an upper limit \(\varepsilon |E| / k\) and a lower limit \((2 - \varepsilon) |E| / k\) on the number of edges. Therefore, each partition can migrate up to \(2(\varepsilon - 2) |E| / k\) edges towards other partitions. When the imbalance coefficient \(\varepsilon\) is 1, the migrated edges in repartition are 0. According to the above analysis, to minimize execution time of graph processing, it would be better to simultaneously minimize the imbalance coefficient and replication factor. However, these two parameters are tightly coupled in terms of their impact to the system performance, so dynamic edge partitioning needs to consider both parameters to find optimized point in graph processing performance.
To illustrate the impact of the imbalance coefficient and replication factor, an example is given in Section 4.2.1. We evaluate the impact of the imbalance coefficient \(\varepsilon\) on the performance of dynamic partitioning. The experimental results are shown in Figure 7, which is consistent with our analysis. If imbalance coefficient \(\varepsilon\) is smaller, then dynamic partitioning optimization space is smaller and the cross-node communication time of the system would be longer. However, when the imbalance coefficient \(\varepsilon\) is 1.05, the total execution time is smallest, because the local computing time of nodes is also part of total execution time. The experimental results indicate that dynamic edge partitioning needs to balance the imbalance coefficient and replication factor to achieve maximum performance improvement. Because the reduction of these two parameters is mutually constrained, the optimized performance of the system in parameter space \((\varepsilon , RF)\) can be reached by migrating the fewest edges to eliminate the largest number of vertex replications.
Fig. 4.
Fig. 4. Example of vertex replica degree. Vertices a, c, and d are boundary vertices. Vertex a connects two edges in the \(P_1\) partition with a replica degree of 2 and connects an edge in the \(P_2\) partition with a replica degree of 1. Vertex c and d have a replica degree of 1 in both partitions.
Fig. 5.
Fig. 5. Example of cutting high-degree vertex. (a) An example of a graph containing two high-degree vertices a and e. (b) Cut a and e only once, and after partitioning, it contains five replicas. (c) Cut a and e twice, respectively, and after partitioning, it contains four replicas, reducing one replica compared to (b).
Fig. 6.
Fig. 6. The running process of degree-based stream edge repartition. Each computing node sequentially traverses its local edge stream in parallel and migrates edges based on degree of vertices from the edge.
Fig. 7.
Fig. 7. The impact of unbalance coefficient \(\varepsilon\) on the total execution time.
If we want to reduce replication factor, then we need to eliminate vertex replicas in one partition and not add new vertex replicas in another partition. This requires that the destination partition where edges is migrated to also contains this vertex and that the source partition no longer contains this vertex after the edge is migrated away. As shown in Figure 4, if we want to eliminate vertex replicas in \(P_1\) partition, then vertices a, c, and d meet the requirements, because they also have vertex replicas in \(P_2\) partition. After migrating and eliminating replicas in \(P_1\) partition, no new replicas will be added in \(P_2\) partition. To measure the efficiency of eliminating vertex replicas, we propose replica degree, which is defined as follows: We use \(RD_i(v)\) to represent the replica degree of vertex v in partition i, representing the total number of edges connected by vertex v in partition i. Because in edge partitioning, an edge can only be allocated to one partition, the set of connected edges of vertex v in all partitions is the set of connected edges of v in the original graph, i.e., \(\sum _{i=1}^{k}RD_i(v) = D(v)\), where \(D(v)\) represents the degree of v. Replica degree is a very intuitive measure of the efficiency of eliminating vertex replicas, because vertex replica v connects \(RD_i(v)\) edges in partition i, which means migrating \(RD_i(v)\) edges to another partition containing v can eliminate vertex replica. The efficiency of vertex replica elimination is \(1 / RD_i(v)\), which is inversely proportional to the replica degree, meaning that vertices with lower replica degree are more worth eliminating through edge repartition. As shown in Figure 4, vertex a need to migrate two edges to eliminate the replica in \(P_1\) partition, while vertices c and d only need to migrate one edge to eliminate the replica in \(P_1\) partition. The replica of vertices c and d is more worth eliminating through edge repartition.
Although replica degree can intuitively measure the efficiency of vertex replica elimination, relying solely on replica degree for dynamic partitioning is not enough. Edge repartition eliminates vertex replicas by migrating edges, since an edge connects two vertices, and determining whether to migrate an edge based solely on the replica degree of one vertex may actually result in an increase in replication factor. However, it is difficult to determine whether to migrate an edge by checking the replica degree of two vertices of the edge at the same time. In the following section, we propose using the degree of vertices to give consideration to both two vertices of an edge, thereby further improving the quality of dynamic partitioning.

3.3 Vertex Degree-based Stream Edge Repartition

To determine more clearly whether an edge is worth migrating, we classify the two vertices connected by an edge as follows: Assume that an edge is migrated from the source partition \(P_s\) to the destination partition \(P_d\). For the source partition \(P_s\), there are two types of migrated vertices v. Before the migration, \(P_s\) already have the vertex v. After the migration there are two types of vertex v, one is that \(P_s\) still has vertex v after the migration, the other is that after the migration vertex v is eliminated. For the destination partition \(P_d\), there are also two types of migrated vertices. After the migration, \(P_d\) would have the vertex v. Before the migration there are two types of migrated vertices v, one is that \(P_d\) do not have vertex v before the migration, the other is that \(P_d\) already has vertex v before the migration. So, based on the combination of two types of source partition and two types of destination partition, a vertex can be divided into four categories, as shown in Table 1. In these four categories, (1) increasing the number of vertex replicas, (2) and (3) having no effect on the number of vertex replicas, only (4) reducing the number of vertex replicas.
Table 1.
 \(v \in P_d\) before migration\(v \notin P_d\) before migration
\(v \in P_s\) after migration2) \(|R(v)|\) remain unchanged1) \(|R(v)| + 1\)
\(v \notin P_s\) after migration4) \(|R(v)| - 1\)3) \(|R(v)|\) remain unchanged
Table 1. Four Types of Vertices
The vertices we actively migrate based on replica degree must be type (4), which ensures that the migration can reduce the number of vertex replicas. However, an edge has two vertices, and during the migration of the edge, one vertex passively follows the actively migrating vertex. To maximize the benefits of dynamic partitioning, it is necessary to reduce the probability of passively following vertex being a (1) type vertex and try to make it a (2) and 3) and 4) type vertex as much as possible. If it is a 4) type vertex, the efficiency of eliminating vertex replicas will be maximized. It is observed that the degree of a vertex is closely related to its probability of being a type (1) vertex. Because type (1) vertices require the destination partition to have no vertex replicas before migrating, if a vertex is in more partitions, then it is less likely for it to be a type (1) vertex. Since vertices of high degree tend to be cut to multiple partitions in an edge partitioning scheme due to their more connected edges, their probability of being a type (1) vertex is low. Conversely, vertices of low degree have a higher probability of becoming a type (1) vertex.
Eliminating vertex replicas essentially is to concentrate the edges of a vertex into one partition as much as possible. If all edges of v are in the same partition, then \(|R(v)|\) has the lowest value 1. High-degree vertices are more likely to drive (1) type vertices to follow the migration during the concentration process. Although high-degree vertices eliminate their own vertex replicas, the migration of (1) type vertices increase vertex replicas and actually reduces the partition quality. This is because high-degree vertices tend to connect with more vertices, and during the process of eliminating high-degree vertex replicas, more low-degree vertices will follow the migration. As mentioned earlier, the probability of low-degree vertices becoming type 1) vertices are higher. As shown in Figure 5, during the process of concentrating high-degree vertices, the number of vertex replicas in the entire system actually increases. So, low-degree vertices are more suitable to be treated as active migrating vertices, while high-degree vertices are more suitable as vertices following the migrating vertices. Therefore, whether an edge is worth of migrating can be judged based on degree of vertices from the edge.
Taking into account considerations in above, our degree-based stream edge repartition method can be illustrated using Figure 6. After the graph data undergoes dynamic changes, each computing node sequentially traverses its local edge list in parallel, then selecting a vertex v with lower degree of the edge. Based on whether or not v is a (4) type vertex and its replica degree, if its efficiency in eliminating vertex replicas is higher than a certain threshold \(T_r\), it is determined that the edge is worth migrating.

3.4 Optimization of Distance-aware

The previous edge repartition works GR-DEP [33], SAGM-ER [34], and GrapH [38] used replication factors to measure the communication cost during the synchronization phase. However, for many-core systems, replication factors cannot reflect the exact communication cost across nodes. This is because the replication factor can only measure the number of messages across nodes, not the transmission distance of messages across nodes. In this work, we optimize the stream edge repartition algorithm for many-core systems and propose a new measurement called replica-hop to measure the communication cost of many-core systems.
Hop is a metric in network-on-chip [13], which refers to the number of routers that a message travels from the source node to the destination node, or the number of links it traverses. Hop is a simple and useful metric for measuring network latency. The proposed replica-hop combines the original concept of hop and replication factor. It is defined as follows: A vertex v has replicas in both \(P_i\) and \(P_j\) partitions, and messages from \(P_i\) to \(P_j\) partitions require Hv hops to transmit. Therefore, replica-hop \(= \sum _{v\in V} Hv\), we can use replica-hop to measure the spatial locality of a graph stored in a many-core system. If the replica-hop is smaller, then it means that the graph has better spatial locality, and the communication overhead would be smaller during the synchronization phase.
Our work and DREDGE both deal with architecture based on a mesh-based many-core system. DREDGE [39] pointed out that a reduction in the replication factor only reduces the total number of messages, but the average transmission distance of messages may increase. Therefore, for many-core systems, simply reducing vertex replicas may actually lead to increased communication overhead. Vertex replicas with larger hops have more communication overhead, therefore, priority should be given to migrating edges to eliminate these vertex replicas, which can maximize the efficiency of the system. Therefore, in combination with the degree-based stream edge repartition algorithm in Section 3.3, the pseudocode of distance-aware stream edge repartition is shown in Algorithm 1. This algorithm runs in parallel on each computing node in the system, and the running process is consistent with Figure 6.
Among Algorithm 1, we calculated the \(node\_id\) of the node with the highest replica degree for each vertex v during initialization, because migrating the vertex to the node with the highest replica degree has the highest probability of eliminating vertex replicas. The function \(distance (a, b)\) calculates the communication distance (hops) between nodes a and b in a many-core system. Because vertex replicas with larger hops and lower replica degree have higher priority for elimination, we use \(distance(L, Max\_ID (v)) / RD_L (v)\) to measure whether the edge is worth migrating. If it is greater than the threshold \(T_R\), then the edge is repartitioned.

3.5 System Overview

Although the system we simulated is a processing-in-memory (PIM) system, our proposed GraphSER is a general method that can apply to systems requiring graph partitioning (such as general PIM system, distributed system and multi-GPU system). This is because GraphSER focuses on optimizing the quality of graph partitions, which is separated from other stages of graph processing. The simulated PIM system is from References [18, 24]. And the details of our system are shown in Table 2.
Table 2.
ComponentConfiguration
Memory48 Gb total capacity, 384 DRAM banks
128 Mb capacity and 4.25 GB/s bandwidth per DRAM bank
Core800 MHz RISC-V in-order core
The number of DRAM bank and RISC-V in-order core are the same
NetworkMesh topology
Each link provides 40 GB/s bandwidth [1]
One hop requires 50 clock cycles between adjacent nodes
Table 2. System Configuration
In our system, each PE core is restricted to access its own local memory only, and each PE communicates with others through messages. Message passing avoids coherence protocol and atomic operations among PE cores. Therefore, our system can scale up without considering coherence management. And we do not use special atomic operations like locks or barriers. We use one PE core as a centralized control point to collect the synchronization status and issue synchronization commands.
We use unsorted edge list as our data format. That is because: (1) most of open-source graph datasets use edge list to store edges, (2) graph compression such as CSR is very inefficient to update [5], (3) our method traverses edges in a streaming form, edge list naturally suits our method. When graph updates (insertion, deletion of edges and vertices), our update operations on edge lists are different, but the performance of repartitioning after updates remains unchanged (this is because our method only traverses the edge list, regardless of the content stored in the edge list).

4 Evaluation

In this section, we first evaluate the impact of various parameters on our system. Then, we evaluate the impact of replica degree, degree, and replica-hop on our GraphSER partitioning. Finally, we evaluate the scalability of GraphSER and compare GraphSER with the state-of-the-art works using an in-house simulator.

4.1 Experimental Setup

We evaluate some of the metrics of the GraphSER algorithm on the Intel (R) Xeon (R) Gold 6254 CPU @ 3.10 GHz, such as replication factor, replication hop, memory overhead, and so on. We evaluate the overall performance of GraphSER on the many-core system using an in-house simulator, which combines ZSsim [47], Ramulator [28, 49], and Booksim [23] so a many-core system can be simulated.
Ramulator is a cycle-accurate DRAM simulator that we use to simulate the main memory of the many-core system. Ramulator is based on a universal template for modeling DRAM systems, so it can provide out-of-the-box support for various DRAM standards such as DDR3/4, LPDDR3/4, GDDR5, HBM, HMC, and so on. Ramulator uses the trace generated by ZSsim [47] to simulate memory access in the many-core system, including local and cross-node access. Booksim is a cycle accurate network-on-chip simulator designed to achieve simulation flexibility and precise modeling of network components. We use it to simulate the network of the many-core system. The topology of the network is mesh, using the X-Y routing algorithm.
We use a total of eight datasets: four static datasets and four dynamic datasets, as shown in Table 3. We use four static datasets, the same as DREDGE [39]: LiveJournal [4], Friendster [52], Amazon [30], and Roadnet [32]. We choose these four static datasets for the convenience of comparison with DREDGE [39] in Section 4.4. And we choose four dynamic datasets for comparison with the state-of-the-art works in Section 4.5: Twitter [8], Facebook [50], Flickr [9], Orkut [40].
Table 3.
NameTypeNumber of VerticesNumber of EdgesAverage Degree
LiveJournalStatic4.8M69M14.38
FriendsterStatic65.6M1.8B27.44
AmazonStatic335k925k2.76
RoadnetStatic2M2.8M1.4
TwitterDynamic54M2B37.04
FacebookDynamic60.3k1.55M25.64
FlickrDynamic2.57M33.14M12.89
OrkutDynamic3.07M223.5M72.8
Table 3. Graph Datasets
Four dynamic datasets can be directly used in our evaluation, but we need to convert four static datasets into dynamic datasets for evaluation. We refer to the method of SAGM-ER [34] and divided each graph into an initial graph and a dynamic variation graph, with the selection of data being random. Metis [25, 26] and its parallel version ParMetis [27] are very well-known static graph partition works, and they can achieve high quality in vertex partitioning. But in edge partitioning, the partitioning quality of Metis is not as good as HDRF [42]. Specifically, we use HDRF [42] for initial partitioning, with a total of 70% of the dataset divided and the remaining 30% being used as dynamic changes, including the insertion and deletion of edges and vertices. According to Section 4.2.1, we set the imbalance coefficient of both the initial partitioning algorithm and subsequent dynamic partitioning algorithm to \(\varepsilon = 1.05\). We run the repartition algorithm after dynamic changes in the dataset and evaluate the various metrics of the algorithm.

4.2 Parameter Analysis

In this section, we evaluate the impact of various parameters on our system, such as unbalance coefficient \(\varepsilon\), threshold \(T_R\), graph diameter, and so on.

4.2.1 Analysis of Unbalance Coefficient \(\varepsilon\).

The time required for graph processing mainly consists of two parts: the cross-node communication time and the local computing time. As discussed in Section 3.2, unbalance coefficient \(\varepsilon\) has an impact on both parts. And in this subsection, we mainly evaluate the impact of unbalance coefficient \(\varepsilon\) on the total execution time.
According to Section 4.4, our system performs best when the system size is 20 × 20, so the system size we simulate is 20 × 20. We use eight datasets in Table 3, and we use PageRank as algorithm in this experiment. And we use Twitter [8] as an example to show the total execution time breakdown in Figure 7. We divide the total execution time into four categories: communication time, computation time, synchronization time, and repartitioning time. The repartitioning time specifically refers to the time it takes for the system to run our repartition algorithm, including the time it takes to calculate the intermediate value (e.g., RF (v)).
As the unbalance coefficient \(\varepsilon\) increases, the repartitioning optimization space also increases and the cross-node communication time of the system is shorter. And the repartitioning time is longer for better repartitioning quality. The local computation time mainly depends on load balancing, so as the unbalance coefficient \(\varepsilon\) increases, the computation time also increases. The synchronization time is independent of the unbalance coefficient \(\varepsilon\), so it does not have a significant change with the unbalance coefficient \(\varepsilon\). The optimal unbalance coefficients of different datasets are shown in Table 4. Combining Figure 7, we can see that when the unbalance coefficients are 1.04–1.1, the total execution time is the shortest, and the unbalance coefficients find the optimized point between repartitioning quality and load balancing.
Table 4.
DatasetsLiveJournalFriendsterAmazonRoadnetTwitterFacebookFlickrOrkut
The Optimal \(\varepsilon\)1.051.071.081.061.051.041.071.05
Table 4. The Optimal Unbalance Coefficient \(\varepsilon\) of Different Datasets

4.2.2 Analysis of Threshold \(T_R\).

We described \(T_R\) in Section 3.4, which is the minimum threshold for triggering edge migration. In this subsection, we conduct a detailed evaluation of the impact of threshold \(T_R\) on system performance. The configuration of this experiment is the same as Section 4.2.1: We use Twitter [8] as dataset and PageRank as algorithm. The system size we simulate is 20 × 20, and we divide the total execution time into four categories: communication time, computation time, synchronization time, and repartitioning time.
As the threshold \(T_R\) decreases, the number of edges that meet the migration condition increases. If the threshold \(T_R\) is 0, then it means that all edges meet the migration condition. The system will migrate all the edges traversed until the load balancing condition is met (i.e., random repartition). If the threshold \(T_R\) is a sufficiently large number, then it means that there are no edges that meet the migration condition (i.e., no repartitions). The evaluation results are shown in Figure 8, which is consistent with our analysis. The computation time and synchronization time are independent of the threshold \(T_R\), so they do not have a significant change with the threshold \(T_R\). As threshold \(T_R\) increases, the number of edges that meet the migration condition decreases. Thus, the system needs to traverse more edges to complete the repartition and the repartition time becomes longer. When the threshold \(T_R\) is small, it is difficult to migrate edges that can truly optimize partitioning quality because a large number of edges meet the migration condition. When the threshold \(T_R\) is large, only a small number of edges meet the transfer condition, thus the repartitioning optimization space is also small. So, when threshold \(T_R\) is very small or very large, the communication time will become longer. We can see that when the threshold \(T_R\) is 8, the total execution time is the shortest, and the threshold \(T_R\) finds the optimized point between repartitioning quality and repartitioning time.
Fig. 8.
Fig. 8. The impact of threshold \(T_R\) on the total execution time.

4.2.3 Analysis of Dataset Parameters.

Graph datasets have many properties, such as average degree, maximum degree, and so on. In this subsection, we select the three most important properties of dynamic graph datasets, which we consider to be: graph diameter, graph change rate, and graph skewness.
The diameter of a graph refers to the maximum distance between any two vertices. The smaller the graph diameter, the tighter the connection between graph partitions, and the more communication between partitions. As mentioned in Section 1, after the update of graph datasets, the quality of graph partitions will deteriorate. So, the higher the rate of graph changes, the faster the quality of dynamic graph partitioning deteriorates. One of the landmark properties of real-world graphs is skewed power-law degree distribution [15]: Most vertices have few neighbors while a few have many neighbors. Under a power-law distribution, the probability of a vertex is given by:
\begin{equation} P(d) \propto d^{-\alpha }, \end{equation}
(3)
where d is the degree of the vertex, and \(\alpha\) is a positive constant that represents the “skewness” of the degree distribution. As \(\alpha\) increases, the number of high vertices in the graph decreases, the density of the graph decreases, and the skewness of the graph increases. And as the skewness of the graph increases, load balancing of graph partitioning becomes more difficult to achieve. Most real-world graphs typically have a skewness constant around \(\alpha \approx 2\) [15].
In this experiment, we analyze three properties of the graph, so we use Graph500 [41] to generate the graph datasets for easy control of variables. The system size we simulate is 20 × 20, and we set the unoptimized solution that not dynamically partition as the baseline. The evaluation results are shown in Figure 9. As the graph diameter decreases, the connection between graph partitions becomes tighter, and our GraphSER can achieve better performance improvement by reducing communication between partitions. Figure 9(b) shows the performance change of GraphSER with the update rate of the dynamic graph. We use the number of edges/vertices updated between two repartitioning to represent the update rate of the dynamic graph. As the graph update rate increases, the quality of dynamic graph partitioning deteriorates faster, and our GraphSER can achieve better performance improvement by optimizing partitions quality. As skewness constant \(\alpha\) increases, the load balancing of graph partitioning becomes more difficult to achieve, and our proposed edge repartitioning is better than vertex repartition in this aspect. Thus, as \(\alpha\) increases, GraphSER performs better.
Fig. 9.
Fig. 9. The impact of dataset parameter on the system performance. (a) is the variation of system performance with graph diameter, (b) is the variation of system performance with dynamic graph change rate, and (c) is the variation of system performance with graph skewness \(\alpha\).

4.2.4 Analysis of Memory Overhead.

In this subsection, we evaluate the memory overhead of GraphSER. We use three datasets: Twitter [8], Facebook [50], and Flickr [9]. The graph algorithm we evaluate is PageRank. We choose Metis [25] and HDRF [42] as baseline. And we evaluate the memory overhead of GraphSER and baselines on Intel (R) Xeon (R) Gold 6254 CPU @ 3.10 GHz.
As detailed in Section 3.4, only the edges currently traversed need to be stored in the memory, which significantly decreases the memory overhead of GraphSER. The evaluation results are shown in Figure 10. The memory consumption of GraphSER is significantly lower than that of Metis [25] and HDRF [42]. The memory consumption of Metis [25] and HDRF [42] increase linearly from \(t_{0}\) to \(t_{5}\), while the memory overhead of GraphSER increases smoothly. This is because GraphSER only stores the currently traversed edges in memory.
Fig. 10.
Fig. 10. Memory overhead of different methods. Each graph changes at timesteps \(t_{1}, \ldots , t_{10}\) and the initial partitioning is at \(t_0\).

4.3 Self-analysis

In Section 3, we proposed replica degree, replica-hop and degree-based stream edge repartition. And in this section, we evaluate the impact of replica degree, degree, and replica-hop on our GraphSER partitioning.

4.3.1 Impact of Replica Degree.

The partitioning quality and load balancing of edge partitioning are mutually constrained, and the more unbalanced the load, the easier it is to partition the quality higher. However, the processing time of distributed graphs is related to both, so only by considering both can the performance of the entire system be maximized. We proposed replica degree in Section 3.2 to measure the efficiency of eliminating vertex replicas by migrating edges to balance the replication factor and imbalance coefficient. In this experiment, we evaluate the impact of replica degree on the edge repartition algorithm.
We use the LiveJournal dataset in the experiment and divide the graph into 64, 144, 256, 400, and 576 partitions, respectively. We set the unoptimized solution that not dynamically partition as the baseline. And a comparison is made between the GraphSER algorithm and the GraphSER algorithm, which do not migrate edges based on replica degree. As mentioned in Section 4.1, we set the imbalance coefficient to \(\varepsilon = 1.05\), which means that a partition can only move up to 10% of its edges to optimize partition quality.
We evaluate the impact of replica degree on edge repartition algorithms using replication factors and replica-hop. The replication factor is used to measure the quality of graph partitioning in general distributed systems, representing the number of cross-node communication in the system. Replica-hop is used to measure the spatial locality of data in the many-core system, representing the cross-node communication overhead in the system. We measure the impact of replica degree on the performance of two systems using two metrics, as shown in Figure 11. Compared to the GraphSER without migrating edges based on replica degree, GraphSER shows a maximum decrease of 18% in replication factor at 144 partitions, a minimum decrease of 15% at 576 partitions, and an average decrease of 16%. At 256 partitions, the replica-hop decreased by a maximum of 18%, and at 400 partitions; it decreased by a minimum of 12%, with an average decrease of 15%. The average decrease in replication factor and replica-hop is similar, indicating that migrating edges based on replica degree has similar performance improvements for both systems, and replica degree is universal in improving edge repartition algorithms. And as the scale of the mesh architecture system changes, the gain of replica degree on the system is stable.
Fig. 11.
Fig. 11. Using LiveJournal to evaluate the impact of replica degree on edge repartitioning.

4.3.2 Impact of Degree.

The edge repartition algorithm optimizes the partition quality by migrating the edges of partitions. The replica degree proposed in Section 3.2 can improve the efficiency of eliminating vertex replicas by migrating edges. However, an edge contains two vertices, and migrating the edge only through the replica degree of one vertex may actually reduce the partition quality. Section 3.3 proposes to solve this problem through the degree of vertices, which means that vertices with higher degrees move with vertices with lower degrees, and an edge determines whether it is worth migrating based on the replica degree of vertices with lower degrees. In this experiment, we evaluate the impact of degree on edge repartition algorithms.
In this experiment, we set the unoptimized solution that not dynamically partition as the baseline and compare the GraphSER algorithm with the GraphSER algorithm where the edge randomly follows a vertex migrating. The other settings in this experiment are similar to those in Section 4.3, using the LiveJournal dataset, dividing the graph into 64, 144, 256, 400, and 576 partitions, and setting the imbalance coefficient to \(\varepsilon = 1.05\). This experiment also uses replication factors and replica-hop to evaluate the impact of degree on edge repartition algorithms. The experimental results are shown in Figure 12. Compared to GraphSER with edges randomly following a vertex migrating, the replication factor of GraphSER decreased by up to 26% at 64 partitions, and by minimum 5% at 576 partitions, with an average decrease of 15.5%. At 144 partitions, replica-hop decreased by a maximum of 23%, and at 576 partitions, it decreased by a minimum of 6%, with an average decrease of 14%. The average decrease in replication factor and replica-hop is similar, indicating that migrating edges according to lower degree vertices has similar performance improvements for both systems, and degree is universal for improving edge repartition algorithms.
Fig. 12.
Fig. 12. Using LiveJournal to evaluate the impact of degree on edge repartitioning.

4.3.3 Spatial Locality.

For a many-core system, the communication cost is not only determined by the number of cross-node communications, but also by the average distance (hops) of cross-node communication. However, the replication factor can only measure the number of communications across nodes. Therefore, in Section 3.4, we propose new metric replica-hop and distance-aware stream edge repartition algorithm for many-core systems. In this experiment, we mainly evaluate the measurement of communication overhead on mesh topology by replica-hop and the performance improvement of the distance-aware algorithm on mesh topology.
In this experiment, we use an in-house simulator to simulate the many-core system, with the topology of NOC being mesh and using the X-Y routing algorithm. We run the GraphSER algorithm aware and not aware of distance separately and set the unoptimized solution without dynamic partition as the baseline. We use the LiveJournal dataset and PageRank algorithm to evaluate system sizes of 8 × 8, 12 × 12, 16 × 16, 20 × 20, and 24 × 24. This experiment uses replication factor and replica-hop to evaluate partition quality and uses in-house simulator to simulate the total execution time of the system, as shown in Figure 13 and Table 5. The GraphSER aware of distance shows an average reduction of 29% in replication factor, 79.6% in replica-hop, and 55% in total execution time compared to the unoptimized solution without dynamic partitioning. The GraphSER algorithm, which is not distance-aware, shows an average reduction of 29.6% in replication factor, 31% in replica-hop, and 21% in total execution time compared to the unoptimized solution without dynamic partitioning. Based on the above data, we can draw two observations: First, replica-hop can effectively measure the communication overhead in many-core systems, as the GraphSER aware of distance has similar replication factors compared to the GraphSER not aware of distance, while replica-hop and total execution time have significantly decreased; second, the GraphSER aware of distance greatly improves the data spatial locality in the many-core system, greatly reducing total execution time.
Table 5.
System SizeNo RepartitionNo Distance-awareGraphSER
8 × 80.79×0.57×
12 × 120.85×0.49×
16 × 160.80×0.46×
20 × 200.73×0.38×
24 × 240.76×0.39×
Table 5. Total Execution Time in Many-core Systems
Fig. 13.
Fig. 13. The impact of distance-aware on edge repartitioning.

4.4 Scalability

In this experiment, we use an in-house simulator to evaluate the scalability of GraphSER. And we compare the scalability of GraphSER with DREDGE [39] and SAGM-ER [34]. We use four datasets, LiveJournal, Friendster, Amazon, and Roadnet, the same as DREDGE [39]. For the convenience of comparison with DREDGE, our experimental results do not use the total execution time, but instead use the number of graph algorithm iterations completed within 100M clock cycle. Graph algorithms we evaluate are PageRank and SSSP. And we use an unoptimized solution without dynamic partitioning as the baseline.
We first evaluate the scalability of repartition time of GraphSER and SAGM-ER algorithms on Intel (R) Xeon (R) Gold 6254 CPU @ 3.10 GHz, as shown in Table 6. We do not compare with DREDGE, because its solution relies on hardware implementation and there is no specific software algorithm implementation. At a system size of 4 × 4, GraphSER shows the minimum improvement compared to SAGM-ER, with a 67% reduction in repartition time. At a system size of 24 × 24, GraphSER shows the maximum improvement compared to SAGM-ER, with a 94.8% reduction in repartition time. This is because, as the system size increases, the amount of data in each partition will decrease, and the time for GraphSER to sequentially traverse the edges of local partition will also decrease. The experimental results illustrate two points: First, the cost of GraphSER edge repartition is significantly lower than the previous edge repartition algorithm SAGM-ER. Second, GraphSER has good scalability for system scale, and as the system size increases, the repartition cost will decrease.
Table 6.
System SizeSAGM-ER (ms)GraphSER (ms)
4 × 4987.63322.59
8 × 81,001.27163.75
12 × 121,222.43100.98
16 × 161,173.5569.63
20 × 201,031.3157.09
24 × 241,020.9853.07
Table 6. Repartition Time of GraphSER and SAGM-ER
Then, we evaluate the scalability of overall performance of GraphSER, DREDGE, and SAGM-ER. The evaluation results are shown in Table 7. The system sizes we evaluate are 4 × 4, 8 × 8, 12 × 12, 16 × 16, 20 × 20, and 24 × 24. When the dataset is Friendster and the system size is 20 × 20, the maximum GraphSER performance is 2× that of SAGM-ER. When the dataset is Roadnet and the system size is 4 × 4, the minimum GraphSER performance is 0.9× that of SAGM-ER, and the average GraphSER performance is 1.52× that of SAGM-ER. When the dataset is LiveJournal and the system size is 8 × 8, the maximum GraphSER performance is 2.17× that of DREDGE. When the dataset is Roadnet and the system size is 24 × 24, the minimum GraphSER performance is 1.04× that of DREDGE, and the average GraphSER performance is 1.40× that of DREDGE. From the experimental data, we can draw some observations:
Table 7.
AlgorithmSystem SizeLiveJournalFriendsterAmazonRoadnet
[34][39]Ours[34][39]Ours[34][39]Ours[34][39]Ours
PageRank4 × 41.33×N.A.1.77×1.41×N.A.1.92×1.15×N.A.1.45×1.16×N.A.1.04×
8 × 81.41×0.81×1.76×1.33×1.12×2.01×1.17×1.28×1.78×1.02×0.96×1.21×
12 × 121.24×N.A.2.05×1.49×N.A.2.16×1.26×N.A.1.81×1.13×N.A.1.22×
16 × 161.35×1.44×2.16×1.36×1.55×2.63×1.25×1.68×1.84×1.12×1.21×1.41×
20 × 201.36×N.A.2.59×1.39×N.A.2.78×1.24×N.A.2.26×1.07×N.A.1.46×
24 × 241.29×2.05×2.54×1.47×2.20×2.80×1.22×1.90×2.29×1.09×1.38×1.43×
SSSP4 × 41.24×N.A.1.68×1.33×N.A.1.77×1.12×N.A.1.34×1.04×N.A.1.09×
8 × 81.39×N.A.1.64×1.42×N.A.2.07×1.18×N.A.1.76×0.99×N.A.1.19×
12 × 121.22×N.A.1.93×1.47×N.A.2.14×1.20×N.A.1.69×1.02×N.A.1.11×
16 × 161.36×N.A.2.00×1.24×N.A.2.68×1.14×N.A.1.93×1.10×N.A.1.33×
20 × 201.30×N.A.2.80×1.43×N.A.2.64×1.22×N.A.2.49×1.11×N.A.1.57×
24 × 241.35×N.A.2.77×1.48×N.A.3.00×1.28×N.A.2.36×1.08×N.A.1.55×
Table 7. Overall Performance in Many-core Systems
In most cases, when the system size is less than 20 × 20, GraphSER’s performance on the many-core system increases with the increase of the system size. And when the system size reaches 20 × 20, the overall performance of GraphSER reaches saturation zone. This is because, as the system size increases, the average communication distance becomes longer, and the spatial locality optimization space becomes larger, which is similar to DREDGE. However, the performance improvement does not always increase with the system size. This is because, as the system size increases, it is more difficult for GraphSER to optimize adjacent vertices to adjacent computing nodes (because graph partitions become more dispersed).
When the network size is relatively small, GraphSER performance improves more significantly compared to DREDGE, because DREDGE relies on vertex pressure to optimize partitioning. When the mesh size decreases, the average vertex pressure also decreases, and the optimization space also decreases. The optimization methods in GraphSER that rely on replica degree and degree are independent of network size, resulting in more stable performance improvement.
Compared to the SAGM-ER edge repartition method, GraphSER has an average improvement of 52%, mainly because GraphSER has better scalability in terms of repartition time and overall performance than SAGM-ER.
In the experiment in Table 7, 30% of all datasets are used as the dynamically changing part. Although datasets such as LiveJournal, Friendster, and Amazon vary significantly in size, GraphSER’s performance on these datasets shows no notable difference. We can see from Figure 14 that GraphSER’s performance is not related to the size of the input graph.
In Table 7, the performance of SAGM-ER goes up and down in different experiments. This is mainly because, in different experiments, SAGM-ER’s algorithm starts searching using data from the dynamically changing part of the graph and the data are randomly selected from the 30% dynamic part of the datasets, so the performance of SAGM-ER varies in different experiments.
Fig. 14.
Fig. 14. Comparison with the state-of-the-art works.
Finally, we evaluate the scalability of different graph algorithms of GraphSER. We run the Random Walk (5%) algorithm on the LiveJournal dataset to evaluate the performance of GraphSER. The proportion of Random Walk (5%) accessing edges is 5%, while PageRank is 100%. The experimental results are shown in Table 8. Combined with Table 7, it can be seen that GraphSER and SAGM-ER have relatively stable performance for different algorithms, as they are directly optimized based on the structure of dynamic graphs. The repartition results of DREDGE depend on the program’s memory access patterns. DREDGE monitors past vertex update messages to repartition. Therefore, when the algorithms are different, the performance of DREDGE will change, such as for Random Walk (5%). Due to the significant randomness of memory access pattern between different iterations, DREDGE performance is only 48% of the baseline.
Table 8.
System SizeNo RepartitionSAGM-ERGraphSER
4 × 41.18×1.81 ×
8 × 81.34×1.77 ×
12 × 121.28×2.06 ×
16 × 161.24×2.26 ×
20 × 201.16×2.29×
24 × 241.31×2.05×
Table 8. Overall Performance in Random Walk (5%)

4.5 Comparison with the State-of-the-art

In this section, we compare GraphSER with several the state-of-the-art works: PowerGraph [20], GraphSteel [29], GrapH [38], Lux [22], DREDGE [39], and SAGM-ER [34].
PowerGraph is the first static graph computing framework to propose edge partitioning. And PowerGraph is very well-known in edge partitioning. We choose PowerGraph as baseline.
GraphSteel is an edge repartition algorithm. It transports edges from slower partitions to faster partitions, thereby improving the computation and memory efficiency.
GrapH proposed repartition algorithm H-adapt to improve partitioning quality. In a heterogeneous environment, GrapH checks from high traffic vertices to determine if their adjacent edges can reduce communication costs by migrating between two selected parts.
Lux proposed a novel dynamic partitioning strategy that achieves good load balancing across GPUs. And to the best of our knowledge, Lux is the state-of-the-art edge repartitioning work in GPUs.
DREDGE proposed a novel hardware solution to improve partitioning quality. And to the best of our knowledge, DREDGE is the state-of-the-art dynamic repartitioning hardware work.
SAGM-ER proposed edge group and improve partitioning quality by migrating certain edge groups between two selected parts. And to the best of our knowledge, SAGM-ER is the state-of-the-art edge repartitioning work.
In this experiment, we use four dynamic datasets: Twitter [8], Facebook [50], Flickr [9], Orkut [40]. Graph algorithms we evaluate are PageRank, SSSP, connected component (CC), and BFS. According to Section 4.4, our system performs best when the system size is 20 × 20, so the system size we simulate is 20 × 20. The experimental results are shown in Figure 14. GraphSteel can balance the work load between different nodes, but it does not optimize communication between nodes, and cross-node communication occupies most of the time in graph computation. GrapH cuts the vertices with low traffic to decrease the communication costs. But the traffic of vertices varies with different iterations, and GrapH needs to predict vertex traffic. With different iterations, its predictions are not always accurate. Lux monitors runtime performance and balances workload among nodes, which is similar to GraphSteel. And, like GraphSteel, Lux does not optimize communication between nodes. DREDGE repartitions graph by monitoring past vertex update messages. However, many graph algorithms have random memory access patterns between different iterations such as SSSP, CC, and BFS. As mentioned in Section 4.4, GraphSER has better scalability in terms of repartition time and overall performance than SAGM-ER.

5 Conclusion

The main bottleneck of graph processing lies in irregular memory accesses. Some many-core systems utilize the high local bandwidth of HMC systems to alleviate this bottleneck. However, as highlighted by GraphP, the cross-node bandwidth emerges as a critical bottleneck. DREDGE was the first to introduce a method for enhancing the spatial locality of graph partitions through vertex repartition, thereby reducing cross-node memory accesses and mitigating this bottleneck. Nevertheless, the performance of DREDGE exhibits variability across different graph algorithms and system sizes.
In this work, we propose GraphSER, a stream-based edge repartition method designed to overcome the limitations of prior research. We optimize GraphSER to be aware of distance in many-core systems and achieve performance better than state-of-the-art references. Our evaluation demonstrates that, when compared to SAGM-ER, GraphSER exhibits an average speedup of 1.52×, with a maximum of 2×. Additionally, in comparison to DREDGE, GraphSER exhibits an average performance improvement of 40%, reaching a maximum of 117%.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 105–117.
[2]
Konstantin Andreev and Harald Räcke. 2004. Balanced graph partitioning. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 120–124.
[3]
Tewodros Alemu Ayall, Huawen Liu, Changjun Zhou, Abegaz Mohammed Seid, Fantahun Bogale Gereme, Hayla Nahom Abishu, and Yasin Habtamu Yacob. 2022. Graph computing systems and partitioning techniques: A survey. IEEE Access 10 (2022), 118523–118550.
[4]
Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. 2006. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 44–54.
[5]
Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoefler. 2021. Practice of streaming processing of dynamic graphs: Concepts, models, and systems. IEEE Trans. Parallel Distrib. Syst. 34, 6 (2021), 1860–1876.
[6]
Florian Bourse, Marc Lelarge, and Milan Vojnovic. 2014. Balanced graph edge partition. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1456–1465.
[7]
Huanqi Cao, Yuanwei Wang, Haojie Wang, Heng Lin, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling graph traversal to 281 trillion edges with 40 million cores. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 234–245.
[8]
Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM).
[9]
Meeyoung Cha, Alan Mislove, and Krishna P. Gummadi. 2009. A measurement-driven analysis of information propagation in the Flickr social network. In Proceedings of the 18th International Conference on World Wide Web. 721–730.
[10]
Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. 2012. Kineograph: Taking the pulse of a fast-changing and connected world. In Proceedings of the 7th ACM European Conference on Computer Systems. 85–98.
[11]
Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One trillion edges: Graph processing at Facebook-scale. Proc. VLDB Endow. 8, 12 (2015), 1804–1815.
[12]
Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong Yang. 2018. GraphH: A processing-in-memory architecture for large-scale graph processing. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 38, 4 (2018), 640–653.
[13]
William J. Dally and Charles L. Seitz. 1986. The torus routing chip. Distrib. Comput. 1 (1986), 187–196.
[14]
David Ediger, Rob McColl, Jason Riedy, and David A. Bader. 2012. Stinger: High performance data structure for streaming graphs. In Proceedings of the IEEE Conference on High Performance Extreme Computing. IEEE, 1–5.
[15]
Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. 1999. On power-law relationships of the internet topology. ACM SIGCOMM Comput. Commun. Rev. 29, 4 (1999), 251–262.
[16]
Wenfei Fan, Chunming Hu, and Chao Tian. 2017. Incremental graph computations: Doable and undoable. In Proceedings of the ACM International Conference on Management of Data. 155–169.
[17]
Guoyao Feng, Xiao Meng, and Khaled Ammar. 2015. DISTINGER: A distributed graph data structure for massive dynamic graph processing. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). IEEE, 1814–1822.
[18]
Bai Fujun, Jiang Xiping, Wang Song, Yu Bing, Tan Jie, Zuo Fengguo, Wang Chunjuan, Wang Fan, Long Xiaodong, Yu Guoqing, Fu Ni, Li Qiannan, Li Hua, Wang Kexin, Duan Huifu, Bai Liang, Jia Xuerong, Li Jin, Li Mei, Wang Zhengwen, Hu Sheng, Zhou Jun, Zhan Qiong, Sun Peng, Yang Daohong, Cheichan Kau, David Yang, Ching-Sung Ho, Sun Hongbin, Lv Hangbing, Liu Ming, Kang Yi, and Ren Qiwei. 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88 pJ/b logic-to-memory interface. In Proceedings of the IEEE International Electron Devices Meeting (IEDM’20). IEEE, 6–6.
[19]
Shufeng Gong, Chao Tian, Qiang Yin, Wenyuan Yu, Yanfeng Zhang, Liang Geng, Song Yu, Ge Yu, and Jingren Zhou. 2021. Automating incremental graph processing with flexible memoization. Proc. VLDB Endow. 14, 9 (2021), 1613–1625.
[20]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 17–30.
[21]
Oded Green and David A. Bader. 2016. cuSTINGER: Supporting dynamic graph algorithms for GPUs. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’16). IEEE, 1–6.
[22]
Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken. 2017. A distributed multi-GPU system for fast graph processing. Proc. VLDB Endow. 11, 3 (2017), 297–310.
[23]
Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, David E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). IEEE, 86–96.
[24]
Xiping Jiang, Fengguo Zuo, Song Wang, Xiaofeng Zhou, Yubing Wang, Qi Liu, Qiwei Ren, and Ming Liu. 2022. A 1596-GB/s 48-Gb stacked embedded DRAM 384-core SoC with hybrid bonding integration. IEEE Solid-state Circ. Lett. 5 (2022), 110–113.
[25]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Scient. Comput. 20, 1 (1998), 359–392.
[26]
George Karypis and Vipin Kumar. 1998. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48, 1 (1998), 96–129.
[27]
George Karypis and Vipin Kumar. 1998. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J. Parallel Distrib. Comput. 48, 1 (1998), 71–95.
[28]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Comput. Archit. Lett. 15, 1 (2015), 45–49.
[29]
Dinesh Kumar, Arun Raj, and Janakiram Dharanipragada. 2017. GraphSteal: Dynamic re-partitioning for efficient graph processing in heterogeneous clusters. In Proceedings of the IEEE 10Th International Conference On Cloud Computing (CLOUD’17). IEEE, 439–446.
[30]
Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Trans. Web 1, 1 (2007), 5–es.
[31]
Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1, 1 (2007), 2–es.
[32]
Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6, 1 (2009), 29–123.
[33]
He Li, Hang Yuan, Jianbin Huang, Jiangtao Cui, Xiaoke Ma, Senzhang Wang, Jaesoo Yoo, and S Yu Philip. 2021. Group reassignment for dynamic edge partitioning. IEEE Trans. Parallel Distrib. Syst. 32, 10 (2021), 2477–2490.
[34]
He Li, Hang Yuan, Jianbin Huang, Xiaoke Ma, Jiangtao Cui, and Jaesoo Yoo. 2022. Edge repartitioning via structure-aware group migration. IEEE Trans. Computat. Soc. Syst. 9, 3 (2022), 751–760. DOI:
[35]
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning in the cloud. arXiv preprint arXiv:1204.6078 (2012).
[36]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 135–146.
[37]
Mugilan Mariappan and Keval Vora. 2019. GraphBolt: Dependency-driven synchronous processing of streaming graphs. In Proceedings of the 14th EuroSys Conference. 1–16.
[38]
Christian Mayer, Muhammad Adnan Tariq, Ruben Mayer, and Kurt Rothermel. 2018. Graph: Traffic-aware graph processing. IEEE Trans. Parallel Distrib. Syst. 29, 6 (2018), 1289–1302.
[39]
Andrew McCrabb, Eric Winsor, and Valeria Bertacco. 2019. DREDGE: Dynamic repartitioning during dynamic graph execution. In Proceedings of the 56th Annual Design Automation Conference. 1–6.
[40]
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. 29–42.
[41]
Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the Graph 500. Cray Users Group 19 (2010), 45–74.
[42]
Fabio Petroni, Leonardo Querzoni, Khuzaima Daudjee, Shahin Kamali, and Giorgio Iacoboni. 2015. HDRF: Stream-based partitioning for power-law graphs. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 243–252.
[43]
Fatemeh Rahimian, Amir H. Payberah, Sarunas Girdzijauskas, and Seif Haridi. 2014. Distributed vertex-cut partitioning. In Distributed Applications and Interoperable Systems: 14th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems (DAIS’14), Held as Part of the 9th International Federated Conference on Distributed Computing Techniques (DisCoTec ’14). Springer, 186–200.
[44]
Shafiur Rahman, Mahbod Afarin, Nael Abu-Ghazaleh, and Rajiv Gupta. 2021. JetStream: Graph analytics on streaming data with event-driven hardware accelerator. In Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21). 1091–1105.
[45]
Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. 472–488.
[46]
Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar, Renzo Angles, Walid Aref, Marcelo Arenas, Maciej Besta, Peter A. Boncz, Khuzaima Daudjee, Emanuele Della Valle, Stefania Dumbrava, Olaf Hartig, Bernhard Haslhofer, Tim Hegeman, Jan Hidders, Katja Hose, Adriana Iamnitchi, Vasiliki Kalavri, Hugo Kapp, Wim Martens, M. Tamer Özsu, Eric Peukert, Stefan Plantikow, Mohamed Ragab, Matei R. Ripeanu, Semih Salihoglu, Christian Schulz, Petra Selmer, Juan F. Sequeda, Joshua Shinavier, Gábor Szárnyas, Riccardo Tommasini, Antonino Tumeo, Alexandru Uta, Ana Lucia Varbanescu, Hsiang-Yun Wu, Nikolay Yakovets, Da Yan, and Eiko Yoneki. 2021. The future is big graphs: A community view on graph processing systems. Commun. ACM 64, 9 (2021), 62–71.
[47]
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Comput. Archit. News 41, 3 (2013), 475–486.
[48]
Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, Theodore L. Willke, Jeffrey Young, Matthew Wolf, and Karsten Schwan. 2016. GraphIn: An online high performance incremental graph processing framework. In Proceedings of the 22nd International Conference on Parallel and Distributed Computing: Parallel Processing (Euro-Par’16). Springer, 319–333.
[49]
Gagandeep Singh, Juan Gómez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2019. NAPEL: Near-memory computing application performance prediction via ensemble learning. In Proceedings of the 56th Annual Design Automation Conference 2019. 1–6.
[50]
Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009. On the evolution of user interaction in Facebook. In Proceedings of the 2nd ACM Workshop on Online Social Networks. 37–42.
[51]
Keval Vora, Rajiv Gupta, and Guoqing Xu. 2017. KickStarter: Fast and accurate computations on streaming graphs via trimmed approximations. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. 237–251.
[52]
Jaewon Yang and Jure Leskovec. 2012. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics. 1–8.
[53]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 544–557.

Index Terms

  1. GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 3
    September 2024
    592 pages
    EISSN:1544-3973
    DOI:10.1145/3613629
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 September 2024
    Online AM: 26 April 2024
    Accepted: 17 April 2024
    Revised: 08 April 2024
    Received: 12 October 2023
    Published in TACO Volume 21, Issue 3

    Check for updates

    Author Tags

    1. Distance-aware
    2. stream edge repartition
    3. many-core system

    Qualifiers

    • Research-article

    Funding Sources

    • Strategic Priority Research Program of Chinese Academy of Sciences

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 326
      Total Downloads
    • Downloads (Last 12 months)326
    • Downloads (Last 6 weeks)132
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media