There are usually a large number of concurrent graph queries (CGQs) requirements in streaming graphs. However, existing graph processing systems mainly optimize a single graph query in streaming graphs or CGQs in static graphs. They have a large number of redundant computations and expensive memory access overhead, and cannot process CGQs in streaming graphs efficiently. To address these issues, we propose PMGraph, a software-hardware collaborative accelerator for efficient processing of CGQs in streaming graphs. First, PMGraph centers on fine-grained data, selects graph queries that meet the requirements through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large amount of repeated access to the same data by different graph queries in CGQs, thereby reducing memory access overhead. Furthermore, it adopts the update strategy that regularizes the processing order of vertices in each graph query according to the order of the vertex dependence chain, consequently effectively reducing redundant computations. Second, we propose a CGQs-oriented scheduling strategy to increase the data overlap when different graph queries are processed, thereby further improving the performance. Finally, PMGraph prefetches the vertex information according to the global active vertex set Frontier of all graph queries, hiding the memory access latency. It also provides prefetching for the same vertices that need to be processed by different graph queries, reducing the memory access overhead. Compared with the state-of-the-art concurrent graph query software systems Kickstarter-C and Tripoline, PMGraph achieves average speedups of 5.57× and 4.58×, respectively. Compared with the state-of-the-art hardware accelerators Minnow, HATS, LCCG, and JetStream, PMGraph achieves the speedup of 3.65×, 3.41×, 1.73×, and 1.38× on average, respectively. Experimental results show that our proposed PMGraph outperforms the state-of-the-art concurrent graph processing systems and hardware accelerators.

1 Introduction

In the real world, graphs are used to represent complex relationships between things [13, 27, 39, 61]. If a graph is updated dynamically through a stream of update streams, and the query is updated incrementally as new updates are applied, it is usually called a streaming graph [10, 17, 24, 31, 45]. Graph update includes edge addition, edge deletion, edge weight change, vertex addition, and so on [45, 58]. Concurrent graph queries (CGQs) in the streaming graph means that every time the graph is updated, CGQs must be re-executed. Among them, the update operation on the graph is called update stream. There are often a large number of CGQs in the streaming graph [20, 29, 46, 47, 57, 60]. CGQs usually represent multiple independent graph queries executed simultaneously on the same graph starting from different source vertices [59]. CGQs appear in many application scenarios. (1) Complex algorithms for independent queries starting at different sources on the same graph. For example, betweenness centrality randomly selects the source vertex, and then calls breadth-first search (BFS) for graph queries [19]. (2) An end-to-end scenario where multiple users issue multiple graph queries simultaneously on the same platform. For example, daily active users on X (formerly known as Twitter) are as high as 23.78 billion, and the daily tweets (update streams) received by the platform are about 50 billion [1, 11, 18, 44]. X needs to retrieve specific information for numerous users from a large and evolving social network. The above description indicates that there are a large number of CGQs requirements in streaming graphs.

There is a large amount of memory access overhead and redundant computations when processing CGQs in streaming graphs. The reason that the processing of CGQs is independent of each other, and each graph query traverses the same graph structure along its own path, which makes the same data processed by different graph queries need to be frequently loaded from memory to cache, resulting in inefficient cache usage and large memory access overhead. In addition, there are a large number of redundant computations in each graph query. In streaming graphs, the update of CGQs only involves a part of the graph structure, starting from the directly affected vertices will greatly reduce the runtime. However, the irregular processing order of these directly affected vertices will cause subsequent vertices to be computed multiple times, which cannot quickly reach the stable state, resulting in a large number of redundant computations. Furthermore, directly affected vertices are random and the data access is usually irregular, which incurs a large amount of memory access overhead.

However, none of the existing graph processing systems can efficiently address the above problems. On the one hand, many graph processing systems are proposed for a single graph query in streaming graphs. They reduce data random access overhead and accelerate convergence by preprocessing update streams [4], reusing historical information [15, 31, 45], regularizing propagation paths [58], accelerating state propagation [28, 53, 54, 55, 56], and data prefetching [33, 51, 58]. These systems cannot efficiently address the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. On the other hand, most of the existing systems for processing CGQs are for static graphs. That is, each graph query is processed from the source vertices. However, in streaming graphs they are often computed starting from the vertices directly affected by the update stream, which affects the state propagation of vertices, resulting in a large number of redundant computations [29, 47, 60].

To address the above issues, we analyze the processing of CGQs in streaming graphs and obtain two observations. First, in the streaming graph, there are similarities between the data processed by different graph queries. Second, regularizing the processing order of vertices directly affected by update streams according to the vertex dependencies can reduce redundant computations in each graph query. Based on these two observations, we propose a fine-grained data-centric Load-Trigger-Filter-Propagate (LTFP) model based on vertex dependencies. The LTFP model centers on fine-grained data selects graph queries that meet the requirements in each iteration through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. Furthermore, it regularizes the processing order of vertices in each graph query of CGQs according to the dependence chain of the vertices, thus effectively reducing the redundant computations in each graph query. The dependence chain of vertices is first proposed by KickStarter [45] for dependence tracking, computation, and adjustment of active vertices. We also propose a CGQs-oriented scheduling strategy to increase the overlap between data processed by different graph queries, thereby improving the performance. Based on the LTFP model, we propose a hardware-software collaborative accelerator PMGraph for efficient processing of CGQs in streaming graphs. PMGraph mainly contains two hardware units: dependency-aware data prefetch unit (DADP) and hot vertex state coalescing unit (HVSC). When the software system needs to process graph data, PMGraph is called. The DADP unit prefetches the graph data required for CGQs based on the dependencies between vertices, hiding the memory access latency. It also provides prefetching for the same vertices that need to be processed by different graph queries in each iteration, reducing the number of prefetching and thus reducing the data access overhead. During the process of PMGraph, HVSC will merge the state data of frequently accessed hot vertices and ensure that these state data can be obtained, reducing the data access overhead.

To evaluate the effectiveness of PMGraph, we conduct experiments on a simulated 64-core processor. Compared with the state-of-the-art concurrent graph query software systems Kickstarter-C and Tripoline [21], PMGraph achieves the speedup of 3.18\(\times\)\(\sim 8.37\times\) (5.57\(\times\) on average) and 1.54\(\times\)\(\sim 7.96\times\) (4.58\(\times\) on average), respectively. Compared with the state-of-the-art hardware accelerators Minnow [51], HATS [33], LCCG [59], and JetStream [38], PMGraph achieves the speedup of 2.57\(\times\)\(\sim\) 4.46\(\times\) (3.65\(\times\) on average), 2.36\(\times\)\(\sim 4.22\times\) (3.41\(\times\) on average), 1.33\(\times\)\(\sim 2.14\times\) (1.73\(\times\) on average) and 1.3\(\times\)\(\sim 1.52\times\) (1.38\(\times\) on average), respectively.

Our main contributions are as follows:

—

We propose a software-hardware collaborative accelerator PMGraph, which can efficiently process CGQs in streaming graphs. PMGraph is centered on fine-grained data, selects the graph queries that meet the requirements in each iteration through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large number of repeated accesses to the same data by different graph queries. Furthermore, it adopts the update strategy that regularizes the processing order of vertices in each graph query of CGQs according to dependence chain of the vertices, thus effectively reducing the redundant computations in each graph query.

—

We propose a CGQs-oriented scheduling strategy to increase the overlap between data processed by different graph queries, thereby further improving the performance of processing CGQs.

—

PMGraph prefetches the vertex information to be processed for CGQs according to the global active vertex set Frontier of all graph queries, hiding the memory access latency in the process of processing CGQs. It also provides prefetching for the same vertices that need to be processed by different graph queries in each iteration, reducing data access overhead.

—

Our evaluation shows that compared with the state-of-the-art software systems and hardware accelerators for CGQs on streaming graphs, PMGraph has better performance and lower cache misses.

The remainder of our work is organized as follows. Section 2 reviews the background and our motivation. Section 3 introduces our proposed PMGraph. Result evaluation is presented in Section 4. Section 5 describes the related work. Finally, we conclude our remarks in Section 6.

2 Background and Motivation

Existing streaming graph systems often determine the vertices directly affected by update streams according to the vertex dependencies, and traverse the graph structure from these vertices to update and propagate the state until all vertices obtain the correct results. The dependence relationship of vertices represents the sequence relationship in the dependence tracking process of active vertices. The dependence tracing process only depends on a single incoming neighbor for each vertex, where a single incoming neighbor is the input neighbor vertex that makes the vertex ultimately affected. When processing CGQs, different graph queries independently traverse the same graph structure in different paths. Thus, different graph queries repeatedly access the same data, which will result in a large amount of memory access overhead. Furthermore, the multi-sources traversal mode of each graph query will result in a large number of redundant computations.

2.1 A Large Amount of Memory Access Overhead Existing in CGQs of Streaming Graphs

When processing CGQs in a streaming graph, each graph query obtains the vertices directly affected by update streams according to their respective vertex dependencies. For different graph queries, due to the different dependencies of the vertices, the set of vertices directly affected by the update stream in each graph query is different. However, since all graph queries receive the same update stream, there is a large overlap in the set of vertices affected by the update stream between different graph queries. The processing of CGQs in the streaming graph starts from the vertices directly affected by the update stream, and traverses along their respective paths. They go through many of the same vertices, reflecting the high degree of overlap between the data processed by different graph queries.

In order to evaluate the data similarity when traversing CGQs, we employ 64 graph queries to test on five classical datasets [2, 5, 25, 49, 50]. The experimental results are shown in Figure 1. The X-axis in Figure 1 represents the intervals of the number of queries. Figure 1 shows that 80%, 59%, 81%, 72%, and 92% of all vertices accessed by graph queries in dataset Google [25], Topcats [49], LJ [2], WP [50], and UK [5], respectively, are accessed by the majority of graph queries (“majority of graph queries” means a large portion of all the queries or most of all the queries). Although graph queries will access many of the same vertices, since each graph query is independent, they will update and propagate according to their respective paths. Thus, the time for them to access the same vertices is often different, which makes the same vertices frequently accessed, resulting in a large amount of memory access overhead.

Fig. 1.

2.2 A Large Number of Redundant Computations Existing in CGQs of Streaming Graphs

There is very little work for CGQs in streaming graphs [20, 21, 29, 46, 57, 59, 60]. They reuse query results for high-degree vertices to speed up convergence [20, 21, 46, 59]. However, they are not optimized for the redundant computation problem. In order to improve the performance of processing CGQs, the existing general static graph system for CGQs adopts the method of partitioning the graph to manage the processing order of vertices and process them according to partitions [9, 29, 57, 60]. When CGQs finishes executing in a partition, it will proceed to the next partition. In this manner, the data required by the CGQs currently being processed will remain in the LLC and will not be replaced. However, this kind of method for CGQs in static graphs makes the updated values of vertices not propagated in real-time, thus causing a large number of redundant computations in each graph query. The problem of redundant computation is more prominent when processing CGQs in streaming graphs. The reason that streaming graph processing takes the directly affected vertices as the sources for state propagation, the update of each graph query in the streaming graph in CGQs can be regarded as a multi-source graph query. The state propagation from all sources within each graph query is not independent, which may affect the final result of the graph query. When the processing order of vertices is constrained by the vertex dependencies, more redundant computations will be generated in this single-query multi-source mode.

We introduce an example of processing two graph queries in a streaming graph. There are nine vertices and two queries (Query(A) and Query(G)) in Figure 2(a). Among them, A and G are the source vertices of the two graph queries, respectively. Assume that the vertices directly affected by the update stream in the first query and the second query are C and F, G, and H, respectively. Without loss of generality, assume that active vertices are processed in priority order A to I. The red arrow in the figure indicates the state determination relationship between the vertices of the query with A as the source vertex, and the black arrow indicates the state determination relationship between the vertices of the graph query with G as the source vertex. The state determination relationship between vertices means that the state of each vertex is finally determined by the parent node except the source vertex in the vertex dependence chain. When processing the update stream, the first query (Query(A)) activates and updates vertices D and E from vertices C and F. The second query (Query(G)) activates and updates vertices B, H, and I from vertices G and H. According to this rule, the vertices computed in each round of the two queries are shown in Figure 3(a). It takes a total of 16 computations for two graph queries to reach a stable state, and there are redundant computations. We next address this redundant computation problem in terms of vertex dependencies and compare the number of computations.

Fig. 2.

Fig. 3.

The update strategy based on vertex dependencies can address the redundant computation problem of each graph query in CGQs. The approach is to process vertices according to the vertex dependence chain. The vertex dependence chain [22, 45] is established through the dependence relationship between vertices. In addition to the source vertex, the parent vertex of a vertex on the dependence chain represents the input neighbor vertex that makes this vertex ultimately affected. Vertices at the front of each dependence chain are processed with higher priority than vertices at the end of the dependence chain. The order of vertices on the dependence chain determines the processing order of vertices in each graph query. If each dependence chain updates vertices in order from the front to the end, then all vertices only need to be visited once to reach a stable state, thereby reducing redundant computations. Thus, we regularize the processing order of vertices in each graph query according to the order of the dependence chain of vertices, consequently effectively reducing redundant computations in each graph query. Furthermore, we utilize the dependence chain of vertices to determine the iteration round of vertices, which are applied to increase the overlap between the data processed by different graph queries, which is described in Section 3. The above two graph queries in Figure 2(a) adopt the establishment method of dependence chain in KickStarter [45], and two corresponding dependence chains are obtained as shown in Figure 2(b). The left graph in Figure 2(b) is the dependence chain of the vertices corresponding to the graph query with A as the source vertex. The right graph in Figure 2(b) is the dependence chain of vertices corresponding to the graph query with G as the source vertex. We take Figures 2(b) and 3(b) as an example to briefly describe the above update strategy based on vertex dependencies. We process the same two queries which are used in the previous example and select the same vertices directly affected by the update stream on the same graph (Figure 2(a)). The vertices in the first query that are directly affected by the update stream are C and F. The vertices directly affected by the update stream in another graph query are G and H, respectively. As shown in Figure 3(b), it first selects the vertices C and G with the smallest levels among the vertices directly affected by the update stream in the dependence chains of each graph query. Then, they sequentially process the vertices of each level along the dependence chains of vertices, and perform update and activation. The number of computations for this approach is 12. Compared with the above method, the update strategy approach reduces 4 computations. If vertices C and F in Figure 2(b) are not on the same path (one is not the ancestor of the other) or their propagation paths do not overlap, the status propagation from C and F does not generate extra redundant computations and in such cases, they do not influence the execution of our approach. In such cases, we do not need to take extra operations, and we normally perform our update strategy.

2.3 Motivation

To address above problems, we utilize the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large number of repeated accesses to the same data by different graph queries in CGQs, thereby reducing memory access overhead. We also adopt the update strategy that regularizes the update order of vertices in each graph query through the vertex dependencies, thereby reducing redundant computations. Based on these strategies, we propose a fine-grained data-centric concurrent execution model to process CGQs in streaming graphs.

Concurrent Execution Model Centered on Fine-grained Data: Existing execution models for CGQs are mainly divided into two categories: graph query-centric model [20, 36, 46] and block data-centric model [29, 47, 57]. The graph query-centric model loads each graph query and the data related to the graph query together, in order to facilitate vertex scheduling for the respective graph query. However, due to the similarity between different graph queries, when processing CGQs, there will be a large number of repeated accesses to the same vertices, resulting in a large amount of data access overhead. Block data-centric model utilizes the similarity between data processed by different graph queries to partition the graph. First, graph data is sequentially loaded in units of partitions. Then all CGQs are triggered to process data in the current partition. However, this partition-based execution mode results in a large number of redundant computations in CGQs of stream graphs and is not suitable for streaming graphs. Thus, we employ the similarity between different graph queries and the update strategy based on vertex dependencies to reduce the memory access overhead and the redundant computations in each graph query, in order to efficiently process CGQs in streaming graphs.

We propose a fine-grained data-centric LTFP model based on vertex dependencies, using the similarity between different graph queries to schedule CGQs centered on vertex data. It first loads an active vertex data, then triggers CGQs, filters qualified graph queries based on the vertex data, and reduces redundant computations in each graph query based on vertex dependencies. Our proposed model provides a unified processing scheme for CGQs in streaming graphs. In the model, we employ the global active vertex set Frontier to prefetch the vertex information to be processed by the current iteration, hiding the memory access latency. It also provides prefetching for the same vertices in each iteration of different graph queries, reducing data access overhead.

3 Overview of Our Approach

This section mainly introduces the fine-grained data-centric LTFP model based on vertex dependencies and our proposed PMGraph accelerator.

3.1 LTFP Execution Model

This subsection introduces the preprocessing of the LTFP model, its execution process and scheduling strategy.

3.1.1 Definition and Preprocessing.

We employ the adjacency list to store the graph and utilize (G, S) to denote the data needed to process CGQs in the streaming graph. G=(V, E, W), where V is the set of graph vertices, and \(|\)V\(|\) represents the number of vertices in the graph. E is a set of edges, and the edges are represented as \(\lt\)s, d\(\gt\), where s represents the starting vertex of the edge, and d is the destination vertex of the edge. Since the edges of an undirected graph are symmetric, they can be expressed as \(\lt\)s, d\(\gt\) and \(\lt\)d, s\(\gt\). W represents the weight of the edge. The state dataset of vertices is S, and the state data of vertices in each graph query consists of (val, ord, pn). Among them, val is the computation result of the vertex. ord represents its level on the dependence chain of the vertices in the graph query. The level of the root vertex is 1, and the levels of other vertices are the levels of their parent vertices plus 1. pn represents the parent of the vertex. Take the left graph in Figure 2(b) as an example, the ord of the source vertex A of a graph query is 1. The ord of vertices C, D, B, F, E, H, and I are 2, 3, 4, 4, 5, 6, and 7, respectively. The smaller the ord, the higher the position of the vertex on the dependence chain, and the earlier the vertices are processed. Besides, we use Bsize to represent the number of CGQs currently processed, and use \(Q=\lbrace q_{1}, q_{2}, q_{3}\ldots , q_{i}\ldots , q_{Bsize}\rbrace\) to represent the set of CGQs currently processed, where \(q_{i}\) represents the ith graph query.

As mentioned in Section 2.2, using vertex dependencies can greatly reduce redundant computations in a single graph query. Thus, before the execution of the LTFP model, we perform preprocessing according to the levels of all vertices directly affected by the update stream in each graph query. Specifically, we sort all directly affected vertices in each graph query. Thus, each graph query obtains the maximum and minimum levels of all directly affected vertices. The minimum value of levels of all directly affected vertices in each graph query is adopted to determine the levels of vertices when the respective graph query starts executing. The maximum value of levels of all directly affected vertices in each graph query can be employed to evaluate whether the computation of the respective graph query ends. If the levels of the computed vertices in the current iteration of a graph query are equal to the maximum value of levels of all directly affected vertices, and there are no new active vertices, the computation ends. The minimum and maximum values of the levels of all directly affected vertices of each graph query in CGQs are stored in the vertex filter array and the maximum level array respectively, which are represented by StartFlag[Bsize] and EndFlag[Bsize] respectively. During preprocessing, we put the directly affected vertices with the smallest level in the respective graph query into the global active vertex set Frontier, and make the CGQs be processed sequentially from the smallest level in the respective graph query.

3.1.2 Overview of LTFP.

During the incremental computation of a streaming graph, the vertices affected by the update stream will propagate their state values along the latest graph structure, so that the states of other vertices in the graph maintain the latest correct results. When adding an edge, it first uses the state value of the source vertex and the weight of the edge to calculate the new state value of the destination vertex of the edge. Then, it uses the destination vertex as the starting point to update and propagate its state value to other vertices according to the topology of the new graph to obtain the latest correct result. When an edge is deleted, it is first propagated through labels to determine which vertices will be affected. The state values of these vertices are then set to initial values (e.g., in the single source shortest path (SSSP) algorithm, the initial value is +\(\infty\)). Then, these vertices will recalculate their respective state values through the algorithm based on the state values of all source vertices of their incoming edges, and propagate the latest state values to other vertices according to the topology of the new graph to obtain the latest state results. Our approach adopts the above-mentioned approach to support dynamic changes in graph structures. When new updates arrive, the update stream is received to form a new graph. LTFP propagates state values in the new graph starting from the directly affected vertices to other affected vertices along the topology of the new graph.

The LTFP model includes four stages: data loading, graph query triggering and conditional filtering, vertex state update propagation, and result evaluation which are described in detail below. The execution flow of the LTFP model is as follows: In the data loading stage, LTFP obtains active vertices through Frontier. Then in the triggering and filtering stage, it selects the queries that need to process the currently loaded active vertices. These queries then update the state data of all neighbor vertices and perform state propagation. When an iteration is over, it evaluates whether all queries are finished in the result evaluation stage. If the queries are not finished, they will proceed to the next iteration.

Data Loading: When LTFP is executed, it traverses the global active vertex set Frontier of CGQs to obtain active vertices sequentially. Then it loads all the edge data of the currently fetched active vertices and merges the same vertices that need to be processed by triggering CGQs, thereby reducing a large number of accesses to the same data by different graph queries in CGQs.

Trigger and Filter: CGQs are triggered after the current data of the active vertex and all its edge data are loaded. The active vertices of each graph query in the streaming graph are not exactly the same. Since the global active vertex set (Frontier) is the union of the active vertices of all graph queries in the current iteration, if all graph queries process all active vertices in Frontier, a large number of redundant computations will be generated. If a separate Frontier whose size is the total number of vertices is reserved for each graph query to store the vertices that need to be processed in each iteration for each graph query, then it needs to access the Frontier corresponding to each graph query during processing, which will result in a lot of memory access overhead. Thus, we design a novel lightweight filter for CGQs. It only utilizes a global vertex filter array whose size is the number of CGQs processed in the current batch (Bsize) to filter out the graph queries that process the current active vertices in each round of iterations, thereby reducing the useless computing overhead generated by all graph queries processing all active vertices in Frontier and the memory access overhead of filtering graph queries.

Figure 4 presents a working diagram of our proposed filter. The global active vertex set Frontier on the left stores all active vertices of the graph queries in the current iteration. The StartFlag on the right is a vertex filter array whose size is Bsize, which is equal to the number of CGQs processed in the current batch. \(q_{1}\), \(q_{2}\), \(q_{3}\), ..., \(q_{i}\), ..., \(q_{Bsize-1}\) and \(q_{Bsize}\) represent each graph query of CGQs. They correspond to the flag values \(flag_{1}\), \(flag_{2}\), \(flag_{3}\), ..., \(flag_{i}\), ..., \(flag_{Bsize-1}\), and \(flag_{Bsize}\) in the vertex filter array, respectively. It first stores the level to be processed by the current iteration of each graph query in the corresponding flag value in the vertex filter array (StartFlag). Then, it filters out all graph queries that process the current active vertex in each iteration through the vertex filter array, thereby reducing redundant computations. If the level of the currently loaded active vertex in each graph query of CGQs is equal to the corresponding flag value of their respective graph query in the vertex filter array, the active vertex is processed by the qualified graph queries. In this manner, each graph query can update the affected vertices in the order of their respective vertex dependence chains.

Fig. 4.

Update Propagation: The graph queries filtered through the triggering and filtering stages will update the state data of all neighbor vertices according to the loaded data of active vertices and perform state propagation. The state data of vertices in different graph queries are stored together. The updated value of a vertex includes the vertex value, its level, and its parent vertex. The updated value of a vertex depends on the algorithm executed by the graph query. For SSSP [32], the vertex value is the shortest path from the source vertex to the destination vertex. For breadth-first search (BFS) [12], the vertex value is the level from the source vertex to the destination vertex. If the neighbor vertex is updated, the level of the neighbor vertex is updated to the level of the current active vertex plus one, and the current active vertex is updated as the parent vertex of the neighbor vertex. At this moment, the neighbor vertex directly depends on the active vertex. The neighbor vertex is activated as the active vertex of the next iteration and stored in the temporary active vertex set nextFrontier.

Result Evaluation: Results evaluation is required after each round of update propagation. In this stage, we update the vertex filter array and active vertex set Frontier. After all the active vertices of the current iteration are processed, if no new vertices are activated in a graph query, and the level of vertices processed in the current iteration is equal to the maximum level of vertices directly affected by the update stream, then all vertices of this graph query reach a stable state, and the graph query ends. At the moment, we set the corresponding flag value of the graph query in the vertex filter array to infinity (\(\infty\)), which indicates that the graph query will not participate in subsequent computation. On the contrary, we set the corresponding flag value of the graph query in the vertex filter array to the original value plus 1 for the next round of processing. The active vertex set Frontier is updated as the union of the vertex set directly affected by the update stream and nextFrontier that needs to be processed in the next round.

3.1.3 Scheduling Strategy of LTFP.

When processing CGQs in a streaming graph, the affected vertices of each graph query are different. Thus, the dependence paths of vertex propagation are different, which leads to the fact that the same vertices may be processed in different rounds in different graph queries, and the same vertices cannot be efficiently shared and processed, resulting in additional data access overhead. Previous work aligns high-degree vertices [50], thereby increasing the overlap of vertex data processed by different graph queries. However, it is aimed at CGQs in static graphs. Since the graph structure of the streaming graph dynamically changes, previous methods need to traverse forward from the high-degree vertices to the source vertex before each computation, the computation cost is high. Thus, previous methods are not suitable for processing CGQs in streaming graphs. To address the problem, we propose a low-overhead scheduling strategy for LTFP to reduce data access overhead. Next, we describe our scheduling strategy from two aspects: the selection of high-degree vertices for alignment and the determination of the iteration rounds of high-degree vertices.

Selection of High-Degree Vertices for Alignment: The previous scheduling strategy selects top-k high-degree vertices from the entire graph. However, a streaming graph affected by update streams often only traverses a part of the graph structure. Thus, we narrow down the selection of vertices for alignment. We select top-k high-degree vertices from the vertices directly affected by the update stream. Figure 5 shows the ratio of the number of graph queries that access the top-4 high-degree vertex set to the number of all the graph queries, and the ratios of the number of graph queries that access each vertex in the vertex set to the number of all graph queries in CGQs on the Google, LJ, Topcats, WP, and UK dataset, respectively. Experimental results indicate that the top-4 high-degree vertices selected from the vertices directly affected by the update stream can be accessed by most graph queries.

Fig. 5.

Determination of the Iteration Round of the High-Degree Vertex: The previous scheduling strategy selects the top-k high-degree vertex set from the entire graph through preprocessing and traverses forward to acquire the distance (hops) from the source vertex of CGQs to each high-degree vertex. According to the shortest distance from the source vertex of each graph query to the high-degree vertex set, the initial iteration rounds of these high-degree vertex sets in different graph queries are obtained. Then, it delays the start of some graph queries through the obtained start iteration round of the high-degree vertex set. Thus, the high-degree vertex set is aligned to increase the overlap between the data processed by different graph queries. However, the structure of the streaming graph dynamically changes. If the previous method is adopted, it must traverse forward from the high-degree vertices to the source vertices before each processing, which is very time-consuming. Thus, we utilize the levels of vertices in the vertex dependence chain that are directly affected by the update stream to determine the start iteration round of the high-degree vertex set. The levels of vertices in the dependence chain determine the order in which vertices are processed. When the streaming graph receives the update stream, it will identify the directly affected vertices according to the vertex dependence chain, and adjust the positions of these vertices on the dependence chain [45]. We only need to obtain the level of the high-degree vertex set to determine the iteration round at which they start to be processed, in order to perform the alignment operation. Compared with the previous scheduling strategies, our scheduling strategy does not need to traverse forward from the high-degree vertices to the source vertex each time, thus reducing the expensive computation overhead. Our approach increases the overlap between data processed by different graph queries. Thus, the loaded data is shared by different graph queries, reducing data access overhead.

3.1.4 Runtime Overhead.

The above-mentioned LTFP software for processing CGQs has runtime overhead for the following reasons. First of all, when the LTFP software is preprocessing, it needs to randomly access the vertices directly affected by the update stream in each graph query to obtain the corresponding minimum and maximum levels and initialize the LTFP model. Second, it needs to perform result evaluation at the end of each iteration and merge the currently activated vertices with the vertices directly affected by the update stream that need to be processed in the next round. Third, there are a large number of irregular data accesses in the streaming graph.

3.2 PMGraph Hardware Accelerator

Hardware methods can address the above problems, so we propose PMGraph. Existing hardware accelerators focus on processing CGQs in a single graph query or static graphs, and do not consider CGQs in streaming graphs. Thus, we propose PMGraph, a hardware accelerator for processing CGQs in streaming graphs. Based on the LTFP model, PMGraph obtains the vertex information to be processed by the current iteration of CGQs according to the active vertex set Frontier and provides data prefetch for the execution of graph queries.

3.2.1 System Architecture.

We design PMGraph using a hardware-software co-design approach on processor cores. Processor cores interact with PMGraph through software calls. The overall hardware-software collaboration architecture of PMGraph on a general-purpose multi-core processor is shown in Figure 6. The blue part in Figure 6 shows that various software systems run on the processor. The grey part presents that PMGraph is coupled to each core in the processor which is located between the core and the private L2 cache of the core. PMGraph accesses vertex data through L2 TLB and L2 cache of the coupled core. When data is not in L2 cache, PMGraph fetches data from LLC or main memory. The software system running on the processor invokes PMGraph to process the CGQs. Figure 7 reveals the basic architecture of PMGraph. It mainly contains two hardware units: DADP and HVSC. DADP filters out all graph queries that process the current active vertices in each iteration according to the filter in LTFP and performs dependency-aware data prefetching according to vertex dependencies. HVSC implements the merged storage of frequently accessed state data of hot vertices. When the software system needs to process graph data, it can call PMGraph. When it is called, the DADP unit in PMGraph prefetches the graph data required by CGQs based on the dependence relationship between vertices into the fetched buffer and then passes the prefetched data to the software system for processing. During the operation of PMGraph, the frequently accessed state data of graph vertices will be merged and stored by the HVSC unit. Next, we describe the hardware-software co-design of PMGraph in detail. It contains three aspects: software-level operations, hardware-level operations, and cooperation of hardware and software.

Fig. 6.

Fig. 7.

Software-Level Operations. When a software system processes CGQs in a streaming graph, it first receives a stream of updates and then identifies the vertices directly affected by the update stream. Then it initializes the data structures required by PMGraph (e.g., Frontier, vertex filter array), and configures the hardware units DADP and HVSC of PMGraph. In addition, the software system will call PMGraph to obtain data related to active vertices for processing.

Hardware-Level Operations. PMGraph includes two hardware units: DADP and HVSC shown in Figure 7. Their functions are described below.

(a) Operations of DADP. DADP prefetches the data to be processed for CGQs based on the dependency relationship between vertices, so that each graph query propagates vertex status from top to bottom according to the levels of vertices on vertex dependence chain, reducing redundant computations and memory access latency in each graph query. DADP also enables CGQs to process high-degree vertices directly affected by the update stream in the same round, increasing the overlap of processed data. When the software system initializes the relevant data structures (such as the global active vertex set Frontier) for a round of iteration, DADP obtains the data related to vertex dependencies in each graph query, including the levels of vertices that need to be processed in the current iteration of each graph query, and the waiting round of each graph query. Based on these information, DADP prefetches the graph structure data and vertex status data of the global active vertices, and puts the prefetched data into the buffer. The software system obtains these data from the buffer through the API PM_Fetch_Data() and processes them.

(b) Operations of HVSC. HVSC is mainly responsible for merging the state data of frequently accessed hot vertices and obtaining the state data of these hot vertices. For each graph data block associated with a specific processor core, HVSC aggregates the hot vertex state data into a continuous memory space to form a state merging array (ConState). In this array, each element stores the state data of a hot vertex in all CGQs. When the DADP unit needs to obtain the state data of a vertex, HVSC first determines whether the vertex is a hot vertex. If the vertex is a hot vertex, HVSC will create an index pointing to the corresponding position in the state merging array to obtain the relevant data of the vertex. If the vertex is not a hot vertex, HVSC directly provides the address of the vertex in the vertex state array in each graph query. HVSC enables the software system to update the state data of the vertex through the API PM_STATE_UPDATE() provided by PMGraph. For hot vertices that are frequently accessed, the API will update their state in the state merging array. For hot vertices that are not frequently accessed, the API will update their state in the vertex state array of each graph query. This means that during the data update process, the state of the hot vertex is only recorded in the specified position of the state merging array. When the iteration is over, these state data are migrated from the state merging array to the vertex state array.

Cooperation of Hardware and Software. Software systems and PMGraph cooperate to process CGQs in streaming graphs. Every time the software system receives a batch of update streams, it will updates the graph structure. Next, the software system identifies all vertices directly affected by update streams in each graph query, and initializes the filter and frontier based on these direclty affected vertex information. Then, it calls the API PM_CONFIGURE() to trigger PMGraph. PMGraph traverses the graph data through DADP, filters out the graph queries that need to process the currently active vertices, and prefetches the vertex data to be processed by the software system in the current iteration, and puts the data into the fetched buffer. The software system obtains the vertex information and edge data from the fetched buffer in L2 through the API PM_FETCH_DATA() for processing, and use HVSC unit to determine whether these vertices are frequently accessed hot vertices, obtain and update the vertex status data. When each round of iteration ends, the filter and frontier will be updated until all graph queries are completed.

3.2.2 The Work Mechanisms of DADP and HVSC in PMGraph.

Work Mechanism of DADP: During the execution of the software model LTFP, the processor core activates the prefetch mechanism of the DADP hardware unit in PMGraph by calling the API before each round of iteration. DADP prefetches active vertices according to the graph data allocated to each processor core. DADP also offloads the filtering operation for graph queries in the software model, reducing the running overhead. Specifically, DADP first obtains the starting address (\(start_{v}\)) of the graph vertex, the ending address (\(end_{v}\)) of the graph vertex, the base address (\(StartFlag\_baseaddress\)) of the vertex filter array, the base address (\(WaitIter\_baseaddress\)) of the graph query waiting rounds array, and the base address (\(Frontier\_baseaddress\)) of the global active vertex set. Then DADP traverses the graph vertices in sequence starting from the starting address of the graph vertex, calculates the offset (\(Frontier\_offset\)) of the vertex on Frontier according to the vertex id, and obtains the address of the vertex in Frontier. Since each graph query in CGQs stores the state data of the graph vertex, it is necessary to filter out the graph queries that need to process the vertex in the current iteration and prefetch the state data of the vertices in these graph queries. In addition, DADP calculates the offset of each graph query in the \(WaitIter\) array through the ID of each graph query, obtains the address of each graph query in the \(WaitIter\) array, and thus determines the round that each graph query needs to wait. If the waiting round is 0, the graph query will run in the current iteration. If the waiting round is not 0, the waiting round of the graph query is reduced by 1.

In order to achieve efficient data prefetching, five stages are implemented by DADP as a pipeline shown in Figure 8, namely, Fetch_Vertex, Filter_Vertex, Filter_Query, Fetch_Neighbors and Fetch_States. In the Fetch_Vertex stage, DADP fetches graph vertices from the starting address (\(start_{v}\)) of the graph vertex to the ending address (\(end_{v}\)) of the graph vertex. If the graph vertex exists, it enters the Filter_Vertex stage. In the Filter_Vertex stage, DADP filters out the graph vertices that need to be prefetched by judging whether the vertex is an active vertex. In Filter_Query stage, DADP filters out the graph query that needs to process the vertices in the current iteration through the vertex filter array (StartFlag) and the graph query waiting round array (WaitIter). In the Fetch_Neighbors stage, DADP obtains the ids of all neighbors of the vertex through the vertex id and the pointer to the neighbor list of the vertex. Finally, in the Fetch_States stage, the HVSC hardware unit gains their corresponding state data according to the previously acquired vertex ids and their respective neighbor ids. After the above five stages, the information of global active vertices, the ids of all neighbors of the vertex, and their status data in each graph query are prefetched into the fetched buffer. The software system obtains vertex information and edge data from the fetched buffer in L2 through the API PM_FETCH_DATA() for processing. Due to the similarity between the data processed by different graph queries, PMGraph merges the same vertices processed by different graph queries and prefetches the vertex data shared by different graph queries, reducing the data access memory overhead and hiding the data access latency.

Fig. 8.

Work Mechanism of HVSC: When DADP obtains vertex state data or the software system calls the instruction PM_STATE_UPDATE() to update the vertex state, HVSC needs to redirect the access to these state data. Specifically, the access to state data of the hot vertex needs to be redirected by HVSC to the access to the state merging array (ConState) through the hash index table (HashTable), while the access to other vertices does not need to be converted by HVSC and directly accesses the vertex state array in each graph query. For example, when a graph query \(q\) accesses the state data of vertex \(v\), HVSC determines whether the vertex is a hot vertex based on the degree of vertex and the vertex threshold \(T\). If the vertex is a hot vertex, HVSC directly obtains its offset (Constate_offset_v) of graph vertex offset in the state merging array (ConState) from the HashTable through the id of vertex v. HVSC also obtains the base address (ConState_baseaddress) of the ConState array and calculates the offset (offset_q) through the id of the graph query. Then, by taking sum of ConState_baseaddress, ConState_offset_v and offset_q, the address of the state data of the vertex to be accessed by the graph queries can be obtained. According to the address of the obtained state data, HVSC can locate the frequent access to the state data of hot vertices to the corresponding position in the ConState array. The state data associated with other graph queries is also loaded into the cache, reducing the memory access overhead. If the vertex accessed by the graph query has no corresponding entry in HashTable, it means that the vertex has not been accessed before. In this case, HVSC will obtain the state data of vertex \(v\) from the vertex state array of the graph query, merge this state data with the state data of the vertex in other graph queries and record it in the state merging array (ConState). The hash index table (HashTable) will add an entry <vertex ID, vertexoffset> containing the ID of the vertex and its offset in the ConState array.

3.3 Initialization and Configuration of PMGraph

This section introduces the key data structure, initialization, and configuration of PMGraph.

3.3.1 Key Data Structure.

Graph is usually stored in an adjacency list because of its ability to provide fast processing of update streams. Specifically, a graph vertex array (V) is used to store the graph structure information of each vertex, including the degree of a vertex and a pointer to the neighbor list of the vertex. Each query utilizes a graph vertex state array (S) to store the state data of the graph vertices, including algorithm-related values, the levels of vertices, and their parent vertices in dependency chains. The active vertices of each iteration are stored in the global active vertex set (Frontier). The levels of vertices to be processed by the CGQs in each iteration are recorded in the dependency-aware vertex filter array (StartFlag). The iteration rounds that the query needs to wait for starting computations are recorded in the query waiting rounds array (WaitIter). Queries filtered by the vertex filter array (StartFlag) and the query waiting rounds array (WaitIter) are recorded in the filtered query array (FilterQ), which represents the vertex status data in these queries will be prefetched. The state data of frequently accessed hot vertices is merged and stored in the state merging array (ConState). In order to efficiently index the state data of hot vertices, a hash index table (HashTable) is employed to locate the state data of a specific hot vertex within the array. Each element in the hash table (HashTable) is a tuple consisting of the vertex id (vertexID) and vertex offset (vertexoffset) and the hash table can be dynamically expanded. The determination of hot vertices is determined by the hot vertex threshold (T).

3.3.2 Data Structure Initialization.

When an update stream is received, the software system applies these updates to the graph structure to generate the latest graph, and initializes the global active vertex set, vertex filter array and query waiting rounds array according to the update stream. These data structures will also be updated after an iteration. Since real-world graphs usually follow a power-law distribution, the hot vertex threshold (T) is set using the relative scale method. We select the four vertices with the highest degrees affected by the update stream and calculate their average degree (avgDegree). Then, the hot vertex threshold (T) is set \(\alpha\) \(\times\) avgDegree (0 < \(\alpha\) < 1, default value is 30% in the experiment).

3.3.3 Configuration of PMGraph.

When PMGraph is configured, the graph vertex array, graph vertex state array, global active vertex set, query waiting round array, filtered query array, state merging array, hash table, hot vertex threshold, the start address and end address of graph vertices are configured. When the software system starts, it will call API PM_CONFIGURE() to configure the PMGraph. After the software system receives the update streams and updates the graph structure, it will identify the directly affected vertices and update the vertex dependence chain for each graph query. Before each iteration, the processor core activates the efficient prefetching mechanism of DADP unit in PMGraph by calling the API PM_ITER_PREFETCH(). The prefetched data is stored in a fetched buffer. These data will be obtained by the software through API PM_FETCH_DATA(). When the software system needs to access the state data of a vertex, HVSC unit will determine whether the vertex is a frequently accessed hot vertex. Then it obtains its state data and update it.

4 Evaluation

In this section, we evaluate the effectiveness of our proposed PMGraph.

4.1 Experimental Setup

Since there is no concurrent graph query system for streaming graphs that starts from the vertices directly affected by the update stream, we build a concurrent graph query system Kickstarter-C as a baseline and compare it with our proposed PMGraph. Kickstarter-C extends the state-of-the-art single streaming graph query processing engine Kickstarter [45] for CGQs in streaming graphs. Since the original Kickstarter does not support CGQs on the same graph, we performed a minor extension to Kickstarter and made it support the CGQs in the streaming graph and all engines share the same graph. We also compare PMGraph with Tripoline [21] which is the state-of-the-art concurrent graph query processing system for streaming graphs (retaining graph query results of top-4 high-degree vertices). Besides, we integrate state-of-the-art graph processing hardware accelerators HATS [33] and Minnow [51] into Kickstarter-C for CGQs of streaming graphs. We also implement two core functions of the state-of-the-art hardware accelerator LCCG [59] for concurrent queries in static graphs: regularizing traversal paths of graph queries and prefetching data, and extend the state-of-the-art streaming graph hardware accelerator JetStream [38] to support concurrent queries. We also compare PMGraph with Minnow, HATS, LCCG, and JetStream, respectively.

Applications. We select five classic graph queries for experiments including BFS [12], SSSP [32], single source widest path (SSWP) [45], single source narrowest path (SSNP) [21], and Radii [42]. These queries traverse the graph structure from the source and compute the state data of the vertices. Applications such as PageRank [35] does not have a specific source. Thus, they are outside the scope of this article.

Datasets. We evaluate our PMGraph on five classic real-world graphs including Google [25], Topcats [49], LJ [2], WP [50], and UK [5]. Their basic properties are summarized in Table 1. The number of vertices ranges from 0.8–18.5 M, and the number of edges ranges from 5–298 M. Similar to existing solutions [45, 58], we first load 70% of the edges as the initial graph. We then take the remaining 30% of edges as added edges in the update stream, while deleted edges in the update stream randomly select 30% from the initial graph. For each dataset, we randomly select 64 sources for graph queries at runtime. The degree of the source vertex of each graph query is larger than the average degree of the vertices in the dataset.

Table 1.

Abbr.	Dataset	\(\|\)V\(\|\)	\(\|\)E\(\|\)	Avg Degree	Diameter
Google	web-Google [25]	875,713	5,105,039	12	22
Topcats	wiki-Topcats [49]	1,791,489	28,511,807	32	9
LJ	LiveJournal [2]	4,847,571	68,993,773	17	18
WP	Wikipedia [50]	13,593,032	437,217,424	32	10
UK	UK-2002 [5]	18,520,486	298,113,762	28	45

Table 1. Characteristic of Datasets

Simulation Platform. Similar to previous methods [33, 59], we use sniper [6] to simulate a 64-core processor and utilize its default Beckton microarchitecture configuration. Its specific parameters are listed in Table 2. Each out-of-order core has private L1 and L2 caches and shares the last-level cache. We implement PMGraph into the simulated processor. The program is compiled with GCC 7.2 with the O3 flag. Similar to previous methods [33, 58], we employ the EDA tool (Design Compiler 2016) and CACTI [23] to estimate the power and area overhead of hardware units in PMGraph.

Table 2.

Core	64 cores, OoO @ 2.66 GHz, 4-wide front-end
L1-I/D Cache	32 KB per-core, 4/8-ways set-associative, 4 cycles access latency
L2 Cache	256 KB per-core, 8-ways set-associative, 8 cycles access latency
LLC Cache	16 MB NUCA (2 MB per core), 16-ways set-associative, 10 cycles bank access latency
NoC	Ring network with 2 cycles per hop, MESI coherence
Memory	50 ns latency, 2 on-chip memory controllers

Table 2. Configuration of the Simulation Platform

4.2 Comparison with Different Software Approaches

We compare the experimental results of the state-of-the-art concurrent graph query software systems Kickstarter-C, Tripoline, our pure-software-based approach (PMGraph-S), our approach containing only a hardware component DADP (PMGraph-D), and our proposed hardware accelerator (PMGraph) on the above five datasets. Among them, PMGraph-S represents the pure software implementation of PMGraph, and PMGraph-D represents that only DADP is enabled in PMGraph, and HVSC is not enabled. PMGraph represents that DADP and HVSC are all enabled.

Software Runtime Comparison: Figure 9 shows the normalized time of various methods relative to Kickstarter-C. It indicates that our proposed software approach PMGraph-S outperforms Kickstarter-C, and gains the speedups of 1.01\(\times\)\(\sim 3.12\times\) (1.71\(\times\) on average). Compared with Tripoline, PMGraph-S achieves an average speedup of 1.38\(\times\). The reason is that our PMGraph-S utilizes the similarity between different graph queries to address the problem of a large number of repeated accesses to the same data by different graph queries, thereby reducing memory access overhead. We regularize the processing order of vertices in each graph query according to the dependence chain of vertices, reducing redundant computations. Our scheduling strategy can also increase the data overlap when different graph queries are processed, thereby improving the performance.

Fig. 9.

We also compare our software approach (PMGraph-S) combined with hardware components with other software methods. PMGraph-D offloads the graph query filtering and graph traversal of PMGraph-S to the DADP hardware unit, thereby reducing the memory access overhead caused by a large number of judgment operations and irregular data access in software PMGraph-S. The runtime of PMGraph-D is 23%\(\sim\)48% less than that of PMGraph-S. Compared with Kickstarter-C and Tripoline, PMGraph achieves the speedup of 3.18\(\sim 8.37\times\) (5.57\(\times\) on average), 1.54\(\sim 7.96\times\) (4.58\(\times\) on average), respectively, and has better performance than PMGraph-S. Among them, PMGraph has the best performance in highly skewed power-law graphs. The reason is that when the degree distribution of vertices is highly skewed, more vertex data will be shared between different queries, reducing memory access overhead. The above comparison results show that the performance of PMGraph is better than other methods, indicating that our proposed PMGraph is effective.

Cache Miss Comparison: Figure 10 reveals the LLC miss comparison results of different methods relative to Kickstarter-C. Compared with Kickstarter-C and Tripoline, PMGraph-S (pure-software-based approach) reduces the LLC misses by 11.4% and 1.56% on average, respectively. The reason is that PMGraph-S is centered on fine-grained data to enable different graph queries in CGQs to share the vertex data that needs to be processed, reducing memory access overhead. Compared with PMGraph-S, PMGraph-D reduces the LLC misses by 11.4% on average. The reason is that PMGraph-D provides dependency-aware data prefetching based on PMGraph-S, loading the data to be processed by CGQs into the cache in advance, and also reducing random memory access caused by frequent judgment operations in the software. Compared with PMGraph-D, PMGraph reduces the LLC misses by 16.88% on average. The reason is that PMGraph merges the status data of hot vertices in all queries into a continuous memory space. When a hot vertex data is loaded into the cache, other data related to the hot vertex will also be loaded into the cache, reducing the cache miss rate.

Fig. 10.

Comparison of the Total Number of Vertex Updates: Figure 11 indicates the total number of state updates of all vertices during CGQs for Kickstarter-C, Tripoline, and PMGraph-S, and they are normalized with Kickstarter-C as the baseline. PMGraph-S reduces the total number of vertex updates by an average of 37% compared to Kickstarter-C and 41% compared to Tripoline. The reason is that PMGraph-S utilizes the order of vertex dependence chains to regularize the processing order of vertices within each graph query in CGQs, which effectively reduces redundant computations within each graph query. In addition, when the graph is updated, Tripoline does not start computation from the vertices directly affected by the update stream, but from the source vertices. Thus, the total number of vertex updates is large. The above experimental results reveal that our proposed PMGraph-S can effectively reduce redundant computations in CGQs in the streaming graph by utilizing the update strategy based on vertex dependencies.

Fig. 11.

4.3 Comparison with Hardware Accelerators

We compare PMGraph with the state-of-the-art hardware accelerators and normalize their runtime based on the runtime of Minnow. The comparison results are shown in Figures 12 and 13. Compared with Minnow, HATS, LCCG and JetStream, PMGraph achieves the speedup of 2.57\(\sim 4.46\times\) (3.65\(\times\) on average), 2.36\(\sim 4.22\times\) (3.41\(\times\) on average), 1.33\(\sim 2.14\times\) (1.73\(\times\) on average) and 1.3\(\sim 1.52\times\) (1.38\(\times\) on average), respectively, and reduces the LLC misses by 15.68%, 9.36%, 2.16%, and 3.76% on average, respectively. The reason is that Minnow, HATS, and JetStream are hardware accelerators designed for a single graph query, which do not consider the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. Thus, they do not work well for CGQs. LCCG is designed for CGQs in static graphs and recomputes from the source vertices which causes a lot of computation overhead.

Fig. 12.

Fig. 13.

Area Cost and Power Consumption: The area occupied by PMGraph is mainly the area used by the two hardware units (DADP and HVSC). We use the Verilog RTL to implement the hardware units and utilize the EDA tool under 40 nm process with the same target frequency to synthesize the RTL. Experimental results show that the power consumption and the area of PMGraph are 692.19 mW and 0.077 \(mm^2\), which are very small.

The above experimental results reveal that our proposed PMGraph efficiently addresses the problem of a large number of redundant computations and memory access overheads in CGQs in the streaming graph with a very small area overhead, which shows that our proposed PMGraph is effective.

4.4 Sensitivity Analysis

Stability: To evaluate the stability of our proposed PMGraph, we conduct experiments on the dataset LJ using the SSSP algorithm to explore the impact of update streams of different sizes (from 10 K to 10 M) on the runtime, and compare it with existing classic software and hardware methods. The experimental results are normalized based on Minnow and shown in Figure 14. Figure 14 indicates that when the size of the update stream gradually increases, the runtime of PMGraph continues to be better than other methods and achieves a higher speedup. The reason is that PMGraph can provide merge processing and prefetching for the vertices that different queries need to process. When the size of the update stream increases, more vertex data is shared between different queries, thereby increasing the reuse of data in the cache.

Fig. 14.

Scalability: To evaluate the scalability of PMGraph, we perform experiments on the dataset LJ using SSSP algorithm to explore the impact of different sizes of LLCs (from 64 M to 256 M) on the runtime, and compare it with the classic hardware accelerators Minnow, HATS, LCCG, and JetStream. The experimental results are normalized based on Minnow and revealed in Figure 15. Figure 15 shows that when the size of the LLC increases, the performance of PMGraph is superior over other accelerators, and achieves the best speedup. The reason is that as the size of the LLC increases, more effective data can be cached in the LLC, and the data can be fully utilized during concurrent queries, thereby improving the performance.

Fig. 15.

5 Related Work

Previous work on graph queries could be roughly divided into three categories: single graph query processing, CGQs processing, and hardware schemes for graph query.

Single Graph Query Processing. Many graph processing systems were proposed for a single graph query in streaming graphs. Kickstarter [45] identified the vertices directly affected by update streams according to vertex dependencies and accelerated the iterative computations by approximating the intermediate value. Graphbolt [31] also utilized intermediate values and vertex dependencies for incremental propagation and computation. RisGraph [16] proposed localized data access and inter-update parallelism method to achieve high throughput and low latency simultaneously. However, this method was mainly used in the scenario that it performed per-update incremental analysis, and the performance would degrade in the scenario that it performed update for a large batch size of update stream at the same time. Dzig [30] proposed a sparsity-aware incremental processing technique to optimize Graphbolt. GraPU [41] precomputed updates in the buffer to speed up convergence. However, previous methods all focused on processing a single graph query in streaming graphs. They did not consider data reuse, and could not efficiently process CGQs in streaming graphs. According to the similarity between the data processed by different graph queries, our PMGraph schedules CGQs centered on fine-grained data, which enabled CGQs to share vertex data to reduce data access overhead.

CGQs Processing. Many graph processing systems have been proposed for CGQs. One was for static graph scenarios. SimGQ [46] and VRGQ [20] reused the query results of high-degree source vertices to speed up other graph queries. GraphM [60] proposed a storage system for CGQs. It divided the graph into multiple partitions and set the partition with the largest total number of graph queries accessing the partition as the highest priority for processing. ForkGraph [29] proposed a cache-efficient processing system for multiple graph queries. It divided the graph into LLC-sized partitions and buffered and executed multiple queries based on partitions. CGraph [57] proposed a subgraph-based scheduling algorithm to effectively reduce the disk access cost. These methods of dividing a graph into partitions or subgraphs resulted in a large amount of redundant computation. Krill [8] regarded all CGQs as a whole to share edge data for accelerating computation. Glign [50] was an improvement over Krill [8]. It aligned high-degree vertices [50], thereby increasing the overlap of vertex data processed by different graph queries. However, they were all for queries in static graphs, and they were all processed from the source vertices, which would cause a lot of computation overhead. Thus, these methods were not suitable for CGQs in streaming graphs.

The other type was for streaming graph scenarios. EGraph [52] proposed an execution model for load processing switching, which effectively reduced the CPU-GPU data transmission overhead by utilizing the structural similarity of streaming graphs at different times, thereby improving the speed of graph queries. However, this type of processing method for CGQs was optimized for GPU. Tripoline [21] proposed a general processing mode for CGQs, which utilized the result of high-degree standing queries to accelerate other queries. However, this type of processing method for CGQs was not optimized for redundant computations in graph queries. Furthermore, some methods did not start from the affected vertices, which would have expensive computation costs. Different from their methods, our proposed PMGraph for processing CGQs utilizes the dependencies of vertices and schedules CGQs centered on fine-grained data and addresses the problem that different graph queries in CGQs have a lot of repeated access to the same data and redundant computations in each graph query.

Hardware Schemes for Graph Query. There were some hardware accelerators for processing graph queries. GraphDynS [48] and TDGraph [58] adopted a hardware-software co-design approach to address the irregularity of graph queries. Some accelerators utilized in-memory computing technology to reduce the memory access cost and data transmission cost of graph queries [34, 62]. In addition, some accelerators used ReRAM to massively parallelize vector or matrix-based graph processing to accelerate graph queries [7, 26, 40, 43]. Furthermore, event-driven accelerators had also been proposed to accelerate graph queries, such as JetStream [38] and GraphPulse [37]. However, these hardware accelerators were mainly for a single graph query. When these accelerators were employed for multiple graph queries, different graph queries would have a lot of repeated access to the same data. PMGraph scheduled CGQs centered on fine-grained data according to the similarity between different graph queries, reducing a large amount of repeated access to the same data by different graph queries. Accelerators for multiple queries have also been proposed. LCCG [59] provided a topology-aware hardware accelerator, which processed CGQs by regularizing the traversal order of graph queries. However, it was aimed at static graph scenarios. It started processing from the source vertices, which would cause a lot of computation overhead, and it did not optimize the redundant computation in the graph query and was not suitable for CGQs in the streaming graphs.

Besides, some hardware techniques had been proposed for graph queries. GRASP [14] maximized the reuse of vertices in graph queries by protecting frequently used high-degree vertices from cache thrashing. P-OPT [3] predicted the time when a vertex would be processed next time through the transposition of the graph, in order to perform the cache replacement of the graph query. There were also some hardware prefetchers proposed to hide the latency of data access in graph queries [33, 51]. However, these hardware techniques were mainly used for a single graph query and did not consider the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. Different from previous accelerators, our proposed PMGraph is based on the LTFP model, utilizing the dependence chain of the vertices and adopting the update strategy that regularizes the processing order of vertices in each graph query of CGQs, thus effectively reducing the redundant computations in each graph query. PMGraph can also provide shared access for the same vertices processed by different graph queries, thereby reducing memory access overhead. Furthermore, we can provide prefetching for the same vertices processed by different graph queries, hiding the data access delay.

6 Conclusion

We propose PMGraph, a software-hardware co-design hardware accelerator, to efficiently address the problems of massive redundant computation and high memory access overhead in CGQs of streaming graphs. First, PMGraph selects graph queries that meet the requirements through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to reduce memory access overhead. Furthermore, it adopts the update strategy that regularizes the processing order of vertices in each graph query of CGQs according to the order of the vertex dependence chain to reduce redundant computations in each graph query. Second, PMGraph utilizes a CGQs-oriented scheduling strategy to increase the data overlap when different graph queries are processed, thereby improving the performance. Finally, PMGraph prefetches the vertex information for CGQs, hiding the data access latency. It also provides prefetching for the same vertices processed by different graph queries, reducing the data access overhead. Compared with the state-of-the-art streaming graph concurrent query software systems Kickstarter-C, Tripoline, and the state-of-the-art hardware accelerators Minnow, HATS, LCCG, and JetStream, PMGraph achieves an average speedup of 5.57\(\times\), 4.58\(\times\), 3.65\(\times\), 3.41\(\times\), 1.73\(\times\), and 1.38\(\times\), respectively, which reveals the effectiveness of our proposed PMGraph for CGQs in streaming graphs.

References

[1]

Salman Aslam. 2023. Twitter by the Numbers: Stats, Demographics and Fun Facts. Retrieved July 28th, 2023 from https://www.omnicoreagency.com/twitter-statistics/

Abstract

1 Introduction

2 Background and Motivation

2.1 A Large Amount of Memory Access Overhead Existing in CGQs of Streaming Graphs

2.2 A Large Number of Redundant Computations Existing in CGQs of Streaming Graphs

2.3 Motivation

3 Overview of Our Approach

3.1 LTFP Execution Model

3.1.1 Definition and Preprocessing.

3.1.2 Overview of LTFP.

3.1.3 Scheduling Strategy of LTFP.

3.1.4 Runtime Overhead.

3.2 PMGraph Hardware Accelerator

3.2.1 System Architecture.

3.2.2 The Work Mechanisms of DADP and HVSC in PMGraph.

3.3 Initialization and Configuration of PMGraph

3.3.1 Key Data Structure.

3.3.2 Data Structure Initialization.

3.3.3 Configuration of PMGraph.

4 Evaluation

4.1 Experimental Setup

4.2 Comparison with Different Software Approaches

4.3 Comparison with Hardware Accelerators

4.4 Sensitivity Analysis

5 Related Work

6 Conclusion

References

Index Terms

Recommendations

sGrapp: Butterfly Approximation in Streaming Graphs

The clique-separator graph for chordal graphs

Equistable graphs, general partition graphs, triangle graphs, and graph products

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations