1 Introduction
In the real world, graphs are used to represent complex relationships between things [
13,
27,
39,
61]. If a graph is updated dynamically through a stream of update streams, and the query is updated incrementally as new updates are applied, it is usually called a streaming graph [
10,
17,
24,
31,
45]. Graph update includes edge addition, edge deletion, edge weight change, vertex addition, and so on [
45,
58].
Concurrent graph queries (
CGQs) in the streaming graph means that every time the graph is updated, CGQs must be re-executed. Among them, the update operation on the graph is called update stream. There are often a large number of CGQs in the streaming graph [
20,
29,
46,
47,
57,
60]. CGQs usually represent multiple independent graph queries executed simultaneously on the same graph starting from different source vertices [
59]. CGQs appear in many application scenarios. (1) Complex algorithms for independent queries starting at different sources on the same graph. For example, betweenness centrality randomly selects the source vertex, and then calls
breadth-first search (
BFS) for graph queries [
19]. (2) An end-to-end scenario where multiple users issue multiple graph queries simultaneously on the same platform. For example, daily active users on X (formerly known as Twitter) are as high as 23.78 billion, and the daily tweets (update streams) received by the platform are about 50 billion [
1,
11,
18,
44]. X needs to retrieve specific information for numerous users from a large and evolving social network. The above description indicates that there are a large number of CGQs requirements in streaming graphs.
There is a large amount of memory access overhead and redundant computations when processing CGQs in streaming graphs. The reason that the processing of CGQs is independent of each other, and each graph query traverses the same graph structure along its own path, which makes the same data processed by different graph queries need to be frequently loaded from memory to cache, resulting in inefficient cache usage and large memory access overhead. In addition, there are a large number of redundant computations in each graph query. In streaming graphs, the update of CGQs only involves a part of the graph structure, starting from the directly affected vertices will greatly reduce the runtime. However, the irregular processing order of these directly affected vertices will cause subsequent vertices to be computed multiple times, which cannot quickly reach the stable state, resulting in a large number of redundant computations. Furthermore, directly affected vertices are random and the data access is usually irregular, which incurs a large amount of memory access overhead.
However, none of the existing graph processing systems can efficiently address the above problems. On the one hand, many graph processing systems are proposed for a single graph query in streaming graphs. They reduce data random access overhead and accelerate convergence by preprocessing update streams [
4], reusing historical information [
15,
31,
45], regularizing propagation paths [
58], accelerating state propagation [
28,
53,
54,
55,
56], and data prefetching [
33,
51,
58]. These systems cannot efficiently address the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. On the other hand, most of the existing systems for processing CGQs are for static graphs. That is, each graph query is processed from the source vertices. However, in streaming graphs they are often computed starting from the vertices directly affected by the update stream, which affects the state propagation of vertices, resulting in a large number of redundant computations [
29,
47,
60].
To address the above issues, we analyze the processing of CGQs in streaming graphs and obtain two observations. First, in the streaming graph, there are similarities between the data processed by different graph queries. Second, regularizing the processing order of vertices directly affected by update streams according to the vertex dependencies can reduce redundant computations in each graph query. Based on these two observations, we propose a fine-grained data-centric
Load-Trigger-Filter-Propagate (
LTFP) model based on vertex dependencies. The LTFP model centers on fine-grained data selects graph queries that meet the requirements in each iteration through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. Furthermore, it regularizes the processing order of vertices in each graph query of CGQs according to the dependence chain of the vertices, thus effectively reducing the redundant computations in each graph query. The dependence chain of vertices is first proposed by KickStarter [
45] for dependence tracking, computation, and adjustment of active vertices. We also propose a CGQs-oriented scheduling strategy to increase the overlap between data processed by different graph queries, thereby improving the performance. Based on the LTFP model, we propose a hardware-software collaborative accelerator PMGraph for efficient processing of CGQs in streaming graphs. PMGraph mainly contains two hardware units:
dependency-aware data prefetch unit (
DADP) and
hot vertex state coalescing unit (
HVSC). When the software system needs to process graph data, PMGraph is called. The DADP unit prefetches the graph data required for CGQs based on the dependencies between vertices, hiding the memory access latency. It also provides prefetching for the same vertices that need to be processed by different graph queries in each iteration, reducing the number of prefetching and thus reducing the data access overhead. During the process of PMGraph, HVSC will merge the state data of frequently accessed hot vertices and ensure that these state data can be obtained, reducing the data access overhead.
To evaluate the effectiveness of PMGraph, we conduct experiments on a simulated 64-core processor. Compared with the state-of-the-art concurrent graph query software systems Kickstarter-C and Tripoline [
21], PMGraph achieves the speedup of 3.18
\(\times\)\(\sim 8.37\times\) (5.57
\(\times\) on average) and 1.54
\(\times\)\(\sim 7.96\times\) (4.58
\(\times\) on average), respectively. Compared with the state-of-the-art hardware accelerators Minnow [
51], HATS [
33], LCCG [
59], and JetStream [
38], PMGraph achieves the speedup of 2.57
\(\times\)\(\sim\) 4.46
\(\times\) (3.65
\(\times\) on average), 2.36
\(\times\)\(\sim 4.22\times\) (3.41
\(\times\) on average), 1.33
\(\times\)\(\sim 2.14\times\) (1.73
\(\times\) on average) and 1.3
\(\times\)\(\sim 1.52\times\) (1.38
\(\times\) on average), respectively.
Our main contributions are as follows:
—
We propose a software-hardware collaborative accelerator PMGraph, which can efficiently process CGQs in streaming graphs. PMGraph is centered on fine-grained data, selects the graph queries that meet the requirements in each iteration through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large number of repeated accesses to the same data by different graph queries. Furthermore, it adopts the update strategy that regularizes the processing order of vertices in each graph query of CGQs according to dependence chain of the vertices, thus effectively reducing the redundant computations in each graph query.
—
We propose a CGQs-oriented scheduling strategy to increase the overlap between data processed by different graph queries, thereby further improving the performance of processing CGQs.
—
PMGraph prefetches the vertex information to be processed for CGQs according to the global active vertex set Frontier of all graph queries, hiding the memory access latency in the process of processing CGQs. It also provides prefetching for the same vertices that need to be processed by different graph queries in each iteration, reducing data access overhead.
—
Our evaluation shows that compared with the state-of-the-art software systems and hardware accelerators for CGQs on streaming graphs, PMGraph has better performance and lower cache misses.
The remainder of our work is organized as follows. Section
2 reviews the background and our motivation. Section
3 introduces our proposed PMGraph. Result evaluation is presented in Section
4. Section
5 describes the related work. Finally, we conclude our remarks in Section
6.
2 Background and Motivation
Existing streaming graph systems often determine the vertices directly affected by update streams according to the vertex dependencies, and traverse the graph structure from these vertices to update and propagate the state until all vertices obtain the correct results. The dependence relationship of vertices represents the sequence relationship in the dependence tracking process of active vertices. The dependence tracing process only depends on a single incoming neighbor for each vertex, where a single incoming neighbor is the input neighbor vertex that makes the vertex ultimately affected. When processing CGQs, different graph queries independently traverse the same graph structure in different paths. Thus, different graph queries repeatedly access the same data, which will result in a large amount of memory access overhead. Furthermore, the multi-sources traversal mode of each graph query will result in a large number of redundant computations.
2.1 A Large Amount of Memory Access Overhead Existing in CGQs of Streaming Graphs
When processing CGQs in a streaming graph, each graph query obtains the vertices directly affected by update streams according to their respective vertex dependencies. For different graph queries, due to the different dependencies of the vertices, the set of vertices directly affected by the update stream in each graph query is different. However, since all graph queries receive the same update stream, there is a large overlap in the set of vertices affected by the update stream between different graph queries. The processing of CGQs in the streaming graph starts from the vertices directly affected by the update stream, and traverses along their respective paths. They go through many of the same vertices, reflecting the high degree of overlap between the data processed by different graph queries.
In order to evaluate the data similarity when traversing CGQs, we employ 64 graph queries to test on five classical datasets [
2,
5,
25,
49,
50]. The experimental results are shown in Figure
1. The X-axis in Figure
1 represents the intervals of the number of queries. Figure
1 shows that 80%, 59%, 81%, 72%, and 92% of all vertices accessed by graph queries in dataset Google [
25], Topcats [
49], LJ [
2], WP [
50], and UK [
5], respectively, are accessed by the majority of graph queries (“majority of graph queries” means a large portion of all the queries or most of all the queries). Although graph queries will access many of the same vertices, since each graph query is independent, they will update and propagate according to their respective paths. Thus, the time for them to access the same vertices is often different, which makes the same vertices frequently accessed, resulting in a large amount of memory access overhead.
2.2 A Large Number of Redundant Computations Existing in CGQs of Streaming Graphs
There is very little work for CGQs in streaming graphs [
20,
21,
29,
46,
57,
59,
60]. They reuse query results for high-degree vertices to speed up convergence [
20,
21,
46,
59]. However, they are not optimized for the redundant computation problem. In order to improve the performance of processing CGQs, the existing general static graph system for CGQs adopts the method of partitioning the graph to manage the processing order of vertices and process them according to partitions [
9,
29,
57,
60]. When CGQs finishes executing in a partition, it will proceed to the next partition. In this manner, the data required by the CGQs currently being processed will remain in the LLC and will not be replaced. However, this kind of method for CGQs in static graphs makes the updated values of vertices not propagated in real-time, thus causing a large number of redundant computations in each graph query. The problem of redundant computation is more prominent when processing CGQs in streaming graphs. The reason that streaming graph processing takes the directly affected vertices as the sources for state propagation, the update of each graph query in the streaming graph in CGQs can be regarded as a multi-source graph query. The state propagation from all sources within each graph query is not independent, which may affect the final result of the graph query. When the processing order of vertices is constrained by the vertex dependencies, more redundant computations will be generated in this single-query multi-source mode.
We introduce an example of processing two graph queries in a streaming graph. There are nine vertices and two queries (Query(A) and Query(G)) in Figure
2(a). Among them,
A and
G are the source vertices of the two graph queries, respectively. Assume that the vertices directly affected by the update stream in the first query and the second query are
C and
F,
G, and
H, respectively. Without loss of generality, assume that active vertices are processed in priority order
A to
I. The red arrow in the figure indicates the state determination relationship between the vertices of the query with
A as the source vertex, and the black arrow indicates the state determination relationship between the vertices of the graph query with
G as the source vertex. The state determination relationship between vertices means that the state of each vertex is finally determined by the parent node except the source vertex in the vertex dependence chain. When processing the update stream, the first query (Query(A)) activates and updates vertices
D and
E from vertices
C and
F. The second query (Query(G)) activates and updates vertices
B,
H, and
I from vertices
G and
H. According to this rule, the vertices computed in each round of the two queries are shown in Figure
3(a). It takes a total of 16 computations for two graph queries to reach a stable state, and there are redundant computations. We next address this redundant computation problem in terms of vertex dependencies and compare the number of computations.
The update strategy based on vertex dependencies can address the redundant computation problem of each graph query in CGQs. The approach is to process vertices according to the vertex dependence chain. The vertex dependence chain [
22,
45] is established through the dependence relationship between vertices. In addition to the source vertex, the parent vertex of a vertex on the dependence chain represents the input neighbor vertex that makes this vertex ultimately affected. Vertices at the front of each dependence chain are processed with higher priority than vertices at the end of the dependence chain. The order of vertices on the dependence chain determines the processing order of vertices in each graph query. If each dependence chain updates vertices in order from the front to the end, then all vertices only need to be visited once to reach a stable state, thereby reducing redundant computations. Thus, we regularize the processing order of vertices in each graph query according to the order of the dependence chain of vertices, consequently effectively reducing redundant computations in each graph query. Furthermore, we utilize the dependence chain of vertices to determine the iteration round of vertices, which are applied to increase the overlap between the data processed by different graph queries, which is described in Section
3. The above two graph queries in Figure
2(a) adopt the establishment method of dependence chain in KickStarter [
45], and two corresponding dependence chains are obtained as shown in Figure
2(b). The left graph in Figure
2(b) is the dependence chain of the vertices corresponding to the graph query with A as the source vertex. The right graph in Figure
2(b) is the dependence chain of vertices corresponding to the graph query with G as the source vertex. We take Figures
2(b) and
3(b) as an example to briefly describe the above update strategy based on vertex dependencies. We process the same two queries which are used in the previous example and select the same vertices directly affected by the update stream on the same graph (Figure
2(a)). The vertices in the first query that are directly affected by the update stream are
C and
F. The vertices directly affected by the update stream in another graph query are
G and
H, respectively. As shown in Figure
3(b), it first selects the vertices
C and
G with the smallest levels among the vertices directly affected by the update stream in the dependence chains of each graph query. Then, they sequentially process the vertices of each level along the dependence chains of vertices, and perform update and activation. The number of computations for this approach is 12. Compared with the above method, the update strategy approach reduces 4 computations. If vertices
C and
F in Figure
2(b) are not on the same path (one is not the ancestor of the other) or their propagation paths do not overlap, the status propagation from
C and
F does not generate extra redundant computations and in such cases, they do not influence the execution of our approach. In such cases, we do not need to take extra operations, and we normally perform our update strategy.
2.3 Motivation
To address above problems, we utilize the similarity between different graph queries to merge the same vertices they need to process to address the problem of a large number of repeated accesses to the same data by different graph queries in CGQs, thereby reducing memory access overhead. We also adopt the update strategy that regularizes the update order of vertices in each graph query through the vertex dependencies, thereby reducing redundant computations. Based on these strategies, we propose a fine-grained data-centric concurrent execution model to process CGQs in streaming graphs.
Concurrent Execution Model Centered on Fine-grained Data: Existing execution models for CGQs are mainly divided into two categories: graph query-centric model [
20,
36,
46] and block data-centric model [
29,
47,
57]. The graph query-centric model loads each graph query and the data related to the graph query together, in order to facilitate vertex scheduling for the respective graph query. However, due to the similarity between different graph queries, when processing CGQs, there will be a large number of repeated accesses to the same vertices, resulting in a large amount of data access overhead. Block data-centric model utilizes the similarity between data processed by different graph queries to partition the graph. First, graph data is sequentially loaded in units of partitions. Then all CGQs are triggered to process data in the current partition. However, this partition-based execution mode results in a large number of redundant computations in CGQs of stream graphs and is not suitable for streaming graphs. Thus, we employ the similarity between different graph queries and the update strategy based on vertex dependencies to reduce the memory access overhead and the redundant computations in each graph query, in order to efficiently process CGQs in streaming graphs.
We propose a fine-grained data-centric LTFP model based on vertex dependencies, using the similarity between different graph queries to schedule CGQs centered on vertex data. It first loads an active vertex data, then triggers CGQs, filters qualified graph queries based on the vertex data, and reduces redundant computations in each graph query based on vertex dependencies. Our proposed model provides a unified processing scheme for CGQs in streaming graphs. In the model, we employ the global active vertex set Frontier to prefetch the vertex information to be processed by the current iteration, hiding the memory access latency. It also provides prefetching for the same vertices in each iteration of different graph queries, reducing data access overhead.
4 Evaluation
In this section, we evaluate the effectiveness of our proposed PMGraph.
4.1 Experimental Setup
Since there is no concurrent graph query system for streaming graphs that starts from the vertices directly affected by the update stream, we build a concurrent graph query system Kickstarter-C as a baseline and compare it with our proposed PMGraph. Kickstarter-C extends the state-of-the-art single streaming graph query processing engine Kickstarter [
45] for CGQs in streaming graphs. Since the original Kickstarter does not support CGQs on the same graph, we performed a minor extension to Kickstarter and made it support the CGQs in the streaming graph and all engines share the same graph. We also compare PMGraph with Tripoline [
21] which is the state-of-the-art concurrent graph query processing system for streaming graphs (retaining graph query results of top-4 high-degree vertices). Besides, we integrate state-of-the-art graph processing hardware accelerators HATS [
33] and Minnow [
51] into Kickstarter-C for CGQs of streaming graphs. We also implement two core functions of the state-of-the-art hardware accelerator LCCG [
59] for concurrent queries in static graphs: regularizing traversal paths of graph queries and prefetching data, and extend the state-of-the-art streaming graph hardware accelerator JetStream [
38] to support concurrent queries. We also compare PMGraph with Minnow, HATS, LCCG, and JetStream, respectively.
Applications. We select five classic graph queries for experiments including BFS [
12], SSSP [
32],
single source widest path (
SSWP) [
45],
single source narrowest path (
SSNP) [
21], and Radii [
42]. These queries traverse the graph structure from the source and compute the state data of the vertices. Applications such as PageRank [
35] does not have a specific source. Thus, they are outside the scope of this article.
Datasets. We evaluate our PMGraph on five classic real-world graphs including Google [
25], Topcats [
49], LJ [
2], WP [
50], and UK [
5]. Their basic properties are summarized in Table
1. The number of vertices ranges from 0.8–18.5 M, and the number of edges ranges from 5–298 M. Similar to existing solutions [
45,
58], we first load 70% of the edges as the initial graph. We then take the remaining 30% of edges as added edges in the update stream, while deleted edges in the update stream randomly select 30% from the initial graph. For each dataset, we randomly select 64 sources for graph queries at runtime. The degree of the source vertex of each graph query is larger than the average degree of the vertices in the dataset.
Simulation Platform. Similar to previous methods [
33,
59], we use sniper [
6] to simulate a 64-core processor and utilize its default Beckton microarchitecture configuration. Its specific parameters are listed in Table
2. Each out-of-order core has private L1 and L2 caches and shares the last-level cache. We implement PMGraph into the simulated processor. The program is compiled with GCC 7.2 with the O3 flag. Similar to previous methods [
33,
58], we employ the EDA tool (Design Compiler 2016) and CACTI [
23] to estimate the power and area overhead of hardware units in PMGraph.
4.2 Comparison with Different Software Approaches
We compare the experimental results of the state-of-the-art concurrent graph query software systems Kickstarter-C, Tripoline, our pure-software-based approach (PMGraph-S), our approach containing only a hardware component DADP (PMGraph-D), and our proposed hardware accelerator (PMGraph) on the above five datasets. Among them, PMGraph-S represents the pure software implementation of PMGraph, and PMGraph-D represents that only DADP is enabled in PMGraph, and HVSC is not enabled. PMGraph represents that DADP and HVSC are all enabled.
Software Runtime Comparison: Figure
9 shows the normalized time of various methods relative to Kickstarter-C. It indicates that our proposed software approach PMGraph-S outperforms Kickstarter-C, and gains the speedups of 1.01
\(\times\)\(\sim 3.12\times\) (1.71
\(\times\) on average). Compared with Tripoline, PMGraph-S achieves an average speedup of 1.38
\(\times\). The reason is that our PMGraph-S utilizes the similarity between different graph queries to address the problem of a large number of repeated accesses to the same data by different graph queries, thereby reducing memory access overhead. We regularize the processing order of vertices in each graph query according to the dependence chain of vertices, reducing redundant computations. Our scheduling strategy can also increase the data overlap when different graph queries are processed, thereby improving the performance.
We also compare our software approach (PMGraph-S) combined with hardware components with other software methods. PMGraph-D offloads the graph query filtering and graph traversal of PMGraph-S to the DADP hardware unit, thereby reducing the memory access overhead caused by a large number of judgment operations and irregular data access in software PMGraph-S. The runtime of PMGraph-D is 23%\(\sim\)48% less than that of PMGraph-S. Compared with Kickstarter-C and Tripoline, PMGraph achieves the speedup of 3.18\(\sim 8.37\times\) (5.57\(\times\) on average), 1.54\(\sim 7.96\times\) (4.58\(\times\) on average), respectively, and has better performance than PMGraph-S. Among them, PMGraph has the best performance in highly skewed power-law graphs. The reason is that when the degree distribution of vertices is highly skewed, more vertex data will be shared between different queries, reducing memory access overhead. The above comparison results show that the performance of PMGraph is better than other methods, indicating that our proposed PMGraph is effective.
Cache Miss Comparison: Figure
10 reveals the LLC miss comparison results of different methods relative to Kickstarter-C. Compared with Kickstarter-C and Tripoline, PMGraph-S (pure-software-based approach) reduces the LLC misses by 11.4% and 1.56% on average, respectively. The reason is that PMGraph-S is centered on fine-grained data to enable different graph queries in CGQs to share the vertex data that needs to be processed, reducing memory access overhead. Compared with PMGraph-S, PMGraph-D reduces the LLC misses by 11.4% on average. The reason is that PMGraph-D provides dependency-aware data prefetching based on PMGraph-S, loading the data to be processed by CGQs into the cache in advance, and also reducing random memory access caused by frequent judgment operations in the software. Compared with PMGraph-D, PMGraph reduces the LLC misses by 16.88% on average. The reason is that PMGraph merges the status data of hot vertices in all queries into a continuous memory space. When a hot vertex data is loaded into the cache, other data related to the hot vertex will also be loaded into the cache, reducing the cache miss rate.
Comparison of the Total Number of Vertex Updates: Figure
11 indicates the total number of state updates of all vertices during CGQs for Kickstarter-C, Tripoline, and PMGraph-S, and they are normalized with Kickstarter-C as the baseline. PMGraph-S reduces the total number of vertex updates by an average of 37% compared to Kickstarter-C and 41% compared to Tripoline. The reason is that PMGraph-S utilizes the order of vertex dependence chains to regularize the processing order of vertices within each graph query in CGQs, which effectively reduces redundant computations within each graph query. In addition, when the graph is updated, Tripoline does not start computation from the vertices directly affected by the update stream, but from the source vertices. Thus, the total number of vertex updates is large. The above experimental results reveal that our proposed PMGraph-S can effectively reduce redundant computations in CGQs in the streaming graph by utilizing the update strategy based on vertex dependencies.
4.3 Comparison with Hardware Accelerators
We compare PMGraph with the state-of-the-art hardware accelerators and normalize their runtime based on the runtime of Minnow. The comparison results are shown in Figures
12 and
13. Compared with Minnow, HATS, LCCG and JetStream, PMGraph achieves the speedup of 2.57
\(\sim 4.46\times\) (3.65
\(\times\) on average), 2.36
\(\sim 4.22\times\) (3.41
\(\times\) on average), 1.33
\(\sim 2.14\times\) (1.73
\(\times\) on average) and 1.3
\(\sim 1.52\times\) (1.38
\(\times\) on average), respectively, and reduces the LLC misses by 15.68%, 9.36%, 2.16%, and 3.76% on average, respectively. The reason is that Minnow, HATS, and JetStream are hardware accelerators designed for a single graph query, which do not consider the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. Thus, they do not work well for CGQs. LCCG is designed for CGQs in static graphs and recomputes from the source vertices which causes a lot of computation overhead.
Area Cost and Power Consumption: The area occupied by PMGraph is mainly the area used by the two hardware units (DADP and HVSC). We use the Verilog RTL to implement the hardware units and utilize the EDA tool under 40 nm process with the same target frequency to synthesize the RTL. Experimental results show that the power consumption and the area of PMGraph are 692.19 mW and 0.077 \(mm^2\), which are very small.
The above experimental results reveal that our proposed PMGraph efficiently addresses the problem of a large number of redundant computations and memory access overheads in CGQs in the streaming graph with a very small area overhead, which shows that our proposed PMGraph is effective.
4.4 Sensitivity Analysis
Stability: To evaluate the stability of our proposed PMGraph, we conduct experiments on the dataset LJ using the SSSP algorithm to explore the impact of update streams of different sizes (from 10 K to 10 M) on the runtime, and compare it with existing classic software and hardware methods. The experimental results are normalized based on Minnow and shown in Figure
14. Figure
14 indicates that when the size of the update stream gradually increases, the runtime of PMGraph continues to be better than other methods and achieves a higher speedup. The reason is that PMGraph can provide merge processing and prefetching for the vertices that different queries need to process. When the size of the update stream increases, more vertex data is shared between different queries, thereby increasing the reuse of data in the cache.
Scalability: To evaluate the scalability of PMGraph, we perform experiments on the dataset LJ using SSSP algorithm to explore the impact of different sizes of LLCs (from 64 M to 256 M) on the runtime, and compare it with the classic hardware accelerators Minnow, HATS, LCCG, and JetStream. The experimental results are normalized based on Minnow and revealed in Figure
15. Figure
15 shows that when the size of the LLC increases, the performance of PMGraph is superior over other accelerators, and achieves the best speedup. The reason is that as the size of the LLC increases, more effective data can be cached in the LLC, and the data can be fully utilized during concurrent queries, thereby improving the performance.
5 Related Work
Previous work on graph queries could be roughly divided into three categories: single graph query processing, CGQs processing, and hardware schemes for graph query.
Single Graph Query Processing. Many graph processing systems were proposed for a single graph query in streaming graphs. Kickstarter [
45] identified the vertices directly affected by update streams according to vertex dependencies and accelerated the iterative computations by approximating the intermediate value. Graphbolt [
31] also utilized intermediate values and vertex dependencies for incremental propagation and computation. RisGraph [
16] proposed localized data access and inter-update parallelism method to achieve high throughput and low latency simultaneously. However, this method was mainly used in the scenario that it performed per-update incremental analysis, and the performance would degrade in the scenario that it performed update for a large batch size of update stream at the same time. Dzig [
30] proposed a sparsity-aware incremental processing technique to optimize Graphbolt. GraPU [
41] precomputed updates in the buffer to speed up convergence. However, previous methods all focused on processing a single graph query in streaming graphs. They did not consider data reuse, and could not efficiently process CGQs in streaming graphs. According to the similarity between the data processed by different graph queries, our PMGraph schedules CGQs centered on fine-grained data, which enabled CGQs to share vertex data to reduce data access overhead.
CGQs Processing. Many graph processing systems have been proposed for CGQs. One was for static graph scenarios. SimGQ [
46] and VRGQ [
20] reused the query results of high-degree source vertices to speed up other graph queries. GraphM [
60] proposed a storage system for CGQs. It divided the graph into multiple partitions and set the partition with the largest total number of graph queries accessing the partition as the highest priority for processing. ForkGraph [
29] proposed a cache-efficient processing system for multiple graph queries. It divided the graph into LLC-sized partitions and buffered and executed multiple queries based on partitions. CGraph [
57] proposed a subgraph-based scheduling algorithm to effectively reduce the disk access cost. These methods of dividing a graph into partitions or subgraphs resulted in a large amount of redundant computation. Krill [
8] regarded all CGQs as a whole to share edge data for accelerating computation. Glign [
50] was an improvement over Krill [
8]. It aligned high-degree vertices [
50], thereby increasing the overlap of vertex data processed by different graph queries. However, they were all for queries in static graphs, and they were all processed from the source vertices, which would cause a lot of computation overhead. Thus, these methods were not suitable for CGQs in streaming graphs.
The other type was for streaming graph scenarios. EGraph [
52] proposed an execution model for load processing switching, which effectively reduced the CPU-GPU data transmission overhead by utilizing the structural similarity of streaming graphs at different times, thereby improving the speed of graph queries. However, this type of processing method for CGQs was optimized for GPU. Tripoline [
21] proposed a general processing mode for CGQs, which utilized the result of high-degree standing queries to accelerate other queries. However, this type of processing method for CGQs was not optimized for redundant computations in graph queries. Furthermore, some methods did not start from the affected vertices, which would have expensive computation costs. Different from their methods, our proposed PMGraph for processing CGQs utilizes the dependencies of vertices and schedules CGQs centered on fine-grained data and addresses the problem that different graph queries in CGQs have a lot of repeated access to the same data and redundant computations in each graph query.
Hardware Schemes for Graph Query. There were some hardware accelerators for processing graph queries. GraphDynS [
48] and TDGraph [
58] adopted a hardware-software co-design approach to address the irregularity of graph queries. Some accelerators utilized in-memory computing technology to reduce the memory access cost and data transmission cost of graph queries [
34,
62]. In addition, some accelerators used ReRAM to massively parallelize vector or matrix-based graph processing to accelerate graph queries [
7,
26,
40,
43]. Furthermore, event-driven accelerators had also been proposed to accelerate graph queries, such as JetStream [
38] and GraphPulse [
37]. However, these hardware accelerators were mainly for a single graph query. When these accelerators were employed for multiple graph queries, different graph queries would have a lot of repeated access to the same data. PMGraph scheduled CGQs centered on fine-grained data according to the similarity between different graph queries, reducing a large amount of repeated access to the same data by different graph queries. Accelerators for multiple queries have also been proposed. LCCG [
59] provided a topology-aware hardware accelerator, which processed CGQs by regularizing the traversal order of graph queries. However, it was aimed at static graph scenarios. It started processing from the source vertices, which would cause a lot of computation overhead, and it did not optimize the redundant computation in the graph query and was not suitable for CGQs in the streaming graphs.
Besides, some hardware techniques had been proposed for graph queries. GRASP [
14] maximized the reuse of vertices in graph queries by protecting frequently used high-degree vertices from cache thrashing. P-OPT [
3] predicted the time when a vertex would be processed next time through the transposition of the graph, in order to perform the cache replacement of the graph query. There were also some hardware prefetchers proposed to hide the latency of data access in graph queries [
33,
51]. However, these hardware techniques were mainly used for a single graph query and did not consider the problem of a large number of repeated accesses to the same data by different graph queries in CGQs. Different from previous accelerators, our proposed PMGraph is based on the LTFP model, utilizing the dependence chain of the vertices and adopting the update strategy that regularizes the processing order of vertices in each graph query of CGQs, thus effectively reducing the redundant computations in each graph query. PMGraph can also provide shared access for the same vertices processed by different graph queries, thereby reducing memory access overhead. Furthermore, we can provide prefetching for the same vertices processed by different graph queries, hiding the data access delay.
6 Conclusion
We propose PMGraph, a software-hardware co-design hardware accelerator, to efficiently address the problems of massive redundant computation and high memory access overhead in CGQs of streaming graphs. First, PMGraph selects graph queries that meet the requirements through vertex data, and utilizes the similarity between different graph queries to merge the same vertices they need to process to reduce memory access overhead. Furthermore, it adopts the update strategy that regularizes the processing order of vertices in each graph query of CGQs according to the order of the vertex dependence chain to reduce redundant computations in each graph query. Second, PMGraph utilizes a CGQs-oriented scheduling strategy to increase the data overlap when different graph queries are processed, thereby improving the performance. Finally, PMGraph prefetches the vertex information for CGQs, hiding the data access latency. It also provides prefetching for the same vertices processed by different graph queries, reducing the data access overhead. Compared with the state-of-the-art streaming graph concurrent query software systems Kickstarter-C, Tripoline, and the state-of-the-art hardware accelerators Minnow, HATS, LCCG, and JetStream, PMGraph achieves an average speedup of 5.57\(\times\), 4.58\(\times\), 3.65\(\times\), 3.41\(\times\), 1.73\(\times\), and 1.38\(\times\), respectively, which reveals the effectiveness of our proposed PMGraph for CGQs in streaming graphs.