1 Introduction

The reachability query, which asks whether there exists a path from one vertex to another in a directed graph, finds numerous applications in graph and network analysis. Such queries can be answered by graph traversal using either breadth-first or depth-first search in time \(O(|E|+|V|)\) without preprocessing (where V and E are the vertex set and edge set, respectively), or in constant time if we pre-compute and store the transitive closure of each vertex, which takes O(|V||E|) time and \(O(|V|^2)\) space. Unfortunately, neither of these approaches is feasible for applications that need to process large graphs with limited memory. Over the last decades, the problem has been extensively studied and many advanced algorithms have been proposed, with most of them relying on building smart indexes that can strike a balance between online query processing time and offline index construction time (and index size).

More recently, researchers recognized that it is possible to reduce the graph size by graph compression without loosing reachability information, and the compressed graph can help speedup query processing as well as reduce index size and index construction time. Specially, Fan et al. [7] define equivalence classes of vertices with respect to reachability queries, and compress a graph by merging all vertices in an equivalence class into a single vertex. However, finding all equivalence classes is very time-consuming. Zhou et al [22] propose an efficient algorithm to do a transitive reduction which turns a directed acyclic graph (DAG) into a DAG without redundant edges, after that the equivalence reduction of [7] can be done much more efficiently. The resulting graph \(G^\epsilon \) after transitive reduction and equivalence reduction over the original graph G can be a much smaller graph that retains all reachability information, and it was experimentally verified that for many real-world graphs, searching for reachability over \(G^\epsilon \) can be much faster than searching over G using state-of-the-art algorithms.

This paper builds on the work of [22]. We observe that after the removal of redundant edges, many linear chains will be generated. Based on this, we propose a multilevel reachability—preserving compression method that can further reduce the size of the graph obtained by the method in [22]. Our compression utilizes a slightly modified concept of module [14], and constructs a modular decomposition tree. We show how to use the decomposition tree to answer reachability queries over the original graph efficiently. Furthermore, the decomposition tree usually takes very small space. We make the following contributions:

  1. 1.

    We define a new concept of module, based on which we propose a multilevel graph compression scheme that compresses graphs into a smaller graph \(G_c\).

  2. 2.

    We organize the modules into a hierarchical structure called modular decomposition tree, and propose an efficient algorithm to utilize the tree to answer reachability queries.

  3. 3.

    We conduct extensive experiments with real-world graphs as well as synthetic graphs that demonstrate the advantages of our proposed approach.

  4. 4.

    We provide a discussion on similar compression strategies for k-reachability queries.

The remainder of this paper is organized as follows. We first discuss related works in Sect. 2 and present the preliminaries in Sect. 3. Then, we give an overview of our approach and provide the theoretical foundations in Sect. 4, followed by the detailed algorithms in Sect. 5. Our experimental results are given in Sect. 6. Section 7 discusses the possibility of similar compression on k-reachability queries. We conclude our paper in Sect. 8.

2 Related Work

As briefly mentioned in Sect. 1, existing approaches for answering reachability queries can be classified into index-based and compression-based approaches.

2.1 Index-Based Approach

The index-based algorithms create labels for the vertices, and such labels contain the reachability information. These algorithms can be divided into Label-only and Label+G methods [19]. The label-only [1, 3, 4, 8, 10,11,12,13, 20] methods use the labels of source and destination vertices only to answer reachability. Agrawal [1] proposed tree cover approach that creates an optimal spanning tree to create index. Here, an interval for each vertex is created. A reachability query is answered as true if the interval of target is contained in the interval of source vertex. The index construction time and index size both are high in this approach. A chain cover approach is first proposed in [8] where the entire graph is divided into a number of pairwise disjoint chains to create the index. The label of each vertex contains a minimal successor list containing their chain number and position in the chain. A vertex u will be reachable to v if label of u contains a pair (kj) and v has an index pair (ij) such that \(i \ge k\). This chain cover approach is later improved in [3]. Path tree [10] uses the similar concept of chain cover that uses paths to create index and has smaller index size than chain cover. The recent approaches DL [11], PLL [20] and TF [4] use the concept of two-hop labeling proposed in [6]. In two-hop labeling, a label is created for each vertex containing the subset of vertices that it can reach (\(L_\mathrm{out}\)) as well as the subset of vertices that can reach it (\(L_\mathrm{in}\)). Vertex u can reach vertex v if \(L_\mathrm{out}(u) \cap L_\mathrm{in}(v) \ne \emptyset \). [12] uses the concept of chain cover to improve two-hop and proposes a three-hop labeling that creates a transitive closure contour (Con(G)) of graph G using chain decomposition, and then applies two-hop techniques. Path-hop [2] improves three-hop by replacing the chain decomposition with a spanning tree. TF [4] proposes a topological folding approach for two-hop labeling that can significantly reduce the index size as well as the query time.

The Label+G approaches include [15,16,17,18,19, 21] which require online searching of data graph G if the query cannot be answered from labels. Triβl and Leser [17] uses interval labeling over a spanning tree and performs DFS online if needed. Grail [21] and Ferrari [15] use multiple interval instead of single interval label for each vertex over the spanning tree. Feline [18] creates coordinates \(i(u) = (X_u, Y_u)\) for a vertex u and answers reachability from u to v as true if the area of i(v) is contained in that of i(u). Feline also uses interval labeling over spanning tree and compares topological levels of u and v as additional pruning strategy to reduce DFS search. \(\hbox {IP}\) [19] uses independent permutation numbering to label each vertex. Feline and \(\hbox {IP}\) show significant improvement on query time and require less index construction time and smaller index size. BFL [16] proposes a bloom-filter labeling to further improve the performance of \(\hbox {IP}\).

2.2 Compression-Based Approach

Graph compression-based works include scarab [9], equivalence reduction [7] and DAG reduction [22]. Scarab [9] compresses the original graph by creating a reachability backbone that carries the major reachability information. To find reachability from vertex u to vertex v, the algorithm needs access to a list of local outgoing backbone vertices of u and local incoming backbone vertices of v. The algorithm then performs a forward BFS for u and backward BFS for v on the original graph to answer reachability from u to v. If the answer is false, then it checks whether any outgoing backbone vertex of u can reach any incoming backbone vertex of v in the reachability backbone; if yes, then u can reach v. Scarab requires large index size with high time complexity. Equivalence reduction [7] reduces the graph by merging equivalent vertices into a single vertex. Two vertices are equivalent if they have the same ancestors and same descendants. The algorithm requires high equivalence class construction time. DAG reduction [22] improves the construction time of equivalence classes by doing a transitive reduction in graph first.

Our work is different from the previous works; in that, we not only consider equivalent classes, but also linear chains, when compressing the graph, and to the best of our knowledge, none of the previous works uses multilevel compression and modular decomposition tree in reachability queries.

3 Preliminaries

We consider directed graphs in this paper. For any directed graph G, we will use \(V_G\) and \(E_G\) to denote the vertex set and the edge set of G, respectively. Given two vertices u and v in G, if there is a path from u to v, we say v is reachable from u, or equivalently, u can reachv. We use \(u\rightsquigarrow _G v\) to denote u can reach v in graph G. Given directed graph G and vertices u and v in G, a reachability query from u to v denoted \(? u \rightsquigarrow _G v\), asks whether v is reachable from u in G.

A directed acyclic graph (DAG) is a directed graph without cycles. In the literature, most works on reachability queries assume the graph G is a DAG, because if it is not, it can be converted into a DAG by merging all vertices in a strongly connected component into a single vertex, and vertices in a strongly connected component can all reach each other. In this work, we also assume the graph G is a DAG.

If (uv) is an edge in DAG G, we say u is a parent of v, and v is a child of u. For any vertex \(u\in V_G\), we will use parent(uG) and child(uG), respectively, to denote the set of parents of u and the set of children of u in G. We will also use anc(uG) and des(uG) to denote the set of ancestors of u and the set of descendants of u in G, respectively. When G is clear from the context, we will use the abbreviations parent(u), child(u), anc(u), and des(u) for parent(uG), child(uG), anc(uG), and des(uG), respectively.

Let M be a subset of vertices in G. For any vertex \(u\in M\) and a parent vertex \(u'\) of u, we say \(u'\) is an external parent of u (with respect to M) if \(u'\in parent(u)-M\). Similarly, we define an external child (resp. ancestor, descendent) of u with respect to M as a vertex in \(child(u)-M\) (resp. \(anc(u)-M\), \(des(u)-M\)).

3.1 Redundant Edges

Suppose (uv) is an edge in G. If there is a path of length greater than 1 from u to v, then (uv) is redundant for reachability queries, that is, removing (uv) from G will not affect the answer to any reachability queries.

The redundant edges can be efficiently identified and removed by a transitive reduction algorithm proposed in [22]. The following lemma is shown in [22]:

Lemma 1

SupposeGis a DAG without redundant edges, then for any two verticesuandvinG, \(parent(u) = parent(v)\)if and only if\(anc(u) = anc(v)\); \(child(u) = child(v)\)if and only if\(des(u) = des(v)\).

3.2 Equivalence Class

Two vertices u and v are said to be equivalent if they have the same ancestors and the same descendants, that is, \(anc(u) = anc(v)\), \(des(u) = des(v)\) [7]. Because of Lemma 1, if G does not have redundant edges, then u and v are equivalent if and only if they have the same parents and same children. The equivalent vertices form an equivalence class. It is easy to see that all vertices in the same equivalence class have the same reachability properties, that is, if u is in an equivalence class, then for any other vertex \(u'\), u can reach \(u'\) (resp. u is reachable from \(u'\)) if and only if every vertex v in the same equivalence class can reach \(u'\) (resp. is reachable from \(u'\)).

Also as observed in [22], if G has no redundant edges, then all vertices in an equivalence class form an independent set, that is, there are no edges between the vertices in the same equivalence class.

Lemma 2

Suppose G is a DAG without redundant edges, then every equivalent class is an independent set.

3.3 Modular Decomposition

The modular decomposition [14] of a directed graph G partitions the vertex set into a hierarchy of modules, where a module is conventionally defined as follows.

Definition 1

Let M be a set of vertices in G. We call M a module of G if all vertices in M share the same external parents and the same external children. In other words, for any \(u,v\in M\), \(parent(u)-M =parent(v)-M\) and \(child(u)-M =child(v)-M\).

It is easy to see that a singleton set is a module and the set of all vertices in G is also a module. These modules are called trivial modules. Let G be a DAG that has no redundant edges. By Lemma 1, an equivalent class is also a module, and by Lemma 2, such a module is an independent set. In the literature, modules that are independent sets are referred to as parallel modules.

4 Overview of Our Approach

The basic idea of our method is to compress the graph without loosing reachability information. We use modular decomposition; however, the definition of modules has been slightly modified from that found in the literature, in order to help with reachability queries.

Definition 2

A module in a DAG G is a set of vertices \(M\subseteq V_G\) that have identical external ancestors and identical external descendants. In other words, for any two vertices \(u,v\in M\), \(anc(u)-M\) = \(anc(v)-M\), and \(des(u)-M\) = \(des(v)-M\).

Fig. 1
figure 1

Example DAG

Figure 1 shows an example DAG, where the vertices 1, 2, 3, 4 have the same external ancestors and the same external descendants, but not the same external parents and same external children.

In this work, we are interested in two special types of modules, referred to as parallel modules and linear modules, respectively. A parallel module is a module that is an independent set, and a linear module is one that consists of a chain of vertices \(v_1, \ldots , v_k\) such that there is an edge \((v_i, v_{i+1})\) for all \(i\in [1,k-1]\). These modules have the following properties.

Lemma 3

SupposeGis a DAG that does not have redundant edges. (1) IfMis a parallel module ofG, then all vertices inMhave the same parents and same children. (2) IfMis a linear module consisting of the chain\(v_1, \ldots , v_k\), then for each\(i\in [2,k]\), \(v_{i-1}\)is the only parent of\(v_i\), and\(v_i\)is the only child of\(v_{i-1}\).

Proof

(1) Let M be a parallel module. By definition, M is an independent set, and all the vertices have the same external ancestors and the same external descendants. Since M is an independent set, it is impossible for any vertex in M to have an ancestor or descendent in M; therefore, all the vertices have the same ancestors and the same descendants (both external and internal). By Lemma 1, all vertices in M have the same parents and the same children.

(2) Let M be a linear module consisting of the chain \(v_1, \ldots , v_k\). For any \(i\in [2,k]\), if \(v_i\) has a parent u that is not \(v_{i-1}\), then there are two possible cases. The first case is that u is also in M, that is, u is one of \(v_{i+1}, \ldots , v_k\). This contradicts the assumption that G is a DAG since there will be a cycle. The second case is that u is not in M. In this case, by the definition of a module, u must be an ancestor of \(v_1\), that is, there will be a path from u to \(v_i\) with length at least 2. Hence, the edge \((u, v_i)\) would be redundant, contradicting the assumption that there are no redundant edges in G. This proves \(v_{i-1}\) is the only parent of \(v_i\). Similarly, we can prove \(v_i\) is the only child of \(v_{i-1}\). \(\square \)

In Fig. 2a, the vertices v1, v2, v3 form a parallel module. In Fig. 2b, the vertices v1, v2, v3 form a linear module. Note, however, the set \(\{v4,v1,v2,v3,\)\(v6\}\) in Fig. 2b is not a linear module.

Fig. 2
figure 2

Example of modules

It is worth noting that each single vertex forms a parallel module as well as a linear module. These modules are referred to as trivial modules, along with the module that consists of all of the vertices in G. According to Lemma 3, a parallel module is an equivalence class, if G is a DAG that has no redundant edges.

Note that if two vertices are in the same linear module, then their reachability depends on their relative positions in the chain. If they are in the same parallel module, then they cannot reach each other, as shown in the lemma below.

Lemma 4

LetGbe a DAG without redundant edges, anduvbe vertices in the same parallel module ofG, thenucannot reachvinG.

Proof

Let the parallel module that contains u and v be M. If the lemma is not true, there will be a path \(u, v_1, \ldots , v_s, v\) from u to v. Since M is an independent set, \(v_1\) and \(v_s\) cannot be in M. Hence, \(v_1\) is an external child of u, and \(v_s\) is an external parent of v. By Lemma 3 and the definition of modules, \(v_1\) must be a child of v and \(v_s\) must be a parent of u. Therefore, there will be a cycle, contradicting the assumption that G is a DAG. Hence, the proof. \(\square \)

A set of vertices may be in multiple parallel (or linear) modules, e.g., in the graph shown in Fig. 2a, \(\{v1,v2\}\) and \(\{v1,v2,v3\}\) are both parallel modules. However, we are only interested in the maximal modules as defined below.

Definition 3

A parallel (resp. linear) module M is said to be maximal if there is no other parallel (resp. linear) module \(M'\) such that \(M\subset M'\).

For example, \(\{v1,v2, v3\}\) is a maximal parallel module in Fig. 2a, and it is a maximal linear module in Fig. 2b.

Note that two different maximal parallel modules of G cannot have overlaps, and two different maximal linear modules cannot have overlaps. Furthermore, there cannot exist a non-trivial parallel module and a non-trivial linear module such that they have a common vertex. In other words, each vertex can belong to at most one non-trivial parallel or linear module.

4.1 Multilevel Compression and Modular Decomposition Tree

To utilize parallel and linear modules in reachability search, we perform a multilevel compression of the original graph G. First, we identify the maximal linear modules and parallel modules, and merge the vertices in each module into a single super vertex. We add an edge from super vertex \(s_1\) to super vertex \(s_2\) if and only if there exists \(u\in s_1\), and \(v\in s_2\) such that (uv) is an edge in G. In this way, we obtain the first-level compressed graph \(G_1 = Compress(G)\). Clearly, \(G_1\) is also a DAG without redundant edges. Then, we apply the same compression process to \(G_1\) to obtain the next-level compressed graph \(G_2=Compress(G_1)\), and this process is repeated until we obtain a graph \(G_c\) which can no longer be compressed, i.e., \(G_c\) does not have singleton set parallel or linear modules.

Fig. 3
figure 3

a A DAG G and b the DAG \({\tilde{G}}\) after transitive reduction

Example 1

Consider the DAG G in Fig. 3a, which consists of eleven vertices numbered 1 to 11. The graph is reduced to \({\tilde{G}}\) in Fig. 3b after transitive reduction. We will apply our compression to graph \({\tilde{G}}\).

There are no parallel modules in \({\tilde{G}}\). However, vertices 2, 3 and 4 can form a maximal linear module. Another maximal linear module exists in \({\tilde{G}}\) that consists of vertices 5, 6 and 7. So, vertices 2, 3, 4 and vertices 5, 6, 7 are compressed into two single nodes, and they are reduced into nodes LS1 and LS2, respectively, in graph \(G_1\) shown in Fig. 4a after the first-level compression. Then, \(G_1\) is compressed again to obtain \(G_2\) as shown in Fig. 4b, where the nodes LS1, LS2 and 8 in \(G_1\) are merged as they form an equivalent set in \(G_1\). The third-level compression creates graph \(G_3\) in Fig. 4c by merging nodes 1 and IS1 in \(G_2\) which form a linear module. The graph \(G_3\) does not contain any parallel or linear modules thus cannot be compressed further. So, \(G_3\) is the final compressed graph of data graph G.

Fig. 4
figure 4

a Graph \(G_1\) after first-level compression, b graph \(G_2\) after second-level compression and c graph \(G_3\) after final compression. LS denotes linear module, and IS denotes parallel module

Fig. 5
figure 5

The modular decomposition tree T of graph \({\tilde{G}}\)

We organize the modules in all levels of the compressed graphs into a tree structure, called the modular decomposition tree, or decomposition tree for brevity, as follows: The root of the tree is the final compressed graph \(G_c\). Each module in the previous-level compressed graph \(G_{c-1}\) is a child node of the root; Each child node of the root that corresponds to a non-trivial module of \(G_{c-1}\), in turn, has its own children, representing modules in the previous-level graph \(G_{c-2}\). This continues until we reach the nodes representing modules in the first-level compressed graph, where each non-trivial module points to their children, which are individual vertices in the original graph G. Note that the leaf nodes of the tree are individual vertices in the original graph G. Also, to help reachability detection, we keep a record of the vertex positions in the chain of each linear module in a compressed graph \(G_i\), where if the starting vertex has position 1, the next vertex will have position 2 and so on. We will use pos(vLS) to denote the position of node v in the chain of LS. Obviously, for \(u,v\in LS\), \(u\rightsquigarrow _{G_i} v\) if and only if \(pos(u,LS) < pos(v, LS)\).

Figure 5 shows the modular decomposition tree, T, of graph \({\tilde{G}}\) in Fig. 3b. Let M be a non-leaf node in the decomposition tree of G. By definition, M is either a parallel or linear module in some compressed graph \(G_i\)\((i<c)\), or it is the final compressed graph \(G_c\). M can be regarded as a set of the original vertices of G in the obvious way. Put in another way, we say vertex \(v\in G\) belongs to (or is in) M if v is a descendant of M in the decomposition tree. For example, the vertices 2, 3, 4, 5, 6, 7, 8 belong to the module IS1 in Fig. 5.

We have the following observations about modules in the decomposition tree:

Lemma 5

The vertices ofGthat belong to each parallel or linear moduleMin\(G_i \,(i<c)\)form a module ofG. In other words, all vertices inMhave the same external ancestors, as well as the same external descendants inG.

The above lemma can be easily proved by induction on the compression level i. Using Lemma 5, we can easily see:

Lemma 6

Given two distinct nodes\(N_1\)and\(N_2\)in\(G_{i}\)\((i\le c)\), \(N_1\rightsquigarrow _{G_i} N_2\)iff\(u\rightsquigarrow _{G} v\)for every pair of vertices\(u\in N_1\), \(v\in N_2\).

4.2 Answering Reachability Queries Using Modular Decomposition Tree

Suppose we have the decomposition tree T of G. For ease of presentation, let us use \(G_0\) to denote the graph G. For any pair of vertices uv in the original graph G, we use LCA(uv) to denote the lowest common ancestor of u and v in T. Note that LCA(uv) corresponds either to a module in some compressed graph \(G_i\)\((i\in [1, c-1])\), or to the final compressed graph \(G_c\) (i.e., the root of T). We have the following result.

Theorem 1

Given two vertices\(u,v\in V_G\), if LCA(uv) corresponds to a parallel module of some graph\(G_i\)\((i\in [1,c-1])\), thenucannot reachvinG.

Proof

If LCA(uv) corresponds to a parallel module M of \(G_i\), then suppose \(N_1\) and \(N_2\) are the two vertices of \(G_i\) that contain u and v, respectively. By Lemma 4, we know \(N_1\) cannot reach \(N_2\) in \(G_i\). Then by Lemma 6, we know u cannot reach v in G. \(\square \)

With the above discussion, we are ready to present the method for answering reachability queries using the decomposition tree and the final compressed graph \(G_c\). Given two vertices \(u,v\in V_G\), to find whether \(u \rightsquigarrow _G v\), we can find the lowest common ancestor LCA(uv) of u and v, and check the following:

  1. 1.

    If LCA(uv) is a parallel module, then u cannot reach v in G, by Theorem 1.

  2. 2.

    If LCA(uv) is a linear module, say M, then we check the positions of \(N_1\) and \(N_2\) in the corresponding chain of vertices in M, where \(N_1\) is the child of LCA(uv) in the decomposition tree that contains u, and \(N_2\) is the child of LCA(uv) that contains v, and u can reach v in G if and only if \(pos(N_1,M)< pos(N_2,M)\).

  3. 3.

    If LCA(uv) is the root of T, namely \(G_c\), then suppose \(N_1,N_2\) are the children of LCA(uv) that contain u and v, respectively. Then, \(u\rightsquigarrow _G v\) if and only if \(N_1\rightsquigarrow _{G_c} N_2\). Thus, we only need to check whether \(N_1\rightsquigarrow _{G_c} N_2\). We can do it using any existing reachability algorithms. Since \(G_c\) is usually much smaller than G, checking \(N_1\rightsquigarrow _{G_c} N_2\) in \(G_c\) is likely to be faster than checking \(u \rightsquigarrow _G v\) in G.

Example 2

Consider the decomposition tree T shown in Fig. 5.

  1. (1)

    For the query \(?2 \rightsquigarrow _G 6\), we find that lowest common ancestor of vertices 2 and 6 is a parallel module; therefore, we know vertex 2 cannot reach vertex 6.

  2. (2)

    For the query \(?2 \rightsquigarrow _G 4\), we find LCA(2, 4) is a linear module, and the position of vertex 2 is before that of vertex 4. Therefore, we conclude that \(2\rightsquigarrow _G 4\).

  3. (3)

    For the query \(? 2 \rightsquigarrow _G 9\), since 2 and 9 are in different children of the root, i.e., LS3 and 9, respectively, we only need to check whether \(LS3\rightsquigarrow _{G_3} 9\).

5 Algorithms

The previous section provides the main ideas of our approach. This section presents the detailed algorithms.

5.1 Building Modular Decomposition Tree

Algorithm 1 shows the process of creating the modular decomposition tree along with the final compressed graph. The algorithm takes a DAG that has no redundant edges G as input and returns the modular decomposition tree and the final compressed graph. The algorithm first creates a tree with a root node r. Starting with a random vertex v, the algorithm first tries to find all other vertices that can form a linear module with v (Line 7). If no such module is found, then it will search for a maximal parallel module for v (Line 14). If such a module cannot be found, then v will be added as a child of r (Line 21), otherwise the found module M will be added as child of r, and each vertex in the module will be added as a child of M (Lines 8–12, 16–19). We record all such modules in S (Lines 9,16), and use them to compress the graph into a new graph (Line 24). Then, we recursively call the algorithm to compress the new graph (Line 27). If no non-single-vertex module is found in the current graph, the current tree T will be returned, and the current graph will be returned as \(G_c\).

figure a
figure b

The functions FindParallelModule() and FindLinearModule() used in Algorithm 1 are shown in Algorithm 2 and Algorithms 3, respectively. These algorithms try to find a relevant module based on Lemma 3.

Algorithm 2 takes a vertex v and DAG G as input, and it finds the set of vertices, M, that share the same parents and same children with v. To do that, it first finds the set of vertices, \(M_1\) that share the same parents with v, and then finds the set of vertices, \(M_2\) that share the same children with v. Then, M is the intersection of \(M_1\) and \(M_2\).

figure c

Algorithms 3 takes a vertex v and DAG G as input, and it first searches for a possible chain of ancestors of v (lines 2–12), and then searches for a chain of descendants of v (lines 13–26). Both parts are via an iterative process. Specifically, lines 2 and 4 check whether v has a sole parent \(v'\), and \(v'\) has a sole child v, if so \(v'\) is the parent of v in a linear module. After that, lines 7–12 try to find a parent of the first vertex in the chain, one by one. In the process, it also provides a position number for each vertex found (line 10). Note that the position does not have to be a positive number, as long as it can provide an appropriate order of the vertices in the chain, it will be fine. Lines 13–26 work similarly.

5.1.1 Complexity

Algorithm 3 takes L steps to find the linear module that contains v (checking the \(|parent(|v|)=1\) is just checking the in-degree of v), where L is the size of the linear module that contains v. Algorithm 2 takes \(\Sigma _{u\in parent(v)} |child(u)|\)+ \(\Sigma _{u\in child(v)} |parent(u)|\) steps to find vertices that share the same parents and same children with v. If we use \(I_{\max }\) and \(O_{\max }\) to denote the maximum in-degree and maximum out-degree, respectively, then Algorithm 2 takes \(O(I_{\max }\times O_{\max })\) time. In Algorithm 1, for the first-level compression, we visit each vertex in \(V_G\) that has not been put in a module, and once the vertex is visited or put into a module, it will no longer be visited. In the worst case, no non-trivial module exists, so that every vertex will be visited. Therefore, the first-level compression takes \(O(|V|\times I_{\max } \times O_{\max })\). Each next-level compression will take no more than that of the previous level. Therefore, Algorithm 1 takes \(O(|V|\times I_{\max } \times O_{\max }\times h)\), where h is the height of the decomposition tree. In practice, h is usually small. For instance, in our experiment with 12 datasets, the highest decomposition tree has a height of 26 only.

5.2 Finding Reachability Using the Decomposition Tree

As discussed in the previous subsection, to answer reachability query \(? u\rightsquigarrow _G v\) using the decomposition tree, we only need to find LCA(uv) and then take appropriate actions depending on the type of LCA(uv). To save time for finding the LCA and storage space, we design a slightly modified algorithms as shown in Algorithm 4. Given vertices u and v, we first find the children of the root that u and v belong to, respectively; let us suppose \(u\in N_1\), \(v\in N_2\), if they are different (this is equivalent to say LCA(uv) is the root), we will use some existing algorithm to check \(? N_1\rightsquigarrow _{G_c} N_2\). Otherwise we will find the lowest linear LCA of uv (note that to do so we only need to record the linear module ancestors of the vertices), if no such LCA exists, then u cannot reach v. Otherwise, suppose the linear LCA is M, we will check the relative positions of u and v in M to determine whether u can reach v. Here, the position of u is defined to be same as the position of the child of M that contains u. If LCA(uv) is a parallel module, then that module will be a child of M. So, u and v will have the same position in M thus u cannot reach v. It is easy to see that the algorithm is equivalent to the process described in Sect. 4; hence, its correctness is guaranteed.

figure d

5.2.1 Size of the Decomposition Tree

To answer reachability queries in the original graph G, we only need the final compressed graph \(G_c\) and the decomposition tree T. The total number of nodes in the tree is \(|V|+m+1\), where m denotes the number of non-trivial modules. The number of edges in T is \(|V|+m\). The tree T can be regarded as an additional index. As shown in Algorithm 4, in implementation, we do not need to store the entire modular decomposition tree, instead, we only need to store the sequence of linear modules and the child module of the root (i.e., the node in \(G_c\)) each vertex belongs to.

6 Experiments

In this section, we present our experimental results. We compare our compression scheme and DAG reduction [22] on the performance of four state-of-the-art reachability query algorithms: Grail [21], Feline [18], \(\hbox {IP}^+\) [19] and \(\hbox {BFL}^+\) [16], which include query time and index size/construction time. We also report the compression time of our approach.

6.1 Experimental Setup

6.1.1 Implementation and Running Environment

We obtained the source code of DAG reduction, Grail, \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\) from the authors which are all written in C++. We implemented our reachability query processing algorithms connecting with Grail, \(\hbox {IP}^+\), Feline and \(\hbox {BFI}^+\) in C++ using G++ 7.3.0 compiler. Our multilevel compression algorithm was implemented in C# using Visual Studio 2017. (since compression is done offline.) The experiments were run on a PC with Intel Core i7-7700 with 3.60 GHz CPU, 32 GB memory and Windows 10 operating system.

6.1.2 Datasets and Queries

We tested our approach with 12 real datasets and eight synthetic datasets. For each dataset, we first applied the transitive reduction in [22] to find \({\tilde{G}}\), which is a DAG without redundant edges. Then, we applied our multilevel compression algorithm to get \(G_c\), and used DAG reduction to generate \(G^\epsilon \). We used Grail, \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\) to process reachability queries over \(G_c\) and over \(G^\epsilon \). We randomly generated 100,000 reachability queries for each data graph, and each query was run ten times using our compression schema and that of [22], and the average time is recorded.

6.2 Experiments on Real Datasets

Table 1 Real-world datasets and their compression ratio after DAG reduction and multilevel compression

6.2.1 Datasets

We used 12 real datasets Kegg\(^1\), arXiv\(^1\), XMark\(^1\), PubMed\(^1\), Patent\(^1\), Citeseerx,Footnote 1 soc-Epinions\(^2\), Web,\(^2\) LJ,Footnote 2 05Patent\(^3\), 05CiteseerxFootnote 3 and DBpedia.Footnote 4 Among these datasets, Kegg and XMark are very small graphs. Datasets arXiv, PubMed, soc-Epinions, Web and LJ are of medium size, whereas the other five graphs can be considered as large graphs. Here, Kegg is a metabolic network, and XMark is an XML document. Datasets soc-Epinions and LJ are the online social networks. Web is the web graph from Google. arXiv, PubMed, Patent, 05Patent, 05Citeseerx and Citeseerx are all citation networks, and DBpedia is a knowledge graph. The statistics of these datasets are shown in the first two columns of Table 1.

Table 2 Graph size before and after compression

6.2.2 Compression Ratio

The compression ratios of transitive reduction, DAG reduction (i.e., transitive reduction and equivalence reduction) and our multilevel compression are shown in Table 1. From the table, we can see that our approach has more compression for every graph than DAG reduction. The dataset XMark has the best result with 30.1% more compression of vertex and 26% more compression of edges than DAG reduction. For larger graphs, DBpedia shows best compression with 6.6% more compression of nodes and 2.8% more compression of edges. On the other hand, our compression scheme is only slightly better than DAG reduction over the Citeseerx and the Patent datasets. This could be because these datasets do not contain many linear modules. Generally, the reduction ratio depends on the structure of the graph. However, a small percentage of compression for a large graph can also have great impact on query processing since even a small percentage of compression means reduction in lots of vertices and edges in large graphs (see the Patent dataset in Table 7 for example).

Table 3 Size and storage space for decomposition tree

Table 2 shows the size of G, \(G^\epsilon \) and \(G_c\). Here, we calculated the size of the graph as the sum of the number of vertices and the number of edges.

The second column of Table 3 shows the number of nodes in the modular decomposition tree for each of the data graph, which is calculated as \(|V| + |m| + 1\) where |V| is the number of vertices in the data graph which represents the leaf nodes in the tree and m is the number of non-trivial modules in the tree. The number of edges is \(|V| + |m|\). As discussed in Sect. 5.2, we do not need to store the entire decomposition tree, we only need to store, for each vertex, its linear module ancestors and corresponding node in \(G_c\). The required storage space is shown in the third column. For DAG reduction, it also needs to store the equivalence classes each vertex is in. The storage size is shown in the fourth column. As can be seen, our approach needs more storage space, but the difference is small. If we add this space and index size (see Table 6) together, our approach needs less overall space.

Table 4 Compression time (s)

Table 4 shows the time required for building the decomposition tree using our algorithms which are implemented in C#, where the dataset DBpedia has taken the most time. As the indexing is done offline, we consider these times as viable in practice.

6.2.3 Index Construction Time

Table 5 shows the comparison of index construction time for Grail, \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\) algorithms over \(G^\epsilon \) and \(G_c\). The better results are highlighted in bold font in the table. Here, multilevel compression requires less index construction time for every graph for creating index for \(\hbox {IP}^+\) and \(\hbox {BFL}^+\). For Feline, we also have better result for each graph except 05Patent. Grail performs better in multilevel compression for every graph except 05Citeseerx and Citeseerx.

Table 5 Index construction time (ms)

6.2.4 Index Size

The index size of Grail, \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\) for \(G^\epsilon \) and \(G_c\) are shown in Table 6. From the table, we can see that the index sizes of \(G_c\) are smaller for almost every graph than \(G^\epsilon \) for all of the four algorithms, although for the arXiv, PubMed, Citeseerx and Patent datasets the difference is very small and for PubMed the index size of Feline is smaller in \(G^\epsilon \) than in \(G_c\) as well. This is not surprising because the sizes of \(G_c\) and \(G^\epsilon \) are very close for these datasets.

Table 6 Index size (MB)

6.2.5 Query Performance

Table 7 shows the comparison of the query time for Grail, \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\). We run each query ten times and the time shown is the average of the 10 runs. We can see that our compression outperforms DAG reduction in query processing for almost every graph. Surprisingly \(IP^+\) is lower using our approach than using DAG reduction in Kegg dataset; Feline is lower in LJ and \(\hbox {BFL}^+\) is lower in patent. For all other datasets, the performance of multilevel compression is much better than DAG reduction.

Table 7 Query time (ms)

6.3 Experiments on Synthetic Datasets

We generated eight random graphs where the smallest graph have one thousand vertices with two thousands edges, and the largest graph have ten millions vertices with 15 millions edges. The other graphs have number of vertices and edges in between these two graphs.

Table 8 Datasets and their compression ratio after ER reduction and multilevel compression along with the compression time for multilevel compression for synthetic datasets

Table 8 shows the graph profiles and their compression ratio for DAG reduction and multilevel compression. The table also shows the time required for compression of each of the graphs using multilevel compression. Multilevel compression can process the largest graph with ten millions vertices and 15 millions edges within few hours which ensures the scalability of this method.

Fig. 6
figure 6

Vertex and edge compression for \(G^\epsilon \) and \(G_c\)

6.3.1 Compression Ratio

Figure 6 shows the comparison of compression ratio between DAG reduction and multilevel compression. We can see from figure that our method do more than 15% more compression on vertex and more than 10% more compression on edges for each of the graph except Data1 and Data3. The compression ratio of Data1 and Data3 is also better in multilevel compression.

Fig. 7
figure 7

Index construction time (ms) comparison for \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\)

Fig. 8
figure 8

Index size (MB) comparison for \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\)

6.3.2 Index Construction Time

We tested index construction time for the three reachability algorithms \(\hbox {IP}^+\), Feline, and \(\hbox {BFL}^+\). For Grail, we experienced errors; therefore, we omit it here. Figure 7 shows the index construction time for each of the algorithm for both \(G^\epsilon \) and \(G_c\). For index constructions that cannot be completed, we omit them from the figure. We can see from figure that the index construction time is much better for multilevel compression for each of the three algorithm and all datasets except Data5 for Feline algorithm.

6.3.3 Index Size

Figure 8 shows index size comparison for \(\hbox {IP}^+\), Feline and \(\hbox {BFL}^+\). The index size is much smaller for \(G_c\) than \(G^\epsilon \) for each graph in all three algorithms. It is obvious as \(G_c\) is smaller than \(G^\epsilon \).

6.3.4 Query Time

Table 9 shows the query time comparison where the query set contains 100,000 random queries for each of the dataset. We represented “–” in the table for which dataset the algorithm goes out of memory. From the table, only \(IP+\) became able to process all of the datasets, whereas Feline and \(\hbox {BFL}^+\) can process up to Data5 for \(G^\epsilon \) and up to Data6 for \(G_c\). We can see \(G_c\) shows significantly better performance than \(G^\epsilon \) in Feline, whereas surprisingly only for Data4 \(G^\epsilon \) is better than \(G_c\). This could be happened because of graph structure and query set. The query time is much smaller in \(G_c\) than \(G^\epsilon \) in all of the datasets for both \(IP^+\) and \(BFL^+\).

Table 9 Query time (ms) for synthetic datasets

7 Discussion on k-Hop Reachability

Given two vertices \(v_1\) and \(v_2\) in a directed graph G and an integer k, the k-hop reachability asks whether \(v_1\) can reach \(v_2\) in k-hops, that is, whether there is a path from \(v_1\) to \(v_2\) that is of length k or less [5]. Due to the restriction on the length of the path, it is generally not beneficial to convert G into a DAG by merging the vertices in strongly connected components. Furthermore, an edge from \(v_1\) to \(v_2\) is no longer redundant even if there is a path from \(v_1\) to \(v_2\) of length 2 or more. Therefore, two vertices that share the same ancestors (resp. descendants) do not necessarily share the same parents (resp. children). Thus, it appears that neither the DAG reduction [22] nor the compression scheme in Sect. 5 can be applied to k-hop reachability.

However, in the real world, some graphs are naturally DAGs. For example, a paper citation network should have no cycles (for instance, two papers cannot cite each other). Moreover, linear chains and vertices that share the same parents and same children may naturally exist. Therefore, we can still use the compression scheme describe in Sect. 5 to compress the graph. In order to make the compression useful for k-hop queries, for each linear chain, we need to record the length of the chain as well as the position of the vertices in the chain. Given a k-hop reachability query, we can answer it using the a slightly modified algorithm from Algorithm 5.

  1. 1.

    If LCA(uv) is an independent set, then u cannot reach v, let alone reach v in k-hops.

  2. 2.

    If LCA(uv) is a linear module M, and \(N_1\) is the child of LCA(uv) that contains u, and \(N_2\) be the child of LCA(uv) that contains v, and \(pos(N_1, M)<pos(N_2, M)\), then u can reach v, but we need to calculate the number of hops from u to v. To do that, we need to find all the linear modules on the path from the u to M in the decomposition tree, and find all the linear modules on the path from the v to M, and use the position number of the nodes u and v are in to compute the number of hops. For example, consider the decomposition tree in Fig. 5, and the two-hop reachability query from vertex 1 to vertex 7. We find the LCA of the two vertices to be LS3, which is a linear module. The two children of LS3 that contain vertices 1 and 7 are the leaf node 1 and IS1, and from the leaf noded 7 to LS3, there is a another linear module LS2, and the position of 7 in LS2 is 2. Therefore, we know the length of the chain from the first vertex to vertex 7 is 2. Since vertex 1 to IS1 have positions 0 and 1 in LS3, respectively, we know 1 can reach IS1 in one hop; therefore, the path from vertex 1 to 7 has a length of 3. Hence, 1 cannot reach 7 in two hops. We may use straightforward pruning rules to prune paths longer than k rather than actually computing the length of the path in this step.

  3. 3.

    If LCA(uv) is the root of the decomposition tree, we can use a existing algorithm to find whether there is path from \(N_1\) to \(N_2\) of length no larger than k, and use a similar method to the above step to find whether there is a path from u to v of length \(\le k\).

It is obvious that k-reachability is more complicated, and the compression ratio may not be as big as for plain reachability queries. Whether the above approach is helpful for real graphs needs to be verified using experiments, and we leave it as our future work.

8 Conclusion

We presented a method to compress a DAG that has no redundant edges, using two types of modules, to obtain a decomposition tree. We showed how to use the decomposition tree to answer reachability queries over the original graph. Experiments show that for many real-world graphs and also for synthetic graphs, our method can compress the graph to much smaller graphs than DAG reduction, and reachability queries can be answered faster, and the index size can be smaller as well.