Chapter 1
DENSE COMPONENTS IN GRAPHS
Victor E. Lee, Ning Ruan, Ruoming Jin
Department of Computer Science
Kent State University
Kent, OH 44242
{vlee,nruan, jin}@cs.kent.edu
Charu Aggarwal
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
charu@us.ibm.com
Abstract
Should the chapter have an abstract?
Keywords:
Should the chapter have a keyword list?
www
1.
Introduction
In almost any network, density is an indication of importance. Just as someone reading a road map is interesting in knowing the location of the larger
cities and towns, investigators who seek information from abstract graphs are
often interested in the dense components of the graph. Depending on what
properties are being modeled by the graph’s vertices and edges, dense regions
may indicate high degrees of interaction, mutual similarity and hence collective
characteristics, attractive forces, favorable environments, or critical mass.
From a theoretical perspective, dense regions have many interesting properties. Dense components naturally have small diameters (worst case shortest
path between any two members). Routing within these components is rapid.
A simple strategy also exists for global routing. If most vertices belong to
2
a dense component, only a few selected inter-hub links are needed to have a
short average distance between any two arbitrary vertices in the entire network.
Commercial airlines employ this hub-based routing scheme. Dense regions are
also robust, in the sense that many connections can be broken without splitting
the component. A less well-known but equally important property of dense
subgraphs comes from percolation theory. If a graph is sufficiently dense, or
equivalently, if messages are forwarded from one node to its neighbors with
higher than a certain probability, then there is very high probability of propagating a message across the diameter of the graph [19]. This fact is useful in
everything from epidemiology to marketing.
Not all graphs have dense components, however. A sparse graph may have
few or none. This begs the question, how are the words ‘dense’ and ‘sparse’
defined? We will address this question shortly. A uniform graph is either
entirely dense or not dense at all. Uniform graphs, however, are rare, usually
limited to either small or artificially created ones. Due to the usefulness of
dense components, it is generally accepted that their existence is the rule rather
than the exception in nature and in human-planned networks [35].
Dense components have been identified in and have enhanced understanding
of many types of networks; among the best-known are social networks [48, 39],
the World Wide Web [28, 17, 11], financial markets [5], and biological systems [25]. Much of the early motivation, research, and nomenclature regarding
dense components was in the field of social network analysis. Even before the
advent of computers, sociologists turned to graph theory to formulate models
for the concept of social cohesion. Clique, K-core, K-plex, and K-club are
metrics originally devised to measure social cohesiveness [48]. It is not surprising that we also see dense components in the World Wide Web. In many ways,
the Web is simply a virtual implementation of traditional direct human-human
social networks.
Today, the natural sciences, the social sciences, and technological fields are
all using network and graph analysis methods to better understand complex
systems. Dense component discovery and analysis is one important aspect
of network analysis. Therefore, readers from many different backgrounds will
benefit from understanding more about the characteristics of dense components
and some of the methods used to uncover them.
In the next section, we outline the graph terminology and define the fundamental measuures of density to be used in the rest of the chapter. Section 3
categorizes the algorithmic approaches and presents representative implementations in more detail. Section 4 expands the topic to consider frequently-occuring
dense components in a set of graphs. Section 5 provides examples of how these
techniques have been applied in various scientific fields. Section 6 concludes
the chapter with a look to the future.
Dense Components in Graphs
2.
3
Types of Dense Components
Different applications find different definitions of dense component to be
useful. In this section, we outline the many ways to define a dense component,
categorizing them by their important features. Understanding these features
of the various types of components are valuable for deciding which type of
component to pursue.
2.1
Absolute vs. Relative Density
We can divide density definitions into two classes, absolute density and
relative density. An absolute density measure establishes rules and parameters
for what constitutes a dense component, independent of what is outside the
component. For example, we could say that we are only interested in cliques,
fully-connected subgraphs of maximum density. Absolute density measures
can be viewed as relaxations of the pure clique measure.
On the other hand, a relative density measure has no preset level for what is
sufficiently dense. It compares the density of one region to another, with the
goal of finding the densest regions. To establish the boundaries of components,
a metric typically looks to maximize the difference between intra-component
connectedness and inter-component connectedness. Often but not necessarily,
relative density techniques look for a user-defined number k densest regions.
The alert reader may have noticed that relative density discovery is closely
related to clustering and in fact shares many features with it.
Since this book contains another chapter dedicated to graph clustering, we
will focus our attention on absolute density measures. However, we will have
more so say about the relationship between clustering and density at the end of
this section.
2.2
Graph Terminology
Let G(V, E) be a graph with |V | vertices and |E| edges. If the edges are
weighted, then w(u) is the weight of edge u. We treat unweighted graphs
as the special case where all weights are equal to 1. Let S and T be subsets of V . For an undirected graph, E(S) is the set of induced edges on S:
E(S) = {(u, v) ∈ E |u, v ∈ S}. Then, HS is the induced subgraph (S, E(S)).
Similarly, E(S, T ) designates the set of edges from S to T . HS,T is the induced
subgraph (S, T, E(S, T )). Note that S and T are not necessarily disjoint from
each other. If S ∩ T = ∅, HS,T is a bipartite graph. If S and T are not disjoint
(possibly S = T = V ), this notation can be used to represent a directed graph.
A dense component is an maximal induced subgraph which also satisfies
some density constraint. A component HS is maximal if no other subgraph
of G which is a superset of HS would satisfy the density constraints. Table
4
2.2 defines some basic graph concepts and measures that we will use to define
density metrics.
Table 1.1. Graph Terminology
Symbol
Description
G(V, E) graph with vertex set V and edge set E
HS
subgraph with vertex set S and edge set E(S)
HS,T
subgraph with vertex set S ∪ T and edge set E(S, T )
w(u)
weight of edge u
NG (u)
neighbor set of vertex u in G: {v| (u, v) ∈ E}
NS (u)
only those neighbors of vertexP
u that are in S: {v| (u, v) ∈ S}
δG (u)
(weighted) degree of u in G : v∈N (u) w(v)
G
P
δS (u)
(weighted) degree of u in S : v∈N (u) w(v)
S
dG (u, v) shortest (weighted) path from u to v traversing any edges in G
dS (u, v) shortest (weighted) path from u to v traversing only edges in E(S)
We now formally define the density of S, den(S), as the ratio of the total
weight of edges in E(S) to the number of possible edges among |S| vertices.
If the graph is unweighted, then the numerator is simply the number of actual
edges, and the maximum possible density is 1. If the graph is weighted, the
maximum density is unbounded.
The number of possible edges in an undirected
n
graph of size n is 2 = n(n − 1)/2. We give the formulas for an undirected
graph; the formulas for a directed graph differ by a factor of 2.
den(S) =
denW (S) =
2
2|E(S)|
|S|(|S| − 1)
w(u, v)
|S|(|S| − 1)
P
u,v∈S
Some authors define density as the ratio of the number of edges to the number
|E|
of vertices: |V
| . We will refer to this as average degree of S.
Another important metric is the diameter of S, diam(S). Since we have
given two different distance measures, dS and dG , we accordingly offer two
different diameter measures. The first is the standard one, in which we consider
only paths within S. The second permits paths to stray outside S, if it offers a
shorter path.
diam(S) = max{dS (u, v)| u, v ∈ S}
diamG (S) = max{dG (u, v)| u, v ∈ S}
2.3
Definitions of Dense Components
We now present a collection of measures that have been used to define dense
components in the literature (Table 2.3). To focus on the fundamentals, we
5
Dense Components in Graphs
assume unweighted graphs. In a sense, all dense components are either cliques,
which represent the ideal, or some relaxation of the ideal. There relaxations
fall into three categories: density, degree, and distance. Each relaxation can
be quantified as either a percentage factor or a subtractive amount. While
most of there definitions are widely-recognized standards, the name quasiclique has been applied to any relaxation, with different authors giving different
formal definitions. Abello [1] first defined the term in terms of overall edge
density, without any constraint on individual vertices. This offers considerable
flexibility in the component topology. Several later authors [?] have opted to
define quasi-clique in terms of minimum degree of each vertex. Li et al. [30]
provide a brief overview and comparison of quasi-cliques. When the authorship
of a specific metric can be traced, it is given. Our list is not exhaustive; however,
the majority of definitions can be reduced to some combination of density,
degree, and diameter.
Table 1.2. Types of Dense Components
Component
Clique
Reference
Formal definition
∃(i, j), i 6= j ∈ S
[1]
den(S) ≥ γ
[30]
δS (u) ≥ γ ∗ (k − 1)
K-core
[40]
δS (u) ≥ k
K-plex
[41]
δS (u) ≥ |V ′ | − k
Kd-clique
[32]
diamG (S) ≤ k
K-club
[34]
diam(S) ≤ k
Quasi-Clique
(density-based)
Quasi-Clique
(degree-based)
Description
Every vertex connects to every other vertex in S.
S has at least γ|V ′ |(|V ′ | − 1)/2 edges.
Density may be imbalanced within S.
Each vertex has γ percent of the possible connections to other vertices. Local
degree satisfies a minimum. Compare to
K-core and K-plex.
Every vertex connects to at least k other
vertices in S. A clique is a (k-1)-core.
Each vertex is missing no more than k−1
edges to its neighbors. A clique is a 1plex.
The shortest path from any vertex to any
other vertex is not more than k. An ordinary clique is a 1d-clique. Paths may go
outside S.
The shortest path from any vertex to any
other vertex is not more than k. Paths
may not go outside S. Therefore, every
K-club is a K-clique.
Note that in unweighted graphs, cliques have a density of 1. Quasi-cliques
are only defined for unweighted graphs. We use the term Kd-clique instead
of Mokken’s original name K-clique, because K-clique is already defined in
the mathematics and computer science communities to mean a clique with k
vertices.
6
2
1
7
3
4
6
5
Figure 1.1. Example Graph to Illustrate Component Types
Figure 1.1, a superset of an illustration from Wasserman and Faust [48],
demonstrates each of the dense components that we have defined above.
Cliques: {1,2,3} and {2,3,4}
0.8-Quasi-clique: {1,2,3,4} (includes 5/6 > 0.83 of possible edges)
2-Core: {1,2,3,4,5,6,7}
3-Core: none
2-Plex: {1,2,3,4} (vertices 1 and 3 are missing one edge each)
2d-Cliques: {1,2,3,4,5,6} and {2,3,4,5,6,7} (In the first component, 5
connects to 6 via 7, which need not be a member of the component)
2-Clubs: {1,2,3,4,5}, {1,2,3,4,6}, and {2,3,5,6,7}
2.4
Dense Component Selection
When mining for dense components in a graph, a few additional questions
must be addressed:
1 Minimum size σ: What is the minimum number of vertices in a dense
component S? I.e. |S| ≥ σ, for all S.
2 All or top-N ?: One of the following criteria should be applied.
Select all components which meet the size, density, degree, and
distance constraints.
Select the N highest ranking components that meet the minimum
constraints. A ranking function must be established. This can be
as simple as one of the same metrics used for minimum constraints
Dense Components in Graphs
7
(size, density, degree, distance, etc.) or a linear combination of
them.
Select the N highest ranking components, with no minimum constraints.
3 Overlap: May two components share vertices?
2.5
Relationship between Clusters and Dense Components
The measures described above set an absolute standard for what constitutes
a dense component. Another approach is to find the most dense components on
a relative basis. This is the domain of clustering. It may seem that clustering,
a thoroughly-studied topic in data mining with many excellent methodologies,
would provide a solution to dense component discovery. However, clustering
is a very broad term. Readers interested in a survey on clustering may wish to
consult either Jain, Murty, and Flynn [23] or Berkhin [8]. In the data mining
community, clustering refers to the task of assigning similar or nearby items
to the same group while assigning dissimilar/distant items to different groups.
In most clustering algorithms, similarity is a relative concept; therefore it is
potentially suitable for relative density measures. However, not all clustering
algorithms are based on density, and not all types of dense components can be
discovered with clustering algorithms.
Partitioning refers to one class of clustering problem, where the objective
is to assign every item to exactly one group. A k-partitioning requires the
result to have k groups. K-partitioning is not a good approach for identifying
absolute dense components, because the objectives are at odds. Consider the
well-known k-Means algorithm applied to a uniform graph. It will generate k
partitions, because it must. However, the partitioning is arbitrary, changing as
the seed centroids change.
In hierarchical clustering, we construct a tree of clusters. Conceptually, as
well as in actual implementation, this can be either agglomerative (bottomup), where the closest clusters are merged together to form a parent cluster,
or divisive (top-down), where a cluster is subdivided into relatively distant
child clusters. In basic greedy agglomerative clustering, the process starts by
grouping together the two closest items. The pair are now treated as a single
item, and the process is repeated. Here, pairwise distance is the density measure,
and the algorithm seeks to group together the densest pair. If we use divisive
clustering, we can choose to stop subdividing after finding k leaf clusters. A
drawback of both hierarchical clustering and partitioning is that even sparse
regions are forced to belong to some cluster, so they are lumped together with
the denser cores.
Spectral clustering describes a graph as a adjacency matrix W , from which
is derived the Laplacian matrix L = D − W (unnormalized) or L = I −
8
D 1/2 W D −1/2 (normalized), where D is the diagonal matrix featuring each
vertex’s degree. The eigenvectors of L can be used as cluster centroids, with
the corresponding eigenvalues giving an indication of the cut size between
clusters. Since we want minimum cut size, the smallest eigenvalues are chosen
first. This ranking of clusters is an appealing feature for dense component
discovery.
None of these clustering methods, however, are suited for an absolute density
criterion. Nor can they handle overlapping clusters. Therefore, some but not
all clustering criteria are dense component criteria. Most clustering methods
are suitable for relative dense component discovery, excluding k-partitioning
methods.
3.
Algorithms for Detecting Dense Components in the
Single Graph
In this section, we examine first some basic and then a representative sampling of algorithms for finding subgraphs based on different notions of dense
component. The algorithms can be categorized as follows: Exact enumeration
(Section 3.2), Fast Heuristic Enumeration (Section 3.3), and Bounded Approximation Algorithms (Section 3.4). We review some recent works related to dense
component discovery, concentrating on the details of several well-received algorithms.
3.1
Overview
The table below gives an overview of the major algorithmic approaches and
cites a few representative examples
3.2
Exact Enumeration Approach
The most natural way to discover dense components in a graph is to enumerate
all possible subsets of vertices and check if some of them satisfy the definition
of dense components. In the following, we investigate some algorithms for
discover dense components by enumeration approach, even using several new
pruning techniques.
3.2.1
Enumeration Approach.
One of well-known enumeration algorithms to generate cliques is proposed by Bron and Kerbosch [12]. This algorithm utilizes the branch-and-bound techniques in order to cut of the branch
which is unable to generate a clique. The basic idea is trying to extend a subset of vertices which forms a clique by adding the vertex in a candidate set
while not in a exclusion set. Let C be the set of vertices which already forms
a clique, Cand be the set of vertices which are potentially used for extending
9
Dense Components in Graphs
Table 1.3. Overview of Dense Component Algorithms
Algorithm Type Component Type
Example
Enumeration
Clique
[12]
Biclique
[33]
Quasi-clique
[31]
Quasi-biclique
[42]
k-core
[7]
Fast Heuristic Maximal biclique
[28]
Enumeration
Quasi-clique/biclique
[13]
Relative density
[18]
Maximal quasi-biclique
[30]
Quasi-clique, k-core
[47]
Bounded
Approximation
Max. average degree
[14]
Densest subgraph, n ≥
k
Subgraph of known density θ
[4]
[3]
Comments
min. degree for each vertex
nonoverlapping
spectral analysis
shingling
balanced noise tolerance
pruned search; visual results with
upper-bounded estimates
undirected graph: 2-approx.
directed graph: 2+ǫ-approx.
1/3-approx.
finds subgraph
Ω(θ/ log ∆)
with
density
C and N Cand is the set of vertices which are not allowed to be candidates
of C. N (v) are the neighbors of vertexv. Initially, C and N Cand are empty,
and Cand contains all vertices in the graph. Given C, Cand and N Cand, we
describe the Bron-Kerbosch algorithm as below:
Algorithm 1 CliqueEnumeration(C,Cand,N Cand)
Cand ← {v1 , v2 , · · · , vk };
if Cand = ∅ and N Cand = ∅ then
output the clique induced by vertices C;
else
for i = 1 to k do
Cand ← Cand \ {vi };
call CliqueEnumeration(C ∪{vi }, Cand∩N (vi ), N Cand∩N (vi ));
N Cand ← N Cand ∪ {vi };
end for
end if
Makino et al. [33] proposed new algorithms making full use of efficient
matrix multiplication to enumerate all maximal cliques in a general graph or
bicliques in a bipartite graph. For different types of graphs (i.e, general graph,
bipartite graph, dense graph or sparse graph), they developed different algorithms for them. Particularly, in the sparse graph such that the degree of each
vertex is bounded by ∆ ≪ |V |, an algorithm with O(|V ||E|) preprocessing
10
time, O(∆4 ) time delay (i.e, the bound of running time between two consecutive
outputs) and O(|V |+|E|) space is developed to enumerate all maximal cliques.
Experimental results demonstrates the good performance of this algorithm for
sparse graphs.
3.2.2
Quasi-clique Enumeration.
Batagelj et al. [7] developed a efficient algorithm running in O(m) time for discovering k-core. The algorithm is
based on the observation: given a graph G = (V, E), if we recursively eliminate
the vertices with the degree less than k and their incident edges from the graph,
the resulting graph is k-core. The algorithm is quite simple and can be considered as the variant of [27]. This algorithm attempts to assign each vertex with
a core number to which it belong. At the beginning, the algorithm organizes
all vertices based on their degrees in a queue. For each iteration, we eliminate
the first vertex v (i.e, the vertex with lowest degree) from the queue. After
then, we assign the degree of v as its core number. Considering v’s neighbors
whose degrees are greater than that of v, we decrease their degrees by one and
reorder the remaining vertices in the queue based on their degrees immediately.
We repeat such procedure until no vertex in the queue. Finally, we output the
k-cores based on their assigned core numbers.
The enumeration approach incorporating with pruning techniques is introduced in [31]. They studied the problem for mining maximal quasi-cliques
with size at least min size from a single graph and a parameter γ ∈ [0, 1],
such that the degree of each vertex is at least ⌈γ(|V | − 1)⌉. An efficient algorithm, namely Quick, is proposed to exact such maximal quasi-cliques. This
algorithm integrates some novel pruning techniques based on degree of vertices
with a traditional depth-first search framework to prune the unqualified vertices
as soon as possible. Those pruning techniques also can be combined with other
existing algorithms to achieve the goal of mining maximal quasi-clique.
Some existing pruning techniques based on diameter, minimum size threshold and degree of the vertices are reviewed. Let NkG (v) = {u|distG (u, v) ≤ k}
be the set of vertices that are within a distance of k from vertex v, indegX (u)
denotes the number of vertices in X that are adjacent to u, and exdegX (u)
represents the number of vertices in cand exts(X) that are adjacent to u. All
vertices are sorted in lexicographic order, then cand exts(X) is the set of vertices after the last vertex in X which can be used to extend X. For the pruning
technique based on graph diameter, the vertices which are not in ∩v∈X NkG (v)
can be removed from cand exts(X). Considering the minimum size threshold,
the vertices whose degree is less than ⌈γ(min size − 1)⌉ should be removed.
Besides that, five new pruning techniques are also introduced. The first two
techniques takes the lower and upper bound of the number of vertices that can be
used to extend current X into consideration. 1) Technique 1: pruning based on
the upper bound of the number of vertices that can be added to X concurrently
Dense Components in Graphs
11
to form a γ-quasi-clique. In other words, given a vertex set X, the maximum
number of vertices in cand exts(X) that can be added into X is bounded by
the minimal degree of the vertices in X; 2) Technique 2: pruning based on the
lower bound of the number of vertices that can be added to X concurrently
to form a γ-quasi-clique. The third technique is based on critical vertices. If
we can find some critical vertices of X, then all vertices in cand exts(X) and
adjacent to critical vertices are added into X. Technique 4 is based on cover
vertex u which maximize the size of CX (u) = cand exts(X) ∩ N G (u) ∩
(∩v∈X∧(u,v)∋E N G (v)).
Lemma 1.1 [31] Let X be a vertex set and u be a vertex in cand exts(X)
such that indegX (u) ≥ ⌈γ × |X|⌉. If for any vertex v ∈ X such that
(u, v) ∈ E, we have indegX (v) ≥ ⌈γ × |X|⌉, then for any vertex set Y
such that G(Y ) is a γ-quasi-clique and Y ⊆ (X ∪ (cand exts(X) ∩ N G (u) ∩
(∩v∈X∧(u,v)∋E N G (v)))), G(Y ) cannot be a maximal γ-quasi-clique.
From above lemma, we can prune the CX (u) of cover vertex u from cand exts(X)
to reduce the search space. The last technique, so-called lookahead technique,
is to check if X ∪ cand exts(X) is γ-quasi-clique. If so, we do not need to
extend X anymore and reduce some computational cost. The outline of Quick
is depicted in Algorithm 2.
Algorithm 2 Quick Algorithm
Parameter: X is vertex set;
Parameter: cand exts(X) is the valid candidate extensions of X;
Parameter: γ is the minimum degree threshold;
Parameter: min size is the minimum size threshold;
find the cover vertex u of X and sort vertices in cand exts(X);
for each v ∈ cand exts(X) − CX (u) do
apply minimum size constraint on |X| + |cand exts(X)|;
apply lookahead technique (technique 5) to prune search space;
remove the vertices that are not in NkG (v);
Y ← X ∪ {u};
calculate the upper bound and lower bound of the number vertices to be
added to Y in order to form γ-quasi-clique;
apply technique 1 and technique 2 to recursively prune unqualified vertices;
identify critical vertices of Y and apply pruning technique 3;
apply existing pruning techniques to further reduce the search space;
end for
return output final results;
12
3.3
Fast Heuristic Approach
As we have mentioned before, it is impractical to exactly enumerate all
maximal cliques, especially for some real applications like protein-protein interaction network with a very large collection of vertices. In this case, fast
heuristic algorithms are proposed to address this problem. Those algorithms
are able to efficiently identify some subgraphs which is close to clique (sometimes so-called quasi-clique) in the very large graph.
3.3.1
Shingling Technique. Gibson et al. [18] propose an new algorithm
based on shingling technique for discovering large dense bipartite subgraph in
massive graphs. In the paper, a dense bipartite subgraph is considered a cohesive
group of vertices which share many common neighbors. Since this algorithm
utilizes shingling technique to convert each dense component with arbitrary
size into shingles with constant size, it is very efficient and practical for single
large graph and can be easily extended for streaming graph data.
We first provide some basic knowledge related to shingling algorithm. Shingling firstly introduced in [11] is widely used to estimate the similarity of web
pages using a particular feature extraction scheme. In this work, shingling
technique is applied to generate two constant-size fingerprints for two different
subsets A and B from set S of a universe U of elements, such that the similarity
of A and B can be computed easily by comparing fingerprints of A and B, respectively. Assuming π is a random permutation of the elements in the ordered
universe U which contains A and B, it is interesting to see that the probability
that the smallest element of A and B is the same, is the Jaccard coefficient.
That is,
P r[π −1 (mina∈A {π(a)}) = π −1 (minb∈B {π(b)})] =
|A ∩ B|
|A ∪ B|
(3.1)
Given a constant number c of permutation π1 , · · · , πc of U , we generate a fingerprinting vector whose i-th element is mina∈A πi (a). The similarity between
A and B is estimated by the number of positions which have the same element
with respective to their corresponding fingerprint vectors. Furthermore, we can
generalize this approach by considering every s-element subset of entire set
instead of the subset with only one element. Then the similarity of two sets A
and B can be measured by the fraction of these s-element subsets that appear in
both. This actually is a agreement measure in information retrieval area. Along
this line, it is not hard to figure out that this is also equivalent to computing the
Jaccard coefficient of s-element subsets As and Bs . We say each s-element
subset is a shingle. Thus the motivated feature extraction approach is named
as (s, c) shingling algorithm. Given a n-element set A = {ai , 0 ≤ i ≤ n}
where each element ai is a string, (s, c) shingling algorithm tries to extract c
shingles such that the length of each shingle is exact s. We start from converting
Dense Components in Graphs
13
Figure 1.2. Simple example of web graph
Figure 1.3. Illustrative example of shingles
each string ai into a integer xi by a effective hashing function. Following that,
given two random integer vectors R, S with size c, we generate a n-element
temporary set Y = {yi , 0 ≤ i ≤ n} where each element yi = Rj × xi + Sj .
Then s smallest elements of Y are selected and concatenated together to form
a new sting y. Finally, we apply hash function on string y to get one shingle.
We repeat such procedure c times in order to get c shingles for original set.
Remember that our goal is to discover dense bipartite subgraphs such that
vertices in one side share some common neighbors in another side. Figure 1.2
illustrates a simple scenario in web community where each web page in the
upper part links to some other web pages in the lower part. Clearly, for each
vertex in the upper part, there is a list of outgoing links which is incident to
it. In order to put some vertices into the same group, we have to measure the
similarity of the vertices which denotes to what extend they share common
neighbors. With the help of shingling approach, for each vertex in the upper
part, we can generate a constant-size shingles to describe its outlinks (i.e, its
neighbors in lower part). As shown in Figure 1.3, the outlinks of the vertices in
lower part is converted to a set of shingles (like s1 , s2 , s3 , s4 in the figure) with
small size. Since the size of shingles can be significantly smaller than original
14
Figure 1.4. Recursive Shingling Step
data, it means that much computational cost can be saved in terms of time and
space.
In the paper, Gibson et al. repeatedly employ the shingling algorithm for
converting dense component into constant-size shingles. The algorithm is a twostep procedure: 1) recursive shingling step, the goal is to exact some subsets
of vertices where the vertices in each subset share many common neighbors.
Figure 1.4 illustrates the recursive shingling process for a graph (Γ(V ) is the
outlinks of vertices V ). After the first shingling process, for each vertex v ∈ V ,
its outlinks Γ(v) is converted into a constant size of first-level shingles v ′ . Then
we can transpose the mapping relation E0 to E1 so that each shingle in v p rime
corresponds to a set of vertices which share this shingle. In other words, a
new bipartite graph is constructed where each vertex in one part represent one
shingle and each vertex in another part is the original vertex. If there is a edge
from shingle v ′ to vertex v, v ′ is one of shingles for v’s outlinks generated by
shingling approach. From now on, V is considered as Γ(V ′ ). Following the
same procedure, we apply shingling algorithm on V ′ and Γ(V ′ ). After second
′′
shingling process, V is converted into a constant-size V , so-called secondlevel shingles. Similarly to the transposition in the first shingling process, we
transpose E1 to E2 and obtain many pairs < v ′′ , Γ(v ′′ ) > where v ′′ is secondlevel shingles and Γ(v ′′ ) are all the first-level shingles that share a second-level
shingle. 2) clustering step, it aims to merge first-level shingles which share some
second-level shingles. Essentially, it is to merge some subsets of biclique into
Dense Components in Graphs
15
one dense component. Particularly, given all pairs < v ′′ , Γ(v ′′ ) >, a traditional
algorithm, namely U nionF ind, is used to merge some first-level shingles in
Γ(V ′′ ) such that any two first-level shingles at least share one second-level
shingle. To the end, we map the clustering results back to the vertices in
original graph and generate one dense bipartite subgraph for each cluster. The
entire algorithm is presented in Algorithm 3.
Algorithm 3 DiscoverDenseSubgraph [18]
Parameter: c1 , s1 , c2 , s2 are the parameters for two-step shingling algorithm;
apply recursive shingling algorithms to obtain first-level and second-level
shingles;
let S =< s, Γ(s) > be first-level shingles;
let T =< t, Γ(t) > be second-level shingles;
apply clustering approach to get the clustering result C in terms of first-level
shingles;
for each C ∈ C do
output ∪s∈C Γ(s) as a dense subgraph;
end for
3.3.2
Visualization.
Wang et al. [47] combine theoretical bounds, a
greedy heuristic for graph traversal, and visual cues to develop a mining technique for clique, quasi-clique, and K-core components. Their approach is
named CSV for Cohesive Subgraph Visualization. Figure 1.5 shows a representative plot and how it is interpreted.
A key measure in CSV is co-cluster size CC(v, x), meaning the (estimated)
size of the largest clique containing both vertices v and x. Then, C(v) =
max{CC(v, x), ∀x ∈ N (v)}.
At the top level of abstraction, the algorithm is not difficult. We maintain a
priority queue of vertices observed so far, sorted by C(v) value. We traverse
the graph and draw a density plot by iterating the following steps:
1 Remove the top vertex from the queue, making this the current vertex v.
2 Plot v.
3 Add v’s neighbors to the priority queue.
Now for some details. If this is the ith iteration, plot the point (i, Cseen (vi )),
where Cseen (vi ) is the largest value of C(vi ) observed so far. We say "seen so
far" because we may not have observed all of v neighbors yet, and even when
we have, we are only estimating clique sizes. Next, some neighbors of v may
already be in the queue. In this case, update their C values and reprioritize.
Due to the estimation method described below, the new estimate is no worse
that the previous one.
16
Contains w connected vertices with
degree 1 k.
May contain a clique of size 123456k,w).
w
C_seen(vi)
k
Traversal Order
Figure 1.5. Example of CSV Plot
Since an exact determination of CC(v, x) is computationally expensive,
CSV takes several steps to efficiently find a good estimate of the actual clique
size. First, to reduce the clique search space, the graph’s vertices and edges are
pre-processed to map them to a multi-dimensional space. A certain number of
vertices are selected as pivot points. Then each vertex is mapped to a vector:
v → M (v) = {d(v, p1 ), · · · , d(v, pp )}, where d(v, pi ) is the shortest distance
in the graph from v to pivot pi . The authors prove that all the vertices of a clique
map to the same unit cell, so we can search for cliques by searching individual
cells.
Second, CSV further prunes the vertices within each occupied cell. Do the
following for each vertex v in each occupied cell: For each neighbor x of v,
identify the set of vertices Y which connect to both v and x. Construct the
induced subgraph S(v, x, Y ). If there is a clique, it must be a subgraph of S.
Sort Y by decreasing order of degree in S. To be in a k-clique, a vertex must
have degree ≥ k − 1. Consequently, we step through the sorted Y list and
eliminate the remainder when the threshold δS (yi ) < i − 1 is reached. The
size of the remaining list is an upper bound estimate for C(v) and CC(v, x).
Dense Components in Graphs
17
With relatively minor modification, the same general approach can be used for
quasi-cliques and k-cores.
The slowest step in CSV is searching the cells for pseudo-cliques, with overall
time complexity O(|V |2 log|V |2d ). This becomes exponential when the graph
is a single large clique. However, when tested on two real-life datasets, DBLP
co-authorship and SMD stock market networks, d << |V |, so performance is
polynomial.
3.3.3
Other Heuristic Approaches.
Li et al. [30] studied a problem
of discovering dense bipartite subgraph, so-called maximal quasi-biclique with
balanced noise tolerance, such that each vertex in one part is allowed to at most
disconnect the same number, or the same percentage of vertices in another part.
This definition is capable of avoiding skewness of found dense bipartite subgraph in terms of connections. Li et al. observed that such kind of maximal
quasi-biclique cannot be trivially expanded from traditional maximal biclique.
Some useful properties such as bounded closure property and fixed point property are utilized to develop a efficient algorithm, namely µ − CompleteQB for
discovering maximal quasi-biclique with balanced noise tolerance.
Given a bipartite graph, the algorithm aims to figure out maximal quasibicliques where the number of vertices at each part exceeds a specified value
ms ≥ µ. Two cases are considered in this framework. If ms ≥ 2µ, the problem
to find maximal quasi-biclique is converted into the problem to exact maximal
µ-quasi bicliques which has been well discussed in [42]. On the other hand, if
ms < 2µ, a depth-first search for µ-tolerance maximal quasi-bicliques whose
vertex size is between ms and 2µ is conducted to achieve the goal.
A spectral analysis method [13] is used to uncover the functionality of a certain dense component. To begin with, the similarity matrix on protein-protein
interaction network is defined, and the corresponding eigenvalues and eigenvectors are calculated. Particularly, each eigenvector with positive eigenvalue is
identified as a quasi-clique, while each eigenvector with negative eigenvalue is
considered as a quasi-bipartite. Given those dense component, a statistical test
based on p-value is applied to measure whether a dense component is enriched
with proteins from particular category more than would be expected by chance.
Simply speaking, the statistical test is to make sure such the existence of such
dense component is significant with respect to a specific protein category. If so,
dense component is assigned with corresponding protein functionality. To the
end, the approach is applied to analyze the protein-protein interaction network
of budding yeast.
Kumar et al. [28] focuses on enumerating emerging communities which
have little or no representation in the newsgroups or commercial web directories. Kumar et al. consider the (i, j) biclique where the number of vertices in
two parts are i and j respectively as the core of interested communities. There-
18
fore, this paper aims at extracting a non-overlapping maximal set of cores
for interested communities. An stream-based algorithm combining a set of
pruning techniques is presented to process huge raw web data, and eventually
generate the appropriate cores. Some open problems like how to automatically
extract semantic information and organize them into a useful structure are also
mentioned.
3.4
Approximation Approach with Bound
3.4.1
Greedy Approximation Algorithm with Bound. In [14], Charikar
described exact and greedy approximation algorithms to discover subgraphs
which can maximize two different notions of density. Different definitions of
density are considered for undirected and directed graphs. Simply speaking, the
density notion utilizing for undirected graph is essentially the average degree
of the subgraph, such that density f (S) of the subset S is |E(S)|
|S| . For directed
graph, the criteria first proposed by Kannan and Vinay [26] is applied. That is,
given two subsets of vertices S ⊆ V and T ⊆ V , the density of subgraph HS,T
√ )| . Here, S and T are not necessarily disjoint.
is defined as d(S, T ) = |E(S,T
|S||T |
This paper studied the optimization problem of discovering a subgraph Hs induced by a subset S with maximum f (S) or HS,T induced by two subsets S
and T with maximum d(S, T ), respectively.
In the paper, authors show that finding a subgraph HS in undirected graph
with maximum f (S) is equivalent to solving the following linear programming
(LP) problem:
max
X
xij
(3.2)
∀ij ∈ E xij ≤ yi
∀ij ∈ E xij ≤ yj
(3.3)
(3.4)
ij
X
yi ≤ 1
(3.5)
xij , yi ≥ 0
(3.6)
i
From the viewpoint of graph, we assign each vertex vi with weight j xij
and min(yi , yj ) is the threshold for the weight of all edges (vi , vj ) incidenting
to vertex vi . Then xij can be considered as the weight of edge (vi , vj ) to which
vertex vi distributes. In normalized
fashion, the sum of threshold for edges
P
incidenting to vertexPvi , i yi , is bounded by 1. In this sense, finding the
optimal solution of ij xij is intuitively thought as extracting a set of edges
such that the weights of their incident vertices mostly distribute to them. In
the paper, Charikar shows that the optimality of above LP problem is exactly
equivalent to discovering the densest subgraph in a undirected graph.
P
19
Dense Components in Graphs
Intuitively, the complexity of such LP problem highly depends on the number
of edges and vertices in the graph (i.e., the number of inequality constraints in
LP). It is impractical for large graphs. Therefore, Charikar proposed an efficient
greedy algorithm and prove that this algorithm produce a 2-approximation for
f (G). This greedy algorithm is a simple variant of [27]. Let S is a subset of
V and HS is its induced subgraph with density f (HS ). Given this, we outline
this greedy algorithm as follows:
1 Let S be the subset of vertices and is firstly initialized as V ;
2 Let HS is the induced graph by vertices S;
3 For each iteration, we eliminate the vertex with lowest degree in HS from
S and recompute its density;
4 We output one induced subgraph HS generated in above iterations with
maximum density as resulting dense component.
Similar techniques are also applied to finding densest subgraph in a directed
graph. And a greedy 2-approximation algorithm is also proposed. This greedy
algorithm takes O(m + n) time. According to the analysis, Charikar claims
that we have to run the greedy algorithm for O( logǫ n ) values of c in order to get
a 2 + ǫ approximation, where c = |S|/|T | and S, T are two subset of vertices
in the graph.
Another variant following similar idea is presented in [24]. Jin et al. developed a approximation algorithm for discovering densest subgraph by introducing a new notion of rank subgraph. The rank subgraph can be defined as
follows:
Definition 1 (Rank Subgraph) [24]. Given an undirected graph G =
(V, E) and a positive integer d, we remove all vertices with the degree less
than d and their incident edges from G, we repeat this procedure until no vertex
can be eliminated and form a new graph Gd . Each vertex in Gd is adjacent to
at least d vertices in Gd . If no vertex is left in the graph, we refer to it as the
empty graph, denoted as G∅ . Given this, we construct a subgraph sequence
G ⊇ G1 ⊇ G2 · · · ⊇ Gl ⊃ Gl+1 = G∅ , where Gl 6= G∅ and contains at
least l + 1 vertices. We define l as the rank of the graph G, and Gl as the rank
subgraph of G.
Lemma 1.2 Given an undirected graph G, let Gs be the densest subgraph of
G with density d(Gs ) and Gl be its rank subgraph with density d(Gl ). Then,
the density of Gl is no less than half of the density of Gs :
d(Gl ) ≥
d(Gs )
2
20
The above lemma implies that we can use the rank subgraph Gl with highest
rank of G to approximate its densest subgraph. This technique is utilized to
derive a efficient search algorithm for finding densest subgraph from a sequence
of bipartite graphs. The interested reader can refer to [24] for details.
3.4.2
Others.
Anderson et al. [4] consider the problem of discovering
dense subgraph with lower bound or upper bound of size. Three problems
including dalks, damks and dks are formulated. In detail, dalks is the abbreviation of densest at-least-k-subgraph problem aiming at extracting an induced
subgraph with highest average degree among all subgraphs with at least k vertices. Similarly, damks is to find densest at-most-k-subgraph and dks focuses
on finding densest subgraph with exact k vertices. Clearly, both dalks and
damks are considered as the relaxed version of dks. Anderson et al. show that
daks is approximately hard as dks which have been proved to be NP-Complete.
More importantly, an effective 1/3-approximation algorithm based on core decomposition of a graph is proposed for dalks. This algorithm runs in O(m + n)
and O(m + n log n) time for unweighted and weighted graph, respectively.
We describe the algorithm for dalks as follows. Given a graph G = (V, E)
with n vertices and a lower bound of size k, let Hi be the subgraph induced by
i vertices. At the beginning, i is initialized with n and Hi is the original graph
G. After that, we remove the vertex vi with minimum weighted degree from
Hi to form Hi−1 . At the meanwhile, we update its corresponding total weight
W (Hi−1 ) and density d(Hi−1 ). We repeat this procedure and get a sequence
of subgraph Hn , Hn−1 , · · · , H1 . To the end, we choose the subgraph Hk with
maximal density d(Hk ) as the resulting dense component.
A local algorithm [3] is developed to find a dense bipartite subgraph near
a specified starting vertex in the bipartite graph. Especially, for any bipartite
subgraph with K vertices and density θ (the definition of density is identical to
the definition in [26]), the proposed algorithm guarantees to generate a subgraph
with density Ω(θ/ log ∆) near any starting vertex v within the graph where ∆
is the maximum degree in the graph. The time complexity of this algorithm is
O(∆K 2 ) which is independent of the size of graph, and thus has potential to
be scaled for large graphs.
4.
Frequent Dense Components
The dense component discovery problem can be extended to consider a
dataset consisting of a set of graphs D = {G1 , · · · , Gn }. In this case, we
have two criteria for components: they must be dense and they must occur
frequently. The density requirement can be any of our earlier criteria. The
frequency requirement says that a component satisfies a minumum support
threshold; that is, it appears in at least a certain number of graphs. Obviously,
if we say that we find the same component in different graphs, there must be
Dense Components in Graphs
21
a correspondence of vertices from one graph to another. If the graphs have
exactly the same vertex sets, then we call this a relation graph set.
Many authors have considered the broader problem of frequent pattern mining in graphs [45, 22, 29];however, not until recently has there been a clear
focus on patterns defined and restricting by density. Several recent papers have
looked into discovery methods for frequent dense subgraphs. We take a more
detailed look at some of these papers.
4.1
Frequent Patterns with Density Constraints
One approach is to impose a density constraint on the patterns discovered
by frequent pattern mining. In [50], Yan et al. use the minumum cut clustering
criterion: a component must have an edge cut less than or equal to k. Note
that this is equivalent to a k-core criterion. Furthermore, each frequent pattern
must be closed, meaning it does not have any supergraph with the same support
level. They develop two approaches, pattern growth and pattern reduction.
In pattern growth, begin with a small subgraph (possibly a single vertex) that
satisfies both the frequency and density requirements but may not be closed.
The algorithm incrementally adds adjacent edges until the pattern is closed. In
pattern reduction, initialize the working set P1 to be the first graph G1 . Update
the working set by intersecting its edge set with the edges of the next graph:
Pi = Pi−1 ∩ GI = (V, E(Pi−1 ) ∩ E(GI ))
This removes any edges that do not appear in both input graphs. Decompose
Pi into k-core subgraphs. Recursively call pattern reduction for each dense
subgraph. Record the dense subgraphs that survive enough intersections to be
considered frequent.
The greedy removal of edges at each iteration quickly reduces the working set
size, leading to fast execution time. The trade-off is that we prune away edges
that might have contributed to a frequent dense component. The consequence of
edge intersection is that we only find components whose edges happen to appear
in the first mins upport graphs. Therefore, a useful heuristic would be to order
the graphs by decreasing overall density. In [50], they find that pattern reduction
works better when targeting high connectivity but a low support threshold.
Conversely, pattern growth works better when targeting high support but only
modest connectivity.
4.2
Dense Components with Frequency Constraint
Hu et al. [21] take a different perspective, providing a simple meta-algorithm
on top of an existing dense component algorithm. From the input graphs, which
must be a relation graph set, they derive two new graphs, the Summary Graph
and the Second-Order Graph. The Summary Graph is Ĝ = (V, Ê), where
22
an edge exists if it appears in at least k graphs in D. For the Second-Order
Graph, we transform each edge in D into a vertex, giving us F = (V × V, EF ).
An edge joins two vertices in F (equivalent to two edges in G) if they have
similar support patterns in D. An edge’s support pattern is represented as the ndimensional vector of weights in each graph: w(e) = {wG1 (e), · · · , wGn (e)}.
Then, a similarity measure such as Euclidean distance can be used to determine
whether two vertices in F should be connected.
Given these two secondary graphs, the problem is quite simple to state: find
coherent dense subgraphs, where a subgraph S qualifies if its vertices form a
dense component in Ĝ and if its edges form a dense component in F . Density
in Ĝ means that the component’s edges occur frequently, when considering the
whole relation graph set D. Density in F ensures that these frequent edges are
coherent, that is, they tend to appear in the same graphs.
To efficiently find dense subgraphs, Hu uses a modified version of Hartuv
and Shamir’s HCS mincut algorithm [20]. Because Hu’s approach converts
any n graphs into only 2 graphs, it scales well with the number of graphs. A
drawback, however, is the potentially large size of the second-order graph. The
worst case would occur when all n graphs are identical. Since all edge support
vectors would be identical, the second order graph would become a clique of
size |E| with O(|E|2 ) edges.
4.3
Enumerating Cross-Graph Quasi-Cliques
Pei et al. [36] consider the problem of finding so-called cross-graph quasicliques, CGQC for short. They use the balanced quasi-clique definition. Given
a set of graphs D = {G1 , · · · , Gn } on the same set of vertices U , corresponding parameters γ1 , · · · , γn for the completeness of vertex connectivity, and a
minimum component size minS , they seek to find all subsets of vertices of
cardinality ≥ minS such that when each subset is induced upon graph Gi , it
will form a maximal γi -quasi-clique.
A complete enumueration is #P -Complete. Therefore, they derive several
graph-theoretical pruning methods that will typically reduce the execution time.
They employ a set enumeration tree [38] to list all possible subsets of vertices,
while taking advantage of some tree-based concepts, such as depth-first search
and sub-tree pruning. An example of a set enumeration tree is shown in Figure ??. Below is a brief listing of some of the theoretical properties they utilize,
followed by the main algorithm itself (Algorithm 4).
1 Given γ and graph size n, there exist upper bounds on the graph diameter
1
.
diam(G). For example, diam(G) ≤ n − 1 if γ > n−1
k
2 Define N (u) = vertices within a distance k of u.
3 Reducing vertices: If δ(u) < γi (minS − 1) or |N k (u)| < (minS − 1),
then u cannot be in a CGQC.
Dense Components in Graphs
23
4 Candidate projection: when traversing the tree, a child cannot be in a
CGQC if it does not satisfy its parent’s neighbor distance bounds NGkii .
5 Subtree pruning: apply various rules on minS , redundancy, monotonicity.
Algorithm 4 Crochet: Cross-Graph Quasi-Clique Algorithm
Parameter: graphs G1 , G2 ; completeness factors γ1 , γ2 ; minimum size mins ;
1: for each graph Gi do
2:
construct set enumeration tree for all possible vertex subsets of Gi ;
3: end for
4: apply Vertex and Edge Reduction to G1 and G2 ;
5: for each v ∈ V (G1 ), using DFS and highest-degree-child-first order do
6:
recursive-mine ({v}, G1 , G2 );
7: end for
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
5.
Function recursive-mine(X, G1 , G2 );
Gi ← Gi (P ), P = {u|u ∈ ∩v∈X,i=1,2 NGkii (v)} {Candidate Projection}
Gi ← Gi (P (X));
apply Vertex Reduction;
if a Subtree Pruning condition applies then return(FALSE);
continue ← FALSE;
for each v ∈ P (X)\X, using DFS and highest-degree-child-first order do
continue ← continue ∨ recursive-mine (X ∪ {v}, G1 , G2 );
end for
if not continue ∧ Gi (X) is a γi -quasi-complete graph then
output X;
return (TRUE);
else
return (continue);
end if
Applications of Dense Components Analysis
In financial and economic analysis, dense components represent entities that
are highly correlated. For example, Boginski et al. define a market graph, where
each vertex is a financial instrument, and two vertices are connected if their behaviors (say, price change over time) are highly correlated [9, 10]. A dense component then indicates a set of instruments whose members are well-correlated
to one another. This information is valuable both for understanding market
dynamics and for predicting the behavior of individual instruments. Density
can also indicate strength and robustness. Du et al. [15] identify cliques in a fi-
24
nancial grid space to assist in discovering price-value motifs. Some researchers
have employed bipartite and multipartite networks. Sim et al. [42] correlates
stocks to financial ratios using quasi-bicliques. Alkemade et al. [2] finds edge
density in a tripartite graph of producers, consumers, and intermediaries to be
an important factor in the dynamics of commerce.
In the first decade of the 21st century, the field that perhaps has shown the
greatest interest and benefitted the most from dense component analysis is biology. Molecular and systems biologists have formulated many types of networks:
signal transduction and gene regulation networks, protein interaction networks,
metabolic networks, phylogenetic networks, and ecological networks. [25].
Proteins are so numerous that even simple organisms such as Saccharomyces
cerevisiae, a budding yeast, are believed to have over 6000 [46]. Understanding the function and interrelationships of each one is a daunting task. Fortunately, there is some organization among the proteins. Dense components in
protein-protein interaction networks have been shown to correlate to functional
units [44, 37, 49, 13, 6]. Finding these modules and complexes helps to explain metabolic processes and to annotate proteins whose functions are as yet
unknown.
Gene expression faces similar challenges. Microarray experiments can record
which of the thousands of genes in a genome are expressed under a set of test
conditions and over time. By compiling the expression results from several
trials and experiments, a network can be constructed. Clustering the genes into
dense groups can be used to identify not only healthy functional classes, but
also the expression pattern for genetic diseases [43].
Proteins interact with genes by activating and regulating gene transcription
and translation. Density in a protein-gene bipartite graph suggests which protein
groups or complexes operate on which genes. Everett et al. [16] have extended
this to a tripartite protein-gene-tissue graph.
Other biological systems are also being modeled as networks. Ecological
networks, famous for food chains and food webs, are receiving new attention as
more data becomes available for analysis and as the effects of climate change
become more apparent.
Today, the natural sciences, the social sciences, and technological fields are
all using network and graph analysis methods to better understand complex
systems. Dense component discovery and analysis is one important aspect
of network analysis. Therefore, readers from many different backgrounds will
benefit from understanding more about the characteristics of dense components
and some of the methods used to uncover them.
6.
Conclusion
We have seen that...
Dense Components in Graphs
How will dense component discovery research evolve in the future?
25
References
[1] J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. In LATIN ’02: Proceedings of the 5th Latin American Symposium
on Theoretical Informatics, pages 598–612, London, UK, 2002. SpringerVerlag.
[2] F. Alkemade, H. A. La Poutré, and H. A. Amman. An Agent-Based Evolutionary Trade Network Simulation, chapter 11, pages 237–255. Edward
Elgar Publishing, 2004.
[3] R. Andersen. A local algorithm for finding dense subgraphs. In SODA ’08:
Proc. 19th annual ACM-SIAM symp. on discrete algorithms, pages 1003–
1009, Philadelphia, PA, USA, 2008. Society for Industrial and Applied
Mathematics.
[4] R. Andersen and K. Chellapilla. Finding dense subgraphs with size
bounds. In WAW ’09: Proceedings of the 6th International Workshop
on Algorithms and Models for the Web-Graph, pages 25–37, Berlin, Heidelberg, 2009. Springer-Verlag.
[5] Anna Nagurney, ed. Innovations in Financial and Economic Networks
(New Dimensions in Networks). Edward Elgar Publishing, 2004.
[6] G. Bader and C. Hogue. An automated method for finding molecular
complexes in large protein interaction networks. BMC Bioinformatics,
4(1):2, 2003.
[7] V. Batagelj and M. Zaversnik. An o(m) algorithm for cores decomposition
of networks. CoRR, cs.DS/0310049, 2003.
[8] P. Berkhin. Survey of clustering data mining techniques. In C. N. Jacob Kogan and M. Teboulle, editors, Grouping Multidimensional Data,
chapter 2, pages 25–71. Springer Berlin Heidelberg, 2006.
[9] V. Boginski, S. Butenko, and P. M. Pardalos. On Structural Properties
of the Market Graph, chapter 2, pages 29–45. Edward Elgar Publishing,
2004.
28
[10] V. Boginski, S. Butenko, and P. M. Pardalos. Mining market data: A
network approach. Computers and Operations Research, 33(11):3171–
3184, 2006.
[11] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic
clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157–1166,
1997.
[12] C. Bron and J. Kerbosch. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16(9):575–577, 1973.
[13] D. Bu, Y. Zhao, L. Cai, H. Xue, and X. Z. andH. Lu. Topological structure
analysis of the protein-protein interaction network in budding yeast. Nucl.
Acids Res., 31(9):2443–2450, 2003.
[14] M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In K. Jansen and S. Khuller, editors, APPROX ’00, volume
1913 of Lecture Notes in Computer Science, pages 84–95. Springer, 2000.
[15] X. Du, J. H. Thornton, R. Jin, L. Ding, and V. E. Lee. Migration motif: A
spatial-temporal pattern mining approach for financial markets. In KDD
’09: Proc. 15th ACM SIGKDD Intl. Conf. on knowledge discovery and
data mining. ACM, 2009.
[16] L. Everett, L.-S. Wang, and S. Hannenhalli. Dense subgraph computation via stochastic search: application to detect transcriptional modules.
Bioinformatics, 22(14), July 2006.
[17] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web
communities. In Proc. 6th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pages 150 – 160, Boston, Massachusetts, United
States, 2000.
[18] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs
in massive graphs. In K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten,
P.-Å. Larson, and B. C. Ooi, editors, VLDB, pages 721–732. ACM, 2005.
[19] G. Grimmett. Precolation. Springer Verlag, 2nd edition, 1999.
[20] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Inf. Process. Lett., 76(4-6):175–181, 2000.
[21] H. Hu, X. Yan, Y. H. 0003, J. Han, and X. J. Zhou. Mining coherent dense
subgraphs across massive biological networks for functional discovery. In
ISMB (Supplement of Bioinformatics), pages 213–221, 2005.
REFERENCES
29
[22] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. In PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and
Knowledge Discovery, pages 13–23, London, UK, 2000. Springer-Verlag.
[23] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM
Comput. Surv., 31(3):264–323, 1999.
[24] R. Jin, Y. Xiang, N. Ruan, and D. Fuhry. 3-hop: A high-compression
indexing scheme for reachability query. In SIGMOD ’09: Proceedings
of the 2008 ACM SIGMOD international conference on Management of
data, New York, NY, USA, 2009. ACM.
[25] B. H. Junker and F. Schreiber. Analysis of Biological Networks. WileyInterscience, 2008.
[26] R. Kannan and V. Vinay.
manuscript, August 1999.
Analyzing the structure of large graphs.
[27] G. Kortsarz and D. Peleg. Generating sparse 2-spanners. J. Algorithms,
17(2):222–236, 1994.
[28] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web
for emerging cyber-communities. Computer Networks, 31(11-16):1481–
1493, 1999.
[29] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM
’01: Proceedings of the 2001 IEEE International Conference on Data
Mining, pages 313–320, Washington, DC, USA, 2001. IEEE Computer
Society.
[30] J. Li, K. Sim, G. Liu, and L. Wong. Maximal quasi-bicliques with balanced
noise tolerance: Concepts and co-clustering applications. In SIAM Intl.
Conf. on Data Mining, pages 72–83. SIAM, 2008.
[31] G. Liu and L. Wong. Effective pruning techniques for mining quasicliques. In W. Daelemans, B. Goethals, and K. Morik, editors,
ECML/PKDD (2), volume 5212 of Lecture Notes in Computer Science,
pages 33–49. Springer, 2008.
[32] R. Luce. Connectivity and generalized cliques in sociometric group structure. Psychometrika, 15(2):169–190, 1950.
[33] K. Makino and T. Uno. New algorithms for enumerating all maximal
cliques. Algorithm Theory - SWAT 2004, pages 260–272, 2004.
30
[34] R. Mokken. Cliques, clubs and clans. Quality and Quantity, 13(2):161–
173, 1979.
[35] M. E. J. Newman. The structure and function of complex networks. SIAM
REVIEW, 45:167–256, 2003.
[36] J. Pei, D. Jiang, and A. Zhang. On mining cross-graph quasi-cliques.
In R. Grossman, R. J. Bayardo, and K. P. Bennett, editors, KDD, pages
228–238. ACM, 2005.
[37] N. Pržulj, D. Wigle, and I. Jurisica. Functional topology in a network of
protein interactions. Bioinformatics, 20(3):340–348, 2004.
[38] R. Rymon. Search through systematic set enumeration. In In Proc. Third
Int. Conf. on Knowledge Representation and Reasoning, 1992.
[39] J. P. Scott. Social Network Analysis: A Handbook. Sage Publications Ltd.,
2nd edition, 2000.
[40] S. B. Seidman. Network structure and minimum degree. Social Networks,
5(3):269–287, 1983.
[41] S. B. Seidman and B. Foster. A graph theoretic generalization of the clique
concept. J. Math. Soc., 6(1):139–154, 1978.
[42] K. Sim, J. Li, V. Gopalkrishnan, and G. Liu. Mining maximal quasibicliques to co-cluster stocks and financial ratios for value investment. In
ICDM ’06: Proceedings of the Sixth International Conference on Data
Mining, pages 1059–1063, Washington, DC, USA, 2006. IEEE Computer
Society.
[43] D. K. Slonim. From patterns to pathways: gene expression data analysis
comes of age. Nature Genetics, 32:502–508, 2002.
[44] V. Spirin and L. Mirny. Protein complexes and functional modules in
molecular networks. Proc. Natl. Academy of Sci., 100(21):1123–1128,
2003.
[45] Y. Takahashi, Y. Sato, H. Suzuki, and S.-i. Sasaki. Recognition of largest
common structural fragment among a variety of chemical structures. Analytical Sciences, 3(1):23–28, 1987.
[46] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight,
D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili,
Y. Li, B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang,
M. Johnston, S. Fields, and J. M. Rothberg. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature,
403:623–631, 2000.
REFERENCES
31
[47] N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing
and mining cohesive subgraphs. In SIGMOD ’08: Proceedings of the 2008
ACM SIGMOD international conference on Management of data, pages
445–458, New York, NY, USA, 2008. ACM.
[48] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
[49] S. Wuchty and E. Almaas. Peeling the yeast interaction network. Proteomics, 5(2):444–449, 2205.
[50] X. Yan, X. J. Zhou, and J. Han. Mining closed relational graphs with
connectivity constraints. In KDD ’05: Proc. 11th ACM SIGKDD Intl.
Conf. on Knowledge discovery in data mining, pages 324–333, New York,
NY, USA, 2005. ACM.