Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
A Data Analytic Algorithm for Managing, Querying, and Processing Uncertain Big Data in Cloud Environments
Next Article in Special Issue
Modifying Orthogonal Drawings for Label Placement
Previous Article in Journal
Efficiency Intra-Cluster Device-to-Device Relay Selection for Multicast Services Based on Combinatorial Auction
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generating Realistic Labelled, Weighted Random Graphs †

1
Organisation Européene pour la Recherche Nucléaire (CERN), Route de Meyrin 385, 1217 Meyrin, Switzerland
2
Pattern Recognition and Intelligent Systems (PRIS) Lab, Beijing University of Posts and Telecommunications (BUPT), 100876 Beijing, China
3
School of Electrical and Electronic Engineering and Computer Science, Queen’s University Belfast, University Road, Belfast BT7 1NN, UK
4
Centre for Public Health, Queen’s University Belfast, University Road, Belfast BT7 1NN, UK
*
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper published in New Frontiers in Mining Complex Patterns—Second International Workshop (NFMCP 2013), held in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013), Prague, Czech Republic, 27 September 2013.
Algorithms 2015, 8(4), 1143-1174; https://doi.org/10.3390/a8041143
Submission received: 1 June 2015 / Revised: 16 November 2015 / Accepted: 20 November 2015 / Published: 8 December 2015
(This article belongs to the Special Issue Graph Drawing and Experimental Algorithms)
Graphical abstract
">
Figure 1
<p>A<span class="html-small-caps">gwan</span> parameters. (<b>a</b>) Vertex Labels; (<b>b</b>) Edge Weights.</p> ">
Figure 2
<p>Input Graph Datasets, from (<b>a</b>) a health study and (<b>b</b>) the Enron e-mail corpus. (<b>a</b>) Undirected graph of who exercised with whom; (<b>b</b>) Directed graph of who e-mailed whom.</p> ">
Figure 3
<p>Triad Patterns in a Directed Graph. (<b>a</b>) Cycle; (<b>b</b>) Middleman; (<b>c</b>) In; (<b>d</b>) Out.</p> ">
Figure 4
<p>Fitting BMM and GMM models to edge weights. (<b>a</b>) Probability Mass Function; (<b>b</b>) Cumulative Distribution Function.</p> ">
Figure 5
<p>Vertex Strength Distribution—Real Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (In-strength); (<b>c</b>) Directed (Out-strength).</p> ">
Figure 5 Cont.
<p>Vertex Strength Distribution—Real Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (In-strength); (<b>c</b>) Directed (Out-strength).</p> ">
Figure 6
<p>Spectral Properties—Real Attributes. (<b>a</b>) Undirected—Singular Values; (<b>b</b>) Undirected—Primary Left Singular Vector; (<b>c</b>) Directed—Singular Values; (<b>d</b>) Directed—Primary Left Singular Vector.</p> ">
Figure 7
<p>Clustering Coefficients—Real Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (In-edges); (<b>c</b>) Directed (Out-edges).</p> ">
Figure 8
<p>Triad Participation—Real Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (Cycles); (<b>c</b>) Directed (Middlemen); (<b>d</b>) Directed (Ins); (<b>e</b>) Directed (Outs).</p> ">
Figure 8 Cont.
<p>Triad Participation—Real Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (Cycles); (<b>c</b>) Directed (Middlemen); (<b>d</b>) Directed (Ins); (<b>e</b>) Directed (Outs).</p> ">
Figure 9
<p>Vertex Strength Distribution—Synthetic Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (In-strength); (<b>c</b>) Directed (Out-strength).</p> ">
Figure 9 Cont.
<p>Vertex Strength Distribution—Synthetic Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (In-strength); (<b>c</b>) Directed (Out-strength).</p> ">
Figure 10
<p>Spectral Properties—Synthetic Attributes. (<b>a</b>) Undirected—Singular Values; (<b>b</b>) Undirected—Primary Left Singular Vector; (<b>c</b>) Directed—Singular Values; (<b>d</b>) Directed—Primary Left Singular Vector.</p> ">
Figure 11
<p>Clustering Coefficients—Synthetic Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (In-edges); (<b>c</b>) Directed (Out-edges).</p> ">
Figure 12
<p>Triad Participation—Synthetic Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (Cycles); (<b>c</b>) Directed (Middlemen); (<b>d</b>) Directed (Ins); (<b>e</b>) Directed (Outs).</p> ">
Figure 12 Cont.
<p>Triad Participation—Synthetic Attributes. (<b>a</b>) Undirected; (<b>b</b>) Directed (Cycles); (<b>c</b>) Directed (Middlemen); (<b>d</b>) Directed (Ins); (<b>e</b>) Directed (Outs).</p> ">
Figure 13
<p>Giant Component. (<b>a</b>) Undirected; (<b>b</b>) Directed.</p> ">
Figure 14
<p>Densification Power Law. (<b>a</b>) Undirected, unweighted; (<b>b</b>) Undirected, weighted; (<b>c</b>) Directed, unweighted; (<b>d</b>) Directed, weighted.</p> ">
Figure 15
<p>Shrinking Diameter Effect. (<b>a</b>) Undirected, unweighted; (<b>b</b>) Undirected, weighted; (<b>c</b>) Directed, unweighted; (<b>d</b>) Directed, weighted.</p> ">
Figure 15 Cont.
<p>Shrinking Diameter Effect. (<b>a</b>) Undirected, unweighted; (<b>b</b>) Undirected, weighted; (<b>c</b>) Directed, unweighted; (<b>d</b>) Directed, weighted.</p> ">
Versions Notes

Abstract

:
Generative algorithms for random graphs have yielded insights into the structure and evolution of real-world networks. Most networks exhibit a well-known set of properties, such as heavy-tailed degree distributions, clustering and community formation. Usually, random graph models consider only structural information, but many real-world networks also have labelled vertices and weighted edges. In this paper, we present a generative model for random graphs with discrete vertex labels and numeric edge weights. The weights are represented as a set of Beta Mixture Models (BMMs) with an arbitrary number of mixtures, which are learned from real-world networks. We propose a Bayesian Variational Inference (VI) approach, which yields an accurate estimation while keeping computation times tractable. We compare our approach to state-of-the-art random labelled graph generators and an earlier approach based on Gaussian Mixture Models (GMMs). Our results allow us to draw conclusions about the contribution of vertex labels and edge weights to graph structure.

Graphical Abstract">

Graphical Abstract

1. Introduction

Network analysis is concerned with finding patterns and anomalies in real-world graphs, such as social networks, computer and communication networks, or biological and ecological processes. Real graphs exhibit a number of interesting structural and evolutionary properties, such as the formation of a giant component, heavy-tailed degree distribution, small diameter, shrinking diameter, and the Densification Power Law (DPL) [1,2,3,4,5].
Besides discovering network properties, researchers are interested in the mechanisms of network formation. Generative graph models provide an abstraction of how graphs form: if the model is accurate, generated graphs will obey the same properties as real graphs. Generated graphs are also useful for simulation experiments, hypothesis testing and making predictions about graph evolution or missing graph elements. Most generative models are for unlabelled, unweighted graphs [1,3,4,6,7], although a few models take discrete vertex labels into account [8,9,10].
In this paper, we develop a generative model for labelled, weighted graphs. Weights are commonly used to represent the number of occurrences of each edge: e-mails sent between individuals in a social network [11]; calls to a subroutine in a software call graph [12]; or people walking between a pair of sensors in a building access control network [13]. In other applications, the edge weight may represent continuous values: donation amounts in a bipartite graph of donors and political candidates [11]; distance or speed in a transportation network [12]; or elapsed time to walk between the sensors in the building network [13]. In some cases, the weight has more than one dimension [12,13].
Our main motivation for this work is to create “realistic” random graphs for evaluating graph mining algorithms. Some interesting graph datasets are very small; our approach can generate large graphs with the same characteristics as a smaller input graph. Random graphs can also ameliorate concerns about privacy. A second motivation is to better understand the laws governing the relationship between graph structure and attributes. Our experiments show the extent to which various graph properties are related to labels and weights.
Our model, Agwan (Attribute Graph: Weighted and Numeric), represents the distribution of edge weights as Beta Mixture Models (BMMs) which are learned from weighted input graphs. Learning BMM parameters using Bayesian estimation is analytically intractable. Numerical solutions to simulate the posterior distribution are available, but these incur high computational cost. In Section 3, we introduce an approximation to the prior/posterior distribution of the parameters in the beta distribution and propose an analytically tractable (closed-form) Bayesian approach to parameter estimation, based on the Variational Inference (VI) framework. Following the principles of VI and utilizing the relative convexity bound, the extended factorised approximation method is applied to approximate the distribution of the BMM parameters. In a fully Bayesian model where all the parameters of the BMM are considered as variables and assigned proper distributions, our approach can asymptotically find the optimal estimate of the parameters of the posterior distribution. In addition, the model complexity depends on the empirical distribution of the data. A closed-form solution is proposed to avoid the need for numerical methods in each iteration. Our approach avoids the drawback of overfitting, as in the conventional Expectation Maximisation (EM) algorithm.
This paper is arranged as follows: Section 2 is an overview of generative graph models and approaches to estimating mixture model parameters; Section 3 presents Agwan, our generative model for weighted and numeric labelled graphs, including our algorithm for fitting Agwan’s parameters to real input graphs. Section 4 gives an overview of the datasets used in the experiments, and outlines the statistical measures and tests used to evaluate the generated graphs. The experiments in Section 5 demonstrate that the vertex labels and edge weights of a graph can predict the graph structure with high accuracy. Conclusions are in Section 6.

2. Related Work

Our understanding of the mathematical properties of graph structure was pioneered by Paul Erdős and Alfréd Rényi [3]. Graph formation is modelled as a Bernoulli process, parameterised by the number of vertices and a wiring probability between each vertex pair. While it has been essential to our understanding of component sizes and expected diameter, the Erdős-Rényi model does not explain other important properties of real-world graphs such as degree distribution, transitivity and clustering [2,5].
Barabási and Albert’s Preferential Attachment model [1] uses the “rich get richer” principle to grow graphs from a few vertices up to the desired size. The probability of an edge is proportional to the number of edges already connected to a vertex. This generates graphs with power-law degree distributions. A number of variants of Preferential Attachment have been proposed [2,5]. Still, Preferential Attachment models lack some desired properties, such as community structure.
The RMat algorithm [6] solves the community structure problem with its recursive matrix approach. RMat graphs consist of 2 n vertices and E edges, with four probabilities a , b , c , d to determine in which quadrant of the adjacency matrix each edge falls. These parameters allow the specification of power-law or log-normal degree distributions; if a = b = c = d , the result will be an Erdős-Rényi graph.
Kronecker Graphs [7] fulfil all the properties mentioned above, as well as the DPL (Densification Power Law) and shrinking diameter effect. The model starts with an initiator matrix. Kronecker multipication is recursively applied to yield the final adjacency matrix of the desired size. This work synthesises the previous work in random graphs in a very elegant way and proves that RMat graphs are a special case of Stochastic Kronecker graphs.
The models above tend to have a small number of parameters and are analytically tractable, with simple and elegant proofs of the desired properties. However, graph labels are not taken into consideration. Stochastic models are another class of generative algorithm which may not be amenable to analytical proofs but can be fit to real-world labelled graphs and used to learn the properties of those graphs. Models in this category include the Stochastic Block Model [10] and Latent Space approaches [8].
The Multiplicative Attribute Graph (MAG) model [9] draws on both of the above strands of research. MAG is parameterised by the number of vertices, a set of prior probabilities for vertex label values and a set of affinity matrices specifying the probability of an edge conditioned on the vertex labels. The affinity matrices can be learned from real graphs using Maximum Likelihood Estimation [14]. [9] proves that Kronecker Graphs are a special case of MAG graphs, and that suitably-parameterised MAG graphs fulfil all the desired properties: log-normal or power-law degree distribution, small diameter, the existence of a unique giant component and the DPL. However, the MAG model can only generate unweighted graphs. We believe that our method, described in the next section, is the first generative model for labelled, weighted graphs.
The Agwan model requires a suitable probability distribution to model the edge weights accurately and efficiently. The Gaussian distribution is popular as it has an analytically tractable Probability Density Function (PDF); a weighted mixture of Gaussian components provides a reasonable approximation to any general probability distribution [15]. The resulting Gaussian Mixture Model (GMM) is quite flexible and is used extensively in statistical pattern recognition [16]. If the number of mixture components is known in advance, the GMM parameters can be estimated using EM (Expectation Maximisation) [17]. However, if the number of mixtures is unknown, EM can result in overfitting. The problem of knowing the “correct” number of components can be avoided by using a non-parametric model: the approach in [18] assumes an infinite number of components and uses VI (Variational Inference) to determine the optimal number for a given dataset. The variational algorithm can be accelerated for higher-dimensional data using a kd-tree [19].
In [20], the edge weight parameters are specified as a GMM. However, the Gaussian distribution may not be the best choice where the weight is a countable quantity representing the number of occurrences of an edge. In this case, the weights are bounded by ( 0 , + ) , while the Gaussian distribution is bounded by ( - , + ) . Although the weights can be modelled as a GMM, a large number of mixture components are required to describe the data close to the boundary [21]. Alternatives to the GMM include the truncated GMM [21] and the BMM [22]. We investigate these options in this paper.
The most central task in modeling the data with a BMM is parameter estimation. Since the normalisation constant (the beta function) in the beta distribution is defined as a fraction of integrals, it is difficult to obtain a closed-form expression for estimating the parameters. For maximum likelihood estimation of the BMM parameters, [23,24] proposed an EM algorithm [25], with iterative numerical calculations in the maximisation step. As with GMMs, the EM algorithm for BMM has some disadvantages: it can lead to overfitting when the mixture models are excessively complex; and the iterative numerical calculation in the maximisation step (e.g., with the Newton-Raphson method) has a high computational cost.
For Bayesian estimation, we can formally find the prior distribution and the conjugate posterior distribution of the parameters of the beta distribution. However, this posterior distribution is still defined with an integration expression in the denominator such that the closed-form of the posterior distribution is analytically intractable. [22] proposed a practical Bayesian estimation algorithm based on the Gibbs sampling method, which simulates the posterior distribution approximately rather than computing it. This method prevents the overfitting problem but still suffers from high computational cost because of the Gibbs sampling, especially when the data is in a high-dimensional space. To overcome this problem, [26,27] proposed a full Bayesian estimation method for parameter estimation in a BMM, where the VI framework was employed and an analytically tractable solution can be obtained. The proposed method facilitates the parameter estimation.

3. Agwan: A Generative Model for Labelled, Weighted Graphs

In this section, we present our generative model, Agwan (Attribute Graph: Weighted and Numeric). The model is illustrated in Figure 1.
Figure 1. Agwan parameters. (a) Vertex Labels; (b) Edge Weights.
Figure 1. Agwan parameters. (a) Vertex Labels; (b) Edge Weights.
Algorithms 08 01143 g001
Consider a graph G = ( V , E ) with discrete vertex label values drawn from a set L (Figure 1a). For each combination of vertex attributes i , j , the corresponding mixture model Ω i j parameterises the distribution of edge weights, with an edge weight of 0 indicating no edge (Figure 1b). u , v V are vertices and w u v , w v u N are edge weights. Edges e E are specified as a 3-tuple u , v , w u v . Thus, the Agwan model is parameterised by π, the set of prior probabilities over L; and the set of edge weight mixture parameters Θ = { Ω i j | i , j L } . For directed graphs, | Θ | = | L | 2 and we need to generate both w u v and w v u . For undirected graphs, Ω i j = Ω j i , so | Θ | = O ( | L | 2 / 2 ) and w v u = w u v .
Ω i j is specified as a BMM:
Ω i j = m = 1 M ω m i j · Beta ( a m i j , b m i j )
where ω m i j is the weight of each component; and Beta ( a m i j , b m i j ) is the beta distribution with shape parameters ( a m i j , b m i j ) . The PDF of the beta distribution is:
Beta ( x ; a , b ) = 1 beta ( a , b ) x a - 1 ( 1 - x ) b - 1 , a , b > 0
where beta ( a , b ) is the beta function beta ( a , b ) = Γ ( a ) Γ ( b ) Γ ( a + b ) and Γ ( · ) is the gamma function defined as Γ ( z ) = 0 t z - 1 e - t d t . As the support of the BMM is ( 0 , 1 ) , we use a constant s i j to scale the data into this range before fitting the BMM parameters. During graph generation, we draw values from the BMM and multiply by s i j .
We specify Ω i j such that the weight of the first component ( m = 0 ) encodes the probability of no edge: ω 0 i j = 1 - P ( e i j ) , where P ( e i j ) is the probability of an edge between pairs of vertices with labels i , j and is learned from the input graph. The weights of the BMM components ( m [ 1 , M ] ) are normalised by P ( e i j ) , so the weights of all the components form a probability distribution: m = 0 M ω m i j = 1 .

3.1. Graph Generation

Algorithm 1 describes how to generate a random graph using Agwan( N , L , π , Θ ). The number of vertices in the generated graph is specified by N. After assigning discrete label values to each vertex (lines 2–3), the algorithm checks each vertex pair u , v for the occurrence of an edge (lines 4–7). If there is an edge, we assign its weight from mixture component m (lines 8,9). The generated graph is returned as G = ( V , E ) .
Algorithm 1 Agwan Graph Generation
Require: 
N (no. of vertices), L (set of discrete label values), π (prior distribution over L), Θ = Ω i j (set of mixture models)
1:
Create vertex set V of cardinality N, edge set E =
2:
for all u V do
3:
    Randomly draw discrete label l u L according to prior π
4:
for all u , v V : u v do
5:
     i = l u , j = l v
6:
    Randomly draw mixture component m according to mixture weights ω i j
7:
    if m 0 then
8:
        Randomly draw edge weight w u v from s i j · Beta ( a m i j , b m i j )
9:
        Create edge e = u , v , w u v , E = E { e }
9:
return G = ( V , E )

3.2. Parameter Fitting

To create realistic random graphs, we need to learn the parameters π , Θ from a real-world input graph G. For each i , j V , the edge weights W i j = { w i j } follow an unknown, arbitrary probability distribution. For each set of edge weights W i j , we choose the scaling parameter s i j max W i j , then estimate the BMM parameters for the empirical distribution 1 s i j · W i j , which has support ( 0 , 1 ] .
In this section, we describe the Bayesian estimation of BMMs within a VI framework, followed by our algorithms for Agwan parameter fitting.
The beta distribtion has a conjugate prior in the exponential family. The conjugate prior density function can be written as [15,28]:
f ( a , b ) = 1 C ( α 0 , β 0 , ν 0 ) Γ ( a + b ) Γ ( a ) Γ ( b ) ν 0 e - α 0 ( a - 1 ) e - β 0 ( b - 1 )
where α 0 , β 0 , ν 0 are free positive parameters and C ( α 0 , β 0 , ν 0 ) is a normalisation factor such that 0 0 f ( a , b ) d a d b = 1 . Then, we obtain the posterior distribution of a , b as (with N independent and identically distributed (i.i.d.) scalar observations X = { x 1 , , x N } )
f ( a , b | X ) = f ( X | a , b ) f ( a , b ) 0 0 f ( X | a , b ) f ( a , b ) d a d b = 1 C ( α N , β N , ν N ) Γ ( a + b ) Γ ( a ) Γ ( b ) ν N e - α N ( a - 1 ) e - β N ( b - 1 ) d a d b
where ν N = ν 0 + N , α N = α 0 - n = 1 N ln x n and β N = β 0 - n = 1 N ln ( 1 - x n ) . To avoid infinity in the practical implementation, we assign ε 1 to x n when x n = 0 and 1 - ε 2 to x n when x n = 1 , where ε 1 and ε 2 are slightly positive real numbers.
We introduce an approximation to both the conjugate prior and the posterior distributions of the beta distribution and attempt to solve the Bayesian estimation problem via the factorised approximation method. A distribution can be used as the factorised distribution of the true posterior distribution if the optimal solution to this factorised distribution has exactly the same form as its initialisation. This requirement guarantees that the estimation works iteratively. With the non-negative property of the parameters in the beta distribution and assuming that a and b are statistically independent, we could use some well-defined distribution with support ( 0 , + ) to approximate the conjugate prior. One possible way is to assign the gamma distribution to a and b as:
f ( a ; μ , α ) = α μ Γ ( μ ) a μ - 1 e - α a ; f ( b ; ν , β ) = β ν Γ ( ν ) b ν - 1 e - β b
The conjugate prior is then approximated as:
f ( a , b ) f ( a ) f ( b )
The same form of approximation applies to the posterior distribution as:
f ( a , b | X ) f ( a | X ) f ( b | X )
For each observation x n , the corresponding z n = [ z n 1 , , z n M ] T is the indication vector defined with respect to the M components in the mixture model. One element of z n will be equal to 1 and the rest equal to 0, to indicate which mixture component z n belongs to. Denoting Z = { z 1 , , z N } and assuming the indication vectors are independent given the mixing coefficients, the conditional distribution of Z given P is:
f ( Z | P ) = n = 1 N m = 1 M p m z n m
Introducing the Dirichlet distribution as the prior distribution of the mixing coefficients, the probability function of P can be written as:
f ( P ) = D i r ( p | c ) = C ( c ) m = 1 M p m c m - 1
where C ( c ) = Γ ( c ^ ) Γ ( c 1 ) Γ ( c M ) and c ^ = m = 1 M c m . We consider the observation x n and the unobserved indication vector z n as the complete data. The conditional distribution of X = { x 1 , , x N } and Z = { z 1 , , z N } given the latent variables { A , B , P } is:
f ( X , Z | A , B , P ) = f ( X | A , B , P , Z ) f ( Z | P ) = f ( X | A , B , Z ) f ( Z | P ) = n = 1 N m = 1 M p m Beta ( x n | a m , b m ) z n m
The algorithm for Bayesian estimation of a BMM model is presented in Algorithm 2. An overview of the derivation of the algorithm from Equations (5)–(10) can be found in Appendix A. For full details of the derivations, refer to [26].
Algorithm 2 Bayesian estimation of a Beta Mixture Model (BMM)
Require: 
Number of components M, initial parameters for the Dirichlet distribution, initial parameters (element-wise) α 0 > 0 , β 0 > 0 , μ 0 > 0 . 6156 , ν 0 > 0 . 6156 . Select initial parameters such that μ 0 / α 0 > 1 and ν 0 / β 0 > 1 .
1:
Initialise r n m with k-means
2:
Calculate the initial guess of a ¯ and v ¯ from α 0 , β 0 , μ 0 , ν 0
3:
repeat
4:
    Update the hyperparameters c m * , μ l m * , α l m * , ν l m * and β l m *
5:
until convergence
    return the current estimated hyperparameters
After initialisation (lines 1,2), Algorithm 2 iterates until the the current estimated model and the previous estimated model are sufficiently close (lines 3–5). The order of updating (line 4) does not matter, but each hyperparameter should be updated exactly once during each iteration. Refer to Appendix A for details of how the intermediate quantities are calculated ( c m * is updated following Equation (A3), μ l m * from Equation (A4), α l m * from Equation (A5), ν l m * from Equation (A6) and β l m * from Equation (A7). The expectations for these quantities are given in Equation (A8).) The algorithm returns the current estimated hyperparameters at the last iteration, which are used to get the approximating posterior distribution. The joint posterior distribution of a l m , b l m (Equation (4)) is approximated by the product of two gamma distributions with parameters μ l m * , α l m * and ν l m * , β l m * (Equations (5) and (7)).
Agwan parameter fitting is performed by Algorithm 3. First, we estimate the vertex label priors (lines 1–3); Next, we sample the edge weights for each possible combination of vertex label values (lines 5–10); The proportion of non-edges is used to estimate the weight of mixture 0 (line 10). We estimate each scaled BMM Ω i j from the appropriate set of samples W i j using Algorithm 2 as described above (lines 12–13). Finally, the mixture weights ω m i j are normalised so that they sum to 1 (line 14).
Algorithm 3 Agwan Parameter Fitting
Require: 
Input graph G = ( V , E )
1:
L = { discrete vertex label values } , d = | L |
2:
Calculate vertex label priors, apply Laplace smoothing l L : P ( l ) = c o u n t ( l ) + α N + α d
3:
π = the normalised probability distribution over L such that i = 1 d P ( l i ) = 1
4:
i , j L : W i j =
5:
for all u , v V : u v do
6:
     i = l u , j = l v
7:
    if u , v E then
8:
         W i j = W i j { w u v }
9:
    else
10:
        Increment ω 0 i j
11:
for all i , j L do
12:
     ω 0 i j = 1 - P ( e i j exists )
13:
     s i j = max W i j
14:
     estimate Ω i j from 1 s i j · W i j using Algorithm 2
15:
    Normalise all ω m i j
  return π , Θ = Ω i j

3.3. Extending Agwan to Multiple Attributes

Agwan can be extended to multiple numeric edge labels by generalising the concept of edge weight to K dimensions. In this case, the weight is parameterised as the multivariate beta distribution. For any random vector x consisting of K elements, the dependencies among the elements x 1 , , x K can be represented by a mixture model, even if each specific mixture component can only generate vectors with statistically independent elements. Therefore, we define the multivariate BMM as:
f ( x ; P , A , B ) = m = 1 M p m · Beta ( x ; a m , b m ) = m = 1 M p m · k = 1 K Beta ( x k ; a k m , b k m )
where x = { x 1 , , x K } , P = { p 1 , , p M } , A = { a 1 , , a M } and B = { b 1 , , b M } . a m , b m denote the parameter vectors of the mth mixture component and a k m , b k m are the (scalar) parameters of the beta distribution for element x k . Using this representation, we can apply Algorithm 2 to K-dimensional edge weights.

4. Experimental Section

We evaluate our approach by comparing Agwan with the state-of-the-art in labelled graph generation, represented by the MAG model [9,14]. Agwan and MAG parameters are learned from real-world graphs. We generate random graphs from each model and calculate a series of statistics on each graph. As MAG does not generate weighted graphs, we fixed the weight of the edges in the generated graphs to the mean edge weight from the input graphs. This ensures that statistics such as average vertex strength are maintained.
We used two datasets for the experiments. The first is a graph of “who exercises with whom” from a behavioural health study (Figure 2a, | V | = 279 , | E | = 1308 ) [29]. Vertices represent participants and are labelled with 28 attributes denoting demographic information and health markers obtained from questionnaire data. Participants were tracked during periods of physical activity; when two people frequently register at the same sensor at the same time, an undirected edge is created, weighted with the number of mutual coincidences. Our second dataset is the “who communicates with whom” graph of the Enron e-mail corpus (Figure 2b, | V | = 159 , | E | = 2667 ) [11]. Vertices are labelled with the job role of the employee. Edges are weighted with the number of e-mails exchanged between sender and recipient. As e-mail communications are not symmetric, edges in the Enron graph are directed. Both graphs exhibit a core-periphery structure which is typical of many real-world networks [7].
Figure 2. Input Graph Datasets, from (a) a health study and (b) the Enron e-mail corpus. (a) Undirected graph of who exercised with whom; (b) Directed graph of who e-mailed whom.
Figure 2. Input Graph Datasets, from (a) a health study and (b) the Enron e-mail corpus. (a) Undirected graph of who exercised with whom; (b) Directed graph of who e-mailed whom.
Algorithms 08 01143 g002
We evaluated Agwan against the following models:
Erdős-Rényi random graph (ER): The ER model G ( n , p ) has two parameters. We set the number of vertices n and the edge probability p to match the input graphs as closely as possible. We do not expect a very close fit, but the ER model provides a useful baseline.
MAG model with real attributes (MAG-R1) The MAG model with a single real attribute has a set of binary edge probabilities, Θ = { p i j } instead of a set of BMMs Θ = { Ω i j } .
MAG model with latent attributes (MAG-L x ): The MAG model also allows for modelling the graph structure using latent attributes. The discrete labels provided in the input graph are ignored; instead MagFit [14] learns the values of a set of latent attributes to describe the graph structure. We vary the number of synthetic attributes x to investigate the relative contributions of vertex labels and edge weights to graph structure.
Agwan model with truncated GMMs: We compare our BMM model to an alternative approach using GMMs [20]. One drawback of using GMMs is that it is possible to draw edge weights w u v < 0 . To avoid negative edge weights, we implement a tabula rasa rule during graph generation, drawing new values from Ω i j until w u v 0 .
To evaluate the closeness of fit of each model, we use the following statistics:
Vertex Strength: For an unweighted graph, one of the most important measures is the degree distribution (the number of in-edges and out-edges of each vertex). Real-world graphs tend to have heavy-tailed power-law or log-normal degree distributions [2,5]. For a weighted graph, we generalise the concept of vertex degree to vertex strength [30]:
s u = v u w u v
We compare using the Complementary Cumulative Distribution Function (CCDF) of the strength of each vertex (both in-strength and out-strength in the case of the directed graph). For Cumulative Distribution Function (CDF) F ( x ) = P ( X x ) , the CCDF is defined as F ¯ = P ( X > x ) = 1 - F ( x ) . We show the unnormalised CCDFs in our plots; the normalised value can be obtained by integrating the area under the curve to 1. The CCDF of a power-law function will appear linear when plotted on a log-log scale, while the CCDF of a log-normal function will appear parabolic.
Spectral Properties: We use Singular Value Decomposition (SVD) to calculate the singular values and singular vectors of the graph’s adjacency matrix, which act as a signature of the graph structure. In an unweighted graph, the adjacency matrix contains binary values, for “edge” or “no edge”. In a weighted graph, the adjacency matrix contains the edge weights (with 0 indicating no edge). For SVD U Σ V , we plot the CCDFs of the singular values Σ and the components of the left singular vector U corresponding to the highest singular value.
Clustering Coefficients: the clustering coefficient C is an important measure of community structure. It measures the density of triangles in the graph, or the probability that two neighbours of a vertex are themselves neighbours [5]. We extend the notion of clustering coefficients to weighted, directed graphs by defining C u , the weighted clustering coefficient for vertex u [30]:
C u = [ W u [ 1 3 ] + ( W u T ) [ 1 3 ] ] u u 3 2 [ d u t o t ( d u t o t - 1 ) - 2 d u ]
where W u is the weighted adjacency matrix for u and its neighbours, W T is the transpose of W , d u t o t is the total degree of a vertex (the sum of its in- and out-degrees) and d u is the number of bilateral edges in u (the number of neighbours of u which have both an in-edge and an out-edge between themselves and u). The notation A u u means the uth element of the main diagonal of A .
Figure 3. Triad Patterns in a Directed Graph. (a) Cycle; (b) Middleman; (c) In; (d) Out.
Figure 3. Triad Patterns in a Directed Graph. (a) Cycle; (b) Middleman; (c) In; (d) Out.
Algorithms 08 01143 g003
Triad Participation: Closely related to the clustering coefficient is the concept of triangle or triad participation. The number of triangles that a vertex is connected to is a measure of transitivity [5]. For the directed graphs, the triangles have a different interpretation depending on the edge directions. There are four types of triangle pattern [30], as shown in Figure 3. To generalise the concept of triad participation to weighted, directed graphs, we consider each of the four triangle types separately, and sum the total strength of the edges in each triad:
t u y = v , z W u \ u W u v z y
where y = { c y c l e , m i d d l e m a n , i n , o u t } is the triangle type and W u v z y is calculated as shown in Figure 3 for each triangle type y.
To give a more objective measure of the closeness of fit between the generated graphs and the input graph, we use a Kolmogorov-Smirnov (KS) test and the L2 (Euclidean) distance between the CCDFs for each statistic. Details are in Appendix B.

5. Results and Discussion

In the experimental data, most of the edge weights follow a heavy-tailed distribution. The BMM achieves a very close fit with its primary mixture component (Figure 4a). The GMM would need several Gaussian components to achieve a similar fit. In practice the VI algorithm for GMMs [18] tries to fit to power law or log-normal distributions using a single Gaussian component, resulting in probability mass being assigned to the area x < 0 , as shown in Figure 4a. This results in a fit that is not as close as the BMM (Figure 4b). To compensate for this effect, we used a truncated GMM for graph generation as discussed in Section 3.2.
Figure 4. Fitting BMM and GMM models to edge weights. (a) Probability Mass Function; (b) Cumulative Distribution Function.
Figure 4. Fitting BMM and GMM models to edge weights. (a) Probability Mass Function; (b) Cumulative Distribution Function.
Algorithms 08 01143 g004
For our experiments, we generated 10 random graphs for each Agwan model. For each graph, we calculated the statistics for vertex strength, spectral properties, clustering and triad participation, as described in the previous section. We calculated the CCDFs for each set of statistics, and averaged the CCDF scores at each x-coordinate value across the 10 graphs, to smooth any random fluctuations. The plots of the averaged CCDFs for each model are shown in Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12. Tables for the closeness of fit of each CCDF (KS and L2 statistics) are in Appendix B.
Figure 5. Vertex Strength Distribution—Real Attributes. (a) Undirected; (b) Directed (In-strength); (c) Directed (Out-strength).
Figure 5. Vertex Strength Distribution—Real Attributes. (a) Undirected; (b) Directed (In-strength); (c) Directed (Out-strength).
Algorithms 08 01143 g005aAlgorithms 08 01143 g005b
Figure 6. Spectral Properties—Real Attributes. (a) Undirected—Singular Values; (b) Undirected—Primary Left Singular Vector; (c) Directed—Singular Values; (d) Directed—Primary Left Singular Vector.
Figure 6. Spectral Properties—Real Attributes. (a) Undirected—Singular Values; (b) Undirected—Primary Left Singular Vector; (c) Directed—Singular Values; (d) Directed—Primary Left Singular Vector.
Algorithms 08 01143 g006
Figure 7. Clustering Coefficients—Real Attributes. (a) Undirected; (b) Directed (In-edges); (c) Directed (Out-edges).
Figure 7. Clustering Coefficients—Real Attributes. (a) Undirected; (b) Directed (In-edges); (c) Directed (Out-edges).
Algorithms 08 01143 g007
Figure 8. Triad Participation—Real Attributes. (a) Undirected; (b) Directed (Cycles); (c) Directed (Middlemen); (d) Directed (Ins); (e) Directed (Outs).
Figure 8. Triad Participation—Real Attributes. (a) Undirected; (b) Directed (Cycles); (c) Directed (Middlemen); (d) Directed (Ins); (e) Directed (Outs).
Algorithms 08 01143 g008aAlgorithms 08 01143 g008b
Figure 9. Vertex Strength Distribution—Synthetic Attributes. (a) Undirected; (b) Directed (In-strength); (c) Directed (Out-strength).
Figure 9. Vertex Strength Distribution—Synthetic Attributes. (a) Undirected; (b) Directed (In-strength); (c) Directed (Out-strength).
Algorithms 08 01143 g009aAlgorithms 08 01143 g009b
Figure 10. Spectral Properties—Synthetic Attributes. (a) Undirected—Singular Values; (b) Undirected—Primary Left Singular Vector; (c) Directed—Singular Values; (d) Directed—Primary Left Singular Vector.
Figure 10. Spectral Properties—Synthetic Attributes. (a) Undirected—Singular Values; (b) Undirected—Primary Left Singular Vector; (c) Directed—Singular Values; (d) Directed—Primary Left Singular Vector.
Algorithms 08 01143 g010
Figure 11. Clustering Coefficients—Synthetic Attributes. (a) Undirected; (b) Directed (In-edges); (c) Directed (Out-edges).
Figure 11. Clustering Coefficients—Synthetic Attributes. (a) Undirected; (b) Directed (In-edges); (c) Directed (Out-edges).
Algorithms 08 01143 g011
Figure 12. Triad Participation—Synthetic Attributes. (a) Undirected; (b) Directed (Cycles); (c) Directed (Middlemen); (d) Directed (Ins); (e) Directed (Outs).
Figure 12. Triad Participation—Synthetic Attributes. (a) Undirected; (b) Directed (Cycles); (c) Directed (Middlemen); (d) Directed (Ins); (e) Directed (Outs).
Algorithms 08 01143 g012aAlgorithms 08 01143 g012b

5.1. Real Attributes

For the undirected graph (Health Study), we show results for two vertex attributes: Age and Floor (the building and floor number where the person works; people who work on the same floor were highly likely to exercise together). For the directed graph (Enron), there is one vertex attribute, the person’s job role.
Vertex Strength (Figure 5): The graphs generated from Agwan have vertex strength distributions which map very closely to the input graphs. The graphs generated from MAG-R1 are better than random (ER), but the vertex strength distribution is compressed into the middle part of the range, with too few high- and low-strength vertices. This indicates that vertex strength depends on both the label distribution and the edge weight distribution; Agwan models both of these effectively.
Spectral Properties (Figure 6): The spectral properties of the Agwan graphs map very closely to the input graphs. The singular values follow the same curve as the input graphs. This is significant as it indicates that graphs generated with Agwan have similar connectivity to the input graph: the primary singular value has been shown to be the most significant measure of graph connectivity [2]. In contrast, MAG-R1 does not preserve the singular value curve, showing a linear relationship instead. MAG-R1 cannot model the singular vector components at all without taking the edge weights into account. These results show that the spectral properties of the graph are dependent on both the vertex label distribution and the edge weight distribution.
Clustering Coefficients (Figure 7): The accuracy of Agwan and MAG-R1 is similar. For the undirected graph, performance is little better than random. For the directed graph, Agwan gives reasonable performance for the low-degree vertices but drops away for the high-degree vertices. In all cases, the clustering is independent of the edge weight distribution. The accuracy of the results depends on which vertex attribute was chosen, implying that some attributes can predict community formation better than others. We hypothesise that cluster formation may in fact be (conditionally) independent of the vertex label values and may be better explained by triadic closure [5]: links are likely to form between two people who share a mutual friend, independently from their vertex attributes. The apparent dependency on vertex labels may be an artefact of the connection to the mutual friend, rather than the true explanation of why clustering asises. This aspect of network formation requires further investigation.
Triad Participation (Figure 8): Similar to clustering, triad participation appears to be dependent to some extent on vertex label values but independent of the edge weight distribution. We hypothesise that like clustering, the apparent dependency between vertex label and triad participation values may be due to triadic closure, which is not currently modelled by either MAG or Agwan.

5.2. Synthetic Attributes

The second set of experiments ignore the real attributes of the graph, replacing them with a set of randomly generated synthetic attributes. The purpose is to evaluate the relative contribution of the discrete vertex labels and numeric attributes to the graph structure. By reducing the number of numeric attributes to zero, we can evaluate the contribution of the numeric attributes in isolation.
We replaced the real labels in the input graph with a synthetic vertex attribute taking 2 0 2 9 values allocated uniformly at random, then learned the edge weight distributions using VI as normal. We compare our results with an alternate interpretation of the MAG model, which ignores the true attribute values from the input graph and represents attributes as latent variables, which are learned using a VI EM approach [14]. We also show Agwan with one real attribute for comparison.
Vertex Strength (Figure 9): Agwan with synthetic attributes significantly outperforms MAG for the accuracy of the vertex strength distribution, and has similar accuracy to Agwan-R1. Varying the number of synthetic attributes has a small effect on the accuracy. We conclude that vertex strength is dependent on both edge weight and vertex label distribution, but the edge weights play a more important role.
Spectral Properties (Figure 10): Agwan has a very close fit to the spectral properties of the input graphs. Varying the number of attributes has a moderate effect on the closeness of fit. On the undirected graph, MAG-L9 matches the singular vector very closely but performs poorly with the singular values; in general, MAG is a poor fit as the edge weight distribution is not taken into account. These results confirm that the spectral properties are dependent on both the vertex label distribution and the edge weight distribution, and the degrees of freedom of the vertex labels are important.
Clustering Coefficients (Figure 11): Agwan and MAG are both significantly more accurate using synthetic attributes than with real attributes. This (somewhat surprising) result implies that clustering is not closely related to the real attribute values. While real labels appear to be influenced by the (unobserved) process which gives rise to clustering, synthetic labels with more degrees of freedom can model it more accurately. We note that Agwan performs better where there are more degrees of freedom, while MAG performs better with few degrees of freedom (due to the independence assumption mentioned in Section 3.3). As before, clustering appears to be independent of the edge weight distribution.
Triad Participation (Figure 12): Similarly to clustering, Agwan and MAG are both more accurate using synthetic attributes than they were with real attributes. The effects of degrees of freedom of the synthetic attributes is even more pronounced than it was for clustering: Agwan-L9 has a good fit whereas Agwan-L1 is not so good. Edge weights appear to have little influence.

5.3. Graph Evolution

One of the goals of our model is to generate synthetic graphs which are larger than the input graph. We conducted experiments to measure how Agwan graph properties change as the number of vertices is increased. We generated sets of random graphs from the same model, with increasing numbers of vertices N = { 10 , 20 , , 100 , [ 1 ] 200 , , 3000 } vertices and measured the size of the giant component, the density and the diameter.
Giant Component: We measured the proportion of vertices in the largest component of the graph as the graph grows in size. Figure 13 shows that a giant component forms quickly, as in real graphs [5].
Figure 13. Giant Component. (a) Undirected; (b) Directed.
Figure 13. Giant Component. (a) Undirected; (b) Directed.
Algorithms 08 01143 g013
Densification Power Law: We measured graph densification as the ratio of the number of vertices to the number of edges as the graph grows in size. Figure 14 shows that Agwan graphs densify as they grow, as in real graphs. Real-world graphs have e ( t ) v ( t ) γ , with power-log exponent γ typically in the range ( 1 , 2 ) [4]. When γ = 1 , the rate of edge growth is linear in the number of vertices; γ = 2 means that on average, each vertex has edges to a constant fraction of all vertices. As Figure 14 shows, Agwan graphs have γ 2 , because the model learns its edge probabilities independently from the size of the input graph. This means that Agwan overestimates density when generating graphs larger than the input graph.
Figure 14. Densification Power Law. (a) Undirected, unweighted; (b) Undirected, weighted; (c) Directed, unweighted; (d) Directed, weighted.
Figure 14. Densification Power Law. (a) Undirected, unweighted; (b) Undirected, weighted; (c) Directed, unweighted; (d) Directed, weighted.
Algorithms 08 01143 g014
Shrinking Diameter Effect (Figure 15): Agwan graphs exhibit the Small World Effect [5] and the unweighted diameter shrinks as the graphs grow, as in real graphs [4]. As the weighted diameter was calculated using Johnson’s algorithm [31], which treats the edge weight as a cost function, we used the reciprocal of the edge weights. With this interpretation, the weighted diameter also shrinks as the graphs grow. It is interesting to note that the weighted diameters decay following a power law (with exponent γ = - 0 . 72 for the undirected graph and γ = - 0 . 65 for the directed graph).
Figure 15. Shrinking Diameter Effect. (a) Undirected, unweighted; (b) Undirected, weighted; (c) Directed, unweighted; (d) Directed, weighted.
Figure 15. Shrinking Diameter Effect. (a) Undirected, unweighted; (b) Undirected, weighted; (c) Directed, unweighted; (d) Directed, weighted.
Algorithms 08 01143 g015aAlgorithms 08 01143 g015b

6. Conclusions and Future Work

In this paper, we presented Agwan, a generative model for random graphs with discrete labels and weighted edges. We included a fitting algorithm to learn a model of edge weights from real-world graphs, and an algorithm to generate random labelled, weighted graphs with similar characteristics to the input graph.
We measured the closeness of fit of our generated graphs to the input graph over a range of graph statistics, and compared our approach to the state-of-the-art in random graph generative algorithms. Our results and findings are summarised in Table 1.
Table 1. Summary of results and findings: dependencies between graph labels and weights and structural properties ( * hypothesised); relative accuracy of the properties of weighted graphs generated using Agwan and MAG models.
Table 1. Summary of results and findings: dependencies between graph labels and weights and structural properties ( * hypothesised); relative accuracy of the properties of weighted graphs generated using Agwan and MAG models.
StatisticVertex LabelsEdge WeightsAgwanMAG
Vertex StrengthDependentDependentAccurateLess accurate
Singular ValuesDependentDependentAccurateLess accurate
Primary Singular VectorPartiallyDependentAccuratePoor
Dependent
Clustering CoefficientsConditionallyIndependentLess accurate when using real attributes
Independent * More accurate with synthetic attributes
Triad ParticipationConditionallyIndependentLess accurate when using real attributes
Independent * More accurate with synthetic attributes
Our results demonstrate that Agwan produces an accurate model of the weighted input graphs. The Agwan graphs reproduce many of the properties of the input graphs, including formation of a giant component, heavy-tailed degree distribution, Small World property, Densification Power Law and shrinking diameter effect. Vertex strength and spectral properties were modelled very closely, indicating that these properties are indeed dependent on vertex labels and edge weights. For clustering and triad participation, it appears that these properties are independent of the edge weight distribution. While there appears to be a relationship with the vertex label distribution, we suggest that this may be an artefact of the true process giving rise to these properties, triadic closure. Further research is required into the relationship between vertex attributes and triangle formation in graphs.
We investigated the accuracy of modelling edge weights using BMMs and compared to a previous approach which used GMMs. The edge weights often follow a power-law distribution, which the BMM approach fits more closely, particularly in the lower range of edge weight values. In general, BMMs are more suitable for bounded data; probability mass can only be assigned to the valid range of weight values. For the GMM model, we compensated for this by truncating the GMM during graph generation. We expected that Agwan (BMM) would consistently outperform Agwan (GMM), but the truncated GMM performed surprisingly well. We propose to conduct experiments on a wider range of datasets to investigate this further. It would also be interesting to compare with Poisson Mixture Models to model discrete edge weights.

Supplementary Materials

Source code and datasets used for the experiments (Gzipped TAR archive). See Datasets/README and src/README for details. Supplementary materials can be accessed at: https://www.mdpi.com/1999-4893/8/4/1143/s1.

Acknowledgments

This research was supported by funding from the Engineering and Physical Sciences Research Council (grant No. EP/H049606/1), National Nature Science Foundation of China (NSFC) grant Nos. 61402047 and 61511130081, the National Prevention Research Initiative (grant No. G0802045) and their funding partners, and the Department for Employment and Learning (grant No. M6003CPH).

Author Contributions

Michael Charles Davis is the principal author, responsible for the design of the study, conducting the experiments and analysing the results. Zhanyu Ma contributed the variational inference approach used for Beta Mixture Modelling. Weiru Liu and Paul Miller supervised the study and provided substantial critique and guidance. Ruth Hunter and Frank Kee designed the social network study for the health intervention and provided data for the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

A. Derivation of Approximate Inference Algorithm

The derivation of the algorithm for Bayesian estimation of a BMM model (Algorithm 2) from Equations (5)–(10) is outlined in this appendix.
We combine Equations (5)–(10) using Bayesian rules to obtain the joint density function of the observation X and all the i.i.d. latent variables Z = { A , B , P , Z } . We assume that X is conditionally independent of P , given Z . The joint density function is given by:
f ( X , Z ) = f ( X , A , B , P , Z ) = f ( X | A , B , Z ) f ( Z | P ) f ( P ) f ( A ) f ( B ) = n = 1 N m = 1 M p m Beta ( x n | a m , b m ) z n m · C ( c ) m = 1 M p m c m - 1 · l = 1 L m = 1 M α l m μ l m Γ ( μ l m ) a l m μ l m - 1 e - α l m a l m · β l m ν l m Γ ( ν l m ) b l m ν l m - 1 e - β l m b l m
and the logarithm of Equation (A1) is
L ( X , Z ) = ln f ( X , Z , A , B , P ) = con. + n = 1 N m = 1 M z n m ln p m + l = 1 L ln Γ ( a l m + b l m ) Γ ( a l m ) Γ ( b l m ) + l = 1 L ( a l m - 1 ) ln x l n + ( b l m - 1 ) ln ( 1 - x l n ) + l = 1 L m = 1 M ( μ l m - 1 ) ln a l m - α l m a l m + l = 1 L m = 1 M ( ν l m - 1 ) ln b l m - β l m b l m + m = 1 M ( c m - 1 ) ln p m
The i.i.d. latent variables we have now are A , B , P and Z with the hyperparameters α, β, μ, ν and c . The updating scheme of the VI can be used to estimate these hyperparameters of the latent variables. Following the principles of the VI framework [15,32,33,34,35], we have:
c m * = c m 0 + n = 1 N E z n m = c m 0 + n = 1 N r n m
where c m 0 denotes the initial value of c m .
We can also obtain the closed-form updating scheme for the hyperparameters α l m and μ l m :
μ l m * = μ l m 0 + n = 1 N E z n m a ¯ l m ψ ( a ¯ l m + b ¯ l m ) - ψ ( a ¯ l m ) ψ + b ¯ l m · ψ ( a ¯ l m + b ¯ l m ) ( E b ln b l m - ln b ¯ l m )
α l m * = α l m 0 - n = 1 N E z n m ln x l n
For the same reasons, the update schemes for the hyperparameters β l m and ν l m are:
ν l m * = ν l m 0 + n = 1 N E z n m b ¯ l m ψ ( a ¯ l m + b ¯ l m ) - ψ ( b ¯ l m ) ψ + a ¯ l m · ψ ( a ¯ l m + b ¯ l m ) ( E a ln a l m - ln a ¯ l m )
β l m * = β l m 0 - n = 1 N E z n m ln ( 1 - x l n )
Further details of the derivations are given in [26,27].
The update Equations (A3)–(A7) are calculated through the following expectations:
a ¯ = μ α , b ¯ = ν β , E z n m = r n m , E ln p m = ψ ( c m ) - ψ ( c ^ ) , E a [ ln a ] = ψ ( μ ) - ln α , E b [ ln b ] = ψ ( ν ) - ln β , E a ( ln a - ln a ¯ ) 2 = ψ ( μ ) - ( ln μ ) 2 + ψ ( μ ) E b ( ln b - ln b ¯ ) 2 = ψ ( ν ) - ( ln ν ) 2 + ψ ( ν )

B. KS and L2 Statistics

To measure the closeness of fit between the generated graphs and the input graph, we use a Kolmogorov-Smirnov (KS) test and the L2 (Euclidean) distance between the CCDFs for each of the statistics presented in Section 4. The model that generates graphs with the lowest KS and L2 values has the closest fit to the real-world graph for that graph property.
The KS test is commonly used for goodness of fit and is based on the following statistic:
K S = sup | F * ( x ) - S ( x ) |
where F * ( x ) is the hypothesised Cumulative Density Function (CDF) and S ( x ) is the empirical distribution function based on the sampled data [36]. As trying to perform a linear fit on a heavy-tailed distribution is biased [36], we use the logarithmic variant of the KS test [14]:
K S = sup | log F * ( x ) - log S ( x ) |
We also calculate the L2 (Euclidean) distance for each statistic in the logarithmic scale:
L 2 = 1 log b - log a x = a b ( log F * ( x ) - log S ( x ) ) 2
where [ a , b ] is the support of distributions F * ( x ) and S ( x ) respectively.
Note that for the Singular Values and Primary Left Singular Vectors, we use the standard (linear) version of KS and L2, as these statistics do not follow a heavy-tailed distribution.

B1. Real Attributes (Figure 5, Figure 6, Figure 7 and Figure 8)

Table B1. KS Statistic for Undirected Graph, Real Attributes.
Table B1. KS Statistic for Undirected Graph, Real Attributes.
StatisticErdős-RényiMAG-R1Agwan-R1 (GMM)Agwan-R1 (BMM)
AgeFloorAgeFloorAgeFloor
Vertex Strength6.0645.9405.7992.3032.3031.6092.303
Singular Values3752.4393809.3753829.9671284.796507.622263.853262.647
Singular Vector10.1429.5139.4371.0391.5992.4011.871
Clustering Coefficient5.2245.0484.8954.9174.9944.2495.196
Triad Participation7.0126.8776.6856.6466.6466.8986.685
Table B2. L2 Statistic for Undirected Graph, Real Attributes.
Table B2. L2 Statistic for Undirected Graph, Real Attributes.
StatisticErdős-RényiMAG-R1Agwan-R1 (GMM)Agwan-R1 (BMM)
AgeFloorAgeFloorAgeFloor
Vertex Strength9.6867.28110.0392.5723.2923.2914.623
Singular Values14,813.82714,734.70714,926.96912,370.0465624.6682448.5452772.618
Singular Vector97.40896.35695.16214.68823.92136.76128.416
Clustering Coefficient55.42248.88948.52450.84452.05150.26751.982
Triad Participation16.87917.33417.10117.41919.60119.59619.726
Table B3. KS Statistic for Directed Graph, Real Attributes.
Table B3. KS Statistic for Directed Graph, Real Attributes.
StatisticErdős-RényiMAG-R1Agwan-R1 (GMM)Agwan-R1 (BMM)
In-Vertex Strength2.4694.7001.6092.303
Out-Vertex Strength2.7082.6591.2042.303
Singular Values40,564.5907220.03022,750.79011,944.700
Singular Vector8.9279.1540.7140.619
Clustering Coefficient (In-Edges)3.4442.2081.7843.355
Clustering Coefficient (Out-Edges)3.7280.7693.1670.814
Clustering Coefficient4.3471.6510.5930.835
Triad Participation (Cycles)4.7874.2483.2194.248
Triad Participation (Middlemen)4.3824.5002.6392.503
Triad Participation (Ins)4.7004.5003.4014.248
Triad Participation (Outs)4.3824.0941.8514.248
Table B4. L2 Statistic for Directed Graph, Real Attributes.
Table B4. L2 Statistic for Directed Graph, Real Attributes.
StatisticErdős-RényiMAG-R1Agwan-R1 (GMM)Agwan-R1 (BMM)
In-Vertex Strength5.6794.9122.2441.507
Out-Vertex Strength5.1003.5342.4632.510
Singular Values262,699.99432,990.793104,552.68160,843.087
Singular Vector69.77574.6068.8424.884
Clustering Coefficient (In-Edges)3.5281.6071.2152.231
Clustering Coefficient (Out-Edges)3.1451.1911.8911.399
Clustering Coefficient6.9491.4380.8131.033
Triad Participation (Cycles)3.8233.0003.0275.088
Triad Participation (Middlemen)5.1444.1784.1016.660
Triad Participation (Ins)4.6304.8264.3807.747
Triad Participation (Outs)3.7273.2952.8694.984

B2. Synthetic Attributes (Figure 9, Figure 10, Figure 11 and Figure 12)

Table B5. KS Statistic for Undirected Graph, Synthetic Attributes.
Table B5. KS Statistic for Undirected Graph, Synthetic Attributes.
StatisticMAGAgwan (GMM)
L1L5L9L1L5L9
Vertex Strength2.2435.6706.2342.2071.3540.722
Singular Values6519.1916580.9878801.651284.1931518.9764169.982
Singular Vector5.0373.4470.3623.5142.6862.082
Clustering Coefficient1.2834.4014.7734.6274.4222.319
Triad Participation3.8296.0166.8777.1156.2054.007
StatisticAgwan (BMM)
L1L5L9R1 (Age)R1 (Floor)
Vertex Strength2.9963.5551.6091.6092.303
Singular Values626.3523206.0644602.751263.853262.647
Singular Vector3.2122.3342.4082.4011.871
Clustering Coefficient4.7844.9603.2574.2495.196
Triad Participation7.0656.9664.2006.8986.685
Table B6. L2 Statistic for Undirected Graph, Synthetic Attributes.
Table B6. L2 Statistic for Undirected Graph, Synthetic Attributes.
StatisticMAGAgwan (GMM)
L1L5L9L1L5L9
Vertex Strength7.94410.10321.0274.9803.6112.570
Singular Values32,272.68532,524.85847,155.2212831.4119885.42430,238.426
Singular Vector56.66446.4833.52453.37540.52030.481
Clustering Coefficient25.79540.62261.93357.58355.49634.061
Triad Participation12.04711.03829.13622.05320.6869.375
StatisticAgwan (BMM)
L1L5L9R1 (Age)R1 (Floor)
Vertex Strength5.6108.0363.8123.2914.623
Singular Values6001.37819,322.59432,632.8372448.5452772.618
Singular Vector49.35335.10935.91336.76128.416
Clustering Coefficient57.31854.34036.95050.26751.982
Triad Participation20.34923.20610.75819.59619.726
Table B7. KS Statistic for Directed Graph, Synthetic Attributes.
Table B7. KS Statistic for Directed Graph, Synthetic Attributes.
StatisticMAGAgwan (GMM)
L1L5L9L1L5L9
In-Vertex Strength4.7006.1425.1931.8970.8360.418
Out-Vertex Strength4.9426.2343.4011.1570.6912.303
Singular Values7951.24017,978.15212,036.09123,196.64014,303.6903181.910
Singular Vector7.4852.0666.7370.6450.5340.485
Clustering Coefficient (In-Edges)2.9614.5784.5123.2982.5352.370
Clustering Coefficient (Out-Edges)3.1645.4632.8654.8303.1401.717
Clustering Coefficient3.2785.8394.0004.9962.9492.558
Triad Participation (Cycles)3.9126.8675.7046.2923.6022.639
Triad Participation (Middlemen)4.2486.3195.8585.4384.7873.401
Triad Participation (Ins)3.9127.1706.1536.5074.2482.996
Triad Participation (Outs)1.4766.7684.7456.2924.0072.590
StatisticAgwan (BMM)
L1L5L9R1
In-Vertex Strength0.8111.4550.5112.303
Out-Vertex Strength2.1480.6932.3032.303
Singular Values15,219.73012,480.5901238.55011,944.700
Singular Vector0.7750.4510.4510.619
Clustering Coefficient (In-Edges)3.9042.8613.4263.355
Clustering Coefficient (Out-Edges)4.9713.1821.5350.814
Clustering Coefficient4.8693.3002.9560.835
Triad Participation (Cycles)6.3805.5613.9124.248
Triad Participation (Middlemen)6.6205.7683.4012.503
Triad Participation (Ins)6.5795.0752.5264.248
Triad Participation (Outs)6.4925.3942.3034.248
Table B8. L2 Statistic for Directed Graph, Synthetic Attributes.
Table B8. L2 Statistic for Directed Graph, Synthetic Attributes.
StatisticMAGAgwan (GMM)
L1L5L9L1L5L9
In-Vertex Strength5.02315.7187.0662.1092.1890.649
Out-Vertex Strength3.00110.8823.7372.4832.0600.988
Singular Values29,286.02944,305.50122,485.397111,282.29872,433.1785554.271
Singular Vector68.34524.38866.3197.9325.2245.763
Clustering Coefficient (In-Edges)2.5077.6924.8193.7311.7101.594
Clustering Coefficient (Out-Edges)2.771 5.6422.7851.501
Clustering Coefficient2.45013.9227.6538.2962.6171.827
Triad Participation (Cycles)2.06016.2708.7639.5613.9901.777
Triad Participation (Middlemen)2.82818.57511.15010.1016.0942.476
Triad Participation (Ins)3.29316.36113.74014.1686.5623.074
Triad Participation (Outs)1.45914.6036.3159.2183.9602.077
StatisticAgwan (BMM)
L1L5L9R1
In-Vertex Strength1.5241.2780.3561.507
Out-Vertex Strength2.7311.5240.9072.510
Singular Values78,886.49654,111.3562224.58760,843.087
Singular Vector9.4945.5315.1534.884
Clustering Coefficient (In-Edges)5.4661.8222.1782.231
Clustering Coefficient (Out-Edges)6.4873.4011.4251.399
Clustering Coefficient9.9373.0862.0531.033
Triad Participation (Cycles)11.6847.3003.9465.088
Triad Participation (Middlemen)13.30111.2183.9976.660
Triad Participation (Ins)14.61611.0283.7667.747
Triad Participation (Outs)9.4198.3472.5274.984

References

  1. Barabási, A.L.; Albert, R. Emergence of Scaling in Random Networks. Science 1999, 286, 509–512. [Google Scholar] [PubMed]
  2. Chakrabarti, D.; Faloutsos, C. Graph Mining: Laws, Tools, and Case Studies; Morgan & Claypool Publishers: San Rafael, CA, USA, 2012. [Google Scholar]
  3. Erdős, P.; Rényi, A. On the Evolution of Random Graphs. Publ. Math. Inst. Hung. Acad. Sci. 1960, 5, 17–61. [Google Scholar]
  4. Leskovec, J.; Kleinberg, J.; Faloutsos, C. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 2007. [Google Scholar] [CrossRef]
  5. Newman, M. Networks: An Introduction; OUP: New York, NY, USA, 2010. [Google Scholar]
  6. Chakrabarti, D.; Zhan, Y.; Faloutsos, C. R-MAT: A Recursive Model for Graph Mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, Lake Buena Vista, FL, USA; 2004; pp. 442–446. [Google Scholar] [CrossRef]
  7. Leskovec, J.; Chakrabarti, D.; Kleinberg, J.M.; Faloutsos, C.; Ghahramani, Z. Kronecker Graphs: An Approach to Modeling Networks. J. Mach. Learn. Res. 2010, 11, 985–1042. [Google Scholar]
  8. Hoff, P.D.; Raftery, A.E.; Handcock, M.S. Latent Space Approaches to Social Network Analysis. J. Am. Stat. Assoc. 2002, 97, 1090–1098. [Google Scholar] [CrossRef]
  9. Kim, M.; Leskovec, J. Multiplicative Attribute Graph Model of Real-World Networks. Internet Math. 2012, 8, 113–160. [Google Scholar]
  10. Wang, Y.; Wong, G. Stochastic Block Models for Directed Graphs. J. Am. Stat. Assoc. 1987, 82, 8–19. [Google Scholar]
  11. Akoglu, L.; McGlohon, M.; Faloutsos, C. OddBall: Spotting Anomalies in Weighted Graphs; Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2010; Volume 6119, pp. 410–421. [Google Scholar]
  12. Eichinger, F.; Huber, M.; Böhm, K. On the Usefulness of Weight-Based Constraints in Frequent Subgraph Mining; Bramer, M., Petridis, M., Hopgood, A., Eds.; Springer: Berlin, Germany, 2010; pp. 65–78. [Google Scholar]
  13. Davis, M.; Liu, W.; Miller, P. Finding the most descriptive substructures in graphs with discrete and numeric labels. J. Intell. Inf. Syst. 2014, 42, 307–332. [Google Scholar] [CrossRef]
  14. Kim, M.; Leskovec, J. Modeling Social Networks with Node Attributes Using the Multiplicative Attribute Graph Model; Cozman, F.G., Pfeffer, A., Eds.; AUAI Press: Arlington, VA, USA, 2011; pp. 400–409. [Google Scholar]
  15. Bishop, C.M. Pattern Recognition and Machine Learning, 3rd ed.; Springer: New York, NY, USA, 2011. [Google Scholar]
  16. Jain, A.K.; Duin, R.P.W.; Mao, J. Statistical Pattern Recognition: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 4–37. [Google Scholar] [CrossRef]
  17. Figueiredo, M.A.T.; Jain, A.K. Unsupervised Learning of Finite Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 381–396. [Google Scholar] [CrossRef]
  18. Blei, D.M.; Jordan, M.I. Variational inference for Dirichlet process mixtures. Bayesian Anal. 2005, 1, 121–144. [Google Scholar] [CrossRef]
  19. Kurihara, K.; Welling, M.; Vlassis, N.A. Accelerated Variational Dirichlet Process Mixtures; Schölkopf, B., Platt, J.C., Hoffman, T., Eds.; MIT Press: Cambridge, MA, USA, 2006; pp. 761–768. [Google Scholar]
  20. Davis, M.; Liu, W.; Miller, P. Agwan: A Generative Model for Labelled, Weighted Graphs. In New Frontiers in Mining Complex Patterns; 2014; pp. 181–200. [Google Scholar]
  21. Lindblom, J.; Samuelsson, J. Bounded support Gaussian mixture modeling of speech spectra. IEEE Transa. Speech Audio Process. 2003, 11, 88–99. [Google Scholar] [CrossRef]
  22. Bouguila, N.; Ziou, D.; Monga, E. Practical Bayesian estimation of a finite Beta mixture through Gibbs sampling and its applications. Stat. Comput. 2006, 16, 215–225. [Google Scholar] [CrossRef]
  23. Ji, Y.; Wu, C.; Liu, P.; Wang, J.; Coombes, K.R. Applications of beta-mixture models in bioinformatics. Bioinformatics 2005, 21, 2118–2122. [Google Scholar] [CrossRef] [PubMed]
  24. Ma, Z.; Leijon, A. Beta mixture models and the application to image classification. In Proceedings of 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 2045–2048.
  25. McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions, 1st ed.; Wiley: New York, NY, USA, 1997. [Google Scholar]
  26. Ma, Z.; Leijon, A. Bayesian Estimation of Beta Mixture Models with Variational Inference. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2160–2173. [Google Scholar] [PubMed]
  27. Ma, Z.; Rana, P.K.; Taghia, J.; Flierl, M.; Leijon, A. Bayesian Estimation of Dirichlet Mixture Model with Variational Inference. Pattern Recognit. 2014, 47, 3143–3157. [Google Scholar] [CrossRef]
  28. Diaconis, P.; Ylvisaker, D. Conjugate Priors for Exponential Families. Ann. Stat. 1979, 7, 269–281. [Google Scholar] [CrossRef]
  29. Hunter, R.F.; Davis, M.; Tully, M.A.; Kee, F. The Physical Activity Loyalty Card Scheme: Development and Application of a Novel System for Incentivizing Behaviour Change; Kostkova, P., Szomszor, M., Fowler, D., Eds.; Springer: Berlin, Germany, 2011; Volume 91, pp. 170–177. [Google Scholar]
  30. Fagiolo, G. Clustering in complex directed networks. Phys. Rev. E 2007. [Google Scholar] [CrossRef] [PubMed]
  31. Johnson, D.B. Efficient Algorithms for Shortest Paths in Sparse Networks. J. ACM 1977, 24, 1–13. [Google Scholar] [CrossRef]
  32. Attias, H. A Variational Bayesian Framework for Graphical Models; Solla, S.A., Leen, T.K., Müller, K.R., Eds.; MIT Press: Cambridge, MA, USA, 1999; pp. 209–215. [Google Scholar]
  33. Jaakkola, T.S. Tutorial on Variational Approximation Methods. In Advanced Mean Field Methods: Theory and Practice; Opper, M., Saad, D., Eds.; MIT Press: Cambridge, MA, USA, 2001; pp. 129–159. [Google Scholar]
  34. Jaakkola, T.S.; Jordan, M.I. Bayesian parameter estimation via variational methods. Stat. Comput. 2000, 10, 25–37. [Google Scholar] [CrossRef]
  35. Ueda, N.; Ghahramani, Z. Bayesian model search for mixture models based on optimizing variational bounds. Neural Netw. 2002, 15, 1223–1241. [Google Scholar] [CrossRef]
  36. Goldstein, M.; Morris, S.; Yen, G. Problems with fitting to the power-law distribution. Eur. Phys. J. B Condens. Matter Complex Syst. 2004, 41, 255–258. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Davis, M.C.; Ma, Z.; Liu, W.; Miller, P.; Hunter, R.; Kee, F. Generating Realistic Labelled, Weighted Random Graphs. Algorithms 2015, 8, 1143-1174. https://doi.org/10.3390/a8041143

AMA Style

Davis MC, Ma Z, Liu W, Miller P, Hunter R, Kee F. Generating Realistic Labelled, Weighted Random Graphs. Algorithms. 2015; 8(4):1143-1174. https://doi.org/10.3390/a8041143

Chicago/Turabian Style

Davis, Michael Charles, Zhanyu Ma, Weiru Liu, Paul Miller, Ruth Hunter, and Frank Kee. 2015. "Generating Realistic Labelled, Weighted Random Graphs" Algorithms 8, no. 4: 1143-1174. https://doi.org/10.3390/a8041143

APA Style

Davis, M. C., Ma, Z., Liu, W., Miller, P., Hunter, R., & Kee, F. (2015). Generating Realistic Labelled, Weighted Random Graphs. Algorithms, 8(4), 1143-1174. https://doi.org/10.3390/a8041143

Article Metrics

Back to TopTop