Figures
Abstract
Inferring the gene regulatory network (GRN) is crucial to understanding the working of the cell. Many computational methods attempt to infer the GRN from time series expression data, instead of through expensive and time-consuming experiments. However, existing methods make the convenient but unrealistic assumption of causal sufficiency, i.e. all the relevant factors in the causal network have been observed and there are no unobserved common cause. In principle, in the real world, it is impossible to be certain that all relevant factors or common causes have been observed, because some factors may not have been conceived of, and therefore are impossible to measure. In view of this, we have developed a novel algorithm named HCC-CLINDE to infer an GRN from time series data allowing the presence of hidden common cause(s). We assume there is a sparse causal graph (possibly with cycles) of interest, where the variables are continuous and each causal link has a delay (possibly more than one time step). A small but unknown number of variables are not observed. Each unobserved variable has only observed variables as children and parents, with at least two children, and the children are not linked to each other. Since it is difficult to obtain very long time series, our algorithm is also capable of utilizing multiple short time series, which is more realistic. To our knowledge, our algorithm is far less restrictive than previous works. We have performed extensive experiments using synthetic data on GRNs of size up to 100, with up to 10 hidden nodes. The results show that our algorithm can adequately recover the true causal GRN and is robust to slight deviation from Gaussian distribution in the error terms. We have also demonstrated the potential of our algorithm on small YEASTRACT subnetworks using limited real data.
Citation: Lo L-Y, Wong M-L, Lee K-H, Leung K-S (2015) Time Delayed Causal Gene Regulatory Network Inference with Hidden Common Causes. PLoS ONE 10(9): e0138596. https://doi.org/10.1371/journal.pone.0138596
Editor: Lars Kaderali, University Medicine Greifswald, GERMANY
Received: March 3, 2015; Accepted: September 1, 2015; Published: September 22, 2015
Copyright: © 2015 Lo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: The data for YEASTRACT is obtained from http://www.yeastract.com/formorftogene.php.The yeast cell cycle data is obtained from http://genome-www.stanford.edu/cellcycle/. All other relevant data are within the paper and its Supporting Information files.
Funding: This work was supported by The Research Grants Council of the Hong Kong Special Administrative Region (http://www.ugc.edu.hk/eng/rgc/index.htm), Project References 414413: KSL and LU310111: MLW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Knowing the gene regulatory network (GRN) in the cell is crucial to understanding the working of the cell. In the cell, some proteins are transcription factors (TFs) which trigger or inhibit the transcription of other gene(s), which are then translated into proteins with a delay. These delays have been known to affect the network stability, or cause oscillations [1–4]. A simplistic view of the GRN is a directed network resulting from the complex causal interactions between genes, where each directed link is labeled with the delay. Since experimentally determining the regulatory targets of each TF is expensive and time-consuming, there have been many computational methods that attempt to utilize high-throughput microarray and RNA-seq gene expression data to infer the GRN. High-throughput technology such as microarray or RNA-seq allows the expression of thousands of genes to be measured at the same time, and allows time series expression data to be obtained when this is done for a number of time points.
Even though many GRN inference methods have been developed, to our knowledge, they all implicitly make the assumption of causal sufficiency, i.e. all the relevant factors in the causal network have been observed and there are no unobserved common cause. This assumption is convenient, but very unrealistic. For example, miRNAs were previously not thought to take important roles in gene regulation. In principle, in the real world, it is impossible to be certain that all relevant factors or common causes have been observed, because some factors may not have been conceived of, and therefore are impossible to measure.
Gene Network Inference
There have been many GRN inference algorithms and models, with different levels of details, see [5, 6] for surveys of GRN modelling and [7] for survey on GRN inference algorithms for microarray expression data.
Due to the nature of GRN, most models of GRN could be described as a graph, where the vertices are the genes under consideration, and the edges represent the regulatory relationships. Different levels of details could be achieved by labeling the edges with extra information. In the simplest case, an undirected graph could be used, in which case only an association network of the genes is captured. ARACNE [8] infers undirected network using mutual information, but also uses Data Processing Inequality to try to eliminate indirect interactions. C3NET [9] first identify gene pairs with significant mutual information, then link each gene to the neighbor (if any) with highest mutual information, and output an undirected and conservative network, with no other explicit means of eliminating indirect effect.
But without direction in the edges, there is no causal interpretation. On the other hand, directed edges could be used, as in [10], which uses genetic algorithm to optimize a score based on partial correlation, estimated regulatory direction and effect, but the output edges are not labeled with time delays. Since the process of gene transcription and mRNA translation both take time, and non-negligible translational time delays have been observed [11, 12]. Moreover, RNA polymerase, a main working protein in transcription, has been observed to pause during transcription, adding a cumulative of 204–307s over a 2.3kb region [13]. It is known that these delays affect the network stability, or cause oscillations [1–4]. Therefore, these delays should also be taken into account for a more accurate GRN model.
Some algorithms consider delay of only one time step, as in [14], which considers discretized expression data, and uses association rule mining to find frequent regulatory patterns, but without eliminating indirection association. Boolean network, e.g. in [15], is a classic model of GRN where the expression of each gene is discretized to only “on” or “off”, and the expression of each gene at the next time step is a boolean function of expression of its regulators at the current time step. Another popular class of GRN model is Ordinary Differential Equations (ODE), where the rate of change of expression of each gene is a (linear or nonlinear) function of the expression of the gene and its regulators. When discretized in time, it reduces to one time step model. Examples are [16], which uses Gaussian process for Bayesian inference of an ODE model; and DELDBN [17], which combines ODE model with local Bayesian analysis, and uses estimated Markov blanket as the regulators of each gene. There are also Dynamic Bayesian Network (DBN) based models, which avoids the limitation of plain Bayesian network that no cycles are allowed. An example is [18], which utilizes Bayesian structural expectation maximization to infer a one time step DBN model.
There are relatively few algorithms that infer multiple time delays. [19] first estimates the possible delays from pairwise mutual information from discretized expression data, then infer multiple time step DBN by minimizing MDL score using genetic algorithm. Banjo [20] also optimizes a score metric on DBN using discretized expression data by MCMC based method, and the program later allows multiple delay. TD-ARACNE [21] is an extension of ARACNE with time delays. But these algorithms do not label the edges of GRN with regulatory effect. In contrast, in DD-lasso [22], the expression of a gene is a linear combination of expression of its regulators at (possibly different) previous time steps. It first estimates the delays between each gene pairs by maximum likelihood, then uses lasso [23] to remove indirect effects and estimate the coefficients, therefore the edges are labeled with the delays as well as the regulatory effect. CLINDE [24] uses a similar model, but uses conditional independence of the shifted time series to estimate the delays and eliminate indirect effects.
Some other algorithms use perturbation data, or use a combination of perturbation data and time series expression. [25] is a parallel implementation of the Network Identification by multiple Regression (NIR) algorithm utilizing perturbation data. [26] needs promoter sequence and TF binding site information in addition to (non-time series) expression data. [27] is an Inductive Causation (IC) [28, 29] based method, which uses steady state data, with partial prior knowledge of ordering of regulatory relationship, and uses entropy to test conditional independence, giving an acyclic network where some edges may remain undirected. [30] uses convex programming on an ODE model using perturbation data. TSNI [31] solves a discretized linear ODE model using time series expression data after each gene is perturbed. [32] uses a Dynamic Nested Effects Model using perturbation data, where the delays are assumed to have exponential distribution. [33] uses knockdown data for Boolean network with exponentially distributed time delays.
Causality Inference
Outside of GRN inference, there have been quite a lot of works in the last two decades on inferring a causal network from observational data. Under the framework, the unknown causal relationships are represented by a directed graph, and the observed correlations and partial correlations are related to d-separations among the variables (through statistical tests), and therefore constrain the possible structures of the graph. In the basic framework, correlations are used to give possible causal links (without direction), and partial correlations are used to remove indirect links, finally the assumption of acyclicity gives rise to orientation rules to give directions to the links. Therefore, sometimes some links would remain undirected, as both directions are possible. These researches have been well summarized by [28] and [29]. Most works focused on causal structures represented as directed acyclic graph (DAG), because time delays are not considered, and the data used are not time series data. [34] attempted to extend the method for acyclic graph to cyclic graph, with more complex orientation rules. More recently, LiNGAM [35] formulated the causality inference as Independent Component Analysis (ICA) for the linear non-Gaussian acyclic structural equation model, and [36] extended it to the cyclic case. [37] extended LiNGAM to add time delays by first fitting an autoregressive model for the time delayed effects, and then fit the residuals with LiNGAM. [38] revisited the concept of Granger Causality. [39] generalized Bayesian network structure learning to cyclic structure.
Hidden Common Cause
Hidden (latent) variables have been an important topic in causality inference. The problem is that when hidden common causes are ignored, the causality inferred could be misleading. This is illustrated in Fig 1, where some nodes are wrongly thought to be causally linked.
The number on the link is the delay, and + or − is the sign of the effect.
[40] is an early work that formulates the problem as determining the constraints on the variance-covariance matrix of observed data, then searching for a causal structure that would explain the constraints. Some works assume the presence of hidden common cause of observed variables, but focus only on the relationship of observed variables, with indication that some may have hidden common cause. For example, [41] is a Granger-causality based method that learns a mixed graph from time series data, where directed edges represent direct causal relationship, and dashed edges represent relationship due to hidden common cause. [42] is an extension of the FCI algorithm [29] and also outputs a mixed graph, but does not need time series data. [43] is another extension of the FCI algorithm, and learns an acyclic network with no time delays, and with no consideration that the latent variables may have observed variables as parents. In [44], a stochastic differential equation model (discretized in time) is used, where hidden variables are assumed, but only to more accurately estimate the relationship between observed variables, by a convex optimization based method. In [45], the d-separation and d-connection information, which are in practice provided by conditional tests, are encoded as a Satisfiability (SAT) problem, which is then incrementally solved to attempt to recover the dependency structure between observed variables, with indication that some have latent variables, but some edges may be marked as “unknown” if the given information cannot determine whether it is present or absent. [46] uses nested effects models using perturbation data with no time delays, and hidden common effect of two observed variables may be predicted, and some edges indicate possible presence of hidden nodes.
Some works assume that hidden variables may have other hidden variables as parents, but cannot have observed variables as parents. [47] and [48] are works in this direction, where observed variables are linear function of its parents (either hidden or observed), and hidden variables can be nonlinear function of its parents (only hidden variables). In [49], a linear Bayesian network is learnt, but it is assumed that there are no edges among observed variables, and that the hidden variables are linearly independent.
On the other hand, some works are more general and allow the hidden variables to have observed variables as parents. [50] uses a dynamic Bayesian network (DBN) model, and is based on the observation that in DBN, ignoring hidden variable usually results in violation of Markov property. Based on this, the algorithm tries to find non-markovian correlations (those across more than one time step) and introduce hidden variable to explain them. However, this work does not evaluate how close the resulting dependency structure is to the assumed causal structure, but only focuses on the likelihood on the testing set. [51] learns a discrete Bayesian network with hidden variable without time delays. It assumes that a hidden variable has observed variables as parents and children. It is motivated with the observation that when the hidden variable is not taken into account, the inferred dependency of the observed variables (the parents and children of the hidden variable) will usually be overly complicated, with many connections. Based on this, they find structural signature (semi-clique) to guess the position of the hidden variable(s) and its local structure, then make adjustment by explicitly linking the variables of the semi-clique with the introduced hidden variable. This local structure may then be fine-tuned by Structural-EM (SEM). This work also focuses on the likelihood in the evaluation, but not the inferred structure. This also places some restrictions on the subnetwork surrounding the hidden variable(s), e.g. a hidden variable must have parent(s), which are observed variable(s), also the total number of parents and children of a hidden variable must be at least four, as the smallest semi-clique has size four. [52] complements [51] and focuses on learning the dimensionality (the number of states) of hidden variables.
Objective
First of all, it is important to emphasize that even inferring the causal relationships of observed variables is highly non-trivial, so inferring causal relationships of hidden variable(s) is obviously much more difficult and it is not possible to recover all possible cases of hidden variables. In some cases, the hidden variable is not very interesting, and in some cases, it would be too difficult to recover. For instance, if a hidden variable has no children, or one child with zero or one parent, it is not very interesting and is difficult to detect and estimate. The case that is feasible to handle is a hidden variable with two or more observed children, with or without parents, so we focus on this case.
In this paper, we assume that there is a sparse causal graph (possibly with cycles) of interest, where the variables are continuous and each causal link has a delay (possibly more than one time step). A small but unknown number of variables are not observed. Each unobserved (hidden) variable has only observed variables as children and parents, with at least two children and possibly no parents, and the children of unobserved variable(s) are not linked to each other. Although it is conceivable that two children of a hidden variable may be linked, so that a child has both the other child and the hidden variable as parents, it is difficult to differentiate whether the high correlation between two children are solely due to the hidden common cause, or also due to the presence of direct link. Therefore, we make the simplifying assumption that the children of hidden variable(s) are not linked to each other. Our objective is to infer the causal graph with the delays, given the time series of the observed variables. Since it is difficult to obtain very long time series, it is more realistic that the algorithm be also capable of utilizing multiple short time series, which we call segments in this paper. The segments are not necessarily of the same length (e.g. obtained from replicate experiments). To our knowledge, previous works either make much more restrictive assumptions or give different types of output (e.g. no delays in the links).
Methods
The overall flow of the proposed method is given in Fig 2. The steps are 1) infer an initial GRN of the observed genes, 2) determine the genes with hidden common cause(s) by the variance of the error terms, 3) estimate the hidden common cause(s), 4) infer the parents and children of the hidden common causes. The program code is written in C, any parameters described in this paper has a default value, and can be changed as needed. Also, sample running time is provided to give an idea of the running time needed. The program can be obtained at https://github.com/peter19852001/hcc_clinde. Below we first describe the data and model assumed in this paper, then describe each step in more details, where we first describe the case with one segment, and later we describe the case of multiple segments in a separate subsection.
The steps are 1) infer an initial GRN of the observed genes, 2) determine the genes with hidden common cause(s) by the variance of the error terms, 3) estimate the hidden common cause(s), 4) infer the parents and children of the hidden common causes.
Data and Model
The GRN model assumed here is: where
- xi(t) is the expression value of gene i at time t, i = 1,…,n + nh, t = 1,…,m, and there are n observed genes, nh hidden variable(s) and m equidistant time points.
- nh is unknown, but .
- aij is the regulatory effect of gene i on gene j, where the regulatory effect is repressive if aij is negative, activatory if positive, and absent if zero.
- aij = 0 for , i.e. there are no causal links between hidden variables.
- for , ∣{j : aij ≠ 0}∣ ≥ 2, i.e. each hidden variable has at least two observed genes as children.
- if a gene has a hidden parent, it has no other parents.
- τij is the positive time delay of the edge i → j (if aij ≠ 0).
- ϵj(t) is the error term for gene j at time t. We assume that E(ϵj(t)) = 0 and Var(ϵj(t)) = σ2, i.e. the error terms are zero-mean and have fixed variance. They are also assumed to be mutually independent, but otherwise we do not make stringent assumptions on the distribution of the error terms.
Note that this model does not preclude self-regulation or cycles in the GRN, though any cycles must have positive delays. The given data is {xi(t)}, for i = 1,…,n. If the raw input data does not have equidistant time points, interpolation (e.g. spline interpolation) could be performed as preprocessing before using this algorithm.
Initial GRN
The first step is to obtain an initial GRN. There are not many GRN inference methods that handles multiple time delays, CLINDE [24] and DD-lasso [22] are two of them, and both handles multiple short time series. We choose CLINDE to infer the initial GRN, based on the comparison in [24], which shows that CLINDE outperforms DD-lasso for smaller number of time points relative to the number of genes. Also because CLINDE does not restrict the multiple time series to be of the same length, unlike DD-lasso.
CLINDE is based on the PC algorithm [29], and consists of two stages. Stage 1 considers all (directed) pairs of genes x and y, and considers all possible delays d up to a maximum allowed delay, to determine if x → y is significant with the delay d based either on a correlation test, or mutual information test. The test is considered significant if the score of the test is larger than a score threshold. In the correlation test and partial correlation test below, the score is −log10(p-value), and in the mutual information test and conditional mutual information test below, the score is the (conditional) mutual information. So a higher score threshold means a more stringent test. And for correlation test, the regulatory effect (positive or negative) is also estimated if the edge is significant. After stage 1, there may be multiple edges from x to y, but with different delays. Also note that there may be cycles, but any cycle must have positive delays. Stage 2 attempts to prune the edges by partial correlation test or conditional mutual information test. Iteratively, the remaining edges are considered for pruning by conditioning first on h = 1 neighbor, then h = 2 neighbors, and so on up to h = N0, for a given parameter N0. In each iteration, each remaining edge is tested by conditioning on h neighbors, shifted properly using the delays estimated in stage 1. If the conditional test is not significant, the edge is pruned. After stage 2, the remaining edges are output as the GRN. Note that the pruning removes only indirect effects, and with sufficient data (so that the conditional tests truely reflects conditional independence relationships) any genuine cycles should remain, because each edge in the cycle is a direct effect. In this paper, we use (partial) correlation tests, which is the default for CLINDE. And after stage 2, the links with zero estimated delays are discarded to get the initial GRN for the following steps, as we assume the delays are positive.
Since CLINDE can infer cyclic GRN, so by using CLINDE for the initial GRN, HCC-CLINDE can handle cyclic GRN without further special handling.
Identification and Estimation of Hidden Common Cause
After the initial GRN of the observed gene is obtained, for each gene, we can regress its expression on the shifted expression of its parents and obtain the corresponding error terms. Therefore we can calculate the empirical variance of the error terms. The idea is that for genes with hidden common causes, in the initial GRN, its parents would be determined incorrectly (likely the parent(s) or other children of the hidden variable) because the hidden variable is not observed, and consequently the variance of the error term is likely to differ from the expected fixed variance. On the other hand, if a gene has no hidden variable as parent, its parents would hopefully be determined correctly, and therefore the error term would have the expected variance. The variance of the error terms therefore provides a clue to whether a gene has hidden variable as parent.
To illustrate, consider the case depicted in Fig 3, where X and Y are independent and H is hidden: Suppose that in the initial GRN, the parents of Z are X and Y, so that Z = c(aX + bY + eh) + ez, and the variance of the error terms would be . Similarly, if the parent of Z was incorrectly determined to be W, whereas we have , and the variance of the error terms would be . In both cases, the variance of the fitted error terms would be larger than expected. On the other hand, in the initial GRN, if too many false parents were predicted for a gene, its error terms may have smaller variance than expected, but this is not common in CLINDE. In this paper, we use a simple thresholding to decide if a gene may have hidden parents: , where Var(e) is the observed variance of the error terms of a gene, ρ ≥ 0 (with a default value of 0.1) is the tolerance, and σ2 is the expected variance. Those genes determined to have hidden parents are called candidates.
If the number of observed genes n is small, we assume that the expected variance σ2 is known and given. In contrast, if n is larger, we could instead estimate σ2 from the observed variances of the error terms of the genes. For each observed gene, we can calculate the variance of the error terms based on the initial GRN. Based on the assumption that there are only a small number of hidden variables relative to the number of observed genes, in this paper we simply use the median of these variances as the estimate of the expected variance.
If there are no candidates, we output the initial GRN as the final GRN. Otherwise, we cluster them to estimate the hidden common cause for each cluster, based on the fact that genes with common parent are correlated. From our preliminary tests, we found that even a simple greedy clustering algorithm works adequately:
- Let the k candidates be {g1,g2,…,gk}
- Set nc ← 1, c1 ← g1, τ1 ← 0, C1 ← {g1}
- Set ρ0 ← (1+ρ)e2, where e2 is the expected variance of the error terms, and ρ is the tolerance
- For i = 2,…,k
- Let di = argmax1 ≤ j ≤ nc d(cj,gi), and set τi be the associated time shift of gi relative to cj
- If , update Cdi ← Cdi∪{gi}
- Otherwise, set nc ← nc + 1, then set Cnc ← {gi}, τi ← 0
- Output the nc clusters {Cj : 1 ≤ j ≤ nc}, and the time shifts {τi : 1 ≤ i ≤ k}
The basic idea is that for each series, find out which cluster is the most similar to it, and if the similarity is high enough, it is included in that cluster; otherwise it becomes a new cluster. The similarity threshold is obtained from simple calculations as follows. Consider Fig 3 for Z and W, where Z = cH + ez and W = dH + ew, so we have: Since our threshold of the variance for ez and ew is ρ0 = (1 + ρ)e2, for two genes z and w having a hidden common cause, we have the corresponding lower threshold of correlation as
After the clustering, if there are no clusters with size larger than one, there is no need to estimate the hidden common causes, and we simply output the initial GRN as the final GRN. Otherwise, we would estimate a hidden common cause for each cluster Ci where . Suppose Ci = {gi,1,gi,2…,gi,∣Ci∣}, we first center the expression of these genes to get . Since each gene gi,j has the associated time shift τi,j relative to the cluster center, we wish to estimate the coefficients and the hidden common cause h(t) where . Note that the scale of and h(t) is undetermined as for any β ≠ 0. Also, since the time series may not be aligned (see Fig 4), so we first estimate the overlapping part, then estimate the prefix and suffix.
Suppose that relative to the cluster center, the overlapping parts are ts ≤ t ≤ te, let Yi = [ykt] for 1 ≤ k ≤ ∣Ci∣ and ts ≤ t ≤ te and , and let for ts ≤ t ≤ te, we want , as illustrated in Fig 4. This is conveniently solved by Singular Value Decomposition of Yi as low rank approximation to get where σ is the largest singular value and the Frobenius norm of E is minimized. We then arbitrarily take and . Having estimated the coefficients , we still need to estimate the non-overlapping prefix () and suffix () of hi(t), if any. For or , we estimate hi(t) by minimizing where the sum is over k for which 1 ≤ t−τi,k ≤ m. This is readily solved to be .
Lastly, we take for 1 ≤ t ≤ m−1 and as the estimate of the hidden common cause of cluster Ci, i.e. take the suffix of hi(t) and shift by 1 so that precedes all the genes in Ci in time.
Inferring Causal Relationships of Estimated Hidden Common Cause
After estimating the hidden common cause(s), we attempt to re-estimate the subnetwork of the introduced hidden common cause(s), the candidates, and the parents of the candidates. For this, we apply CLINDE, with the restriction that in stage 1 of CLINDE, we do not consider the links in-between the candidates and links into the parents of the candidates, and consequently the resulting subnetwork will only consists of links from parents of candidates to hidden common causes and candidates, and from hidden common causes to candidates. Having obtained the subnetwork, we remove the links into the candidates in the initial GRN, and then union the resulting GRN with the subnetwork to get the final GRN. The rationale is that we expect the candidates to have hidden common causes as the real parents, where their apparent parents in the initial GRN may really be parents of the hidden common cause(s). Also, as the hidden common cause is estimated from the observed children, it is difficult to differentiate whether the high correlation between two children of a hidden variable is solely due to the hidden common cause, or that there is also a direct link between the two. In this paper, we therefore make the simplifying assumption that the children of a hidden variable are not linked to each other.
Handling Multiple Segments of Time Series Data
The above describes the steps of HCC-CLINDE when one segment of time series data is provided, we now consider the case where multiple segments of time series data are provided. We do not assume that the segments have the same lengths, but we assume that the time steps of different segments are the same. The main idea of handling multiple segments is that when the expression of genes have to be shifted by a delay (e.g. for calculating correlation), all the segments are shifted, and the overlapping parts are concatenated for the calculation. This is illustrated in Fig 5.
For inferring the initial GRN, we use CLINDE, which can handle multiple segments, in the same manner illustrated in Fig 5 for calculating correlation and partial correlation.
For identifying the candidate genes with hidden common cause, we need to first regress the expression of a gene on its parents (from the initial GRN), which is done by shifting multiple segments by the estimated delays using the overlapping parts for regression. After that, the clustering poses no difficulty as it only needs the correlation, which can be calculated in the same manner as in the initial GRN.
For estimating the hidden common cause(s) using SVD, we first shift and take the overlapping parts of the segments, and use SVD to estimate the center part of the hidden common cause for each segment and estimate the coefficients, then we go through each segment and estimate the prefix and suffix. This is illustrated in Fig 4.
Lastly, after the estimation of hidden common cause(s), we infer the causal links for the hidden common cause(s) by CLINDE on subnetwork, which poses no difficulty as CLINDE can handle multiple segments.
Experiment Results
This section evaluates the effectiveness of our proposed algorithm HCC-CLINDE. We mainly rely on synthetic data, where we know the underlying gene network and the hidden variables, and the lack of sufficient expression data is not a concern. Since to our knowledge, there is no previous work that is similar to ours (infer hidden common cause with time delays), we can only compare our algorithm on incomplete data with CLINDE on incomplete and complete data, to show the improvement over ignoring hidden common cause.
We have tested on three types of synthetic data: 1) small GRN with one hidden node and the variance of error terms is known; 2) small GRN without hidden node and the variance of error terms is known; 3) large GRN with more than one hidden node and the variance of error terms is unknown. For all three cases, we try the score thresholds (st) 2, 3 and 4 for CLINDE, and use the default value for other CLINDE parameters. And for each case, we generate two types of data: one long segment of time series where we take prefixes of different lengths; and multiple segments where we use different number of segments for different total number of time points.
Due to the lack of long time series expression real data, we are unable to test our algorithm on large GRN, but we could demonstrate our algorithm on small GRNs. A dataset with relatively large number of time points is [53], which measures the expression of over 6,000 genes of Saccharomyces cerevisiae using DNA microarrays, with three different methods of synchronization for studying yeast cell cycle. Together with previous data from [54] (also included in [53]), we have 4 time series with information shown in Table 1. For the GRN, we use YEASTRACT [55], which is a curated database of over 200,000 transcription regulatory associations in Saccharomyces cerevisiae (including 113 TFs). Since the GRN is too large for the available expression data, we extract a small number of subnetworks for testing instead.
In the following, we first describe the performance metrics, and then the generation of synthetic expression data once the GRN is given, and then describe the generation of the synthetic GRN for different cases, and the results of the three types of synthetic data. After that, we describe the preprocessing of the YEASTRACT subnetworks and the expression data, and then present the results of our algorithm on the real data.
Performance Metrics
It is more appropriate to focus on correctly predicting the presence of links rather than its absence, due to the sparse nature of GRNs. We assess the performance of the inference algorithm on three aspects, namely Links (which is considered correct if and only if both the gene pair and the direction are correct), Delays (which is considered correct if and only if both the link and the time delay τij are correct) and Effects (which is considered correct if and only if both the link and the sign of the effect aij are correct). For each aspect, we look at three metrics respectively, namely Recall , Precision and F-score , where TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives. But since F-score is an overall measurement of performance, we focus on it.
We assume that in the true GRN, the hidden nodes are labeled. But in the predicted GRN, the number and the indices of the predicted hidden nodes may not be the same as the true GRN. Therefore, we need to map the predicted hidden nodes to the true GRN before doing the above performance calculations. Also, for the links to/from the hidden common nodes, the delays and the effects cannot be completely determined because there is no observed time series data for the hidden nodes to constrain them. This is illustrated in Fig 6, where the sign of a hidden node can be flipped, and be compensated by a flipping the signs of all its links; and the delays of links out of a hidden node can be shifted, and be compensated by the same shift of links into the hidden node. Therefore, in mapping the predicted hidden nodes to true hidden nodes, we may need to shift the delays, and flip the signs of the links appropriately, for useful calculation of the performance.
The number on the link is the delay, and + or − is the sign of the effect. Consistently flipping the signs and shifting the delays results in equivalent predicted GRN, as the hidden node is not observed.
For each predicted hidden node, we try to align it to each of the true hidden nodes, and choose the one with the most matched links and delays (after shifting). And in case of ties, we arbitrarily choose true hidden node with the lowest index. When aligning a predicted hidden node, we consider only the links to/from observed genes, and consider all possible shifts in delay (for a shift of d, the delays in parent links will be increased by d, while that in children links will be decreased by d), and also whether to flip the signs of the links. How well it is aligned is measured by the sum of number of matched links and shifted delays. After the mapping of predicted hidden nodes (and shifting in delays and flipping of signs), the performance of the predicted GRN is calculated as described above.
Generation of Synthetic Expression Data
One Segment.
Given a network as (aij,τij) where aij is the effect of gene i to gene j, and τij is the associated delay if aij ≠ 0. Given the parameters σ2 which is the variance of the error terms, and α which controls the gaussianity of error terms, we then generate the expression data with m time points as follows:
- For , 1 ≤ j ≤ n, set xj(t) = ej(t), where each ej(t) is generated from N(0,1)
- For 1 ≤ t ≤ m, 1 ≤ j ≤ n, set , where ej(t) = sign(zjt)s∣zjt∣α, where each zjt is generated from N(0,1) and s is used for scaling such that Var(ej(t)) = σ2
- Output {xj(t)}1 ≤ t ≤ m,1 ≤ j ≤ n
Multiple Segments.
In order to generate K segments of time series data, we first randomly pick the lengths of each segment uniformly between 20 and 30 (inclusive), emulating the situation of multiple short time series, possibly with different lengths. Then we generate each segment of specified length as above.
Synthetic Small GRN with One Hidden Node
We first test our proposed algorithm on small GRN where there is only one hidden node, and the variance of the error terms is assumed known.
Network Generation.
In this small GRN case, each GRN has one hidden node, p observed parents and c observed children, as illustrated in Fig 7, but when applying the algorithm, we do not assume the presence of hidden node. For each link, the delay is uniformly chosen from {1,…,τ0}, where τ0 = 4; and the coefficient is aiu = ρiu ziu, where ρiu is uniformly chosen from {−1,1} and ziu is uniformly chosen from (0.5,1.5). Then we permute the indices of the observed genes.
There are p parents, c children and one hidden node.
We test a few values of the parameters, as shown in the column Small GRN of Table 2. For each setting of p, c, σ2 and α, we generate 20 replicates, for a total of 6,400 GRNs. For the one segment case, for each replicate, we generate expression data with 200 time points, and then take prefix to get m time points, and output only the expression of the observed genes, for a total of 25,600 time series. And for the multiple segments case, for each replicate, we generate 8 segments, for a total of 51,200 segments, and for each setting, we test using K segments at a time.
Results.
First we show that the performances of Links, Delays and Effects are usually very consistent, with occasional discrepancies. Fig 8 shows the profiles of F-scores of Links, Delays and Effects for different settings for small hidden case with one segment time series. The three F-scores are mostly consistent with each other, though occasionally there are greater discrepancies for some records. In order to illustrate that these cases are very small, we plot the histogram of the absolute difference between Links and Effects F-scores in Fig 9. The histogram for difference of F-scores between Delays and Effects and between Links and Effects are similar and are shown in Figures A to B in S1 File. This suggests that the Links, Delays and Effects are usually inferred correctly at the same time, rather than getting one correct but the other incorrect. Therefore, in the following we mainly show the results of Effects, for the lack of space, but the full results including Links and Delays are provided in additional files.
Next we consider the F-scores under different σ2. If we repeat the above method of showing the profiles of F-score for different σ2, we found that the F-score of individual record for different σ2 show great discrepancies for some records. However, Fig 10 shows the boxplot of Effects F-score for different σ2 for one segment case, which shows that the overall distribution of F-scores are quite similar. The boxplot for the multiple segment case is similar and is shown in Figure L in S1 File. Therefore, we mainly show the results for Effects F-score for σ2 = 2 in Table 3 for one segment case and in Table 4 for multiple segments case, and the full results of median performance for small hidden case are shown in Table A in S1 File for one segment case and in Table D in S1 File for multiple segment case.
The results of the small case with one hidden node have no clear trend for either score threshold, α or the number of time points m. We would expect that the performance is better with larger m, but there are some cases where this is not the case. One possible reason is that for this case, the algorithm need to correctly detect the presence of hidden common node, and infer the link(s) between the hidden node and the observed genes, so the performance has more variability. For example, if the algorithm incorrectly predicted that there are no hidden node, then the F-score will surely be 0, as in the true GRN, all links are to/from the hidden node. For α, there are some differences between α = 0.5,1 and 2, but the differences are not great. For α = 3, the algorithm seems to have worse performance when m is small. However, note that when m is large, the F-scores are quite satisfactory for one segment case, and in many settings are over 0.9 or even 1. For multiple segments case, for p = 2,3, the results are a bit worse, especially for α = 3, but can still reach 0.6 in F-score. The results show that the proposed algorithm can adequately detect and estimate the hidden node in a simple setting, and is robust towards slight deviation of gaussianity in the error terms.
Synthetic Small GRN without Hidden Node
In this case, we test the effectiveness of the algorithm when there are in fact no hidden node. And in applying the algorithm, we do not assume the number of hidden nodes is known, but the variance of the error terms is known. The parameters are same as above, which is shown in the column Small GRN of Table 2.
Network Generation.
For this purpose, we generate some “confusing” cases as follows. For each GRN (p, c, σ2, α and replicate) in the above cases of small GRN with one hidden node, we use the (incomplete) data of 200 time points, and use CLINDE (with score threshold of 2 using PCor) to infer an GRN, which is definitely wrong, as CLINDE does not handle hidden nodes, and there are no true links among observed genes. If this GRN is non-empty, it is used; otherwise, a small GRN of a node in the middle with p parents, and c−1 children is generated as above, but all genes are labeled as observed. Having obtained the 6,400 GRNs without hidden nodes, the same method as above is used to generate the 25,600 time series data for one segment case, and 51,200 segments for multiple segments case.
Results.
Similar to the small case with hidden node, the F-scores of Links, Delays and Effects are fairly consistent, so we omit the profiles and histograms here, but they are in Figures C to F in S1 File. Also, the situation is similar for the F-scores with different σ2, and we show the boxplot in Figure J in S1 File for one segment case, and in Figure M in S1 File for multiple segments case. Therefore, we show the results of Effects for σ2 = 2 in Table 5 for one segment case, and in Table 6 for multiple segments case, and the full results of median performance is shown in Table B in S1 File for one segment case and in Table E in S1 File for multiple segments case.
For this case, when m is larger, the F-score is usually higher, although there are some exceptions, especially for α = 3 and p = 0. Also, small score thresholds usually work well, especially when m is small. The F-score is often 0 for small m with high score threshold, which means the score threshold is too stringent. When m is larger, using a more stringent score threshold also gives good F-score, but is only occasionally better than small score threshold. The F-scores can reach over 0.9 or even 1 in many settings, and over 0.6 in all settings. Similar to the above case, the performance does not show great difference for different α, though small α is usually slightly better. The results show that the proposed algorithm works satisfactorily even when there are no hidden node in a simple setting.
Synthetic Large GRN with More than One Hidden Node
The above two cases are for small GRN, and where the variance of the error terms is known. We also test the more realistic case of larger GRN with more than one hidden node (but the number is unknown), and that the variance of the error terms is unknown, but estimated by the algorithm. For a network with n observed genes, we would generate hidden nodes.
Network Generation.
For n observed genes and nh hidden nodes, a maximum of M0 parents for observed genes, maximum of τ0 as delay, we generate a GRN with the structure shown in Fig 11 as follows:
- Choose the number of parents (all distinct) for the hidden nodes: first for each i ∈ {0,1,2,3}, assign nodes to have i parents; for each of the remaining nodes, randomly choose i from {0,1,2,3} as the number of parents.
- Similarly choose the number of children (all distinct and distinct from the parents) for the hidden nodes: first for each i ∈ {2,3,4,5}, assign nodes to have i children; for each of the remaining nodes, randomly choose i from {2,3,4,5} as the number of children.
- Let Np and Nc be the total number of parents and children of hidden nodes respectively.
- Let No = n−Np−Nc (n should be large enough relative to nh so that )
- For 1 ≤ i ≤ No, randomly choose pi from {0,…,M0}, and randomly link pi parents from {1,…,n} to node i.
- For , randomly choose pi from {0,…,M0}, and randomly link pi parents from {1,…,n−Nc} to node i.
- For the hidden nodes, link from distinct parents from {No + 1,…,No + Np}, respecting the number of parents chosen above.
- Similarly link to distinct children for the hidden nodes from {No + Np + 1,…,n}, respecting the number of children chosen above.
- For each link L, randomly set the coefficient as aL = ρL zL, where ρL is uniformly chosen from {−1,1} and zL is uniformly chosen from (0.5,1.5).
- For each link L, uniformly choose τL from {1,…,τ0}.
- Since the GRN may have cycles, scale the coefficients to make the GRN stable.
- Lastly, randomly permute the indices of the observed genes {1,…,n}.
Each hidden node has up to 3 distinct parents, and up to 5 distinct children.
The parameters that we have tested are listed in column Large GRN of Table 2. For each setting of n, σ2, α, we randomly generate 40 replicates as described above, for a total of 1,600 GRNs. Then for each GRN, for the one segment case, we generate expression data with m = 800 time points, and take prefixes to get different number of time points. So there are a total of 9,600 time series. And for the multiple segments case, we generate 32 segments for each replicate, for a total of 51,200 segments. We test more time points because now the number of genes is larger than the above two small cases.
Results.
Fig 12 shows the profiles of F-scores of Links, Delays and Effects for different settings for large case, which shows that the three are very consistent, so we omit the histograms of absolute differences here, but they are in Figures G to I in S1 File. In addition, Fig 13 shows the profiles of Effects F-scores for different σ2 and Figure K in S1 File shows the corresponding boxplot for one segment case, and Figure N in S1 File for multiple segments case. Both show that for large case, the performance is very consistent for different σ2. Therefore, in the following we show the results of Effects F-scores for σ2 = 2 in Table 7 for one segment case, and in Table 8 for multiple segments case, where we show the performances of complete: CLINDE on the complete data (i.e. with also the expression of the “hidden” nodes), hidden: the proposed algorithm on the incomplete data (i.e. only the expression of observed genes), and hiddenCL: CLINDE on the incomplete data. Note that the complete case is mainly for comparison purpose, as in real-life, the hidden variables cannot be observed. It is unsurprising if the complete case has better performance, as CLINDE has more data in this case, so it is the comparison between hidden and hiddenCL that is of main interest here. The full results of median performance are shown in Table L in S1 File for one segment case and Table F in S1 File for multiple segments case. In Tables 7 and 8, we also show the ratio of F-score of hidden to that of complete, in percentages.
Generally, the F-score is better for larger m, however, note that even with m = 800 or K = 32 on the complete data, CLINDE cannot infer the whole GRN, because the GRNs are not small. For each of complete, hidden and hiddenCL, for α = 0.5,1 and 2, the F-scores are quite similar, though the F-scores are usually worse for α = 3 for one segment case; for multiple segments, the results are similar for α = 0.5,1,2, and 3. This shows the robustness of all three methods towards slight deviation from gaussianity for the error terms, though for larger deviation, the difference in performance is more noticeable. As for the score threshold, when m is small, smaller score threshold is better, and for larger m, larger score threshold is better.
When comparing between complete, hidden and hiddenCL, we see that in almost all settings, complete is better than hidden, which in turn is better than hiddenCL. The exceptions are all for m ≤ 100 or K ≤ 4. Note that hidden, using incomplete data, is usually able to achieve 70% to 80% of the performance of complete, and can get up to about 90% with large m. We have also performed one-sided Wilcoxon signed-rank test to test whether the median Effects F-score of hidden is better than hiddenCL. The p-values for σ2 = 2 are shown in Table 9 for one segment case and in Table 10 for multiple segments case. And the p-values for other σ2 are shown in Table C in S1 File for one segment case and in Table G in S1 File for multiple segments case. We see that for most settings, the p-values are very small, but occasionally the p-values are a bit larger for m ≤ 200 or K ≤ 4.
The results show the effectiveness of the proposed algorithm in detecting and estimating hidden nodes in large GRN, without knowing the number of hidden nodes and the variance of error terms.
Heterogeneous Variance in Error Terms in Synthetic Large GRN
In previous sections, the error terms of all genes have the same constant variance in the synthetic data. This is an admittedly restrictive assumption. In this section, we test our algorithm on heterogeneous variance in the error terms for the synthetic large GRNs generated in the previous section, to test how robust HCC-CLINDE is towards violation of the assumption of constant variance in error terms. We simulate data as above, but instead of a constant, the variance of error terms for gene i is generated as , where zi ∼ N(2,δ2), and we test different values of δ2:0.05, 0.1, 0.2, 0.5, 0.7, 0.9, 1.
Results.
Figs 14 and 15 show the median Effects F-scores against different δ2 for different α and different number of time points for one segment case with score threshold st = 2 for n = 50 and n = 100 respectively; while Figs 16 and 17 show the results for multiple segment case with st = 2 for n = 50 and n = 100 respectively. The full table of median F-scores (with different st, and also Links and Delays performances including Recall, Precision and F-score) are given in Tables H and I in S1 File for one segment case and multiple segments case respectively.
complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.
complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.
complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.
complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.
First, note that the median Effects F-scores for complete (which is CLINDE on the complete data) and hiddenCL (which is CLINDE on the incomplete data) are mostly stable for different values of δ2, which is reasonable because CLINDE ignores hidden common cause, and makes no assumption on the variances of the error terms. We have also performed one-sided Wilcoxon signed-rank test to test whether the median Effects F-score of hidden is better than hiddenCL. The p-values are shown in Table 11 for st = 2 for both one segment and multiple segments cases, and the p-values for st = 3 and st = 4 are in Tables J and K in S1 File for one segment and multiple segments cases respectively. With sufficient time points (or segments), for small δ2, the F-scores of hidden (which is HCC-CLINDE on the incomplete data) is smaller than complete but larger than hiddenCL, which is qualitatively the same as the constant variance case in previous section. Up to δ2 = 0.2, the p-values are less than 0.001 for 400 or 800 time points, and 16 and 32 segments. As δ2 increases, the F-scores decrease, and usually at δ2 = 0.5, the performance of hidden would be close to hiddenCL and that hidden is no longer significantly better than hiddenCL, indicating that HCC-CLINDE cannot effectively recover the hidden common causes and has no advantage over ignoring hidden common causes. However, note that when δ2 = 0.2, the variances of most genes are between 1 and 3, and when δ2 = 0.5, most variances are between 0.1 and 4, which are moderately heterogeneous. Therefore, the results show that HCC-CLINDE can tolerate slight violation of the assumption of constant variances in the error terms of the genes.
Small YEASTRACT Subnetworks with Real Data
Preprocessing of Subnetworks.
We accessed YEASTRACT [55] (http://www.yeastract.com/formfindregulators.php) on 7th Feb, 2015 to get the regulating TFs of a list of 149 TFs. We chose “DNA binding and expression evidence” (which gives more confidence than having only binding evidence or expression evidence alone) and queried twice, once with “TF acting as activator” and once with “TF acting as inhibitor” to try to get the regulatory effects (positive or negative) of the regulatory relationships. The obtained 392 links involve only 129 TFs, and we used “ORF List ⇔ Gene List” utility of YEASTRACT (http://www.yeastract.com/formorftogene.php) to convert the gene names into ORF id’s, and all these 129 id’s appear in the yeast cell cycle [53] data.
However, the GRN is too large for the limited data, so we have chosen 22 subnetworks with sizes and constituent TFs shown in Table 12. For each subnetwork, a TF (which has children in the subnetwork) is chosen to be the hidden node. As the delays in the links are not known, and for a TF pair, some links may be both positive and negative in the YEASTRACT network, so we focus on the performance on Links.
Preprocessing of Expression Data.
We downloaded the yeast cell cycle [53] data from http://genome-www.stanford.edu/cellcycle/. The expression data contains 4 time series: alpha, cdc15, cdc28 and elu, with different lengths and time points, as shown in the second column of Table 1. Since the time steps of the 4 time series are not all the same, for each TF, we perform spline interpolation (using the spline() function in R) to the time points shown in the third column of Table 1. Also, some TFs in some series are entirely missing (this seems more often in cdc15), and we fill in with zero. For other missing values, we rely on the spline interpolation to fill in the value.
After this, for each subnetwork we use the converted ORF id of each TF to retrieve expression data of the TFs on the 4 time series, to obtain one set labeled complete which contains the expression data of all TFs of the subnetwork, and another labeled incomplete which contains the expression data of all but the chosen TF of the subnetwork. Therefore, for each subnetwork, we have 8 expression datasets.
Results.
The Links F-scores of CLINDE with complete data, our proposed algorithm (with normalization of data, with unknown σ2 and unknown number of hidden nodes) on incomplete data, and CLINDE on incomplete data on the subnetworks are shown in Table 13, where we set the maximum delay τ0 to be 4, and use the score threshold of 1 as the number of time points is not large. We run the algorithms (ours and CLINDE) separately on one time series and used the 4 time series together (all), because both CLINDE and our proposed algorithm can handle multiple segments of time series. We show which segment(s) have the best performance.
First, as the 4 time series segments are from different experimental conditions, so it is possible that different genes have better responses in different segments. Consequently, using all 4 segments (all) may not give the best results, even though the total number of time points is larger.
Second, our proposed algorithm on incomplete data has better F-scores than CLINDE on complete data in 14 out of 22 subnetworks, with 3 ties. There are two possible reasons. One is that after estimating the initial GRN, the subsequent steps make some assumptions on the structure of the GRN around the hidden common cause(s). This may help the GRN inference, especially in case of limited data. The other is that after estimating the hidden common cause(s), the re-estimation of the links around the hidden common cause(s) works only on a subnetwork. However, we expect that as the amount of (quality) data increases, the situation would be more like the synthetic large case, where CLINDE on complete data has better performance, but our algorithm is close to it.
Third, our proposed algorithm on incomplete data outperforms CLINDE on incomplete data in 21 out of 22 subnetworks. This is as expected, as CLINDE is unaware of the presence of hidden node(s), so there could be misleading links (as illustrated in Fig 1).
In short, due to limited real expression data, we cannot draw very strong conclusions, but the results indicate the potential of our proposed algorithm in recovering hidden common cause(s) using real data, when the error variance and the number of hidden node(s) is unknown.
Discussions
In this section, we discuss possible extensions to the basic algorithm introduced above to relax some assumptions made.
Variance of Error Terms
In the proposed algorithm, we assume that the variance of the error terms is a constant. This assumption is used in detection of genes with hidden common cause, and in calculation of correlation threshold in clustering.
This assumption can be relaxed, for example to assume that the error terms of some genes have variance and some have variance . After inferring an initial GRN and calculating the empirical variances of the error terms, instead of using the median as estimate of the true variance, we may cluster the empirical variances and use the centers of the two largest clusters as estimates of and . Having obtained the estimates (or assumed given if the number of observed genes is too small to allow good estimation), those observed genes with empirical variances too far away from the estimates are predicted to have hidden common cause.
The variance of the error terms is also used in calculating the threshold of absolute correlation in clustering the genes having the same hidden common cause. If we assume more than one possible variance, we may use the larger one to calculate a conservative threshold. Alternatively, for each observed gene, we may choose the estimate closer to the empirical variance as its true variance, and calculate the threshold accordingly. At this point, it is unclear which method gives better clustering, and further study is needed.
However, it is difficult to handle a very general and spread-out distribution of the variances of the error terms. If the distribution is unknown and we desire to estimate it from the empirical variances, a large number of observed genes is needed for meaningful estimation, but this increases the difficulty of inference of the initial GRN substantially because of the increased number of genes. Even if we assume the distribution is known or well estimated, given an empirical variance, we need to decide whether this is as expected, which is essentially a statistical test, which is more difficult for less concentrated distribution.
Discrete Data
The proposed algorithm handles continuous data, here we discuss the possibility of extending it to discrete time series data. For this purpose, a few parts would need to be adapted.
Firstly, the data model would be different. Instead of a linear combination of its parents’ values (with time delay) plus an additive error, the value of a node would depend on its parents through a conditional distribution. In addition, we assume that given each configuration of the parents, the conditional distribution has most probability (e.g. 1−pe) concentrated on a value, and the remaining pe is spread over other values, i.e. the value of the node is roughly a function of its parents, with a “noise level” of pe.
Secondly, the inference of initial GRN needs adaptation. This is simple because similar to the PC algorithm [29], CLINDE can use any type of independence and conditional independence test suitable for the data. For example, χ2 tests could be used instead of the correlation test.
Thirdly, we need to predict the nodes with hidden common cause based on the initial GRN. Given an initial GRN, we could estimate the maximum likelihood conditional distribution of each node from the data, and analogous to variance of the error terms, the empirical “noise level” could be calculated and compared with the expected level to predict whether that node has hidden common cause. If we assume a constant “noise level” for all the nodes, it could be given (for small number of nodes) or estimated from empirical “noise levels” (for large number of nodes).
Fourthly, for the clustering of nodes having the same hidden common cause, since two nodes having the same parent would be associated, so it is only necessary to determine an appropriate threshold for the association measure based on the “noise levels”.
Lastly, instead of using Singular Value Decomposition to estimate hidden common cause(s), a different method has to be used. For example, Structural-EM could be used as in [51], and the method in [52] could be used for the related problem of determining the number of states for the hidden common cause(s).
Relax Structural Assumptions
Nodes with More than One Hidden Common Cause.
In the basic algorithm, we assume that if a gene has a hidden parent, it has no other parents. Here we consider the possibility that a gene has two or more hidden parents (both hidden common causes). This should pose little difficulty for predicting the genes with hidden common cause(s), as they would have the wrong parent(s) in the initial GRN, and consequently the empirical variance of they error terms are likely different from expected. On the other hand, the clustering and estimation of hidden common causes would need some adaptation. For clustering, the determination of the correlation threshold is not as straightforward. One simple strategy is to use a fixed conservative threshold, and hope that genes without the same hidden common cause(s) would not be close enough to have high correlation. And for the estimation of the hidden common cause(s), we may still apply SVD, but use more singular vectors corresponding to the largest singular values. The main difficulty is that a way is needed to determine how many singular vectors to take, i.e. how many hidden common causes the set of genes have. One possible strategy is to successively take more singular vectors, until the drop in residual error is small. But this is a sort of model selection problem, which is not straightforward. Therefore, more study is needed to determine whether the strategy would work well.
Multiple Layers of Hidden Nodes.
The current model and algorithms assumes that any hidden common cause does not have other hidden common cause as parents, i.e. the hidden common causes are not connected directly to each other. This simplifying assumption may be restrictive in some cases, where multiple layers of hidden common causes may exist and may have meaningful interpretation. One possible strategy is to first infer possible hidden common cause(s) of observed genes, then treat them as observed (because our algorithm also estimates the time series of the predicted hidden common cause(s)), and repeat the process to see if further hidden common cause(s) could be inferred. However, at this stage, it is unclear whether this strategy would work well, because the estimated time series of the hidden common cause(s) may not be accurate enough. Moreover, it is unclear that whether it is even feasible to infer rich structures among hidden nodes without prior knowledge or assumptions, given only limited data of observed nodes. We leave these issues as future work.
Relative Error Model
The proposed algorithm assumes a constant variance for the additive error terms, and the variance could be known or estimated from empirical variances. Alternatively, we may assume a model where the variance of the error terms is a constant proportion of the variance of the gene. This could be easily handled by (centering and) normalizing the input expression data such that each gene has a variance of one (except those with constant expression). The proportion of variance of error terms to the variance of the gene would remain unchanged, but now it is also the variance of the error terms. This normalization does not affect the inference of the initial GRN, as CLINDE uses correlation and conditional correlation tests, which are unaffected by centering and scaling.
Delays with Distribution
In our current model, we assume the delay between two genes is deterministic (but unknown and is to be found). While this assumption makes the model and algorithm simpler, in reality, due to the stochastic operations of the cell, it is more realistic to model the delays as random variable, e.g. with exponential distribution as in [32] and [33]. We leave this as future work.
Not Using Time Series Data
The basic algorithm uses time series data, here we briefly consider the possibility of using non-time series data. To begin with, without time series data, it would be impossible to estimate the delays in the links, but it may still be possible to infer the directions of the links, just as in the PC algorithm [29]. However, without time series data, inferring the directions of the links is much more difficult, especially when we allow the presence of cycles in the causal network. Also, the directions of some links may still remain undetermined given the data, because both directions are consistent with the data. The clustering and estimation of hidden common cause, on the other hand, does not pose great difficulty when using non-time series data, assuming the initial GRN is well estimated.
Conclusion
To summarize, we have developed an algorithm to recover the causal GRN with delays in the regulatory links from time series expression data, where a small but unknown number of nodes are hidden, i.e. without expression data. We have tested our algorithm on 3 types of synthetic data: small GRN with one hidden node, small GRN with no hidden node, and large GRN with a small but unknown number of hidden nodes. Results on synthetic data show that our algorithm can effectively recover the causal GRN. We have also demonstrated our algorithm on small subnetworks of YEASTRACT with real expression data, and the results show the potential of our algorithm to recover hidden nodes using real data.
Supporting Information
S1 File. Containing various supplementary figures and tables, listed as follows.
Figure A, Histogram of Absolute Differences of F-scores of Delays and Effects for One Segment Small Hidden Case. Figure B, Histogram of Absolute Differences of F-scores of Links and Delays for One Segment Small Hidden Case. Figure C, Profiles of F-scores of Links, Delays and Effects for Different Settings for One Segment Small Non-Hidden Case. Figure D, Histogram of Absolute Differences of F-scores of Links and Effects for One Segment Small Non-Hidden Case. Figure E, Histogram of Absolute Differences of F-scores of Delays and Effects for One Segment Small Non-Hidden Case. Figure F, Histogram of Absolute Differences of F-scores of Links and Delays for One Segment Small Non-Hidden Case. Figure G, Histogram of Absolute Differences of F-scores of Links and Effects for One Segment Large Case. Figure H, Histogram of Absolute Differences of F-scores of Delays and Effects for One Segment Large Case. Figure I, Histogram of Absolute Differences of F-scores of Links and Delays for One Segment Large Case. Figure J, Boxplot of Effects F-score with Different σ2 for One Segment Small Non-Hidden Case. Figure K, Boxplot of Effects F-score with Different σ2 for One Segment Large Case. Figure L, Boxplot of Effects F-score with Different σ2 for Multiple Segments Small Hidden Case. Figure M, Boxplot of Effects F-score with Different σ2 for Multiple Segments Small Non-Hidden Case. Figure N, Boxplot of Effects F-score with Different σ2 for Multiple Segments Large Case. Table A, Full Results of Median Performance for One Segment Small GRN with One Hidden Node.e2 is σ2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table B, Full Results of Median Performance for One Segment Small GRN without Hidden Node. e2 is σ2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table C, P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the One Segment Large Case. Table D, Full Results of Median Performance for Multiple Segments Small GRN with One Hidden Node. e2 is σ2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table E, Full Results of Median Performance for Multiple Segments Small GRN without Hidden Node. e2 is σ2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table F, Full Results of Median Performance for Multiple Segments Large GRN with More than One Hidden Node with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes nh, e2 is σ2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 40 replicates. Table G, P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the Multiple Segments Large Case. Table H, Full Results of Median Performance for One Segments Large GRN with Different δ2 with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes nh, e2 is σ2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. d2 is δ2. Each median is taken over 40 replicates. Table I, Full Results of Median Performance for Multiple Segments Large GRN with More than One Hidden Node with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes nh, e2 is σ2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. d2 is δ2. Each median is taken over 40 replicates. Table J, P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the Heterogeneous Variance One Segment Large Case. Table K, P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the Heterogeneous Variance Multiple Segments Large Case. Table L, Full Results of Median Performance for One Segment Large GRN with More than One Hidden Node with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes nh, e2 is σ2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 40 replicates.
https://doi.org/10.1371/journal.pone.0138596.s001
(ZIP)
Author Contributions
Conceived and designed the experiments: LYL. Performed the experiments: LYL. Analyzed the data: LYL MLW KHL KSL. Contributed reagents/materials/analysis tools: LYL. Wrote the paper: LYL MLW KHL KSL. Implementation of the algorithm: LYL.
References
- 1. Chen L, Aihara K (2002) Stability of genetic regulatory networks with time delay. Circuits and Systems I: Fundamental Theory and Applications, IEEE Transactions on 49: 602–608.
- 2. Lewis J (2003) Autoinhibition with transcriptional delay: a simple mechanism for the zebrafish somitogenesis oscillator. Curr Biol 13: 1398–1408. pmid:12932323
- 3. Mather W, Bennett MR, Hasty J, Tsimring LS (2009) Delay-induced degrade-and-fire oscillations in small genetic circuits. Phys Rev Lett 102: 068105.
- 4. Tigges M, Marquez-Lago TT, Stelling J, Fussenegger M (2009) A tunable synthetic mammalian oscillator. Nature 457: 309–312. pmid:19148099
- 5. Karlebach G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology 9: 770–780. pmid:18797474
- 6. Schlitt T, Brazma A (2007) Current approaches to gene regulatory network modelling. BMC Bioinformatics 8: S9. pmid:17903290
- 7. Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D (2007) How to infer gene networks from expression profiles. Molecular Systems Biology 3. pmid:17299415
- 8. Margolin A, Nemenman I, Basso K, Wiggins C, Stolovitzky G, et al. (2006) Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7: S7. pmid:16723010
- 9. Altay G, Emmert-Streib F (2010) Inferring the conservative causal core of gene regulatory networks. BMC Systems Biology 4: 132. pmid:20920161
- 10.
Ram R, Chetty M, Dix T (2006) Causal modeling of gene regulatory network. In: Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB’06. 2006 IEEE Symposium on. pp. 1 -8.
- 11. Pavlov MY, Ehrenberg M (1996) Rate of translation of natural mrnas in an optimized in vitro system. Arch Biochem Biophys 328: 9–16.
- 12. Mier-y Terán-Romero L, Silber M, Hatzimanikatis V (2010) The Origins of Time-Delay in Template Biopolymerization Processes. PLoS Comput Biol 6: e1000726+.
- 13. Swinburne IA, Silver PA (2008) Intron delays and transcriptional timing during development. Developmental cell 14: 324–330. pmid:18331713
- 14.
Kuo HC, Tsai PC, Huang JP (2009) Finding time-delayed gene regulation patterns from microarray data. In: Hybrid Intelligent Systems, 2009. HIS’09. Ninth International Conference on. volume 1, pp. 117 -122.
- 15. Maucher M, Kracher B, Kühl M, Kestler HA (2011) Inferring Boolean network structure via correlation. Bioinformatics 27: 1529–1536. pmid:21471013
- 16. Äijö T, Lähdesmäki H (2009) Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics. Bioinformatics 25: 2937–2944.
- 17. Li Z, Li P, Krishnan A, Liu J (2011) Large-scale dynamic gene regulatory network inference combining differential equation models with local dynamic Bayesian network analysis. Bioinformatics (Oxford, England) 27: 2686–2691.
- 18. Tienda-Luna IM, Huang Y, Yin Y, Padillo DPR, Perez MCC (2007) Uncovering gene regulatory networks from time-series microarray data with variational bayesian structural expectation maximization. EURASIP J Bioinformatics and Systems Biology 2007.
- 19.
Xing Z, Wu D (2006) Modeling multiple time units delayed gene regulatory network using dynamic bayesian network. In: Proceedings of the Sixth IEEE International Conference on Data Mining—Workshops. Washington, DC, USA: IEEE Computer Society, ICDMW’06, pp. 190–195.
- 20. Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED (2004) Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20: 3594–3603. pmid:15284094
- 21. Zoppoli P, Morganella S, Ceccarelli M (2010) Timedelay-aracne: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics 11: 154. pmid:20338053
- 22. ElBakry O, Ahmad M, Swamy M (2013) Inference of gene regulatory networks with variable time delay from time-series microarray data. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 10: 671–687.
- 23. Tibshirani R (1994) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58: 267–288.
- 24. Lo LY, Leung KS, Lee KH (2015) Inferring time-delayed causal gene network using time-series expression data. Computational Biology and Bioinformatics, IEEE/ACM Transactions on PP: 1–1.
- 25. Gregoretti F, Belcastro V, di Bernardo D, Oliva G (2010) A parallel implementation of the network identification by multiple regression (nir) algorithm to reverse-engineer regulatory gene networks. PLoS ONE 5: e10179. pmid:20422008
- 26. Xing B, Van Der Laan MJ (2005) A causal inference approach for constructing transcriptional regulatory networks. Bioinformatics 21: 4007–4013. pmid:16131521
- 27. Zhang X, Baral C, Kim S (2005) An Algorithm to Learn Causal Relations Between Genes from Steady State Data: Simulation and Its Application to Melanoma Dataset. Artificial Intelligence in Medicine: 524–534.
- 28.
Pearl J (2000) Causality: Models, Reasoning, and Inference. Cambridge University Press.
- 29.
Spirtes P, Glymour C, Scheines R (2001) Causation, Prediction, and Search. Cambridge, MA, USA: The MIT Press, second edition.
- 30. Julius A, Zavlanos M, Boyd S, Pappas GJ (2009) Genetic network identification using convex programming. IET systems biology 3: 155–166. pmid:19449976
- 31. Bansal M, Gatta GD, di Bernardo D (2006) Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22: 815–822. pmid:16418235
- 32. Anchang B, Sadeh MJ, Jacob J, Tresch A, Vlad MO, et al. (2009) Modeling the temporal interplay of molecular signaling and gene expression by using dynamic nested effects models. Proceedings of the National Academy of Sciences 106: 6447–6452.
- 33. Dümcke S, Bräuer J, Anchang B, Spang R, Beerenwinkel N, et al. (2013) Exact likelihood computation in boolean networks with probabilistic time delays, and its application in signal network reconstruction. Bioinformatics: btt696.
- 34.
Richardson T (1996) Discovering Cyclic Causal Structure. Technical report, Carnegie Mellon University.
- 35. Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A (2006) A linear non-gaussian acyclic model for causal discovery. J Mach Learn Res 7: 2003–2030.
- 36.
Lacerda G, Spirtes P, Ramsey J, Hoyer PO (2008) Discovering cyclic causal models by independent components analysis. In: UAI. pp. 366–374.
- 37.
Hyvärinen A, Shimizu S, Hoyer PO (2008) Causal modelling combining instantaneous and lagged effects: an identifiable model based on non-gaussianity. In: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, ICML’08, pp. 424–431. http://doi.acm.org/10.1145/1390156.1390210.
- 38. Mukhopadhyay ND, Chatterjee S (2007) Causality and pathway search in microarray time series experiment. Bioinformatics 23: 442–449. pmid:17158516
- 39. Itani S, Ohannessian MI, Sachs K, Nolan GP, Dahleh MA (2010) Structure learning in causal cyclic networks. Journal of Machine Learning Research—Proceedings Track 6: 165–176.
- 40. Glymour C, Spirtes P (1988) Latent variables, causal models and overidentifying constraints. journal of Econometrics 39: 175–198.
- 41.
Eichler M (2005) Graphical modelling of multivariate time series with latent variables. Technical report.
- 42.
Pellet JP, Elisseeff A (2009) Finding latent causes in causal networks: an efficient approach based on markov blankets. In: Advances in Neural Information Processing Systems. pp. 1249–1256.
- 43. Colombo D, Maathuis MH, Kalisch M, Richardson TS, et al. (2012) Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics 40: 294–321.
- 44.
Jalali A, Sanghavi S (2012) Learning the dependence graph of time series with latent factors. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). pp. 473–480.
- 45.
Hyttinen A, Hoyer PO, Eberhardt F, Järvisalo M (2013) Discovering cyclic causal models with latent variables: A general sat-based procedure. In: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, Bellevue, WA, USA, August 11–15, 2013. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu = 1&smnu = 2&article_id = 2391&proceeding_id = 29.
- 46. Sadeh MJ, Moffa G, Spang R (2013) Considering unknown unknowns: Reconstruction of nonconfoundable causal relations in biological networks. Journal of computational biology 20: 920–932. pmid:24195708
- 47.
Silva R (2005) Automatic discovery of latent variable models. Ph.D. thesis, University of West Florida.
- 48. Silva R, Scheines R, Glymour C, Spirtes P (2006) Learning the structure of linear latent variable models. The Journal of Machine Learning Research 7: 191–246.
- 49.
Anandkumar A, Hsu D, Javanmard A, Kakade S (2013) Learning linear bayesian networks with latent variables. In: Proceedings of The 30th International Conference on Machine Learning. pp. 249–257.
- 50.
Boyen X, Friedman N, Koller D (1999) Discovering the hidden structure of complex dynamic systems. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp. 91–100.
- 51.
Elidan G, Lotner N, Friedman N, Koller D, et al. (2000) Discovering hidden variables: A structure-based approach. In: NIPS. volume 13, pp. 479–485.
- 52.
Elidan G, Friedman N (2001) Learning the dimensionality of hidden variables. In: Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp. 144–151.
- 53. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9: 3273–3297. pmid:9843569
- 54. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular cell 2: 65–73. pmid:9702192
- 55. Teixeira MC, Monteiro PT, Guerreiro JF, Gonalves JP, Mira NP, et al. (2014) The yeastract database: an upgraded information system for the analysis of gene and genomic transcription regulation in saccharomyces cerevisiae. Nucleic Acids Research 42: D161–D166. pmid:24170807