REVIEWS
Advantages and limitations of current
network inference methods
Riet De Smet and Kathleen Marchal
Abstract | Network inference, which is the reconstruction of biological networks from
high-throughput data, can provide valuable information about the regulation of gene
expression in cells. However, it is an underdetermined problem, as the number of interactions
that can be inferred exceeds the number of independent measurements. Different
state-of-the-art tools for network inference use specific assumptions and simplifications to
deal with underdetermination, and these influence the inferences. The outcome of network
inference therefore varies between tools and can be highly complementary. Here we
categorize the available tools according to the strategies that they use to deal with the
problem of underdetermination. Such categorization allows an insight into why a certain
tool is more appropriate for the specific research question or data set at hand.
Module inference
Identifying groups of
co-expressed genes from
gene expression data using
clustering or biclustering
algorithms.
Guilt-by-association
principle
The assumption that genes
with similar functions exhibit
similar expression patterns.
This allows the function of an
unknown gene to be inferred
from the function of annotated
genes that are co-expressed
with the unknown gene.
Centre of Microbial and Plant
Genetics/Bioinformatics,
Department of Microbial and
Molecular Systems,
Katholieke Universiteit Leuven,
Kasteelpark Arenberg 20,
3001 Leuven, Belgium.
Correspondence to K.M.
e-mail: kathleen.marchal@
biw.kuleuven.be
doi:10.1038/nrmicro2419
Published online
31 August 2010
The insight that genes and proteins do not work in isolation but act together in intricate networks has launched
the era of systems biology 1,2. In bacteria, regulation at the
transcriptional level is pivotal to guaranteeing metabolic
flexibility and cellular integrity 1,2. In Escherichia coli the
transcription-regulatory network (TRN) was shown to
be composed of basic modular components that contribute to the specificities of global response dynamics, for example by speeding up cellular responses or
making them more robust (that is, able to respond to
a wide range of environmental signals)3,4. Deciphering
the gene co-expression network and the TRN (BOX 1)
is therefore crucial to understanding bacterial cellular
behaviour. The number of computational methods that
are being developed to reconstruct TRNs from genomewide expression data is rapidly increasing; here, these
methods are referred to as expression-centred methods. Module inference methods, which focus on the
co-expression network, rely on the guilt-by-association
principle to identify functional relationships between
genes, searching for gene sets or modules that exhibit a
similar expression behaviour across experimental conditions (BOX 1). Methods that infer TRNs go one step
beyond and infer causality relationships in the network
by also identifying the transcriptional programmes of
the genes or modules, to describe how transcription factors (TFs) cause the observed changes in expression of
their cognate target genes (BOX 1).
Applying these inference procedures on public data
sets of well-studied model organisms has considerably
improved our global understanding of TRNs. In bacteria,
simple regulons that comprise only a few operons show
expression modularity. The operon organization seems
crucial for preserving this modular level of co-expression
under some conditions, whereas under other conditions
the presence of intra-operonic promoters breaks up the
modularity5–7. In addition, complex regulation involving multiple regulators generally results in single genes
showing highly specific expression behaviour that is not
shared with other genes8. By focusing on the role of the
regulatory programme in E. coli, it was observed that
not only global TFs but also local regulators respond to a
range of conditions9. In addition, many TFs are active in
similar conditions and thus trigger similar sets of genes,
suggesting either redundancy in their function or an
intricate cooperation between different TFs to mediate
a common response9.
Several notable examples have set the stage for adopting inference methods in daily laboratory practice. The
unprecedented link between protein mistranslation
and the reaction to reactive oxygen species in response
to antibiotics treatment was unveiled by combining network inference with experimental evidence in
E. coli 10. Similar approaches were used to unravel the
complex network regulating host–pathogen interactions in Salmonella enterica subsp. enterica serovar
Typhimurium11 and to chart the transcriptional network
of the archeon Halobacterium salinarum for the first
time12. Computationally inferred interactions therefore
offer a useful resource for putting experimental findings
into a more global context, by finding novel interactions
that have yet to be unveiled, by unfolding links between
NATURE REVIEWS | Microbiology
VOLUME 8 | O CTOBER 2010 | 717
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
Box 1 | Co-expression networks versus transcription-regulatory networks
co-expression network
a Co-expression network
This is a network representation
in which the nodes represent the
genes and the edges represent
the degree of similarity in the
expression profiles of the genes (see
the figure, part a). Cliques or highly
connected subgraphs correspond to
modules of co-expressed genes. The
edges are undirected, indicating that
they represent only a correlation or
dependency relationship between
the nodes and do not reveal the cause b Transcriptionof the relationship.
regulatory
Transcription-regulatory network
This is a bipartite graphical network
representation in which the nodes
represent either transcription factors
(TFs) or target genes (or modules) (see
the figure, part b). Edges are directed,
as they reflect a causal relationship:
they indicate that an observed
correlation in the expression patterns
of two nodes is caused by
a node corresponding to a TF
regulating a node corresponding to
a target gene. A transcriptional
programme corresponds to a set of
TFs sharing the same set of target
genes, ideally under a similar subset
of conditions.
Expression modularity
Refers to the modular structure
of the co-expression network.
This network can be broken
down into modules, or groups
of co-expressed genes, the
function of which can be
separated from that of other
modules.
Top-down network inference
Reverse engineering or de novo
reconstruction of the structure
of biological networks on
a genome-wide scale by
exploiting high-throughput
data. By contrast, bottom-up
regulatory network inference
is the construction of a
quantitative model from
the data using a known,
mathematically formalized
connectivity network as
input; estimating the kinetic
parameters of this model
from the data allows the
dynamic behaviour of the
network to be modelled.
network
Gene
Transcription factor
Co-expression edge
Regulation edge
the pathway under investigation and other cellular processes or by identifying the conditions under which a
regulator of interest is active.
To guide users in choosing the most appropriate network inference tool for their own application, we provide a scheme that allows state-of-the-art transcriptional
inference methods to be classified on the basis of the
strategies used to solve the inference problem, focusing
mainly on top-down network inference methods. In contrast to previous categorizations, our classification uses
a combination of criteria that relates directly to the biological interpretation of the outcome rather than being
merely data set related13 or computationally focused14,15.
We use representative tools of each class to show how
using different strategies results in inferring different
types of interactions. We also describe how to interpret
benchmark studies. Finally, we give a perspective on
the future of these inference tools in light of novel data
generation procedures.
Inferring TRNs is an underdetermined problem
Under the assumption that each gene is regulated by
only one regulator, inferring the interaction network
in E. coli would require the individual links between
approximately 4,500 genes and each of the 300 known
and predicted regulators to be assessed16, a total of
1,350,000 (that is, 4,500 × 300) tests. When also taking into account the existence of combinatorial regulation and feedback loops, the theoretical number of
combinations can no longer be exhaustively enumerated.
This means that the number of possible solutions is
prohibitively large, and clever algorithms or optimization
strategies are needed to screen them in a time-efficient
way. In addition, module inference (finding the best
combination of genes and conditions that define a
co-expressed gene set according to preset criteria) is
prohibitively complex. The large number of possible
solutions (the large search space), together with the
restricted number of independent data points and
the low information content of the available data17–19, turns
TRN and module inference into an underdetermined
computational problem with many possible solutions,
all of which explain the data equally well but only one of
which can be the biologically true solution.
Extracting general tendencies from inference results
(for example, assessing the number of genes that exhibit
a modular expression behaviour or the differences in
regulon size) can be better supported, statistically, than
strongly emphasizing the individually inferred interactions. However, it is exactly these individual interactions
that wet-laboratory researchers are interested in.
Strategies to deal with underdetermination
The problem of underdetermination relates to the size
of the search space: the larger the search space, the
larger the complexity of the inference problem and
the more difficult it will be to find the unique solution
that approximates the biological truth. To tackle this
problem of underdetermination, module and network
inference methods adopt different strategies to reduce
the search space and/or extend the amount of independent
information (FIG. 1).
‘Conceptualization by simplifying biological reality’
is a commonly used strategy that renders network
inference more tractable. TRNs have been shown to be
modular in structure20, which implies that the network
consists of overlapping modules of functionally related
genes. Genes belonging to the same module act in concert under certain environmental cues21–23, explaining
their coordinated expression behaviour. Modules
are identified by methods that rely on clustering or
biclustering24. Module-based network inference procedures, which are primarily designed to infer transcriptional programmes, assign a regulatory programme to
these modules, rather than assigning an individual programme to each single gene, as is the case with direct
network inference methods. This drastically lowers the
number of interactions that must be evaluated during
the inference pro cess. Another simplification relates
to the definition of combinatorial regulation, in which
multiple regulators act together to mediate specific
condition-dependent responses. Inferring a transcriptional programme that uses combinatorial regulation
means that all possible combinations of regulators and
binding modes (that is, cooperative, synergistic and so
on) must be evaluated in order to explain the observed
expression behaviour. As this is computationally intractable for large data sets, all large-scale network inference methods make an approximation of combinatorial
regulation.
718 | O CTOBER 2010 | VOLUME 8
www.nature.com/reviews/micro
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
Modulebased NI
Direct NI
Module inference
methods
Nonintegrative
SIRENE
Supervised
and semisupervised
SEREND
de Hoon et al.
Integrative
Sabatti et al.
DISTILLER
COALESCE
GPS
cMonkey
CLR
Stochastic
LeMoNe
Clustering or
biclustering
Gat-Viks et al.
Inferelator
Unsupervised
Nonintegrative
Time-lagged
correlation
Input
Output
Global
Query-driven
Figure 1 | categorization of different state-of-the-art methods for module and
network inference. Module inference methods search for sets of co-expressed genes.
The major goal of network inference (NI) methods, on the other hand, is to search for a
regulatory programme that explains an observed expression behaviour. NI methods can
be categorized according to the strategies that they use to cope with the problem of
underdetermination. Direct NI methods consider all genes on an individual basis,
whereas module-based NI methods conceptualize the network by treating sets
of co-expressed genes as single entities (modules). NI and module interference
methods can be further divided according to whether they complement expression
data with additional data sources (integrative methods) or use expression data only
(non-integrative methods). Supervised and semi-supervised methods treat the inference
problem as a classification problem, whereas unsupervised methods do not. The output
of the methods can be global, indicating that they search for global patterns in the data,
or query-driven, starting from a predefined set of core genes or core pathways and
expanding on those. Most of the available programs can be used in either a query-driven
or a global mode. The methods indicated in pink are specifically designed to be query
driven. CLR, context likelihood of relatedness; COALESCE, combinatorial algorithm for
expression- and sequence-based cluster extraction; DISTILLER, data integration system
to identify links in expression regulation; GPS, gene promoter scan; LeMoNe, learning
module networks; SEREND, semi-supervised regulatory-network discoverer; SIRENE,
supervised inference of regulatory networks.
Optimization strategy
A strategy used to screen
the search space so that the
optimal (or almost optimal)
solution can be found without
having to evaluate all possible
solutions.
A second strategy relates to extending the expression
data with other available information. Integrative methods combine the expression data with complementary
data describing the TRN from a different angle, such as
chromatin immunoprecipitation-on-chip (ChIP-chip)
data or motif data, and these methods often obtain more
reliable interactions and a more complete picture of the
network. Moreover, during the search, prioritizing predictions for which the independent data sources agree
allows the search space to be traversed more efficiently.
As a third strategy, query-driven methods reduce the
search space by intentionally restricting the number of
interactions that needs to be evaluated: instead of searching for a global pattern, as global inference methods do,
query-driven methods concentrate their search on a predefined set of core genes or on a subnetwork of interest,
and they then expand on this core gene set or subnetwork
given the available data.
A fourth strategy is to use supervised (and semisupervised) methods, which treat network inference
as a classification problem and can be considered to be
a specific way of exploiting known information in a
query-driven manner.
As each strategy uses different assumptions and poses
different constraints, the specific strategy or combination of strategies that are adopted determine the type
of interactions that can be found. This is shown below,
using results obtained from state-of-the-art inference
tools that have successfully been applied to microbial
data sets. For an algorithmic description of the inference
tools mentioned below, see BOXES 2,3.
Module-based versus direct network inference. Usually,
a biclustering method is used for module inference24.
Most module-based network inference methods also use
module inference based on biclustering as a first step,
before the assignment of the transcriptional programme.
Exploiting the concept of modularity offers advantages
from both the biological and the statistical points of
view. Most module-based approaches not only predict
regulatory interactions, but also identify the experimental conditions under which the predicted interactions
take place. This information can be helpful for designing the appropriate conditions under which experimental validation of the predicted interactions should
be performed8,25. Assuming that modularity exists also
contributes to the statistical robustness of the inferred
interactions: each of the co-expressed genes in a module
confirm the data for the other genes in that module by
providing evidence for a certain regulatory programme,
whereas for direct methods the evidence for a particular
regulator–target interaction is based on only a single-gene
observation.
A comparison of the results from the direct network inference method CLR (context likelihood of
relatedness) and the module-based method Stochastic
LeMoNe (learning module networks) shows how adopting the concept of modularity determines the interactions that can be inferred (BOX 2; FIG. 2). By exploiting
modularity, LeMoNe and related methods26 can assign
regulators with expression profiles that are less similar
to those of their target genes than is the case with CLR
or similar methods27,28. Indeed, LeMoNe performs better than CLR at inferring regulatory programmes for
genes that are grouped in coarse-grained modules which
correspond to larger pathways (for example, Fis, RNA
polymerase σ-factor S (RpoS) and PurR) and for which
the genes show an overall low degree of co-expression
NATURE REVIEWS | Microbiology
VOLUME 8 | O CTOBER 2010 | 719
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
Box 2 | Expression-based and integrative network inference methods
CLR (context likelihood of relatedness)32 is an unsupervised, direct, expression-based
network inference method that reconstructs an interaction between a transcription
factor (TF) and a target gene based on a correlation in their expression behaviour, as
assessed by the mutual information measure.
Stochastic LeMoNe34 is an unsupervised, module-based method that infers the
transcription-regulatory network (TRN) from expression data. It uses a fuzzy, two-way
clustering approach to assign genes and conditions to modules and subsequently
assigns a regulatory programme to these pregrouped gene sets. Each module contains
the genes for which the expression profiles best fit the same multivariate normal
distribution, which simultaneously partitions the conditions within the module
according to overexpression or underexpression. The transcriptional programme
assigned to each module consists of the set of regulators for which the expression
profiles best explain all or part of the condition partitions in the module.
Inferelator25 assigns a transcriptional programme to either individual genes or
predefined modules of co-expressed genes that are obtained by the integrative
module inference method cMonkey47. Multiparametric logistic regression is used to
search for tightly co-expressed modules that are enriched for genes that make up
highly connected subgraphs in metabolic and functional association networks and/or
that contain statistically over-represented de novo-detected motifs. Inferelator uses
standard regression with model shrinkage to build a parsimonious, predictive model for
the expression behaviour of the module or the gene, using changes in environmental
influences and TF expression levels as predictors. The design matrix can capture binary
interactions (AND, OR or XOR interactions) between TFs.
DISTILLER (data integration system to identify links in expression regulation)8 is an
integrative, module-based network inference method that combines expression data
with interaction data (for example, motif or chromatin immunoprecipitation-on-chip
(ChIP-chip) data) to search for co-regulated modules. It uses an unsupervised strategy
based on itemset mining to exhaustively enumerate all gene sets that are co-expressed
under a subset of conditions and that share the same motifs. A probabilistic filtering
step is used to identify the most relevant set of non-redundant modules from this
exhaustive list.
The method described by Sabatti et al.43 is a hidden-component model9 that is related
to the original network component analysis (NCA)112,113 strategy. This method uses
a linear model to decompose E, which is the measured expression data in a product of a
sparse connectivity matrix (A) that contains the interactions between all TFs and their
targets as well as P, the hidden condition-dependent activities of the TFs112. Methods
differ in the way they use constraints to uniquely identify A and P. Liao et al.112 constrain
A using the known network, whereas Sabbatti et al.43 use motif information as a prior in
a Bayesian framework to guide the reconstruction of the unobserved TF activities and
interactions. As these methods exploit known information to constrain their search
space, they can be considered direct, integrative, unsupervised network inference
methods.
COALESCE (combinatorial algorithm for expression- and sequence-based cluster
extraction)48 is an integrative, non-supervised module inference procedure that uses
a Bayesian framework to integrate sequence and expression data. De novo motif
detection occurs concurrently with the biclustering of the genes and conditions.
Motifs, represented by probabilistic suffix trees, are assigned to a developing bicluster
if their occurrence in the module is sufficiently enriched compared with their presence
in the genomic background. Additional information on sequence conservation or
nucleosome placement can be used to guide the motif and module inference.
Methods that explicitly use time series gene expression data to infer causal
relationships are known as time-lagged correlation analysis methods. They generally
consist of two steps37,38. In the first step, genes with similar expression profiles across
multiple time points (by Pearson correlation) are grouped in a module or cluster. In
the second step, causal effectors such as the regulators, the modules that contain the
regulators37, or environmental inputs38 are related to the target modules using
time-lagged correlation, a measure that is related to the Pearson correlation coefficient
but that takes into account shifts in time between the expression of the causal effector
and the target module.
with each other or with their transcriptional regulator29.
Conversely, CLR has a higher precision than LeMoNe for
identifying targets of those bacterial regulators that are
dedicated to one or, at most, a few operons, because in
bacteria such operonic regulators are tightly co-expressed
with their targets (for example, glucitol operon repressor (GutR), IscR, BetI and arabinose operon regulatory
protein (AraC)). A direct method such as CLR also
covers interactions for a larger range of regulators than
a module-based method such as LeMoNe, as modulebased inference methods lose interactions with target
genes that are not co-expressed with a sufficient number
of other target genes29.
Modelling combinatorial regulation. Inferring combinatorial regulation in its full complexity is also computationally intensive. Most direct methods, both supervised
(such as SEREND (semi-supervised regulatory-network
discoverer)30 and SIRENE (supervised inference of regulatory networks)31; see below and BOX 3) and unsupervised (such as CLR32), simplify the problem by assigning
regulators to their target genes one by one and composing the combinatorial regulatory programmes in a
post-processing step that finds sets of regulators which
control the same target genes. Although this substantially
reduces the complexity of the network inference problem, such a stepwise approach renders it impossible to
distinguish between truly complex combinatorial regulation, in which the signals of multiple TFs are integrated
to trigger the observed gene expression pattern, and
condition-dependent regulation, in which different TFs
act independently to mediate expression of their target
genes under different subsets of conditions. For example, applying CLR to data from E. coli resulted in the
correct assignment of the regulators GadE, GadW and
GadX to several genes involved in the acid response32.
However, the true, more complex relationship of GadW
and GadX with GadE, which is the main regulator of the
acid response and is under the control of both GadW
and GadX33, could not be unveiled.
Module-based inference methods such as Stochastic
LeMoNe34 and DISTILLER (data integration system to
identify links in expression regulation)8 (BOX 2) automatically take into account the condition dependency
of the inferred transcriptional programmes: regulators
that are assigned to the same genes under different subsets of conditions are assumed to act independently, as
each of them is responsible for triggering a different
condition-dependent response. Regulators that are
predicted to regulate the same target genes in similar
conditions, on the other hand, are presumed to act combinatorially, as they are needed simultaneously to trigger the observed co-expression response. For example,
using DISTILLER, it was predicted that the E. coli global
regulator cyclic AMP regulatory protein (Crp) interacts,
depending on the conditions, with different specific regulators8. Neither DISTILLER nor Stochastic LeMoNe can
infer the mode of the combinatorial interactions between
the assigned regulators — that is, whether the assigned
TFs act together in an additive or synergistic way (AND),
in a combinatorial interaction, such that the presence of
one of the regulators is sufficient to trigger expression
of the target gene (OR), or in a mutually exclusive manner
(XOR). By combining the expression profiles of the regulators according to these different possible interactions
720 | O CTOBER 2010 | VOLUME 8
www.nature.com/reviews/micro
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
Box 3 | Query-driven and supervised network inference methods
The method proposed by Gat-Viks et al.64 is a query-driven, expression-based inference
method. Qualitative knowledge of a pathway of interest is formalized as a Bayesian
network, in which the nodes represent different molecular entities (genes, proteins or
metabolites) and the edges represent the interactions between them, with their
corresponding connection logics. Such a probabilistic formulation of the network
allows uncertainty to be included in the model. In a first model refinement step,
possible model improvements (changes in topology and interaction logics) are
evaluated. Refinements resulting in a model that better predicts the observed
expression values are withheld. In a second expansion step, transcription factor (TF)
activities are predicted from the network model, and a likelihood score is used to assign
additional target genes for which the expression can be predicted by the TF activity
profile. Thus, the method identifies sets of genes that are regulated by the same set
of regulators and according to a common logic.
GPS (gene promoter scan)62 is a query-driven, integrative network inference method
that starts from a set of genes regulated by a common TF. Each gene is represented by
a list of features, consisting of its expression profile and a detailed description of its
promoter elements. The set of query genes is separated into distinct clusters according
to these features, resulting in these genes being grouped according to their specific
regulation patterns. A fuzzy k-nearest-neighbour classifier is used to extend the
obtained clusters with new targets on the basis of the similarity between the feature
profile of the new gene and that of a cluster.
SIRENE (supervised inference of regulatory networks)31 is a supervised, expressionbased, direct network inference method that splits network inference into multiple
binary classification problems for each TF. One SVM (support vector machine)-based
classifier is trained per TF, according to similarities in the expression profiles of target
and non-target genes: genes regulated by a TF are likely to be co-expressed, whereas
non-targets are not. This TF-specific classifier is then used to predict which genes are
regulated by the TF, resulting in a ranked list of potential targets.
SEREND (semi-supervised regulatory-network discoverer)30 and the method
described by de Hoon et al.68 are supervised (or semi-supervised), integrative network
inference methods. A training set of known targets and non-targets is used to
determine the parameters of two separate logistic regression functions that map the
expression values and motif scores in the training set to their predictor variables (which
determine whether the gene is activated, repressed or not regulated by the TF). The
targets of a TF are thus expected to have a similar motif and expression profile. Motif
and expression data are treated separately to guarantee proper balancing of the
unequal number of features in each data set. A metaclassifier, also based on logistic
regression, combines the outcomes of these separate expression-based and
motif-based classifiers. The complete classifier is subsequently used to predict the
probability that genes belong to the same regulon.
Search space
All possible solutions that need
to be evaluated to find the
one that is the most optimal
according to preset criteria. In
most inference problems, the
number of possible solutions is
prohibitively large and cannot
be enumerated exhaustively.
Clustering
Grouping of genes that have
similar expression patterns
across all conditions.
Biclustering
Combining the selection of
co-expressed gene sets with
a condition selection step to
infer the set of conditions that
is relevant to the clustered
genes.
(AND, OR or XOR) before assessing how well they
explain the target’s expression behaviour, Inferelator 25
can infer these more complex modes of transcriptional
interactions. Recently, CLR was also extended to account
for synergistic relationships (synergy-augmented CLR)
— that is, when the expression value of a third gene can
be better explained by two genes together than by each
of them separately 35,36. Using this approach, novel links
were uncovered in the original E. coli CLR network, such
as the fact that the expression of fecA, which encodes an
Fe3+ dicitrate transport protein, depends on both fecI and
aceK (encoding isocitrate dehydrogenase kinase/phosphatase), with aceK presumably acting as an indirect
inhibitor of ferric citrate transport36.
Integrative versus expression-based approaches. Nonintegrative expression-based network inference methods
extract information about regulator–target interactions from the expression data itself. Except for those
supervised expression-based methods that exploit the
observed co-expression behaviour of known targets of a
particular TF, such as SIRENE31 (see below), most nonintegrative methods assume that the expression profile
of the regulator is a proxy for its activity; for example,
this assumption is made by Stochastic LeMoNe34, CLR32,
Inferelator 25 and correlation-based methods37,38. This
assumption disregards the important role of regulation
mechanisms acting at levels other than transcription39
and restricts the interactions that can be inferred to
those of regulators that are either co-expressed or
inversely correlated with their targets40 (FIG. 3). As a
result, expression-based inference methods such as CLR,
Stochastic LeMoNe and other related methods26–28 are
biased towards inferring interactions of autoregulators
or operonic regulators that are tightly co-expressed with
their targets29. Moreover, most expression-based inference methods cannot distinguish between regulators
that actually regulate a gene (that is, that have a direct
causal effect) and regulators that are simply co-expressed
with a gene (that is, mere correlation). This problem
can be partially alleviated by inferring networks from
dynamic data instead of from static data, as time series
inherently contain information about causal effects, if
one assumes that the expression of the TF needs to be
altered before it can affect its targets (in a direct way or
through a regulatory cascade). Inferring networks from
dynamic data requires special techniques that capture
the expression dynamics (for example, the lag in expression profiles between genes). Time-lagged correlation
analysis (BOX 2) was used to infer the regulatory network
that mediates the response to alternating light conditions in the cyanobacterium Synechocystis 38, and the
Bacillus subtilis regulatory network was inferred using
the same technique37. In practice, inference of networks
from dynamic data is restricted by the insufficient time
resolution of the available samples, which complicates
the matter of distinguishing true from noisy signals and
results in fast responses being missed.
By complementing gene expression with additional
transcriptional information (such as motif data or DNA–
protein interaction data), integrative network inference
methods8,30,41–45 can extend the scope of their predictions beyond interactions that can be inferred from coexpression behaviour and usually result in more reliable
predictions (FIG. 3). Sabatti et al.43 proposed a direct integrative approach based on hidden component analysis
(BOX 2) that overlays a network topology derived from
known and motif-based interactions with expression
data. This method was used to infer the transcriptionally active edges in the E. coli network. By exploiting
the known information on regulatory motifs and transcriptional interactions in the EcoCyc database46 in a
supervised way, the direct integrative method SEREND
inferred novel interactions for previously characterized
regulators of E. coli (see below).
Module-based network inference methods
such as DISTILLER8 (BOX 2) rely on an integrative
module detection step to derive regulatory programmes. Integrative module inference (DISTILLER,
cMonkey47 and COALESCE (combinatorial algorithm
for expression- and sequence-based cluster extraction)48 (BOX 2)) searches for genes that not only show
NATURE REVIEWS | Microbiology
VOLUME 8 | O CTOBER 2010 | 721
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
Precision of CLR >
precision of LeMoNe
Precision of LeMoNe >
precision of CLR
a
b
c
LeMoNe
RcsA
Fis
GadE
RpoS
PurR
LldR
FlhC
Fnr
CpxR
LexA
FadR
CusR
GutM
GadW
GatR
DnaA
Lrp
FliA
GadX
AraC
IscR
GalS
GutR
TdcA
CsgD
Betl
NikR
RcsA
Fis
GadE
RpoS
PurR
LldR
FlhC
Fnr
CpxR
LexA
FadR
CusR
GutM
GadW
GatR
DnaA
Lrp
FliA
GadX
AraC
IscR
GalS
GutR
TdcA
CsgD
Betl
NikR
–1
–0.5 0 0.5 1
Precision of CLR –
Precision of LeMoNe
CLR
LeMoNe
≥10 operons
<10 operons
CLR
≥10 operons
<10 operons
1
0
Precision
10
5 10 15
Regulators
20
Figure 2 | complementarity in the type of interactions inferred by direct
|
and module-based inference methods. CLR (context likelihood of relatedness)
and
Stochastic LeMoNe (learning module networks), as representatives of direct and
module-based inference methods, respectively, were applied to the same Escherichia coli
compendium32. The precision of the inferred interactions was calculated as described in
Faith et al.32, using experimentally documented interactions in RegulonDB69 as a standard.
a | A comparison of the precision with which true interactions were inferred for both
methods; the difference in the precision obtained with CLR and with LeMoNe was
calculated for each regulator. Regulators are ranked according to this difference in
precision. A high negative value indicates a higher precision for LeMoNe than for CLR,
and high positive values indicate the opposite. b | The values of the regulator-specific
precision for LeMoNe and CLR. c | The size distribution of the the known regulon
membership, according to RegulonDB, for the regulators for which either LeMoNe or
CLR show a higher precision. Parts a and b illustrate the complementarity between both
methods in retrieving interactions for different regulators. Part c shows that LeMoNe
predicts, on average, correct targets for more global regulators (with a larger regulon
size), whereas CLR typically predicts targets for regulators with fewer known targets.
Note that predictions for regulators that are not documented in RegulonDB are not
included in this plot.
Motif
TF-binding site or specific
sequence tag that is recognized
by a TF and is located in the
promoter region of a gene.
co-expression, but also share a common regulatory
binding site (identified by de novo motif detection or ChIP
analysis). Exploiting complementary data sources to confirm expression-based module assignments reduces the
assignment of false members to true modules and
the detection of spurious modules. As the observed coexpression in a module also implies true co-regulation
when using integrative module inference methods, the
module inherently contains information that infers
the transcriptional programme: for example, each
module is assigned the regulator that is known to
recognize the motif or binding site associated with
the module. Applying DISTILLER to a cross-platform E. coli expression compendium and motif data
for 67 known regulators resulted in the prediction
of 278 new interactions for 29 different regulators8.
Of the 11 new interactions for fumarate and nitrate
reduction regulatory protein (Fnr) that were experimentally verified by ChIP–quantitative real-time PCR,
none were predicted by the non-integrative methods CLR 32 and Stochastic LeMoNe 29. When using
integrative approaches in combination with de novodetected motifs, the assignment of a cognate regulator
will be based on additional, computationally derived
criteria (for example, the genomic proximity of the genes
encoding the regulator and its targets)5 or on a concomitant expression-based inference of the regulatory programme25. In the future, mapping of cognate regulators
to novel motifs will be further facilitated by integration
with data resulting from protocols that globally survey an
organism’s proteome for sequence-specific interactions
with putative DNA regulatory elements49,50.
So, inference methods that use only expression data
are useful for organisms for which there is little additional information available. Integrative methods, on the
other hand, provide a more complete view of the network and are more likely to predict true positive interactions. However, the additive value of integrative methods
depends on the quality of both the additional data51 and
the algorithm used.
Global versus query-driven inference. Global module
inference methods22,52–59 search for the modules that
explain most of the data. This usually corresponds to
identifying large pathways that consist of many genes
and that are usually responsible for the general responses
to major metabolic or condition shifts, such as the pathways that regulate flagellar synthesis, amino acids biosynthesis and the DNA damage response. As such, global
approaches provide a general view of the active TRN
and the resulting physiological state in the cell. Querydriven module detection methods, on the other hand60,61,
search for genes that are co-expressed, in a conditiondependent way, with a predefined set of genes (also called
query genes). These algorithms are deliberately biased
towards finding a specific local solution in the search
space according to the particular interests of the user.
This solution is usually not easy to find using a global
approach, as either the expression signals of the query
genes are too low to be significant or the local solution
is obscured by a more global one. For example, searching
an E. coli compendium for a PurR-related module using
a known PurR target as a query returns a module that is
indeed significantly enriched for previously known PurR
targets (P < 1 × 10–15), whereas with a global approach
the module that contains the most PurR-related genes
(under default conditions) is much larger and enriched
for more general functions related to amino acid biosynthesis and translation (R.D.S., unpublished observations). Query-driven approaches are thus typically used
to expand or curate a particular pathway or process either
by searching for additional genes that are co-expressed
with genes known to be involved in the pathway or by
filtering out genes that are not co-expressed with the
majority of the so-called pathway genes. For instance,
the query-driven Signature Algorithm (SA) refined the
gene set involved in the tricarboxylic acid (TCA) cycle
in Saccharomyces cerevisiae using the homologues of
37 E. coli TCA cycle genes as queries61.
Most of the global network inference methods
described above can be applied in a query-driven setting by restricting their input data sets. In some cases
722 | O CTOBER 2010 | VOLUME 8
www.nature.com/reviews/micro
© 2010 Macmillan Publishers Limited. All rights reserved
0.14
0.12
RegulonDB
CLR
0.1
0.08
0.06
0.04
0.02
0
–1.0 –0.8 –0.6 –0.4 –0.2
0
0.2
0.4
0.6
0.8
b
0.14
Frequency of interactions
a
Frequency of interactions
REVIEWS
0.12
RegulonDB
DISTILLER
0.1
0.08
0.06
0.04
0.02
0
–1.0 –0.8 –0.6 –0.4 –0.2
1
Correlation
SIRENE
SEREND
0.07
0.06
Precision
0.2
0.4
0.6
0.8
1
d
0.08
0.05
0.04
0.03
0.02
0.01
Area under precision–recall
curve (× 10–3)
c
0
Correlation
3
SIRENE
SEREND
2
1
0
0
0
0.002
0.004
0.006
0.008
0.01
0.012
Crp
0.014
H-NS
IHF
Fnr
Fis
Recall
Classification problem
A problem that can be
solved by a system whereby
properties or features of known
targets and non-targets of a
regulator are derived from
high-throughput data and used
to construct a classifier function
— that is, a mathematical
function that describes the
relationship between the class
labels (being a target versus
being a non -target) and the
corresponding properties of
the high-throughput data.
These classifier functions
can then be used to predict
whether or not a gene of
interest is a target of the
studied TF on the basis of
its data properties.
Operonic regulator
Regulator dedicated to one
specific operon.
De novo motif detection
Computational strategy to
identify TF binding sites
without any prior information
on the sequence of the site.
Such a strategy relies on
certain subsequences being
statistically over-represented in
a set of co-regulated genes.
Figure 3 | The different characteristics of interactions inferred by expression-based and integrative network
inference methods. a,b | Expression-based methods that estimate the activity levels of the regulators from their
expression profiles are biased towards predicting interactions for regulators that are tightly positively or negatively
correlated with their targets. For methods that infer regulatory programmes from complementary data sources, this is
not the case. The expression-based method CLR (context likelihood of relatedness; part a) and the integrative method
DISTILLER (data integration system to identify links in expression regulation; part b) were applied to the same
Escherichia coli expression compendium (results were taken from Lemmens et al.8). The histograms display the number
of predicted pairwise TF–target interactions as a function of their mutual co-expression. As a reference, the same
distribution is shown for all interactions documented in RegulonDB69. A correlation coefficient of 1 corresponds to
the situation in which the profiles of the regulator and the target gene are exactly the same, which is the case for
autoregulators. c,d | Integrative methods result in more reliable predictions than methods that use only expression
information. The performances of an expression-based network inference method (SIRENE; supervised inference of
regulatory networks) and an integrative (SEREND; semi-supervised regulatory-network discoverer) network inference
method are compared using chromatin immunoprecipitation-on-chip (ChIP-chip) data as an external standard. Part c
displays the precision–recall curve for SEREND and SIRENE predictions made for cyclic AMP regulatory protein (Crp).
The area under the precision–recall curve, indicated by shading, is used as an estimate of the overall performance of the
network inference method. Part d compares the areas under the precision–recall curves for SIRENE and SEREND for five
different regulators for which ChIP-chip data114–116 is available: Crp, H-NS, integration host factor (IHF), fumarate and
nitrate reduction regulatory protein (Fnr) and Fis. SEREND, the integrative method, outperforms SIRENE in retrieving
ChIP-chip targets for each of the regulators considered.
this can be advantageous; for example, methods such
as CLR, Stochastic LeMoNe and Inferelator perform
better if the transcriptional programme can be inferred
from a prespecified list of regulators rather than from
a full gene list, because erroneous interactions with
non-regulators will be eliminated a priori. Algorithms
specifically designed for query-driven network searches
focus on one or a few core pathways62–65. By constraining the search space to only those solutions that contain
the query, these methods can make more detailed network models than would be possible in a global setting.
Gat-Viks et al.64 (BOX 3) formalized the existing qualitative knowledge about the yeast osmotic response as
a probabilistic model. Interrogating this model with
expression data allows both refinement of the model,
by correcting erroneous interactions, and extension of
the original network with novel targets that are affected
by components of the original network. Alternatively,
kinetic approaches for modelling the dynamics between
TFs and target genes from time series expression data,
which are still intractable on a genome-wide scale, have
been successfully applied in a query-driven mode to validate the outcome of a ChIP-chip experiment. So far, these
approaches have only been applied to higher eukaryotes66. The GPS (gene promoter scan) algorithm62 (BOX 3)
is another query-driven network inference method
that takes advantage of detailed promoter descriptions
in combination with expression data from mutants to
extend the regulon of a predefined regulator. More specifically, GPS identified four additional PhoP targets in
S. Typhimurium that were previously thought to be only
indirectly PhoP dependent. Furthermore, the identified
PhoP targets in E. coli were assigned to different modules, one of which primarily contained genes that are
NATURE REVIEWS | Microbiology
VOLUME 8 | O CTOBER 2010 | 723
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
250
b
CLR (in RegulonDB)
CLR (not in RegulonDB)
SIRENE (in RegulonDB)
200
150
Average number
of targets per TF
Number of TFs
a
100
50
0
CLR
c
15
10
5
0
SIRENE
YhdM
UhpA YeiL
PutA SlyA
KdgR FeaR
MntR YdeO
AlaS AlpA
YijC AdiY
XapR YifA
RtcR TreR
PepA RhaR
LysR NadR
HipB IlvY
EbgR HipA
BolA CspA
YgaA YjbK
SoxR YbbS
MtlR SdiA
MarR MelR
LldR MalI
KdpE LacI
FarR HydG
DsdC EmrR
AscG CadC
YhiW AcrR
UidR YgiX
LrhA RhaS
GcvA IclR
CynR DicA
DhaR BetI
AsnC AtoC
YjeB Ada
YgaE
PrpR
LeuO MetR
BbirA IciA
YhhG YlcA
XylR YdeW
RbsR RpiR
MhpR NhaR
GlcC HcaR
GatR2
GatR1
DeoR EvgA
YjfQ CelD
YhcK YjdG
UxuR YbbI
TdcA TdcR
PspF SrlR
GutM IdnR
FecI FucR
PdhR EnvY
HupA HupB
YiaJ ExuR
GlpR Mlc
GalR GalS
Cbl CsgD
AraC BaeR
AppY
YbbU
MalT YaeG
CaiF DnaA
TorR TyrR
FadR
AgaR
TrpR
HyfR
GntR
CyfR
PaaX
Rob
OmpR
MetJ
Nac
YhiX
OxyR
NagC
MarA
RcsB
RcsA
SoxS
CysB
LexA
YhiE IscR
ArgR
FhlA
PurR
PhoP
fruR
RpoH
PhoB
NarP
GlnG
ModE
FliA
CpxR
Fur
Lrp
FlhD
FlhC
RpoE
NarL
RpoN
H-NS
RpoS
ArcA
Fis
HimD
HimA
Fnr
Crp
involved in acid resistance. This allowed a novel link
between PhoP regulation and bacterial acid resistance
to be established62,67.
40
35
30
25
20
CLR
SIRENE
RegulonDB
CLR
SIRENE
400
275
250
200
225
175
150
125
100
75
50
25
0
25
50
75
100
550
Number of targets
Supervised versus unsupervised inference of the regulatory programme. Supervised methods treat inference as
a classification problem. They start from a set of known
TF–target interactions and, on the basis of this predefined training set, characteristic features are derived,
such as TF binding sites (SEREND 30 and de Hoon
et al.68) or the degree of co-expression between TF targets (SIRENE31, SEREND30 and de Hoon et al.68). These
characteristics are subsequently used to evaluate a new
candidate gene as a potential target of a TF. Genes that
share many characteristics with the known targets of the
TF are classified as true targets, and the others as nontargets. Such a classification strategy depends on the quality of the training set of true-positive and true-negative
interactions. It is straightforward to extract examples
of positive interactions from curated databases, such as
RegulonDB69 (E. coli), EcoCyc46 (E. coli) and DBTBS70
(B. subtilis) (see Supplementary information S1 (table) for
further information about databases), but the definition
of true-negative interactions is much less trivial. Genes
that are not known to interact with a specific regulator
— that is, ‘unknowns’ — are often treated as negatives.
However, our knowledge of TRNs is still limited and there
is therefore a good chance that such a set of ‘unknowns’
contains as-yet-uncharacterized true-positive interactors
for a given TF, in which case the classification results will
be deteriorated.
By extrapolating from previously known information,
interactions that are predicted with supervised methods are generally reliable but are restricted to regulators
with sufficient previously known targets, such as global
regulators and σ-factors from well-characterized model
organisms (such as E. coli 30,31 and B. subtilis 68) (FIG. 4).
SEREND was shown to be very useful for extending the
Figure 4 | complementarity in the type of interactions
inferred by supervised versus unsupervised network
inference methods. SIRENE and CLR (context likelihood
of relatedness), as representative supervised and
unsupervised network inference methods, respectively,
were applied to the same Escherichia coli compendium32.
For both methods, the top 1,422 interactions were
considered. a | The number of transcription factors (TFs)
for which interactions could be inferred by each method.
b | The average number of targets inferred per TF by
each method. As they exploit known information,
supervised methods are more comprehensive than
unsupervised methods for predicting targets for a specific
regulator. c | The number of documented targets for all
of the regulators reported in RegulonDB69, ranked
accordingly, is shown on the right side of the graph. The
regulators for which most targets have been described
to date correspond to global regulators and σ-factors.
For each inference method, the number of inferred
interactions per regulator is indicated on the left side
of the graph. Supervised methods are biased towards
predicting targets for those regulators that have a
sufficiently high number of previously known targets.
724 | O CTOBER 2010 | VOLUME 8
www.nature.com/reviews/micro
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
Precision–recall curve
Customary method of
comparing the precision and
recall of a network inference
method in order to evaluate
the performance of inference
algorithm. The precision is the
proportion of correctly inferred
interactions, according to an
external standard, out of the
total number of predictions
made. The recall is the degree
to which the total number of
existing interactions in the real
network has been covered by
the predictions.
repertoire of interactions of the E. coli global regulators
integration host factor (IHF), H-NS, Crp, Fnr and Fis30.
To infer interactions in less studied organisms,
unsupervised approaches are more suitable (such as
Stochastic LeMoNe, CLR, DISTILLER and Inferelator),
as they do not necessarily depend on previously known
information and they can also infer interactions for
regulators for which there is little or no prior knowledge (FIG. 4). Unsupervised methods that can infer transcriptional programmes from only expression data, such
as CLR and Inferelator, have been shown to be useful
for providing a first, global view of the TRNs of, for
example, S. Typhimurium71,72, Shewanella oneidensis 73,
Halobacterium salinarum12 and Cyanobacteria74.
Cross-validation
Statistical technique that
assesses the extent to which a
model fitted on a certain data
set can also predict the
observations made on an
independent data set.
Choosing benchmark data sets
Benchmarking is important for being able to understand
the reliability of the reconstructed network. It is based
on the precision–recall curve, which is calculated according to a predefined external standard. This standard is
generated by collecting all curated interactions for a
particular organism and treating them as true positives,
and treating as false positives all predicted interactions
between a gene and a TF that are not documented in the
curated database. Using such a standard tends to overestimate the false-positive prediction rate, as most genes
probably interact with many more TFs than is currently
documented. Moreover, the assessment ignores all new
interactions with those TFs for which no interactions are
documented yet. As a result, use of an external standard
rewards methods that merely reproduce current knowledge but penalizes those that perform well in finding
new results. To compensate for this, most current studies combine validation based on an external standard
with medium-throughput experiments to also validate
the new results8,9,32.
Medium-throughput experiments avoid the unfeasible task of testing all new predictions by sampling a set of
predicted interactions that is representative for the whole
analysis. In practice this set usually consists of both highconfidence and low-confidence interactions for one or a
subset of the assessed TFs. For E. coli, mainly global regulators were chosen, such as Fnr8 and leucine-responsive
regulatory protein (Lrp)9,32, as for these regulators there
is a good balance between undiscovered and already
known interactions, which favours benchmarking. For
example, by combining performance analysis using
RegulonDB with a ChIP-based medium-throughput
experiment, three groups showed that their respective
methods each had a good sensitivity for detecting known
interactions but also that high-scoring new predictions
usually corresponded to true interactions8,9,32.
For network inference methods that use predictive
models, cross-validation can be used to validate the reliability of the inferred model; this method assesses the ability
of the model to predict the expression behaviour of genes
in experiments that were not used to build the model25,34.
In several studies, ChIP-chip-derived interactions have also been used as an alternative standard to
benchmark algorithms but, like any high-throughput
data source, they contain many false-positive (or
non-functional) and false-negative interactions. This
explains the low performances that are often observed
in benchmark studies using ChIP-chip data (FIG. 5).
Obtaining insight into the behaviour of the algorithm
requires a more objective validation strategy that uses
perfect standards, made in silico by simulating data that
mimic real data75,76. Simulated data are very useful for
unveiling the qualitative properties of the algorithm
under all kinds of test conditions that can never be
obtained with real experimental data (for example, they
can be used to test noise robustness, the sensitivity of
the parameter settings and the optimality of the proposed solution)77. Their drawback is that they can never
grasp the full biological complexity of real data (such
as the exact properties of the experimental noise or the
multilayered aspect of gene regulation78). To further
bridge the gap between in silico and real data, the use of
synthetic gene networks has been proposed79. These are
engineered circuits with well-defined network topologies and interaction structures. The dynamic behaviour
of such circuits is fully characterized using real measurements, and the resulting models are used to simulate
data on which inference methods can be tested.
Benchmark studies are extremely useful for guiding
both users and developers. However, relying on a benchmark study to find out which algorithm is ‘the best’ is
difficult, as the choice of an appropriate inference tool
depends on the research question posed. Fair benchmark studies should describe not only in what respect
an algorithm is the best, but also where it fails. The quality of a benchmark study also depends on the extent to
which parameter tuning is performed to guarantee that
each of the applied tools performs optimally in the setting in which they are used. In this regard, the DREAM
(Dialogue on Reverse Engineering Assessments and
Methods) initiative78,80 offers a platform for the unbiased
assessment of network inference methods. They organize a yearly competition in which developers can participate with their own method to infer networks from
blinded data sets.
Exploiting the complementarity
The overlap between inferred results from different
methods can be very low, as illustrated in FIG. 5. This,
together with the observation that the results of each
of the tested methods show a similar degree of overlap with an external validation standard (RegulonDB69
or ChIP data), indicates that this discrepancy in predicted interactions is not due to the failure of one of
the methods to infer biologically relevant interactions, but is rather due to the complementarity of the
different methods.
It is likely that no single best method exists, and different methods highlight different interaction types, so
aggregating the outcomes of complementary methods
offers a means of improving the breadth and the accuracy of the predictions. This idea of combining the outcomes of different methods has already been suggested
in various contexts81, and a ‘reverse-engineering by
consensus’ approach has been advocated recently 80,82, as
a result of the outcomes of the DREAM2 and DREAM3
NATURE REVIEWS | Microbiology
VOLUME 8 | O CTOBER 2010 | 725
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
a
b
RegulonDB
3,403
95
CLR
763
63
108
156 LeMoNe
755
c
ChIP-chip
615
54
2 3
CLR 61
1,055
whether ensemble solutions will succeed in simultaneously increasing precision and recall of the predicted
interactions.
ChIP-chip
583
42
34
15
SIRENE
2,444
SEREND
173
1,428
SIRENE
2,332
Figure 5 | The low overlap of the predictions made by different network inference
methods that rely on different strategies. Various network inference methods were
run on the same Escherichia coli gene expression compendium32 and their results were
compared. The proportion of shared predictions out of the total number of predictions
ranges from 5.7% to, at most, 24%. The overlap with RegulonDB ((number of interactions
in common with the external standard / total number of predicted interactions) x 100)
ranges from 15% to 18%, and the overlap with chromatin immunoprecipitation-on-chip
(ChIP-chip) data ranges from 2% to 3%, with a very low performance for CLR (context
likelihood of relatedness) predictions compared with ChIP-chip data (<1%). a | A mutual
comparison between the results of the module-based approach Stochastic LeMoNe
(learning module networks) and the direct method CLR, both of which are non-integrative
and unsupervised, using the known network data in RegulonDB69 as an external standard.
b | A comparison between the results obtained using CLR and the supervised method
SIRENE (supervised inference of regulatory networks; both methods are non-integrative
and direct). Available ChIP-chip data for several E. coli regulators was used as an external
validation standard114–116, as SIRENE uses the information in RegulonDB to make its
predictions. c | A comparison between the results of the non-integrative method SIRENE
and the integrative method SEREND (semi-supervised regulatory-network discoverer),
which combines expression data with motif data (both methods are supervised and direct).
Available ChIP-chip data was used as an external standard, as in part b.
conferences. At these meetings, it was shown that an
ensemble of the predictions made by the best performing methods of the DREAM contest more closely
approximated the true interaction network than did the
predictions made by each method separately.
To construct an ensemble solution that reflects an
overall statistical confidence in each of the predicted
interactions, inference methods are required that provide an explicit ranking of the predicted interactions
according to the scoring scheme they use; such methods
include Stochastic LeMoNe, CLR, DISTILLER, SEREND
and SIRENE. These individual rankings can then be
combined into a ranked ensemble solution that assigns
a higher confidence to interactions that are repeatedly
retrieved by the different methods.
As well as being useful for combining the outcomes
of different methods, an ensemble solution can be used
to integrate different results from a single method.
Because of the large search space, finding the most
optimal solution to a network inference problem is
non-trivial, and optimization algorithms often result in
suboptimal solutions that all approximate the true global optimal solution but differ slightly from each other.
For methods that can capture different possible solutions, a consensus solution from interactions that are
repeatedly inferred from the data34,83 allows the accuracy
of the predicted interactions to be increased by better
approximating the global solution.
At this stage, only tentative steps have been taken
to improve on TRN reconstruction through ensemble methods. Much more work is needed to assess
Conclusions and future directions
To make sense of the flood of high-throughput data
that is being generated, it is necessary to integrate the
use of inference methods into daily laboratory practice
to assist researchers in grasping higher-level biological insights or in prediction-based hypothesis testing.
State-of-the-art inference tools rely on a unique combination of strategies to solve the inference problem. Because
each strategy applies different assumptions, they each
have different strengths and limitations and highlight
complementary aspects of the network. Categorizing the
tools according to their strategies allows users to gain
insights into the settings under which they can most
optimally be applied. The tool that is most appropriate
for a certain researcher depends on the available data
and the research purpose.
The nature of the expression data generally determines whether a direct or module-based inference
method will be more appropriate. When the set of available expression data is large and/or heterogeneous in
the assessed conditions, module-based inference methods are to be preferred over direct inference methods.
When aiming to reconstruct the complete TRN, global inference methods are more suitable than querydriven approaches. For less studied microorganisms
for which only expression data is available, expressionbased network inference methods are ideal for making
a first-draft reconstruction of the TRN. Integrating
high-throughput data on TF–target interactions along
with the expression data will generally allow for a more
accurate (that is, with fewer false-positive interactions)
and more complete picture of the TRN — including the
prediction of combinatorial control, for example. But
this method might become restrictive, inferring interactions for only those TFs for which the additional information is available. This is a disadvantage if one wants
to derive global network properties. When a researcher
is interested in expanding our knowledge of a particular part of the regulatory network rather than gaining a
complete network view, query-driven methods are to be
chosen over global approaches. When a reconstructed
network is to be used as a starting point for the generation of further biological hypotheses, methods that provide an explicit ranking of the inferred interactions are
advantageous, as this allows the researcher to prioritize
candidates for further experimental work. Moreover, in
such cases researchers benefit from using an integrative or supervised approach that exploits the properties of existing interactions to infer highly reliable new
interactions. However, the more the method is biased
towards existing knowledge, the more it will be blind to
novelty. To take full advantage of the complementarity
between the different methods, a ‘reverse-engineering
by consensus’ approach seems to be the ideal option,
combining the knowledge gained from multiple inference methods or from multiple outcomes from a single
computational approach80,82.
726 | O CTOBER 2010 | VOLUME 8
www.nature.com/reviews/micro
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
The advent of novel technologies such as tiling arrays
and, more recently, deep-sequencing techniques84,85 gives
further importance to network inference. Although
most inference methods can be readily applied to these
new types of expression data, as they are insensitive to
the type of technology used to generate the data, they will
have to be adapted to account for the more detailed level
of information that results from these novel technologies, including the presence of trans-acting small RNAs86
and riboswitches87, the non-static structures of operons
with multiple intra-operonic transcription sites6,7 and so
on. As well as the increased level of detail, these novel
technologies provide information that was not accessible before: re-sequencing the genomes of individual
bacterial strains pinpoints strain-specific mutations and
copy number variations in both coding and non-coding
regions, and ChIP-seq (ChIP followed by sequencing) or
ChIP-tiling (ChIP followed by microarray analysis) provides more detailed mapping of the genomic regions in
which cis-acting regulators or nucleoid proteins bind88.
The regulation of transcription can be described from
multiple angles using this new data, and so integrative
methods are now further challenged to provide a more
accurate and detailed picture of the TRN and to consider
the full dynamics of the system89.
Although most inference studies carried out to
date have focused on understanding the conditiondependent behaviour of a TRN in one specific model
bacterial strain, these new types of information that are
available have opened a new application field, called
‘individualized, expression-centred’ network inference.
Expression-centred inference uses the premise that most
of the mutations or changes occurring in the regulatory
1.
Jacob, F. & Monod, J. Genetic regulatory mechanisms
in the synthesis of proteins. J. Mol. Biol. 3, 318–356
(1961).
2.
Ptashne, M. & Gilbert, W. Genetic repressors. Sci. Am.
222, 36–44 (1970).
3.
Alon, U. Network motifs: theory and experimental
approaches. Nature Rev. Genet. 8, 450–461 (2007).
4.
Shen-Orr, S. S., Milo, R., Mangan, S. & Alon, U.
Network motifs in the transcriptional regulation
network of Escherichia coli. Nature Genet. 31, 64–68
(2002).
5.
Fadda, A. et al. Inferring the transcriptional network of
Bacillus subtilis. Mol. Biosyst. 5, 1840–1852 (2009).
6.
Cho, B. K. et al. The transcription unit architecture of
the Escherichia coli genome. Nature Biotech. 27,
1043–1049 (2009).
7.
Mendoza-Vargas, A. et al. Genome-wide identification
of transcription start sites, promoters and
transcription factor binding sites in E. coli. PLoS ONE
4, e7526 (2009).
8.
Lemmens, K. et al. DISTILLER: a data integration
framework to reveal condition dependency of complex
regulons in Escherichia coli. Genome Biol. 10, R27
(2009).
A description of the integrative reconstruction of
the E. coli TRN using a cross-platform expression
compendium and motif information, followed by
experimental validation of the predicted network.
9.
Zare, H., Sangurdekar, D., Srivastava, P., Kaveh, M. &
Khodursky, A. Reconstruction of Escherichia coli
transcriptional regulatory networks via regulon-based
associations. BMC Syst. Biol. 3, 39 (2009).
10. Kohanski, M. A., Dwyer, D. J., Wierzbowski, J.,
Cottarel, G. & Collins, J. J. Mistranslation of
membrane proteins and two-component system
activation trigger antibiotic-mediated cell death. Cell
135, 679–690 (2008).
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
network at levels other than transcription will eventually lead to an altered expression profile. This assumption allows the expression profiles of individual strains
to be considered as specific phenotypes or traits90–95.
Additional sequence-derived genomic information
can then be used to explain individually observed variations in expression behaviour, similarly to the identification of eQTLs (expression quantitative trait loci)
in higher eukaryotes. Inference methods that generate
an explicit explanatory model for the observed expression profiles (for example, Inferelator and Stochastic
LeMoNe) can easily be extended for this purpose96–99.
Linking adaptive changes of microbial genomes100–102
to altered expression behaviour will unveil fundamental insights into microbial evolution and will identify
the multifactorial changes that underlie industrially
relevant properties of naturally occurring bacterial or
yeast strains103. Moreover, a better fundamental understanding of how expression behaviour is encoded in
the genome will help further rationalize synthetic
biology 104,105. Most of the inferences from such an
expression-centred approach will provide only an indirect link between the observed genomic or epigenetic
alteration and the observed strain-specific expression
profiles. Future inference tools should focus on finding the hidden path between a genomic change and an
alteration in gene expression, by exploiting information that is available about all levels of regulation, such
as the transcriptional, post-transcriptional, signalling
and metabolic levels97,106–111.
Individualized expression-centred inference studies will not only complete, but also revolutionize our
understanding of bacterial gene regulation.
Yoon, H., McDermott, J. E., Porwollik, S., McClelland, M.
& Heffron, F. Coordinated regulation of virulence
during systemic infection of Salmonella enterica
serovar Typhimurium. PLoS Pathog. 5, e1000306
(2009).
Bonneau, R. et al. A predictive model for
transcriptional control of physiology in a free living
cell. Cell 131, 1354–1365 (2007).
An example of the use of an integrated computational–experimental approach to chart the
regulatory network of a largely uncharacterized
archaeon, including experimental validation of the
predicted network.
Bansal, M., Belcastro, V., Ambesi-Impiombato, A. &
di Bernardo, D. How to infer gene networks from
expression profiles. Mol. Syst. Biol. 3, 78 (2007).
Bonneau, R. Learning biological networks: from
modules to dynamics. Nature Chem. Biol. 4, 658–664
(2008).
Karlebach, G. & Shamir, R. Modelling and analysis of
gene regulatory networks. Nature Rev. Mol. Cell Biol.
9, 770–780 (2008).
Babu, M. M. & Teichmann, S. A. Evolution of
transcription factors and the gene regulatory network
in Escherichia coli. Nucleic Acids Res. 31, 1234–1244
(2003).
Draghici, S., Khatri, P., Eklund, A. C. & Szallasi, Z.
Reliability and reproducibility issues in DNA
microarray measurements. Trends Genet. 22,
101–109 (2006).
Marshall, E. Getting the noise out of gene arrays.
Science 306, 630–631 (2004).
Johnson, D. S. et al. Systematic evaluation of
variability in ChIP-chip experiments using predefined
DNA targets. Genome Res. 18, 393–403 (2008).
Ma, H. W., Buer, J. & Zeng, A. P. Hierarchical structure
and modules in the Escherichia coli transcriptional
NATURE REVIEWS | Microbiology
regulatory network revealed by a new top-down
approach. BMC Bioinformatics 5, 199 (2004).
21. Hartwell, L. H., Hopfield, J. J., Leibler, S. & Murray,
A. W. From molecular to modular cell biology. Nature
402, C47–C52 (1999).
22. Ihmels, J., Bergmann, S. & Barkai, N. Defining
transcription modules using large-scale gene
expression data. Bioinformatics 20, 1993–2003
(2004).
23. Qi, Y. & Ge, H. Modularity and dynamics of cellular
networks. PLoS Comput. Biol. 2, e174 (2006).
24. Madeira, S. C. & Oliveira, A. L. Biclustering
algorithms for biological data analysis: a survey.
IEEE/ACM Trans. Comput. Biol. Bioinform. 1, 24–45
(2004).
25. Bonneau, R. et al. The Inferelator: an algorithm for
learning parsimonious regulatory networks from
systems-biology data sets de novo. Genome Biol. 7,
R36 (2006).
26. Segal, E. et al. Module networks: identifying
regulatory modules and their condition-specific
regulators from gene expression data. Nature Genet.
34, 166–176 (2003).
Pioneering work introducing module-based network
inference.
27. Margolin, A. A. et al. ARACNE: an algorithm for the
reconstruction of gene regulatory networks in a
mammalian cellular context. BMC Bioinformatics 7,
S7 (2006).
28. Basso, K. et al. Reverse engineering of regulatory
networks in human B cells. Nature Genet. 37,
382–390 (2005).
29. Michoel, T., De Smet, R., Joshi, A., Van de Peer, Y. &
Marchal, K. Comparative analysis of module-based
versus direct methods for reverse-engineering
transcriptional regulatory networks. BMC Syst. Biol. 3,
49 (2009).
VOLUME 8 | O CTOBER 2010 | 727
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
30. Ernst, J. et al. A semi-supervised method for
predicting transcription factor–gene interactions in
Escherichia coli. PLoS Comput. Biol. 4, e1000044
(2008).
The first integrative reconstruction of the E. coli
TRN using a supervised method, combining motif
information and the expression compendium from
reference 31.
31. Mordelet, F. & Vert, J. P. SIRENE: supervised inference
of regulatory networks. Bioinformatics 24, i76–i82
(2008).
32. Faith, J. J. et al. Large-scale mapping and validation of
Escherichia coli transcriptional regulation from a
compendium of expression profiles. PLoS Biol. 5, e8
(2007).
The first global reconstruction of the E. coli TRN
from an Affymetrix gene expression compendium,
along with experimental validation of the predicted
network.
33. Foster, J. W. Escherichia coli acid resistance: tales of
an amateur acidophile. Nature Rev. Microbiol. 2,
898–907 (2004).
34. Joshi, A., De Smet, R., Marchal, K., Van de Peer, Y. &
Michoel, T. Module networks revisited: computational
assessment and prioritization of model predictions.
Bioinformatics 25, 490–496 (2009).
35. Anastassiou, D. Computational analysis of the synergy
among multiple interacting genes. Mol. Syst. Biol. 3,
83 (2007).
36. Watkinson, J., Liang, K. C., Wang, X., Zheng, T. &
Anastassiou, D. Inference of regulatory gene
interactions from expression data using three-way
mutual information. Ann. NY Acad. Sci. 1158,
302–313 (2009).
37. Shaw, O. J., Harwood, C., Steggles, L. J. & Wipat, A.
SARGE: a tool for creation of putative genetic
networks. Bioinformatics 20, 3638–3640
(2004).
38. Schmitt, W. A. Jr, Raab, R. M. & Stephanopoulos, G.
Elucidation of gene interaction networks through
time-lagged correlation analysis of transcriptional
data. Genome Res. 14, 1654–1663 (2004).
39. Gutierrez-Rios, R. M. et al. Regulatory network of
Escherichia coli: consistency between literature
knowledge and microarray profiles. Genome Res. 13,
2435–2443 (2003).
40. Herrgard, M. J., Covert, M. W. & Palsson, B. O.
Reconciling gene expression data with known genomescale regulatory network structures. Genome Res. 13,
2423–2434 (2003).
An informative study illustrating the limitations of
expression-based network inference for E. coli and
S. cerevisiae.
41. Bar-Joseph, Z. et al. Computational discovery of gene
modules and regulatory networks. Nature Biotech. 21,
1337–1342 (2003).
The first large-scale integration of ChIP-chip and
expression data, applied to yeast (including
experimental validation).
42. Lemmens, K. et al. Inferring transcriptional modules
from ChIP-chip, motif and microarray data. Genome
Biol. 7, R37 (2006).
43. Sabatti, C. & James, G. M. Bayesian sparse hidden
components analysis for transcription regulation
networks. Bioinformatics 22, 739–746 (2006).
44. Tanay, A., Sharan, R., Kupiec, M. & Shamir, R.
Revealing modularity and organization in the yeast
molecular network by integrated analysis of highly
heterogeneous genomewide data. Proc. Natl Acad.
Sci. USA 101, 2981–2986 (2004).
45. Myers, C. L. & Troyanskaya, O. G. Context-sensitive
data integration and prediction of biological networks.
Bioinformatics 23, 2322–2330 (2007).
46. Keseler, I. M. et al. EcoCyc: a comprehensive view of
Escherichia coli biology. Nucleic Acids Res. 37,
D464–D470 (2009).
47. Reiss, D. J., Baliga, N. S. & Bonneau, R. Integrated
biclustering of heterogeneous genome-wide datasets
for the inference of global regulatory networks.
BMC Bioinformatics. 7, 280 (2006).
48. Huttenhower, C. et al. Detailing regulatory networks
through large scale data integration. Bioinformatics
25, 3267–3274 (2009).
49. Freckleton, G., Lippman, S. I., Broach, J. R. &
Tavazoie, S. Microarray profiling of phage-display
selections for rapid mapping of transcription factor–
DNA interactions. PLoS Genet. 5, e1000449 (2009).
50. Butala, M., Busby, S. J. & Lee, D. J. DNA sampling: a
method for probing protein binding at specific loci on
bacterial chromosomes. Nucleic Acids Res. 37, e37
(2009).
51. Lu, L. J., Xia, Y., Paccanaro, A., Yu, H. & Gerstein, M.
Assessing the limits of genomic data integration for
predicting protein networks. Genome Res. 15,
945–953 (2005).
52. Sheng, Q., Moreau, Y. & De Moor, B. Biclustering
microarray data by Gibbs sampling. Bioinformatics
19, ii196–ii205 (2003).
53. Getz, G., Levine, E. & Domany, E. Coupled two-way
clustering analysis of gene microarray data. Proc. Natl
Acad. Sci. USA 97, 12079–12084 (2000).
54. Tanay, A., Sharan, R. & Shamir, R. Discovering
statistically significant biclusters in gene expression
data. Bioinformatics 18, S136–S144 (2002).
55. Lazzeroni, L. & Owen, A. Plaid models for gene
expression data. Stat. Sin. 2, 61–86 (2002).
56. Murali, T. M. & Kasif, S. Extracting conserved gene
expression motifs from gene expression data. Pac.
Symp. Biocomput. 2003, 77–88 (2003).
57. Cheng, Y. & Church, G. M. Biclustering of expression
data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 93–103
(2000).
58. Ben-Dor, A., Chor, B., Karp, R. & Yakhini, Z.
Discovering local structure in gene expression data:
the order-preserving submatrix problem. J. Comput.
Biol. 10, 373–384 (2003).
59. Kluger, Y., Basri, R., Chang, J. T. & Gerstein, M.
Spectral biclustering of microarray data: coclustering
genes and conditions. Genome Res. 13, 703–716
(2003).
60. Dhollander, T. et al. Query-driven module discovery in
microarray data. Bioinformatics 23, 2573–2580
(2007).
61. Ihmels, J. et al. Revealing modular organization in the
yeast transcriptional network. Nature Genet. 31,
370–377 (2002).
62. Zwir, I., Huang, H. & Groisman, E. A. Analysis of
differentially-regulated genes within a regulatory
network by GPS genome navigation. Bioinformatics
21, 4073–4083 (2005).
63. Pena, J. M., Bjorkegren, J. & Tegner, J. Growing
Bayesian network models of gene networks from seed
genes. Bioinformatics 21, ii224–ii229 (2005).
64. Gat-Viks, I. & Shamir, R. Refinement and expansion of
signaling pathways: the osmotic response network in
yeast. Genome Res. 17, 358–367 (2007).
65. Tanay, A. & Shamir, R. Computational expansion of
genetic networks. Bioinformatics 17, S270–S278
(2001).
66. Honkela, A. et al. Model-based method for
transcription factor target identification with limited
data. Proc. Natl Acad. Sci. USA 107, 7793–7798
(2010).
67. Zwir, I. et al. Dissecting the PhoP regulatory network
of Escherichia coli and Salmonella enterica. Proc. Natl
Acad. Sci. USA 102, 2862–2867 (2005).
68. de Hoon, M. J. et al. Predicting gene regulation by
sigma factors in Bacillus subtilis from genome-wide
data. Bioinformatics. 20, i101–i108 (2004).
69. Gama-Castro, S. et al. RegulonDB (version 6.0): gene
regulation model of Escherichia coli K-12 beyond
transcription, active (experimental) annotated
promoters and Textpresso navigation. Nucleic Acids
Res. 36, D120–D124 (2008).
70. Sierro, N., Makita, Y., de Hoon, M. & Nakai, K.
DBTBS: a database of transcriptional regulation in
Bacillus subtilis containing upstream intergenic
conservation information. Nucleic Acids Res. 36,
D93–D96 (2008).
71. McDermott, J. E., Taylor, R. C., Yoon, H. & Heffron, F.
Bottlenecks and hubs in inferred networks are
important for virulence in Salmonella typhimurium.
J. Comput. Biol. 16, 169–180 (2009).
72. Taylor, R. C. et al. A network inference workflow
applied to virulence-related processes in Salmonella
typhimurium. Ann. NY Acad. Sci. 1158, 143–158
(2009).
73. Fredrickson, J. K. et al. Towards environmental
systems biology of Shewanella. Nature Rev. Microbiol.
6, 592–603 (2008).
74. Toepel, J., McDermott, J. E., Summerfield, T. C. &
Sherman, L. A. Transcriptional analysis of the
unicellular, diazotrophic cyanobacterium Cyanothece
sp. ATCC 51142 grown under short day/night cycles.
J. Phycol. 45, 610–620 (2009).
75. Mendes, P., Sha, W. & Ye, K. Artificial gene networks
for objective comparison of analysis algorithms.
Bioinformatics 19, ii122–ii129 (2003).
76. Van den Bulcke, T. et al. SynTReN: a generator of
synthetic gene expression data for design and analysis
of structure learning algorithms. BMC Bioinformatics.
7, 43 (2006).
728 | O CTOBER 2010 | VOLUME 8
77. Van den Bulcke, T., Lemmens, K., Van de Peer, Y. &
Marchal, K. Inferring transcriptional networks by
mining ‘omics’ data. Curr. Bioinform. 1, 301–331
(2006).
78. Stolovitzky, G., Monroe, D. & Califano, A. Dialogue on
reverse-engineering assessment and methods: the
DREAM of high-throughput pathway inference. Ann.
NY Acad. Sci. 1115, 1–22 (2007).
79. Cantone, I. et al. A yeast synthetic network for in vivo
assessment of reverse-engineering and modeling
approaches. Cell 137, 172–181 (2009).
80. Marbach, D. et al. Revealing strengths and
weaknesses of methods for gene network inference.
Proc. Natl Acad. Sci. USA 107, 6286–6291 (2010).
A discussion about the current limitations of
network inference methods based on submissions
to the DREAM3 in silico challenge.
81. Hibbs, M. A. et al. Directing experimental biology: a
case study in mitochondrial biogenesis. PLoS Comput.
Biol. 5, e1000322 (2009).
82. Stolovitzky, G., Prill, R. J. & Califano, A. Lessons from
the DREAM2 Challenges. Ann. NY Acad. Sci. 1158,
159–195 (2009).
83. Nachman, I. & Regev, A. BRNI: modular analysis
of transcriptional regulatory programs. BMC
Bioinformatics 10, 155 (2009).
84. Sorek, R. & Cossart, P. Prokaryotic transcriptomics: a
new view on regulation, physiology and pathogenicity.
Nature Rev. Genet. 11, 9–16 (2010).
85. MacLean, D., Jones, J. D. & Studholme, D. J.
Application of ‘next-generation’ sequencing
technologies to microbial genetics. Nature Rev.
Microbiol. 7, 287–296 (2009).
86. Sharma, C. M. & Vogel, J. Experimental approaches
for the discovery and characterization of regulatory
small RNA. Curr. Opin. Microbiol. 12, 536–546
(2009).
87. Coppins, R. L., Hall, K. B. & Groisman, E. A. The
intricate world of riboswitches. Curr. Opin. Microbiol.
10, 176–181 (2007).
88. Vora, T., Hottes, A. K. & Tavazoie, S. Protein occupancy
landscape of a bacterial genome. Mol. Cell 35,
247–253 (2009).
89. Madar, A., Greenfield, A., Ostrer, H., Vanden Eijnden,
E. & Bonneau, R. The Inferelator 2.0: a scalable
framework for reconstruction of dynamic regulatory
network models. Conf. Proc. IEEE Eng. Med. Biol. Soc.
2009, 5448–5451 (2009).
90. Rockman, M. V. & Kruglyak, L. Genetics of global gene
expression. Nature Rev. Genet. 7, 862–872 (2006).
91. Cookson, W., Liang, L., Abecasis, G., Moffatt, M. &
Lathrop, M. Mapping complex disease traits with
global gene expression. Nature Rev. Genet. 10,
184–194 (2009).
92. Cooper, T. F., Remold, S. K., Lenski, R. E. & Schneider, D.
Expression profiles reveal parallel evolution of
epistatic interactions involving the CRP regulon in
Escherichia coli. PLoS Genet. 4, e35 (2008).
93. Fong, S. S., Joyce, A. R. & Palsson, B. O. Parallel
adaptive evolution cultures of Escherichia coli lead to
convergent growth phenotypes with different gene
expression states. Genome Res. 15, 1365–1372 (2005).
94. Mitchell, A. et al. Adaptive prediction of environmental
changes by microorganisms. Nature 460, 220–224
(2009).
95. Tagkopoulos, I., Liu, Y. C. & Tavazoie, S. Predictive
behavior within microbial genetic networks. Science
320, 1313–1317 (2008).
96. Litvin, O., Causton, H. C., Chen, B. J. & Pe’er, D.
Modularity and interactions in the genetics of gene
expression. Proc. Natl Acad. Sci. USA 106,
6441–6446 (2009).
97. Lee, S. I. et al. Learning a prior on regulatory
potential from eQTL data. PLoS Genet. 5,
e1000358 (2009).
98. Lee, S. I., Pe’er, D., Dudley, A. M., Church, G. M. &
Koller, D. Identifying regulatory mechanisms using
individual variation reveals key role for chromatin
modification. Proc. Natl Acad. Sci. USA 103,
14062–14067 (2006).
99. Gat-Viks, I., Meller, R., Kupiec, M. & Shamir, R.
Understanding gene sequence variation in the context
of transcription regulation in yeast. PLoS Genet. 6,
e1000800 (2010).
100. Herring, C. D. et al. Comparative genome sequencing
of Escherichia coli allows observation of bacterial
evolution on a laboratory timescale. Nature Genet. 38,
1406–1412 (2006).
101. Barrick, J. E. et al. Genome evolution and adaptation
in a long-term experiment with Escherichia coli.
Nature 461, 1243–1247 (2009).
www.nature.com/reviews/micro
© 2010 Macmillan Publishers Limited. All rights reserved
REVIEWS
102. Conrad, T. M. et al. Whole-genome resequencing of
Escherichia coli K-12 MG1655 undergoing short-term
laboratory evolution in lactate minimal media reveals
flexible selection of adaptive mutations. Genome Biol.
10, R118 (2009).
103. Brem, R. B. & Kruglyak, L. The landscape of genetic
complexity across 5,700 gene expression traits in yeast.
Proc. Natl Acad. Sci. USA 102, 1572–1577 (2005).
104. Isalan, M. et al. Evolvability and hierarchy in rewired
bacterial gene networks. Nature 452, 840–845 (2008).
105. Barrett, C. L., Kim, T. Y., Kim, H. U., Palsson, B. O. &
Lee, S. Y. Systems biology as a foundation for genomescale synthetic biology. Curr. Opin. Biotechnol. 17,
488–492 (2006).
106. Joshi, A., Van, P. T., Van de Peer, Y. & Michoel, T.
Characterizing regulatory path motifs in integrated
networks using perturbational data. Genome Biol. 11,
R32 (2010).
107. Ye, C., Galbraith, S. J., Liao, J. C. & Eskin, E. Using
network component analysis to dissect regulatory
networks mediated by transcription factors in yeast.
PLoS Comput. Biol. 5, e1000311 (2009).
One of the pioneering methods that tries to explain
mechanistically how genomic variations result in
observed expression changes.
108. Zhu, J. et al. Integrating large-scale functional genomic
data to dissect the complexity of yeast regulatory
networks. Nature Genet. 40, 854–861 (2008).
109. Hwang, D. et al. A data integration methodology for
systems biology: experimental verification. Proc. Natl
Acad. Sci. USA 102, 17302–17307 (2005).
110. Lee, I., Date, S. V., Adai, A. T. & Marcotte, E. M.
A probabilistic functional network of yeast genes.
Science 306, 1555–1558 (2004).
111. Suthram, S., Beyer, A., Karp, R. M., Eldar, Y. &
Ideker, T. eQED: an efficient method for interpreting
eQTL associations using protein networks. Mol. Syst.
Biol. 4, 162 (2008).
112. Liao, J. C. et al. Network component analysis:
reconstruction of regulatory signals in biological
systems. Proc. Natl Acad. Sci. USA 100,
15522–15527 (2003).
113. Gardner, T. S., di Bernardo, D., Lorenz, D. & Collins,
J. J. Inferring genetic networks and identifying
compound mode of action via expression profiling.
Science 301, 102–105 (2003).
114. Grainger, D. C., Hurd, D., Harrison, M., Holdstock, J.
& Busby, S. J. Studies of the distribution of
Escherichia coli cAMP-receptor protein and RNA
polymerase along the E. coli chromosome.
Proc. Natl Acad. Sci. USA 102, 17693–17698
(2005).
115. Grainger, D. C., Hurd, D., Goldberg, M. D. & Busby,
S. J. Association of nucleoid proteins with coding
and non-coding segments of the Escherichia coli
genome. Nucleic Acids Res. 34, 4642–4652
(2006).
116. Grainger, D. C., Aiba, H., Hurd, D., Browning, D. F.
& Busby, S. J. Transcription factor distribution in
Escherichia coli: studies with FNR protein. Nucleic
Acids Res. 35, 269–278 (2007).
Acknowledgements
We thank the anonymous reviewers as well as Y. Van de Peer
and J. Vanderleyden for their useful comments on the manuscript. R.D.S. is a research assistant of the agency for
Innovation by Science and Technology (IWT, Belgium). This
work is further supported by the Katholieke Universiteit
NATURE REVIEWS | Microbiology
Leuven (GOA AMBioRICS, GOA/08/011, CoE EF/05/007,
SymBioSys and CREA/08/023), by the IWT through the SBOBioFrame project, by the Interuniversity Attraction Poles
(IUAP, Belgium) (BioMaGNet grant P6/25), by the National
Fund for Scientific Research (FWO, Belgium) (grant
IOK-B9725-G.0329.09) and by the Human Frontier Science
Program (grant HFSP-RGY0079/2007C).
Competing interests statement
The authors declare no competing financial interests.
DATABASES
Entrez Genome Project: http://www.ncbi.nlm.nih.gov/
genomeprj
Bacillus subtilis | Escherichia coli | Halobacterium salinarum |
Saccharomyces cerevisiae | Salmonella enterica subsp.
enterica serovar Typhimurium | Shewanella oneidensis
UniProtKB: http://www.uniprot.org
AraC | BetI | Crp | Fis | Fnr | GadE | GadW | GadX | GutR | H-NS |
IscR | Lrp | PhoP | PurR | RpoS
FURTHER INFORMATION
Kathleen Marchal’s homepage:
http://homes.esat.kuleuven.be/~kmarchal/
EcoCyc: http://ecocyc.org
DBTBS: http://dbtbs.hgc.jp
RegulonDB: http://regulondb.ccg.unam.mx
SUPPLEMENTARY INFORMATION
See online article: S1 (table)
All links Are AcTive in The online pdf
VOLUME 8 | O CTOBER 2010 | 729
© 2010 Macmillan Publishers Limited. All rights reserved