14 Aos1260

The Annals of Statistics
2014, Vol. 42, No. 6, 2526–2556

DOI: 10.1214/14-AOS1260
© Institute of Mathematical Statistics, 2014
CAM: CAUSAL ADDITIVE MODELS, HIGH-DIMENSIONAL

ORDER SEARCH AND PENALIZED REGRESSION
B Y P ETER B ÜHLMANN , J ONAS P ETERS1 AND JAN E RNEST2

ETH Zürich
We develop estimation for potentially high-dimensional additive struc-
tural equation models. A key component of our approach is to decouple or-
der search among the variables from feature or edge selection in a directed
acyclic graph encoding the causal structure. We show that the former can be
done with nonregularized (restricted) maximum likelihood estimation while
the latter can be efficiently addressed using sparse regression techniques.
Thus, we substantially simplify the problem of structure search and esti-
mation for an important class of causal models. We establish consistency of
the (restricted) maximum likelihood estimator for low- and high-dimensional
scenarios, and we also allow for misspecification of the error distribution.
Furthermore, we develop an efficient computational algorithm which can deal
with many variables, and the new method’s accuracy and performance is il-
lustrated on simulated and real data.
1. Introduction. Inferring causal relations and effects is an ambitious but im-

portant task in virtually all areas of science. In absence of prior information about
underlying structure, the problem is plagued, among other things, by identifiabil-
ity issues [23, 34], cf., and the sheer size of the space of possible models, growing
super-exponentially in the number of variables, leading to major challenges with
respect to computation and statistical accuracy. Our approach is generic, taking
advantage of the tools in sparse regression techniques [4, 8], cf., which have been
successively established in recent years.
More precisely, we consider p random variables X1 , . . . , Xp whose distribu-
tion is Markov with respect to an underlying causal directed acyclic graph (causal
DAG). We assume that all variables are observed, that is, there are no hidden vari-
ables, and that the causal influence diagram does not allow for directed cycles.
Generalizations to include hidden variables, for example, unobserved confounders,
or directed cycles are briefly discussed in Section 7.1. To formalize a model, one
can use the concepts of graphical modeling [12], cf., or structural equation mod-
els [23], cf. The approaches are equivalent in the nonparametric or multivariate
Received October 2013; revised July 2014.

1 Supported by the People Programme (Marie Curie Actions) of the European Union’s Seventh
Framework Programme (FP7/2007-2013) under REA Grant agreement no. 326496.
2 Supported in part by the Swiss National Science Foundation Grant no. 20PA20E-134493.
MSC2010 subject classifications. Primary 62G99, 62H99; secondary 68T99.
Key words and phrases. Graphical modeling, intervention calculus, nonparametric regression,
regularized estimation, sparsity, structural equation model.
2526
CAM: CAUSAL ADDITIVE MODELS 2527
Gaussian case, but this is not true anymore when placing additional restrictions
which can be very useful [25, 26, 32]. We use here the framework of structural
equation models.
1.1. Problem and main idea. Our goal is estimation and structure learning for
structural equation models, or of the corresponding Markov equivalence class of
an underlying DAG. In particular, we focus on causal additive models, that is, the
structural equations are additive in the variables and error terms. The model has the
nice property that the underlying structure and the corresponding parameters are
identifiable from the observational distribution. Furthermore, we can view it as an
extension of linear Gaussian structural equation models by allowing for nonlinear
additive functions.
In general, the problem of structure learning (and estimation of correspond-
ing parameters) can be addressed by a variety of algorithms and methods: in
the frequentist setting, the most widely used procedures for structure learning
(and corresponding parameters) are greedy equivalence search for computing the
BIC-regularized maximum likelihood estimator [6] or the PC-algorithm using
multiple conditional independence testing [34]. However, for the latter, the con-
straint of additive structural equations cannot be (easily) respected, and regard-
ing the former, maximum likelihood estimation among all (e.g., linear Gaussian)
DAG models is computationally challenging and statistical guarantees for high-
dimensional cases (and for uniform convergence with respect to a class of distri-
butions) are only available under rather strong assumptions [38].
Our proposed approach for estimation and selection of additive structural equa-
tion models is based on the following simple idea which is briefly mentioned and
discussed in [35] and [31]. If the order among the variables would be known, the
problem boils down to variable selection in multivariate (potentially nonlinear) re-
gression; see formula (5). The latter is very well understood: for example, we can
follow the route of hypothesis testing in additive models, or sparse regression can
be used for additive models [16, 28, 44]. Thus, the only remaining task is to es-
timate the order among the variables. We show here that this can be done via the
maximum likelihood principle, and we establish its consistency. In particular, for
low or “mid”-dimensional problems, there is no need to consider a penalized like-
lihood approach. The same holds true for high-dimensional settings when using a
preliminary neighborhood selection and then employing a corresponding restricted
maximum likelihood estimator. Therefore, we can entirely decouple the issue of
order estimation without regularization and variable selection in sparse regression
with appropriate regularization. This makes our approach very generic, at least
within the framework where the underlying DAG and a corresponding order of the
variables are identifiable from the joint distribution. Empirical results in Section 6
support that we can do much more accurate estimation than for nonidentifiable
models such as the popular linear Gaussian structural equation model. On the su-
perficial level, our approach can be summarized as follows:
2528 P. BÜHLMANN, J. PETERS AND J. ERNEST
1. Mainly for high-dimensional settings: preliminary neighborhood selection

for estimating a superset of the skeleton of the underlying DAG. This is done by
additive regression of one variable against all others. See Section 3.1.
2. Order search for the variables (or best permutation for the indices of the
variables) using (restricted) maximum likelihood estimation based on an additive
structural equation model with Gaussian errors: the restricted version is employed
if the preliminary neighborhood selection in step 1 is used, and the order search
is then restricted to the structure of the superset of the skeleton. See Sections 2.4
and 3.2.
3. Based on the estimated order of the variables in step 2, sparse additive regres-
sion is used for estimating the functions in an additive structural equation model.
See Section 2.5.
1.2. Related work. We consider (nonlinear) additive structural equation mod-

els. As natural extensions of linear structural equation models, they are attractive
for many applications; see Imoto, Goto and Miyano [10]. Identifiability results
for this model class have been recently derived [21, 26]. The approach in [21] is
based on conditional independence testing and is limited to small dimensions with
a few variables only. Instead of multiple testing of conditional independencies, we
propose and develop maximum likelihood estimation in a semiparametric addi-
tive structural equation model with Gaussian noise variables: fitting such a model
is often appropriate in situations where the sample size is not too large, and we
present here for the first time the practical feasibility of fitting additive models in
the presence of many variables. An extension of our additive structural equation
model with Gaussian errors to the case with a nonparametric specification of the
error distribution is presented in [22], but the corresponding maximum likelihood
estimator is analyzed (and feasible) for problems with a small number of variables
only. When the order of the variables is known, which is a much simpler and dif-
ferent problem than what we consider here, [40] provide consistency results for
additive structural equation models.
A key aspect of our method is that we decouple regularization for feature se-
lection and order estimation with nonregularized (restricted) maximum likelihood.
The former is a well understood subject thanks to the broad literature in sparse
regression and related techniques [17, 36, 41, 44, 47, 48], cf. Regarding the latter
issue about order selection, a recent analysis in [37] extends our low-dimensional
consistency result for the (nonrestricted) maximum likelihood estimator to the sce-
nario where the number of variables can grow with sample size, in the best case
essentially as fast as p = p(n) = o(n). The treatment of the high-dimensional case
with a restricted maximum likelihood approach is new here, and we also present
the first algorithm and empirical results for fitting low- and high-dimensional
causal additive models (CAMs).
All proofs are provided in the supplemental article [3].
2. Additive structural equation models. Consider the general structural

equation model (SEM):
Xj = fj (XpaD (j ) , εj ), ε1 , . . . , εp (mutually) independent,
where paD (j ) denotes the set of parents for node j in DAG D and fj is a func-
tion from R|paD (j )|+1 → R. Thus, a SEM is specified by an underlying (causal)
structure in terms of a DAG D, the functions fj (·) (j = 1, . . . , p) and the distribu-
tions of εj (j = 1, . . . , p). Most parts of this paper can be interpreted in absence
of causal inference issues: clearly though, the main motivations are understand-
ing models and developing novel procedures allowing for causal or interventional
statements, and if we do so, we always assume that the structural equations re-
main unchanged under interventions at one or several variables [23], cf. The model
above is often too general, due to problems of identifiability and the difficulty of
estimation (curse of dimensionality) of functions in several variables.
Our main focus is on a special (and more practical) case of the model above,
namely the additive SEM with potentially misspecified Gaussian errors:

Xj = fj,k (Xk ) + εj ,
k∈paD (j )

(1) ε1 , . . . , εp independent with εj ∼ N 0, σj2 , σj2 > 0 (j = 1, . . . , p),

E fj,k (Xk ) = 0 for all j, k,
where fj,k (·) are smooth functions from R → R. A special case thereof is the
linear Gaussian SEM

Xj = βj,k Xk + εj ,
k∈paD (j )
(2)
ε1 , . . . , εp independent with εj ∼ N 0, σj2 , σj2 > 0 (j = 1, . . . , p).
Although model (2) is a special case of (1), there are interesting differences with
respect to identifiability. If all functions fj,k (·) are nonlinear, the DAG is identifi-
able from the distribution P of X1 , . . . , Xp [26], Corollary 31. We explicitly state
this result as a lemma since we will make use of it later on.
L EMMA 1 (Corollary 31 in [26]3 ). Consider a distribution P that is generated

by model (1) with DAG D and nonlinear, three times differentiable functions fj,k .
Then any distribution Q that is generated by (1) with a different DAG D = D and
is different from P : we have
nonconstant, three times differentiable functions fj,k
Q = P .
3 Corollary 31 in [26] contains a slightly different statement using “nonlinear” instead of “noncon-
stant”. The proof, however, stays exactly the same.
This result does not hold, however, for a general SEM or for a linear Gaussian
SEM as in (2); one can then only identify the Markov equivalence class of the
DAG D 0 , assuming faithfulness. An exception arises when assuming same error
variances σj2 ≡ σ 2 for all j in (2) which again implies identifiability of the DAG
D 0 from P [25]. In the sequel, we consider the fully identifiable case of model (1).
2.1. The likelihood. We slightly re-write model (1) as

Xj = fj,k (Xk ) + εj = fj,k (Xk ) + εj (j = 1, . . . , p),
k∈paD (j ) k=j
fj,k (·) ≡ 0 if and only if there is a directed edge k → j in D,

(3)
E fj,k (Xk ) = 0 for all j, k,

ε1 , . . . , εp independent and εj ∼ N 0, σj2 , σj2 > 0.
Note that the structure of the model, or the so-called active set, {(j, k); fj,k ≡ 0}
is identifiable from the distribution P [26], Corollary 31. Denote by θ the infinite-
dimensional parameter with additive functions and error variances, that is,
θ = (f1,2 , . . . , f1,p , f2,1 , . . . , fp,p−1 , σ1 , . . . , σp ).
Furthermore, we denote by D 0 the true DAG and by θ 0 (and {fj,k 0 }, {σ 0 }) the
j
true infinite-dimensional parameter(s) corresponding to the data-generating true
distribution. We use this notation whenever it is appropriate to make statements
about the true underlying DAG or parameter(s).
The density pθ (·) for the model (3) is of the form
p
1 xj − k=j fj,k (xk )
log pθ (x) = log ϕ ,
j =1
σj σj
where ϕ(·) is the density of a standard normal distribution. Furthermore,
2
σj2 =E Xj − fj,k (Xk ) ,
k=j
and the expected negative log-likelihood is

p

Eθ − log pθ (X) = log(σj ) + C, C = p log(2π)1/2 + p/2.
j =1
2.2. The function class. We assume that the functions in model (1) or (3)
are from a class of smooth functions: F is a subset of L2 (Pj ), where Pj is the
marginal distribution for any j = 1, . . . , p; assume that it is closed with respect to
the L2 (Pj ) norm. Furthermore,

F ⊆ f : R → R, f ∈ C α , E f (X) = 0 ,
where C α denotes the space of α-times differentiable functions and the random
variable X is a placeholder for the variables Xj , j = 1, . . . , p. Note that this is a
slight abuse of notation since F does not specify the variable X; it becomes clear
from the context.
Consider also basis functions {br (·); r = 1, . . . , an } with an → ∞ sufficiently
slowly, for example, B-splines or regression splines. Consider further the space

an
(4) Fn = f ∈ F , f = c + αr br (·) with c, αr ∈ R(r = 1, . . . , an ) .
r=1
We allow for constants c to enforce mean zero for the whole function. Furthermore,
the basis functions can be the same for all variables Xj , j = 1, . . . , p.
For theoretical analysis, we assume that Fn is deterministic and does not depend
on the data. Then, Fn is closed. Furthermore, the space of additive functions is
denoted by

⊕
F = f : R → R; f (x) = fk (xk ), fk ∈ F ,
k=1

Fn⊕ = f : R → R; f (x) = fk (xk ), fk ∈ Fn ,
k=1
where = 2, . . . , p. Clearly, Fn⊕ ⊆ F ⊕ . For f ∈ F ⊕ we denote by fk its kth

additive function.
In our definitions, we assume that the functions in F and Fn have expectation
zero. Of course, this depends on the variables in the arguments of the functions.
For example, when requiring E[f (Xj )] = 0 for f ∈ F , the function class F =
Fj depends on the index j due to the mean zero requirement; and likewise F ⊕
depends on the indices of the variables occurring in the corresponding additive
function terms. We drop this additional dependence on the index of variables as it
does not cause any problems in methodology or theory.
Later, we consider projections of distributions onto the spaces F ⊕ and Fn⊕ ,
see (6). We assume throughout the paper that these spaces are closed with respect
to the L2 norm. The following Lemma 2 guarantees this condition by requiring an
analogue of a minimal eigenvalue assumption.
L EMMA 2. Let the distribution P be generated according to (1) and assume

that there is a φ 2 > 0 such that for all γ ∈ Rp
p 2

γj fj (Xj ) ≥ φ 2 γ 2 for all fj ∈ F with fj (Xj )L2 = 1.

j =1 L2
For any subset I ⊆ {1, . . . , p} of variables the spaces F ⊕ and Fn⊕ are then
closed with respect to the L2 (PI ) norm. Here, PI denotes the marginal distribution
over all variables in I .
The question of closedness of additive models has also been studied in [1], for
example; see also [29].
2.3. Order of variables and the likelihood. We can permute the variables, in-
ducing a different ordering; in the sequel, we use both terminologies, permutations
and order search, which mean the same thing. For a permutation π on {1, . . . , p},
define
Xπ , Xjπ = Xπ(j ) .
There is a canonical correspondence between permutations and fully connected
DAGs: for any permutation π , we can construct a DAG D π , in which each variable
π(k) has a directed arrow to all π(j ) with j > k. The node π(1) has no parents
and is called the source node. For a given DAG D 0 , we define the set of true
permutations as
0
0
= π 0 ; the fully connected DAG D π is a super-DAG of D 0 ,
where a super-DAG of D 0 is a DAG whose set of directed edges is a superset of
the one corresponding to D 0 . If the true DAG D 0 is not fully connected, there is
typically more than one true order or permutation, that is the true order is typically
not unique. It is apparent that any true ordering or permutation π 0 allows for a
lower-triangular (or autoregressive) representation of the model in (3):
−1
j
0 0 0 0
(5) Xjπ = π
fj,k Xkπ + εjπ (j = 1, . . . , p),
k=1
π0 0
where fj,k (·) = fπ00 (j ),π 0 (k) (·) and εjπ = επ0 0 (j ) , that is, with permuted indices in
terms of the original quantities in (3). If all functions fj,k (·) are nonlinear, the set
of true permutations is identifiable from the distribution [26], Corollary 33, and 0
consists of all orderings of the variables which allow for a lower-triangular repre-
sentation (5). We will exploit this fact in order to provide a consistent estimator π̂n
of the ordering: under suitable assumptions the probability that π̂n ∈ 0 converges
to one.
R EMARK 1. For the linear Gaussian SEM (2), all orderings allow for a lower-
triangular representation (5), even those that are not in 0 . Thus, we cannot con-
struct a consistent estimator in the above sense. However, assuming faithfulness of
the true distribution, the orderings of variables which are consistent with the arrow
directions in a DAG of the Markov equivalence class of the true DAG D 0 lead to
sparsest representations with fewest number of nonzero coefficients.
In principle, one can check whether the data come from a linear Gaussian SEM.
Lemma 1 guarantees that if this is case, there is no CAM with nonlinear functions
yielding the same distribution. Thus, if the structural equations of the estimated
DAG look linear with Gaussian noise, one could decide to output the Markov
equivalence class instead of the DAG. One would need to quantify closeness to
linearity and Gaussianity with, for example, a test: this would be important for
practical applications, but its precise implementation lies beyond the scope of this
work.
In the sequel, it is helpful to consider the true underlying parameter θ 0 with cor-
responding nonlinear function fj,k 0 and error variances (σ 0 )2 . For any permutation
j
π∈ / 0 , we consider the projected parameters, defined as

θ π,0 = argmin Eθ 0 − log pθππ (X) ,
θπ
where the density pθππ is of the form

p xπ − j −1 π π
1 j k=1 fj,k (xk )
log pθππ (x) = log pθ π x π = log π ϕ .
j =1
σj σjπ
(Note that if π ∈ 0 , then θ π,0 = θ 0 .) For such a misspecified model with wrong
order π ∈
/ 0 , we have
−1
j 2
π,0 π
fj,k k=1,...,j −1 = argmin Eθ 0 Xjπ − gj,k Xk
gj,k ∈F ,k=1,...,j −1 k=1
(6) 2
= argmin Eθ 0 Xjπ − gj X1π , . . . , Xjπ−1 .
gj ∈F ⊕j −1
It holds that
−1
j 2
π,0 2 1 π,0 π
σj = argmin log(σ ) + 2 Eθ 0 Xjπ − fj,k Xk
σ2 2σ k=1
(7) 2
−1
j
π,0
= Eθ 0 Xjπ − fj,k Xkπ .
k=1
The two displayed formulae above show that autoregression with the wrong order
π,0
π leads to the projected parameters {fj,k } and {(σjπ,0 )2 }. Finally, we obtain
p

Eθ 0 − log pθππ,0 (X) = log σjπ,0 + C, C = p log(2π)1/2 + p/2.
j =1
All true permutations π ∈ 0 correspond to super DAGs of the true DAG

and, therefore, all of them lead to the minimal expected log-likelihood
Eθ 0 [− log(pθππ,0 (X))] = Eθ 0 [− log(pθ 0 (X))]. The permutations π ∈
/ 0 , however,
cannot lead to a smaller expected negative log-likelihood (since it would lead to a
negative KL-divergence between the true and best projected distribution). Let us
therefore define

(8) ξp := min p−1 Eθ 0 − log pθππ,0 (X) − Eθ 0 − log pθ 0 (X) ≥ 0.
π∈
/ 0
0 are nonlinear, we obtain ξ > 0 as follows.

If all true functions fj,k p
L EMMA 3. Consider a distribution P that allows for a density p with respect

to the Lebesgue measure and is generated by model (1) with DAG D 0 and non-
0 . Assume further the condition from
linear, three times differentiable functions fj,k
Lemma 2. Then ξp > 0.
P ROOF. Because of the closedness of F ⊕j (Lemma 2), the minimum in (6)

is obtained for some functions fj,k . Without loss of generality, we can assume
that all constant additive components are zero. But then ξp = 0 would contradict
Lemma 1.
The number ξp describes the degree of separation between the true model and
misspecification when using a wrong permutation. As discussed in Remark 1,
ξp = 0 for the case of linear Gaussian SEMs. Formula (8) can be expressed as
p

(9) ξp = min p−1 log σjπ,0 − log σj0 ≥ 0.
π∈
/ 0
j =1
R EMARK 2. Especially for situations where p is very large so that the factor
p −1 is small, requiring a lower bound ξp > 0 can be overly restrictive. Instead
of requiring a gap with the factor p −1 between the likelihood scores of the true
distribution and all distributions corresponding to permutations, one can weaken
this as follows. Consider H (D, D 0 ) = {j ; paD 0 (j ) paD (j )}. We require that
−1
(10) ξp := min H D, D 0 log σjD,0 − log σj0 ≥ 0,
D=D 0
j ∈H (D,D 0 )
where (σjD,0 )2 is the error variance in the best additive approximation of Xj based
on {Xk ; k ∈ paD (j )}. Such a weaker gap condition is proposed in [13], Section 5.2.
All our theoretical results still hold when replacing statements involving ξp in (9)
by the corresponding statements with ξp in (10).
2.4. Maximum likelihood estimation for order: Low-dimensional setting. We

assume having n i.i.d. realizations X(1) , . . . , X(n) from model (3). For a n × 1 vec-

tor x = (x (1) , . . . , x (n) )T , we denote by x 2(n) = n−1 ni=1 (x (i) )2 . Depending on
the context, we sometimes denote by fˆ a function and sometimes an n × 1 vec-
tor evaluated at (the components of) the data points X (1) , . . . , X(n) ; and similarly
for Xjπ . We consider the unpenalized maximum likelihood estimator:

−1 2 j −1 2
j π 2
π ˆπ π
fˆjπ = argmin Xjπ − gj,k Xkπ , σ̂j = Xj − fj,k Xk .
⊕j −1
gj ∈Fn k=1 (n) k=1 (n)
Denote by π̂ a permutation which minimizes the unpenalized negative log-

likelihood:
p

(11) π̂ ∈ argmin log σ̂jπ .
π
j =1
Estimation of fˆjπ is based on Fn with pre-specified basis functions br (·) with

r = 1, . . . , an . In practice, the basis functions could depend on the predictor vari-
able or on the order of variables, for example, when choosing the knots in regres-
sion splines. The classical choice for the number of basis functions is an n1/5
for twice differentiable functions: here, and as explained in Section 4, however,
a smaller number such as an = O(1) to detect some nonlinearity might be suffi-
cient for estimation of the true underlying order.
2.5. Sparse regression for feature selection. Section 4 presents assumptions

and results ensuring that with high probability π̂ = π 0 for some π 0 ∈ 0 . With
such an estimated order π̂ , we obtain a complete super-DAG (super-graph) D π̂
of the underlying DAG D 0 in (3), where the parents of a node π̂(j ) are defined
as paD π̂ (π̂(j )) = {π̂ (k); k < j } for all j . We can pursue consistent estimation of
intervention distributions based on D π̂ without any additional need to find the true
underlying DAG D 0 ; see Section 2.6.
However, we can improve statistical efficiency for estimating the intervention
distribution when it is ideally based on the true DAG D 0 or realistically a not too
large super-DAG D̂ π̂ ⊇ D 0 . The task of estimating such a super-DAG D̂ π̂ ⊇ D 0
is conceptually straightforward: starting from the complete super-DAG D π̂ of D 0
as discussed above, we can use model selection or a penalized multivariate (auto-)
regression technique in the model representation (5). For additive model fitting, we
can either use hypothesis testing for additive models [15] or the Group Lasso [28],
or its improved version with a sparsity-smoothness penalty proposed in [16]. All
the techniques mentioned above perform variable selection, where we denote by

D̂ π̂ = π̂(k), π̂(j ) ; fˆj,k
π̂
≡ c ,
(the constant c = 0 when assuming that fˆj,k π̂ have mean zero when evaluated over
all data-points) the selected variables indexed in the original order [we obtain esti-
mates fˆj,k
π̂ in the representation (5) with correspondence to the indices π̂(k), π̂(j )
in the original order]; we identify these selected variables in D̂ π̂ as the edge set
of a DAG. For example, with the Group Lasso, assuming some condition avoid-
ing near collinearity of functions, that is, a compatibility condition for the Group
Lasso [4], Chapter 5.6, Theorem 8.2, and that the 2 -norms of the nonzero func-
tions are sufficiently large, we obtain the screening property (since we implicitly
assume that π̂ ∈ 0 with high probability): with high probability and asymptoti-
cally tending to one,
(12) D̂ π̂ ⊇ D 0 = (k, j ); fj,k
0
≡ 0
saying that all relevant variables (i.e., edges) are selected. Similarly with hypothe-
ses testing, assuming that the nonzero fj,k 0 have sufficiently large
2 -norms, we
also obtain that (12) holds with high probability.
The same argumentation applies if we use Drestr π̂ from Section 3.2 instead of D π̂
as an initial estimate. This then results in D̂restr , replacing D̂ π̂ above.

π̂
2.6. Consistent estimation of causal effects. The property in (12) has an im-
portant implication for causal inference:4 all estimated causal effects and estimated
intervention distributions based on the estimated DAG D̂ π̂ are consistent. In fact,
using the do-calculus [23], cf. (3.10), we have for the single intervention (at vari-
able Xk ) distribution for Xj , for all j = k:

pD 0 xj |(Xk = x) = pD̂ π̂ xj |(Xk = x) for all x,
where pD (·|(·)) denotes the intervention density based on a DAG D.
We note that the screening property (12) also holds when replacing D̂ π̂ with the
full DAG induced by π̂ , denoted by D π̂ . Thus, the feature selection step in Sec-
tion 2.5 is not needed to achieve consistent estimation of causal effects. However, a
smaller DAG D 0 ⊆ D̂ π̂ ⊆ D π̂ typically leads to better (more statistically efficient)
estimates of the interventional distributions than the full DAG D π̂ .
3. Restricted maximum likelihood estimation: Computational and statisti-

cal benefits. We present here maximum likelihood estimation where we restrict
the permutations, instead of searching over all permutations in (11). Such a restric-
tion makes the computation more tractable, and it is also statistically crucial when
dealing with high-dimensional settings where p > n.
3.1. Preliminary neighborhood selection. We first perform neighborhood se-

lection with additive models, following the general idea in [17] for the linear
Gaussian case. We pursue variable selection in an additive model of Xj versus all
other variables X{−j } = {Xk ; k = j }: a natural method for such a feature selection
is the Group Lasso for additive models [28], ideally with a sparsity-smoothness
penalty [16]; see also [40]. This provides us with a set of variables
Âj ⊆ {1, . . . , p} \ j
4 We assume that interventions at variables do not change the other structural equations, and that
there are no unobserved hidden (e.g., confounder) variables.
which denotes the selected variables in the estimated conditional expectation

Êadd [Xj |X{−j } ] = ĥj k (Xk )
k∈Âj
with functions ĥj k satisfying n −1 n (i)

= 0 (i.e., a possible intercept is
i=1 ĥj k (Xk )
subtracted already): that is,
Âj = {k; k = j, ĥj,k ≡ 0}.
We emphasize that the functions ĥj,k (·) are different from fˆj,k
π (·) in Section 2.4
because for the former, the additive regression is against all other variables.
We give conditions in Section 4.2 (see Lemma 4) ensuring that the neighbor-
hood selection set contains the parental variables from the structural equation
model in (1) or (3), that is, Âj ⊇ pa(j ).
3.2. Restricted maximum likelihood estimator. We restrict the space of permu-

tations in the definition of (11) such that they are “compatible” with the neighbor-
hood selection sets Âj . Note that for the estimator σ̂jπ in (11), we regress Xπ(j )
against {Xk ; k ∈ {π(j − 1), . . . , π(1)}}. We restrict here the set of regressors to the
indices Rπ,j = {π(j − 1), . . . , π(1)} ∩ Âπ(j ) . We then calculate the π(j )th term
of the log-likelihood using the set of regressors XRπ,j = {Xk ; k ∈ Rπ,j }. More pre-
cisely, we estimate
2
π

fˆjπ,R = argminXjπ − gj,k Xk ,
gj,k ∈Fn k;π(k)∈Rπ,j

(n)
2
π,R 2 π π,R π
σ̂j = Xj − ˆ
fj,k Xk ,
k;π(k)∈Rπ,j (n)
and the restricted maximum likelihood estimator is

p

(13) π̂ ∈ argmin log σ̂jπ,R .
π
j =1
If maxj |Âj | < n, the estimators σ̂jπ,R are well defined.

The computation of the restricted maximum likelihood estimator in (13) is sub-
stantially easier than for the unrestricted MLE (11) if maxj |Âj | is small (which
is ensured if the true neighborhoods are sparse). The set of all permutations can
be partitioned in equivalence classes r Rr and the minimization in (13) can be
restricted to single representatives of each equivalence class Rr . The equivalence
relation can be formulated with a restricted DAG Drestrπ whose parental set for node
π (π(j )) = Rπ,j . We then have that
π(j ) equals paDrestr

π ∼ π π
if and only if Drestr = Drestr
π
.
Computational details are described in Section 5.
4. Consistency in correct and misspecified models. We prove consistency

for the ordering among variables in additive structural equation models, and under
an additional identifiability assumption even for the case where the model is mis-
specified with respect to the error distribution or when using highly biased function
estimation.
4.1. Unrestricted MLE for low-dimensional settings. We first consider the

low-dimensional setting where p < ∞ is fixed and n → ∞, and we establish con-
sistency of the unrestricted MLE in (11). We assume the following:
(A1) Consider a partition of the real line
∞

R= Im
m=1
using disjoint intervals Im . The individual functions in F are α-times differen-
tiable, with α ≥ 1, whose derivatives up to order α are bounded in absolute value
by Mm in Im .
(A2) Tail and moment conditions:
(i) For V = 1/α and Mm as in (A1):
∞
V /(V +2)
2
Mm P[Xj ∈ Im ] < ∞, j = 1, . . . , p.
m=1
(ii)
E|Xj |4 < ∞, j = 1, . . . , p,
4
sup Ef (Xj ) < ∞, j = 1, . . . , p.
f ∈F
(A3) The error variances satisfy (σjπ,0 )2 > 0 for all j = 1, . . . , p and all π .
(A4) The true functions fj,k 0 can be approximated on any compact set C ⊂ R:
for all k ∈ paD 0 (j ), j = 1, . . . , p,
2
E fj,k
0
(Xk ) − fn;j,k
0
(Xk ) I (Xk ∈ C ) = o(1),
where
2

0
fn;j = argmin E Xj − gj,k (Xk ) .
⊕j −1
gj ∈Fn k∈paD 0 (j )
All assumptions are not very restrictive. The second part of assumption (A2)(ii)
holds if we assume, for example, a bounded function class F , or if |f (x)| |x| as
|x| → ∞ for all f ∈ F .
T HEOREM 1. Consider an additive structural equation model as in (3). As-

sume (A1)–(A4) and ξp > 0 in (8) (see also Lemma 3 and Remark 2).Then we have
0
P π̂ ∈ →1 (n → ∞).
A proof is given in the supplemental article [3]. Theorem 1 says that one can
find a correct order among the variables without pursuing feature or edge selection
for the structure in the SEM.
R EMARK 3. Studying near nonidentifiable models, for example, near linearity

in a Gaussian structural equation model, can be modelled by allowing ξp = ξn,p to
converge to zero as n → ∞. If one requires ξn,p n−1/2 , the statement of Theo-
rem 1 still holds. We note that Theorem 3 for the high-dimensional case implicitly
allows ξp = ξpn to change with sample size n. However, it is a nontrivial issue to
translate such a condition in terms of closeness of one or several nonlinear func-
0 to their closest linear approximations. Similarly, if some error variances
tions fj,k
σjπ,0 would be close to zero (e.g., converge to zero as n → ∞ asymptotically), this
could cause identifiability problems such that ξp might be close to (e.g., converge
fast to) zero.
Related to Remark 3 is the question about uniform convergence in the state-

ment of Theorem 1, over a whole class of structural equation models. This can be
ensured by strengthening the assumptions to hold uniformly:
(U1) The quantities in (A2)(i) and (ii) are upper-bounded by positive constants
C1 < ∞, C2 < ∞ and C3 < ∞.
(U2) The error variances in (A3) are lower bounded by a finite constant L > 0.
(U3) The approximation in (A4) holds uniformly over a class of functions F :
for any compact set C and any j, k:
2
sup E fj,k
0
(Xk ) − fn;j,k
0
(Xk ) I (Xk ∈ C ) = o(1).
f 0 ∈F
(U4) The constant ξp ≥ B > 0 for some finite constant B > 0.

Denote the class of distributions in an additive SEM which satisfy (U1)–(U4) by
P (C1 , C2 , L, F , B). We then obtain a uniform convergence result
0
(14) inf PP π̂ ∈ →1 (n → ∞).
P ∈P (C1 ,C2 ,C3 ,F ,L)
This can be shown exactly along the lines of the proof of Theorem 1 in the supple-
mental article [3].
4.1.1. Misspecified error distribution and biased function estimation. Theo-

rem 1 generalizes to the situation where the model in (3) is misspecified and the
truth has independent, non-Gaussian errors ε1 , . . . , εp with E[εj ] = 0. As in The-
orem 1, we make the assumption ξp > 0 in (9): its justification, however, is some-
what less backed up because the identifiability results from [26] and Lemma 3 do
not carry over immediately. The latter results say that the set of correct orderings
0 can be identified from the distribution of X , . . . , X , but we require in (9) that
1 p
identifiability is given in terms of all the error variances, that is, involving only
second moments. It is an open problem whether (or for which subclass of models)
identifiability from the distribution carries over to automatically ensure that ξp > 0
in (9).
Furthermore, assume that the number of basis functions an for functions in Fn
is small such that assumption (A4) does not hold, for example, an = O(1). We
denote by
−1
j 2
π,0,an 2 π
σj = min Eθ 0 Xjπ − gj,k X k ,
⊕j −1
gj ∈Fn k=1
which is larger than (σjπ,0 )2 in (7). Instead of (9), we then consider

p

−1 0 ,0,a
(15) ξpan := min p log σjπ,0,an − log σjπ n
.
π∈
/ 0 ,π 0 ∈ 0
j =1
Requiring
lim inf ξpan > 0
n→∞
is still reasonable: if (9) with ξp > 0 holds because of nonlinearity of the ad-
ditive functions [26], and see the interpretation above for non-Gaussian errors,
we believe that it typically also holds for the best projected additive functions in
Fn⊕ as long as some nonlinearity is present when using an basis functions; here,
π =
the best projected additive function for the j th variable Xjπ is defined as fn;j
j −1
argming ⊕j −1 E[(Xjπ − k=1 gj,k (Xk )) ].
π 2 We also note that for an → ∞, even
j ∈Fn
when diverging very slowly, and assuming (A4) we have that ξpan → ξp and thus
lim infn→∞ ξpan > 0. In general, the choice of the number of basis functions an is a
trade-off between identifiability (due to nonlinearity) and estimation accuracy: for
a
an small we might have a smaller value in (15), that is, it might be that ξpan ≤ ξpn
for an ≤ an , which makes identifiability harder but exhibits less variability in es-
timation, and vice versa. In particular, the trade-off between identifiability and
variance might be rather different than the classical bias-variance trade-off with
respect to prediction in classical function estimation. A low complexity (with an
small) might be better than a prediction optimal number of basis functions.
Theorem 2 below establishes the consistency for order estimation in an additive
structural equation model with potentially non-Gaussian errors, even when the ex-
pansion for function estimation is truncated at few basis functions.
T HEOREM 2. Consider an additive structural equation model as in (3) but

with independent potentially non-Gaussian errors ε1 , . . . , εp having E[εj ] = 0
(j = 1, . . . , p). Assume either of the following:
1. (A1)–(A4) hold, and ξp > 0 in formula (9) (see also Remark 2).
2. (A1)–(A3) hold, and lim infn→∞ ξpan > 0 in formula (15).
Then
0
P π̂ ∈ →1 (n → ∞).
A proof is given in the supplemental article [3]. Again, as appearing in the

discussion of Theorem 1, one can obtain uniform convergence by strengthening
the assumptions to hold uniformly over a class of distributions.
4.2. Restricted MLE for sparse high-dimensional setting. We consider here

the restricted MLE in (13) and show that it can cope with high-dimensional settings
where p n.
The model in (1) is now assumed to change with sample size n: the dimension
is p = pn and the parameter θ = θn depends on n. We consider the limit as n → ∞
allowing diverging dimension pn → ∞ where pn n. For notational simplicity,
we often drop the sub-index n.
We make a few additional assumptions. When fitting an additive model of Xj
versus all other variables X{−j } , the target of such an estimation is the best approx-
imating additive function:

Eadd [Xj |X{−j } ] = h∗j k (Xk ),
k∈{−j }
2
h∗j k ; k ∈ {−j } = argmin E Xj − hj k (Xk ) .
hj ∈F ⊕p−1 k∈{−j }
In general, some variables are irrelevant, and we denote the set of relevant variables
by Aj : Aj ⊆ {1, . . . , p} \ j is the (or a) smallest set5 such that
Eadd [Xj |X{−j } ] = Eadd [Xj |XAj ].
We assume the following:
(B1) For all j = 1, . . . , p: for all k ∈ pa(j ),

Eadd Xj − Eadd Xj |XAj \k |Xk ≡ 0.
Assumption (B1) requires that for each j = 1, . . . , p: Xk [k ∈ pa(j )] has an addi-
tive influence on Xj given all additive effects from XAj \k .
L EMMA 4. Assume that (B1) holds. Then, for all j = 1, . . . , p: pa(j ) ⊆ Aj .
5 Uniqueness of such a set is not a requirement but implicitly ensured by the compatibility condition
and sparsity which we invoke to guarantee (B2)(ii).
A proof is given in the supplemental article [3]. Lemma 4 justifies, for the pop-
ulation case, to pursue preliminary neighborhood selection followed by restricted
maximum likelihood estimation: because pa(j ) ⊆ Aj , the restriction in the maxi-
mum likelihood estimator is appropriate and a true permutation in π 0 ∈ 0 leads to
a valid restriction Rπ 0 ,j ⊇ pa(π 0 (j )) (when defined with the population sets Aj ).
For estimation, we assume the following:
(B2) The selected variables in Âj from neighborhood selection satisfy: with
probability tending to 1 as n → ∞,
(i) Âj ⊇ Aj (j = 1, . . . , p),
(ii) maxj =1,...,p |Âj | ≤ M < ∞ for some positive constant M < ∞.
Assumption (B2)(i) is a rather standard screening assumption. It holds for the
Group Lasso with sparsity-smoothness penalty: using a basis expansion as in (4),
the condition is implied by a sparsity assumption, a group compatibility condi-
tion (for the basis vectors), and a beta-min condition about the minimal size of
the 2 -norm of the coefficients for the basis functions of the active variables in
Aj ; see [4], Chapter 5.6, Theorem 8.2. The sparsity and group compatibility con-
dition ensure identifiability of the active set, and hence, they exclude concurvity
(or collinearity) among the additive functions in the structural equation model.
Assumption (B2)(ii) can be ensured by assuming maxj |Aj | ≤ M1 < ∞ for some
positive constant M1 < ∞ and, for example, group restricted eigenvalue assump-
tions for the design matrix (with the given basis); see [39, 45] for the case without
groups.
Finally, we need to strengthen assumption (A2) and (A3).
(B3) (i) For B⊆ {1, . . . , p} \ j with |B| ≤ M, with M as in (B2), denote by
j,g = (Xj − k∈B gk (Xk )) . For some 0 < K < ∞, it holds that
hB 2

max max j,g ≤ D1 < ∞,
sup ρK hB
j =1,...,p B⊆{1,...,p}\j,|B|≤M g∈F ⊕|B|
where
2 B
ρK hj,g = 2K 2 Eθ 0 exp hB B
j,g (X) /K − 1 − hj,g (X) /K .
(ii) For V = 1/α,
∞ (V +4)/8
V /(V +4)
max 2
Mm P[Xj ∈ Im ] ≤ D2 < ∞.
j =1,...,p
m=1
This assumption is typically weaker than what we require in (B3)(i), when
assuming that the values Mm are reasonable (e.g., bounded).
(iii)
4
max E|Xj |4 ≤ D3 < ∞, max sup Ef (Xj ) ≤ D4 < ∞.
j j f ∈F
(B4) The error variances satisfy minπ minj (σjπ,0 )2 ≥ L > 0.

Assumption (B3)(i) requires exponential moments. We note that the sum of ad-
ditive functions over the set B is finite. Thus, we essentially require exponential
moments for the square of finite sums of additive functions.
T HEOREM 3. Consider an additive structural equation model as in (3) with

independent potentially non-Gaussian errors ε1 , . . . , εp having E[εj ] = 0 (j =
1, . . . , p). Assume either of the following:
1. (A1), (A4) and (B1)–(B4) hold, and for ξp in (9) (see also Remark 2):
2
max log(p)/n, max E fj,k
0
(Xk ) − fn;j,k
0
(Xk ) = o(ξp ).
j,k
2. (A1), (A4) and (B1)–(B4) hold, and for ξpan in (15):

2
max log(p)/n, max E fj,k
0
(Xk ) − fn;j,k
0
(Xk ) = o ξpan .
j,k
Then, for the restricted maximum likelihood estimator in (13):

0
P π̂ ∈ →1 (n → ∞).
A proof is given in the supplemental article [3]. The assumption that

E[(fj,k
0 (X ) − f 0
k n;j,k (Xk )) ] is of sufficiently small order can be ensured by the
2
following condition.
(Badd ) Consider the basis functions br (·) appearing in Fn : for the true functions
0 ∈ F , we assume an expansion
fj,k
∞

0
fj,k (x) = αf 0 br (x)
j,k ;r
r=1
with smoothness condition:
∞

α 0 0 ≤ Ck −β .
fj,k ;r
r=k
−(β−1−κ)
Assuming (Badd ), we have that E[(fj,k
0 (X ) − f 0
k n;j,k (Xk )) ] = O(an
2 ) for
any κ > 0: for example, when using an → ∞ and for β > 1, E[(fj,k (Xk ) −0
0
fn;j,k (Xk ))2 ] → 0.
Uniform convergence can be obtained exactly as described after the discussion
of Theorem 1: when requiring the additional uniform versions (U3)–(U4) [since
(B3) and (B4) invoke already uniform bounds we do not need (U1) and (U2)],
and requiring uniform convergence of the probability in (B2), we obtain uniform
convergence over the corresponding class of distributions analogously as in (14).
F IG . 1. Step PNS. For each variable the set of possible parents is reduced (in this plot, a directed
edge from Xk to Xj indicates that Xk is a selected variable in Âj and a possible parent of Xj ). This
reduction leads to a considerable computational gain in the remaining steps of the procedure.
5. Computation and implementation. In Section 2, we have decomposed

the problem of learning DAGs from observational data into two main parts: finding
the correct order (Section 2.4) and feature selection (Section 2.5). Our algorithm
and implementation consists of two corresponding parts: IncEdge is a greedy pro-
cedure providing an estimate π̂ for equation (11) and Prune performs the feature
selection. Section 3.1 discusses the benefits of performing a preliminary neighbor-
hood selection before estimating the causal order, and we call the corresponding
part PNS. The combination PNS + IncEdge provides an estimate for equation (13).
The three components of our implementation are described in the following
subsections, Figures 1, 2 and 3 present the steps graphically. We regard the modular
structure of the implementation as an advantage; each of the three parts could be
replaced by an alternative method (as indicated in the subsections below).
5.1. Preliminary neighborhood selection: PNS. As described in Section 3.1,

we fit an additive model for each variable Xj against all other variables X{−j } . We
implement this with a boosting method for additive model fitting [2, 5], using the
R-function gamboost from the package mboost [9]. We select the ten variables
that have been picked most often during 100 iterations of the boosting method;
F IG . 2. Step IncEdge. At each iteration the edge leading to the largest decrease of the negative
log-likelihood is included.
F IG . 3. Step Prune. For each node, variable selection techniques are exploited to remove nonrele-
vant edges.
hereby, we only consider variables that have been picked at least three times dur-
ing the iterations. The sets Âj obtained by this procedure estimate Aj ⊇ pa(j ) as
shown in Lemma 4. We construct a graph in which for each j , the set Âj is the
parental set for node j corresponding to the variable Xj . Figure 1 summarizes this
step. We say that the set of “possible parents” of node j has been reduced to the
set Âj . Importantly, we do not disregard true parents if the sample size is large
enough (Section 4.2, Lemma 4). Instead of the boosting method, we could alter-
natively use additive model fitting with a sparsity- or sparsity-smoothness penalty
[16, 28].
5.2. Estimating the correct order by greedy search: IncEdge. Let us first con-
sider the situation without PNS. Searching over all permutations π for finding π̂
in (11) is computationally infeasible if the number of variables p is large. We
propose a greedy estimation procedure that starts with an empty DAG and adds
at each iteration the edge k → j between nodes k and j that corresponds to the
largest gain in log-likelihood. We therefore compute the score function in (11),
with π corresponding to the current DAG,
j −1
p
π
p
π ˆπ π
log σ̂j = log Xj − fj,k Xk

j =1 j =1 k=1 (n)
and construct a matrix, whose entry (k, j ) specifies by how much this score is re-
duced after adding the edge k → j and, therefore, allowing a nonconstant function
fj,k (see Figure 2). For implementation, we employ additive model fitting with
penalized regression splines (with ten basis functions per variable), using the R-
function gam from the R-package mgcv, in order to obtain estimates fˆj,k and σ̂j .
After the addition of an edge, we only need to recompute the j th column of the
score matrix (see Figure 2) since the score decomposes over all nodes. In order
to avoid cycles, we remove further entries of the score matrix. After p(p − 1)/2
iterations, the graph has been completed to a fully connected DAG. The latter
corresponds to a unique permutation π̂ . This algorithm is computationally rather
efficient and can easily handle graphs of up to 30 nodes without PNS (see Sec-
tion 6.1.2).
If we have performed PNS as in Section 5.1 we sparsify the score matrix from
the beginning. We only consider entries (k, j ) for which k is considered to be a
possible parent of j . This way the algorithm is feasible for up to a few thousands
of nodes (see Section 6.1.3).
Alternative methods for (low-dimensional) additive model fitting include back-
fitting [14], cf.
5.3. Pruning of the DAG by feature selection: Prune. Section 2.5 describes
sparse regression techniques for pruning the DAG that has been estimated by
step IncEdge; see Figure 3. We implement this task by applying significance test-
ing of covariates, based on the R-function gam from the R-package mgcv and
declaring significance if the reported p-values are lower or equal to 0.001, inde-
pendently of the sample size (for problems with small sample size, the p-value
threshold should be increased).
If the DAG estimated by (PNS and) IncEdge is a super DAG of the true DAG,
the estimated interventional distributions are correct; see Section 2.6. This does
not change if Prune removes additional “superfluous” edges. The structural Ham-
ming distance to the true graph, however, may reduce significantly after perform-
ing Prune; see Section 6.1.2. Alternative methods for hypothesis testing in (low-
dimensional) additive are possible [43], cf., or one could use penalized additive
model fitting for variable selection [16, 28, 44].
6. Numerical results.
6.1. Simulated data. We show the effectiveness of each step in our algorithm
(Section 6.1.2) and compare the whole procedure to other state of the art methods
(Section 6.1.3). We investigate empirically the role of noninjective functions (Sec-
tion 6.1.4) and discuss the linear Gaussian case (Section 6.1.5). In Section 6.1.6,
we further check the robustness of our method against model misspecification, that
is, in the case of non-Gaussian noise or nonadditive functions. For evaluation, we
compute the structural intervention distance that we introduce in Section 6.1.1.
For simulating data, we randomly choose a correct ordering π 0 and connect
each pair of variables (nodes) with a probability pconn . If not stated otherwise,
each of the possible p(p − 1)/2 connections is included with a probability of
pconn = 2/(p − 1) resulting in a sparse DAG with an expected number of p
edges. Given the structure, we draw the functions fj,k from a Gaussian process
with a Gaussian (or RBF) kernel with bandwidth one and√add Gaussian noise with
standard deviation uniformly sampled between 1/5 √ and 2/5. All nodes without
parents have a standard deviation between 1 and 2. The experiments are based
on 100 repetitions if the description does not say differently.
All code is provided on the second author’s homepage.
6.1.1. Structural intervention distance. As a performance measure, we con-

sider the recently proposed structural intervention distance (SID); see [24]. The
SID is well suited for quantifying the correctness of an order among variables,
mainly in terms of inferring causal effects afterward. It counts the number of
wrongly estimated causal effects. Thus, the SID between the true DAG D 0 and
the fully connected DAGs corresponding to the true permutations π 0 ∈ 0 is zero;
see Section 2.6.
6.1.2. Effectiveness of preliminary neighborhood selection and pruning. We

demonstrate the effect of the individual steps of our algorithm. Figure 4 shows the
performance (in terms of SID and SHD) of our method and the corresponding time
consumption (using eight cores) depending on which of the steps are performed.
If only IncEdge is used, the SHD is usually large because the output is a fully con-
nected graph. Only after the step Prune the SHD becomes small. As discussed in
Section 2.6 the pruning does not make a big difference for the SID. Performing
these two steps is not feasible for large p. The time consumption is reduced signif-
icantly if we first apply the preliminary neighborhood selection PNS. In particular,
this first step is required in the case of p > n in order to avoid a degeneration of
the score function.
6.1.3. Comparison to existing methods. Different procedures have been pro-

posed to address the problem of inferring causal graphs from a joint observa-
tional distribution. We compare the performance of our method to greedy equiv-
alence search (GES) [6], the PC algorithm [34], the conservative PC algorithm
(CPC) [27], LiNGAM [32] and regression with subsequent independence tests
(RESIT) [21, 26]. The latter has been used with a significance level of α = 0,
such that the method does not remain undecided. Both PC methods are equipped
F IG . 4. The plots show the effect of the individual steps of our method. Prune reduces the SHD to
the true DAG but leaves the SID almost unchanged. PNS reduces the computation time, especially
for large p.
F IG . 5. SHD (left) and SID (right) for different methods on sparse DAGs with p = 10 (top) and
p = 100 (bottom); the sample size is n = 200.
with α = 0.01 and partial correlation as independence test. GES is used with a
linear Gaussian score function. Thus, only RESIT is able to model the class of
nonlinear additive functions. We apply the methods to DAGs of size p = 10 and
p = 100, whereas in both cases, the sample size is n = 200. RESIT is not appli-
cable for graphs with p = 100 due to computational reasons. Figure 5 shows that
our proposed method outperforms the other approaches both in terms of SID and
SHD.
The difference between the methods becomes even larger for dense graphs with
an expected number of 4p edges and strong varying degree of nodes (results not
shown).
Only the PC methods and the proposed method CAM scale to high-dimensional
data with p = 1000 and n = 200. Keeping the same (sparse) setting as above re-
sults in SHDs of 1214 ± 37, 1330 ± 40 and 477 ± 19 for PC, CPC and CAM,
respectively. These results are based on five experiments.
6.1.4. Injectivity of model functions. In general, the nonlinear functions that

are generated by Gaussian processes are not injective. We therefore test CAM for
the case where every function in (1) is injective. Correct direction of edges (j, k) is
a more difficult task in this setting. We sample sigmoid-type functions of the form
b · (xk + c)
fj,k (xk ) = a ·
1 + |b · (xk + c)|
with a ∼ Exp(4) + 1, b ∼ U ([−2, −0.5] ∪ [0.5, 2]) and c ∼ U ([−2, 2]); as be-
fore, we use Gaussian noise. Note that some of these functions may be very close
F IG . 6. SHD (left) and SID (right) for various values of p and n = 300. The plots compare the
performances of CAM for the additive SEM (1) with functions generated by Gaussian processes
(noninjective in general) and sigmoid-type functions (injective).
to linear functions which makes the direction of the corresponding edges diffi-
cult to identify. Figure 6 shows a comparison of the performance of CAM in the
previously applied setting with Gaussian processes and in the new setting with
sigmoid-type functions. As expected, the performance of CAM decreases in this
more difficult setting but is still better than for the competitors such as RESIT,
LiNGAM, PC, CPC and GES (not shown).
6.1.5. Linear Gaussian SEMs. In the linear Gaussian setting, we can only
identify the Markov equivalence class of the true graph (if we assume faithful-
ness). We now sample data from a linear Gaussian SEM and expand the DAGs
that are estimated by CAM and LiNGAM to CPDAGs, that is, we consider the
corresponding Markov equivalence classes. The two plots in Figure 7 compare
the different methods for p = 10 variables and n = 200. They show the structural
Hamming distance (SHD) between the estimated and the true Markov equivalence
class (left), as well as lower and upper bounds for the SID (right). (By the defini-
tion of lower and upper bounds of the SID, the SID between the true and estimated
DAG lies in between those values.) The proposed method has a disadvantage be-
cause it uses nonlinear regression instead of linear regression. The performance
F IG . 7. Data are generated by linear Gaussian SEM. SHD between true and estimated CPDAG
(left), lower and upper bounds for SID between true DAG and estimated CPDAG (right).
is nevertheless comparable. Remark 1 discusses that at least in principle, this sce-

nario is detectable.
6.1.6. Robustness against nonadditive functions and non-Gaussian errors.

This work focuses on the additive model (1) and Gaussian noise. The score func-
tions (11) and (13) and their corresponding optimization problems depend on these
model assumptions. The DAG remains identifiable (under weak assumptions) even
if the functions of the data generating process are not additive or the noise variables
are non-Gaussian [26], cf. The following experiments analyze the empirical per-
formance of our method under these misspecifications. The case of misspecified
error distributions is discussed in Section 4.1.1.
As a first experiment, we examine deviations from the Gaussian noise assump-
tion by setting εj = sign(Nj )|Nj |γ with Nj ∼ N (0, σj2 ) for different exponents
0.1 ≤ γ ≤ 4. Only γ = 1 corresponds to normally distributed noise. Figure 8
shows the change in SHD and SID when varying γ .
As a second experiment, we examine deviations from additivity by simulating
from the model

Xj = ω · fj,k (Xk ) + (1 − ω) · fj (XpaD (j ) ) + εj
k∈paD (j )
for different values of ω ∈ [0, 1] and Gaussian noise. Both, fj,k and fj are drawn
from a Gaussian process using an RBF kernel with bandwidth one. Note that ω = 1
corresponds to the fully additive model (3), whereas for ω = 0, the value of Xj is
given as a nonadditive function of all its parents. Figure 9 shows the result for a
F IG . 8. SHD (top) and SID (bottom) for p = 25 and n = 300 in the case of misspecified models. The
plot shows deviations of the noise from a normal distribution (only γ = 1 corresponds to Gaussian
noise).
F IG . 9. SHD (left) and SID (right) for p = 25 and n = 300 in the case of misspecified models. The
plot shows deviations from additivity for sparse (top) and nonsparse (bottom) truths, respectively
(only ω = 1 corresponds to a fully additive model).
sparse truth with expected number of p edges (top) and a nonsparse truth with
expected number of 4p edges (lower). In sparse DAGs, many nodes have a small
number of parents and our algorithm yields a comparably small SID even if the
model contains nonadditive functions. If the underlying truth is nonsparse, the
performance of our algorithm becomes worse but it is still slightly better than PC
which achieves average lower bounds of SID values of roughly 520, both for ω = 1
and for ω = 0 (not shown).
6.2. Real data. We apply our methodology to microarray data described

in [42]. The authors concentrate on 39 genes (118 observed samples) on two iso-
prenoid pathways in Arabidopsis thaliana. The dashed edges in Figures 10 and 11
indicate the causal direction within each pathway. While graphical Gaussian mod-
els are applied to estimate the underlying interaction network by an undirected
model in [42], our CAM procedure estimates the structure by a directed acyclic
graph.
Given a graph structure, we can compute p-value scores as described in Sec-
tion 5.3. Figure 10 shows the twenty best scoring edges of the graph estimated
by our proposed method CAM (the scores should not be interpreted as p-values
anymore since the graph has been estimated from data). We also apply stability
selection [18] to this data set. We therefore consider 100 different subsamples of
size 59 and record the edges that have been considered at least 57 times as being
among the 20 best scoring edges. Under suitable assumptions, this leads to an ex-
pected number of false positives being less than two [18]. These edges are shown
in Figure 11. They connect genes within one of the two pathways and their direc-
tions agree with the overall direction of the pathways. Our findings are therefore
consistent with the prior knowledge available. The link MCT → CMK does not
appear in Figure 10 since it was ranked as the 22nd best scoring edge.
F IG . 10. Gene expressions in isoprenoid pathways. The twenty best scoring edges provided by the
method CAM.
7. Conclusions and extensions. We have proposed maximum likelihood esti-

mation and its restricted version for the class of additive structural equation models
(i.e., causal additive models, CAMs) with Gaussian errors where the causal struc-
ture (underlying DAG) is identifiable from the observational probability distribu-
tion [26]. A key component of our approach is to decouple order search among the
variables from feature or edge selection in DAGs. Regularization is only necessary
for the latter while estimation of an order can be done with a nonregularized (re-
stricted) maximum likelihood principle. Thus, we have substantially simplified the
problem of structure search and estimation for an important class of causal models.
We established consistency of the (restricted) maximum likelihood estimator for
low- and high-dimensional scenarios, and we also allow for misspecification of the
error distribution. Furthermore, we developed an efficient computational algorithm
which can deal with many variables, and the new method’s accuracy and perfor-
mance is illustrated with a variety of empirical results for simulated and real data.
We found that we can do much more accurate estimation for identifiable, nonlinear
CAMs than for nonidentifiable linear Gaussian structural equation models.
7.1. Extensions. The estimation principle of first pursuing order search based
on nonregularized maximum likelihood and then using penalized regression for
F IG . 11. Gene expressions in isoprenoid pathways. Edges estimated by stability selection: all di-
rections are in correspondence with the direction of the pathways.
feature selection works with other structural equation models where the underly-
ing DAG is identifiable from the observational distribution. Closely related exam-
ples include nonlinearly transformed additive structural equation models [46] or
Gaussian structural equation models with same error variances [25].
If the DAG D is nonidentifiable from the distribution P , the methodology needs
to be adapted; see also Remark 1 considering the linear Gaussian SEM. The true
orders 0 can be defined as the set of permutations which lead to most sparse
autoregressive representations as in (5): assuming faithfulness, these orders cor-
respond to the Markov equivalence class of the underlying DAG. Therefore, for
estimation, we should use regularized maximum likelihood estimation leading to
sparse solutions with, for example, the 0 -penalty [6, 38].
Finally, it would be very interesting to extend (sparse) permutation search to
(possibly nonidentifiable) models with hidden variables [7, 11, 23, 34] or with
graph structures allowing for cycles [19, 20, 30, 33]. Note that unlike linear Gaus-
sian models, CAMs are not closed under marginalization: if X, Y and Z follow a
CAM (1), then X and Y do not necessarily remain in the class of CAMs.
Acknowledgments. The authors thank Richard Samworth for fruitful discus-

sions regarding the issue of closedness of subspaces allowing to construct proper
projections.
SUPPLEMENTARY MATERIAL
Supplement to “CAM: Causal additive models, high-dimensional order
search and penalized regression” (DOI: 10.1214/14-AOS1260SUPP; .pdf). This
supplemental article [3] contains all proofs.
REFERENCES
[1] B REIMAN , L. and F RIEDMAN , J. H. (1985). Estimating optimal transformations for multiple
regression and correlation. J. Amer. Statist. Assoc. 80 580–619. With discussion and with
a reply by the authors. MR0803258
[2] B ÜHLMANN , P. and H OTHORN , T. (2007). Boosting algorithms: Regularization, prediction
and model fitting. Statist. Sci. 22 477–505. MR2420454
[3] B ÜHLMANN , P., P ETERS , J. and E RNEST, J. (2014). Supplement to “CAM: Causal addi-
tive models, high-dimensional order search and penalized regression.” DOI:10.1214/14-
AOS1260SUPP.
[4] B ÜHLMANN , P. and VAN DE G EER , S. (2011). Statistics for High-Dimensional Data: Methods,
Theory and Applications. Springer, Heidelberg. MR2807761
[5] B ÜHLMANN , P. and Y U , B. (2003). Boosting with the L2 loss: Regression and classification.
J. Amer. Statist. Assoc. 98 324–339. MR1995709
[6] C HICKERING , D. M. (2002). Optimal structure identification with greedy search. J. Mach.
Learn. Res. 3 507–554. MR1991085
[7] C OLOMBO , D., M AATHUIS , M. H., K ALISCH , M. and R ICHARDSON , T. S. (2012). Learning
high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist.
40 294–321. MR3014308
[8] H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. MR2722294
[9] H OTHORN , T., B ÜHLMANN , P., K NEIB , T., S CHMID , M. and H OFNER , B. (2010). Model-
based boosting 2.0. J. Mach. Learn. Res. 11 2109–2113. MR2719848
[10] I MOTO , S., G OTO , T. and M IYANO , S. (2002). Estimation of genetic networks and functional
structures between genes by using Bayesian network and nonparametric regression. In
Proceedings of the Pacific Symposium on Biocomputing (PSB) 175–186. Lihue, HI.
[11] JANZING , D., P ETERS , J., M OOIJ , J. M. and S CHÖLKOPF, B. (2009). Identifying confounders
using additive noise models. In Proceedings of the 25th Annual Conference on Uncer-
tainty in Artificial Intelligence (UAI) 249–257. AUAI Press, Corvallis, OR.
[12] L AURITZEN , S. L. (1996). Graphical Models. Oxford Univ. Press, New York. MR1419991
[13] L OH , P. and B ÜHLMANN , P. (2013). High-dimensional learning of linear causal net-
works via inverse covariance estimation. J. Mach. Learn. Res. To appear. Available at
arXiv:1311.3492.
[14] M AMMEN , E. and PARK , B. U. (2006). A simple smooth backfitting method for additive
models. Ann. Statist. 34 2252–2271. MR2291499
[15] M ARRA , G. and W OOD , S. N. (2011). Practical variable selection for generalized additive
models. Comput. Statist. Data Anal. 55 2372–2387. MR2786996
[16] M EIER , L., VAN DE G EER , S. and B ÜHLMANN , P. (2009). High-dimensional additive model-
ing. Ann. Statist. 37 3779–3821. MR2572443
[17] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selec-
tion with the lasso. Ann. Statist. 34 1436–1462. MR2278363
[18] M EINSHAUSEN , N. and B ÜHLMANN , P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat.
Methodol. 72 417–473. MR2758523
[19] M OOIJ , J. and H ESKES , T. (2013). Cyclic causal discovery from continuous equilibrium data.
In Proceedings of the 29th Annual Conference on Uncertainty in Artificial Intelligence
(UAI) 431–439. AUAI Press, Corvallis, OR.
[20] M OOIJ , J., JANZING , D., H ESKES , T. and S CHÖLKOPF, B. (2011). On causal discovery with
cyclic additive noise models. In Advances in Neural Information Processing Systems 24
(NIPS) 639–647. Curran, Red Hook, NY.
[21] M OOIJ , J., JANZING , D., P ETERS , J. and S CHÖLKOPF, B. (2009). Regression by dependence
minimization and its application to causal inference. In Proceedings of the 26th Interna-
tional Conference on Machine Learning (ICML) 745–752. ACM, New York.
[22] N OWZOHOUR , C. and B ÜHLMANN , P. (2013). Score-based causal learning in additive noise
models. Available at arXiv:1311.6359.
[23] P EARL , J. (2000). Causality: Models, Reasoning, and Inference. Cambridge Univ. Press, Cam-
bridge. MR1744773
[24] P ETERS , J. and B ÜHLMANN , P. (2013). Structural intervention distance (SID) for evaluating
causal graphs. Neural Comput. To appear. Available at arXiv:1306.1043.
[25] P ETERS , J. and B ÜHLMANN , P. (2014). Identifiability of Gaussian structural equation models
with equal error variances. Biometrika 101 219–228. MR3180667
[26] P ETERS , J., M OOIJ , J., JANZING , D. and S CHÖLKOPF, B. (2014). Causal discovery with
continuous additive noise models. J. Mach. Learn. Res. 15 2009–2053.
[27] R AMSEY, J., Z HANG , J. and S PIRTES , P. (2006). Adjacency-faithfulness and conservative
causal inference. In Proceedings of the 22nd Annual Conference on Uncertainty in Artifi-
cial Intelligence (UAI) 401–408. AUAI Press, Corvallis, OR.
[28] R AVIKUMAR , P., L AFFERTY, J., L IU , H. and WASSERMAN , L. (2009). Sparse additive mod-
els. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030. MR2750255
[29] R ÉNYI , A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hung. 10 441–451
(unbound insert). MR0115203
[30] R ICHARDSON , T. (1996). A discovery algorithm for directed cyclic graphs. In Uncertainty in
Artificial Intelligence (Portland, OR, 1996) 454–461. Morgan Kaufmann, San Francisco,
CA. MR1617227
[31] S CHMIDT, M., N ICULESCU -M IZIL , A. and M URPHY, K. (2007). Learning graphical model
structure using L1-regularization paths. In Proceedings of the National Conference on
Artificial Intelligence 22 1278. AAAI Press, Menlo Park, CA.
[32] S HIMIZU , S., H OYER , P. O., H YVÄRINEN , A. and K ERMINEN , A. (2006). A linear non-
Gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7 2003–2030.
MR2274431
[33] S PIRTES , P. (1995). Directed cyclic graphical representations of feedback models. In Proceed-
ings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI) 491–499. Mor-
gan Kaufmann, San Francisco, CA.
[34] S PIRTES , P., G LYMOUR , C. and S CHEINES , R. (2000). Causation, Prediction, and Search,
2nd ed. MIT Press, Cambridge, MA. MR1815675
[35] T EYSSIER , M. and KOLLER , D. (2005). Ordering-based search: A simple and effective algo-
rithm for learning Bayesian networks. In Proceedings of the 21st Conference on Uncer-
tainty in Artificial Intelligence (UAI) 584–590. AUAI Press, Corvallis, OR.
[36] T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.
Ser. B 58 267–288. MR1379242
[37] VAN DE G EER , S. (2014). On the uniform convergence of empirical norms and inner products,
with application to causal inference. Electron. J. Stat. 8 543–574. MR3211024
[38] VAN DE G EER , S. and B ÜHLMANN , P. (2013). 0 -penalized maximum likelihood for sparse
directed acyclic graphs. Ann. Statist. 41 536–567. MR3099113
[39] VAN DE G EER , S., B ÜHLMANN , P. and Z HOU , S. (2011). The adaptive and the thresholded
Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J.
Stat. 5 688–749. MR2820636
[40] VOORMAN , A., S HOJAIE , A. and W ITTEN , D. (2014). Graph estimation with joint additive
models. Biometrika 101 85–101. MR3180659
[41] WAINWRIGHT, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recov-
ery using 1 -constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55
2183–2202. MR2729873
[42] W ILLE , A., Z IMMERMANN , P., V RANOVÁ , E., F ÜRHOLZ , A., L AULE , O., B LEULER , S.,
H ENNIG , L., P RELIC , A., VON ROHR , P., T HIELE , L., Z ITZLER , E., G RUISSEM , W.
and B ÜHLMANN , P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene
network in Arabidopsis thaliana. Genome Biol. 5 R92.
[43] W OOD , S. N. (2006). Generalized Additive Models: An Introduction with R. Chapman &
Hall/CRC, Boca Raton, FL. MR2206355
[44] Y UAN , M. and L IN , Y. (2006). Model selection and estimation in regression with grouped
variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. MR2212574
[45] Z HANG , C.-H. and H UANG , J. (2008). The sparsity and bias of the Lasso selection in high-
dimensional linear regression. Ann. Statist. 36 1567–1594. MR2435448
[46] Z HANG , K. and H YVÄRINEN , A. (2009). On the identifiability of the post-nonlinear causal
model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence
(UAI) 647–655. AUAI Press, Corvallis, OR.
[47] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res.
7 2541–2563. MR2274449
[48] Z OU , H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–
1429. MR2279469
S EMINAR FOR S TATISTICS

ETH Z ÜRICH
R ÄMISTRASSE 101
8092 Z ÜRICH
S WITZERLAND
E- MAIL : buhlmann@stat.math.ethz.ch
peters@stat.math.ethz.ch
ernest@stat.math.ethz.ch
URL: http://stat.ethz.ch

14 Aos1260

Uploaded by

Copyright:

Available Formats

14 Aos1260

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14 Aos1260

Uploaded by

Copyright:

Available Formats

The Annals of Statistics

2014, Vol. 42, No. 6, 2526–2556

CAM: CAUSAL ADDITIVE MODELS, HIGH-DIMENSIONAL

B Y P ETER B ÜHLMANN , J ONAS P ETERS1 AND JAN E RNEST2

1. Introduction. Inferring causal relations and effects is an ambitious but im-

Received October 2013; revised July 2014.

1. Mainly for high-dimensional settings: preliminary neighborhood selection

1.2. Related work. We consider (nonlinear) additive structural equation mod-

2. Additive structural equation models. Consider the general structural

L EMMA 1 (Corollary 31 in [26]3 ). Consider a distribution P that is generated

2.1. The likelihood. We slightly re-write model (1) as

fj,k (·) ≡ 0 if and only if there is a directed edge k → j in D,

and the expected negative log-likelihood is

where = 2, . . . , p. Clearly, Fn⊕ ⊆ F ⊕ . For f ∈ F ⊕ we denote by fk its kth

L EMMA 2. Let the distribution P be generated according to (1) and assume

where the density pθππ is of the form

All true permutations π ∈ 0 correspond to super DAGs of the true DAG

0 are nonlinear, we obtain ξ > 0 as follows.

L EMMA 3. Consider a distribution P that allows for a density p with respect

P ROOF. Because of the closedness of F ⊕j (Lemma 2), the minimum in (6)

2.4. Maximum likelihood estimation for order: Low-dimensional setting. We

for Xjπ . We consider the unpenalized maximum likelihood estimator:

Denote by π̂ a permutation which minimizes the unpenalized negative log-

Estimation of fˆjπ is based on Fn with pre-specified basis functions br (·) with

2.5. Sparse regression for feature selection. Section 4 presents assumptions

as an initial estimate. This then results in D̂restr , replacing D̂ π̂ above.

3. Restricted maximum likelihood estimation: Computational and statisti-

3.1. Preliminary neighborhood selection. We first perform neighborhood se-

which denotes the selected variables in the estimated conditional expectation

with functions ĥj k satisfying n −1 n (i)

3.2. Restricted maximum likelihood estimator. We restrict the space of permu-

and the restricted maximum likelihood estimator is

If maxj |Âj | < n, the estimators σ̂jπ,R are well defined.

4. Consistency in correct and misspecified models. We prove consistency

4.1. Unrestricted MLE for low-dimensional settings. We first consider the

T HEOREM 1. Consider an additive structural equation model as in (3). As-

R EMARK 3. Studying near nonidentifiable models, for example, near linearity

Related to Remark 3 is the question about uniform convergence in the state-

(U4) The constant ξp ≥ B > 0 for some finite constant B > 0.

4.1.1. Misspecified error distribution and biased function estimation. Theo-

which is larger than (σjπ,0 )2 in (7). Instead of (9), we then consider

T HEOREM 2. Consider an additive structural equation model as in (3) but

A proof is given in the supplemental article [3]. Again, as appearing in the

4.2. Restricted MLE for sparse high-dimensional setting. We consider here

L EMMA 4. Assume that (B1) holds. Then, for all j = 1, . . . , p: pa(j ) ⊆ Aj .

(B4) The error variances satisfy minπ minj (σjπ,0 )2 ≥ L > 0.

T HEOREM 3. Consider an additive structural equation model as in (3) with

2. (A1), (A4) and (B1)–(B4) hold, and for ξpan in (15):

Then, for the restricted maximum likelihood estimator in (13):

A proof is given in the supplemental article [3]. The assumption that

5. Computation and implementation. In Section 2, we have decomposed

5.1. Preliminary neighborhood selection: PNS. As described in Section 3.1,

6.1.1. Structural intervention distance. As a performance measure, we con-

6.1.2. Effectiveness of preliminary neighborhood selection and pruning. We

6.1.3. Comparison to existing methods. Different procedures have been pro-

6.1.4. Injectivity of model functions. In general, the nonlinear functions that

is nevertheless comparable. Remark 1 discusses that at least in principle, this sce-

6.1.6. Robustness against nonadditive functions and non-Gaussian errors.

fj,k (·) ≡ 0 if and only if there is a directed edge k → j in D,

with functions ĥj k satisfying n −1 n (i)