14 Aos1260
14 Aos1260
14 Aos1260
Gaussian case, but this is not true anymore when placing additional restrictions
which can be very useful [25, 26, 32]. We use here the framework of structural
equation models.
1.1. Problem and main idea. Our goal is estimation and structure learning for
structural equation models, or of the corresponding Markov equivalence class of
an underlying DAG. In particular, we focus on causal additive models, that is, the
structural equations are additive in the variables and error terms. The model has the
nice property that the underlying structure and the corresponding parameters are
identifiable from the observational distribution. Furthermore, we can view it as an
extension of linear Gaussian structural equation models by allowing for nonlinear
additive functions.
In general, the problem of structure learning (and estimation of correspond-
ing parameters) can be addressed by a variety of algorithms and methods: in
the frequentist setting, the most widely used procedures for structure learning
(and corresponding parameters) are greedy equivalence search for computing the
BIC-regularized maximum likelihood estimator [6] or the PC-algorithm using
multiple conditional independence testing [34]. However, for the latter, the con-
straint of additive structural equations cannot be (easily) respected, and regard-
ing the former, maximum likelihood estimation among all (e.g., linear Gaussian)
DAG models is computationally challenging and statistical guarantees for high-
dimensional cases (and for uniform convergence with respect to a class of distri-
butions) are only available under rather strong assumptions [38].
Our proposed approach for estimation and selection of additive structural equa-
tion models is based on the following simple idea which is briefly mentioned and
discussed in [35] and [31]. If the order among the variables would be known, the
problem boils down to variable selection in multivariate (potentially nonlinear) re-
gression; see formula (5). The latter is very well understood: for example, we can
follow the route of hypothesis testing in additive models, or sparse regression can
be used for additive models [16, 28, 44]. Thus, the only remaining task is to es-
timate the order among the variables. We show here that this can be done via the
maximum likelihood principle, and we establish its consistency. In particular, for
low or “mid”-dimensional problems, there is no need to consider a penalized like-
lihood approach. The same holds true for high-dimensional settings when using a
preliminary neighborhood selection and then employing a corresponding restricted
maximum likelihood estimator. Therefore, we can entirely decouple the issue of
order estimation without regularization and variable selection in sparse regression
with appropriate regularization. This makes our approach very generic, at least
within the framework where the underlying DAG and a corresponding order of the
variables are identifiable from the joint distribution. Empirical results in Section 6
support that we can do much more accurate estimation than for nonidentifiable
models such as the popular linear Gaussian structural equation model. On the su-
perficial level, our approach can be summarized as follows:
2528 P. BÜHLMANN, J. PETERS AND J. ERNEST
3 Corollary 31 in [26] contains a slightly different statement using “nonlinear” instead of “noncon-
stant”. The proof, however, stays exactly the same.
2530 P. BÜHLMANN, J. PETERS AND J. ERNEST
This result does not hold, however, for a general SEM or for a linear Gaussian
SEM as in (2); one can then only identify the Markov equivalence class of the
DAG D 0 , assuming faithfulness. An exception arises when assuming same error
variances σj2 ≡ σ 2 for all j in (2) which again implies identifiability of the DAG
D 0 from P [25]. In the sequel, we consider the fully identifiable case of model (1).
2.2. The function class. We assume that the functions in model (1) or (3)
are from a class of smooth functions: F is a subset of L2 (Pj ), where Pj is the
marginal distribution for any j = 1, . . . , p; assume that it is closed with respect to
the L2 (Pj ) norm. Furthermore,
F ⊆ f : R → R, f ∈ C α , E f (X) = 0 ,
CAM: CAUSAL ADDITIVE MODELS 2531
where C α denotes the space of α-times differentiable functions and the random
variable X is a placeholder for the variables Xj , j = 1, . . . , p. Note that this is a
slight abuse of notation since F does not specify the variable X; it becomes clear
from the context.
Consider also basis functions {br (·); r = 1, . . . , an } with an → ∞ sufficiently
slowly, for example, B-splines or regression splines. Consider further the space
an
(4) Fn = f ∈ F , f = c + αr br (·) with c, αr ∈ R(r = 1, . . . , an ) .
r=1
We allow for constants c to enforce mean zero for the whole function. Furthermore,
the basis functions can be the same for all variables Xj , j = 1, . . . , p.
For theoretical analysis, we assume that Fn is deterministic and does not depend
on the data. Then, Fn is closed. Furthermore, the space of additive functions is
denoted by
⊕
F = f : R → R; f (x) = fk (xk ), fk ∈ F ,
k=1
Fn⊕ = f : R → R; f (x) = fk (xk ), fk ∈ Fn ,
k=1
For any subset I ⊆ {1, . . . , p} of variables the spaces F ⊕ and Fn⊕ are then
closed with respect to the L2 (PI ) norm. Here, PI denotes the marginal distribution
over all variables in I .
2532 P. BÜHLMANN, J. PETERS AND J. ERNEST
The question of closedness of additive models has also been studied in [1], for
example; see also [29].
2.3. Order of variables and the likelihood. We can permute the variables, in-
ducing a different ordering; in the sequel, we use both terminologies, permutations
and order search, which mean the same thing. For a permutation π on {1, . . . , p},
define
Xπ , Xjπ = Xπ(j ) .
There is a canonical correspondence between permutations and fully connected
DAGs: for any permutation π , we can construct a DAG D π , in which each variable
π(k) has a directed arrow to all π(j ) with j > k. The node π(1) has no parents
and is called the source node. For a given DAG D 0 , we define the set of true
permutations as
0
0
= π 0 ; the fully connected DAG D π is a super-DAG of D 0 ,
where a super-DAG of D 0 is a DAG whose set of directed edges is a superset of
the one corresponding to D 0 . If the true DAG D 0 is not fully connected, there is
typically more than one true order or permutation, that is the true order is typically
not unique. It is apparent that any true ordering or permutation π 0 allows for a
lower-triangular (or autoregressive) representation of the model in (3):
−1
j
0 0 0 0
(5) Xjπ = π
fj,k Xkπ + εjπ (j = 1, . . . , p),
k=1
π0 0
where fj,k (·) = fπ00 (j ),π 0 (k) (·) and εjπ = επ0 0 (j ) , that is, with permuted indices in
terms of the original quantities in (3). If all functions fj,k (·) are nonlinear, the set
of true permutations is identifiable from the distribution [26], Corollary 33, and 0
consists of all orderings of the variables which allow for a lower-triangular repre-
sentation (5). We will exploit this fact in order to provide a consistent estimator π̂n
of the ordering: under suitable assumptions the probability that π̂n ∈ 0 converges
to one.
R EMARK 1. For the linear Gaussian SEM (2), all orderings allow for a lower-
triangular representation (5), even those that are not in 0 . Thus, we cannot con-
struct a consistent estimator in the above sense. However, assuming faithfulness of
the true distribution, the orderings of variables which are consistent with the arrow
directions in a DAG of the Markov equivalence class of the true DAG D 0 lead to
sparsest representations with fewest number of nonzero coefficients.
In principle, one can check whether the data come from a linear Gaussian SEM.
Lemma 1 guarantees that if this is case, there is no CAM with nonlinear functions
yielding the same distribution. Thus, if the structural equations of the estimated
CAM: CAUSAL ADDITIVE MODELS 2533
DAG look linear with Gaussian noise, one could decide to output the Markov
equivalence class instead of the DAG. One would need to quantify closeness to
linearity and Gaussianity with, for example, a test: this would be important for
practical applications, but its precise implementation lies beyond the scope of this
work.
In the sequel, it is helpful to consider the true underlying parameter θ 0 with cor-
responding nonlinear function fj,k 0 and error variances (σ 0 )2 . For any permutation
j
π∈ / 0 , we consider the projected parameters, defined as
θ π,0 = argmin Eθ 0 − log pθππ (X) ,
θπ
(Note that if π ∈ 0 , then θ π,0 = θ 0 .) For such a misspecified model with wrong
order π ∈
/ 0 , we have
−1
j 2
π,0 π
fj,k k=1,...,j −1 = argmin Eθ 0 Xjπ − gj,k Xk
gj,k ∈F ,k=1,...,j −1 k=1
(6) 2
= argmin Eθ 0 Xjπ − gj X1π , . . . , Xjπ−1 .
gj ∈F ⊕j −1
It holds that
−1
j 2
π,0 2 1 π,0 π
σj = argmin log(σ ) + 2 Eθ 0 Xjπ − fj,k Xk
σ2 2σ k=1
(7) 2
−1
j
π,0
= Eθ 0 Xjπ − fj,k Xkπ .
k=1
The two displayed formulae above show that autoregression with the wrong order
π,0
π leads to the projected parameters {fj,k } and {(σjπ,0 )2 }. Finally, we obtain
p
Eθ 0 − log pθππ,0 (X) = log σjπ,0 + C, C = p log(2π)1/2 + p/2.
j =1
negative KL-divergence between the true and best projected distribution). Let us
therefore define
(8) ξp := min p−1 Eθ 0 − log pθππ,0 (X) − Eθ 0 − log pθ 0 (X) ≥ 0.
π∈
/ 0
The number ξp describes the degree of separation between the true model and
misspecification when using a wrong permutation. As discussed in Remark 1,
ξp = 0 for the case of linear Gaussian SEMs. Formula (8) can be expressed as
p
(9) ξp = min p−1 log σjπ,0 − log σj0 ≥ 0.
π∈
/ 0
j =1
R EMARK 2. Especially for situations where p is very large so that the factor
p −1 is small, requiring a lower bound ξp > 0 can be overly restrictive. Instead
of requiring a gap with the factor p −1 between the likelihood scores of the true
distribution and all distributions corresponding to permutations, one can weaken
this as follows. Consider H (D, D 0 ) = {j ; paD 0 (j ) paD (j )}. We require that
−1
(10) ξp := min H D, D 0 log σjD,0 − log σj0 ≥ 0,
D=D 0
j ∈H (D,D 0 )
where (σjD,0 )2 is the error variance in the best additive approximation of Xj based
on {Xk ; k ∈ paD (j )}. Such a weaker gap condition is proposed in [13], Section 5.2.
All our theoretical results still hold when replacing statements involving ξp in (9)
by the corresponding statements with ξp in (10).
all data-points) the selected variables indexed in the original order [we obtain esti-
mates fˆj,k
π̂ in the representation (5) with correspondence to the indices π̂(k), π̂(j )
in the original order]; we identify these selected variables in D̂ π̂ as the edge set
of a DAG. For example, with the Group Lasso, assuming some condition avoid-
ing near collinearity of functions, that is, a compatibility condition for the Group
2536 P. BÜHLMANN, J. PETERS AND J. ERNEST
Lasso [4], Chapter 5.6, Theorem 8.2, and that the 2 -norms of the nonzero func-
tions are sufficiently large, we obtain the screening property (since we implicitly
assume that π̂ ∈ 0 with high probability): with high probability and asymptoti-
cally tending to one,
(12) D̂ π̂ ⊇ D 0 = (k, j ); fj,k
0
≡ 0
saying that all relevant variables (i.e., edges) are selected. Similarly with hypothe-
ses testing, assuming that the nonzero fj,k 0 have sufficiently large
2 -norms, we
also obtain that (12) holds with high probability.
The same argumentation applies if we use Drestr π̂ from Section 3.2 instead of D π̂
2.6. Consistent estimation of causal effects. The property in (12) has an im-
portant implication for causal inference:4 all estimated causal effects and estimated
intervention distributions based on the estimated DAG D̂ π̂ are consistent. In fact,
using the do-calculus [23], cf. (3.10), we have for the single intervention (at vari-
able Xk ) distribution for Xj , for all j = k:
pD 0 xj |(Xk = x) = pD̂ π̂ xj |(Xk = x) for all x,
where pD (·|(·)) denotes the intervention density based on a DAG D.
We note that the screening property (12) also holds when replacing D̂ π̂ with the
full DAG induced by π̂ , denoted by D π̂ . Thus, the feature selection step in Sec-
tion 2.5 is not needed to achieve consistent estimation of causal effects. However, a
smaller DAG D 0 ⊆ D̂ π̂ ⊆ D π̂ typically leads to better (more statistically efficient)
estimates of the interventional distributions than the full DAG D π̂ .
4 We assume that interventions at variables do not change the other structural equations, and that
there are no unobserved hidden (e.g., confounder) variables.
CAM: CAUSAL ADDITIVE MODELS 2537
because for the former, the additive regression is against all other variables.
We give conditions in Section 4.2 (see Lemma 4) ensuring that the neighbor-
hood selection set contains the parental variables from the structural equation
model in (1) or (3), that is, Âj ⊇ pa(j ).
(A3) The error variances satisfy (σjπ,0 )2 > 0 for all j = 1, . . . , p and all π .
(A4) The true functions fj,k 0 can be approximated on any compact set C ⊂ R:
for all k ∈ paD 0 (j ), j = 1, . . . , p,
2
E fj,k
0
(Xk ) − fn;j,k
0
(Xk ) I (Xk ∈ C ) = o(1),
where
2
0
fn;j = argmin E Xj − gj,k (Xk ) .
⊕j −1
gj ∈Fn k∈paD 0 (j )
All assumptions are not very restrictive. The second part of assumption (A2)(ii)
holds if we assume, for example, a bounded function class F , or if |f (x)| |x| as
|x| → ∞ for all f ∈ F .
A proof is given in the supplemental article [3]. Theorem 1 says that one can
find a correct order among the variables without pursuing feature or edge selection
for the structure in the SEM.
This can be shown exactly along the lines of the proof of Theorem 1 in the supple-
mental article [3].
identifiability is given in terms of all the error variances, that is, involving only
second moments. It is an open problem whether (or for which subclass of models)
identifiability from the distribution carries over to automatically ensure that ξp > 0
in (9).
Furthermore, assume that the number of basis functions an for functions in Fn
is small such that assumption (A4) does not hold, for example, an = O(1). We
denote by
−1
j 2
π,0,an 2 π
σj = min Eθ 0 Xjπ − gj,k X k ,
⊕j −1
gj ∈Fn k=1
Requiring
lim inf ξpan > 0
n→∞
is still reasonable: if (9) with ξp > 0 holds because of nonlinearity of the ad-
ditive functions [26], and see the interpretation above for non-Gaussian errors,
we believe that it typically also holds for the best projected additive functions in
Fn⊕ as long as some nonlinearity is present when using an basis functions; here,
π =
the best projected additive function for the j th variable Xjπ is defined as fn;j
j −1
argming ⊕j −1 E[(Xjπ − k=1 gj,k (Xk )) ].
π 2 We also note that for an → ∞, even
j ∈Fn
when diverging very slowly, and assuming (A4) we have that ξpan → ξp and thus
lim infn→∞ ξpan > 0. In general, the choice of the number of basis functions an is a
trade-off between identifiability (due to nonlinearity) and estimation accuracy: for
a
an small we might have a smaller value in (15), that is, it might be that ξpan ≤ ξpn
for an ≤ an , which makes identifiability harder but exhibits less variability in es-
timation, and vice versa. In particular, the trade-off between identifiability and
variance might be rather different than the classical bias-variance trade-off with
respect to prediction in classical function estimation. A low complexity (with an
small) might be better than a prediction optimal number of basis functions.
Theorem 2 below establishes the consistency for order estimation in an additive
structural equation model with potentially non-Gaussian errors, even when the ex-
pansion for function estimation is truncated at few basis functions.
1. (A1)–(A4) hold, and ξp > 0 in formula (9) (see also Remark 2).
2. (A1)–(A3) hold, and lim infn→∞ ξpan > 0 in formula (15).
Then
0
P π̂ ∈ →1 (n → ∞).
In general, some variables are irrelevant, and we denote the set of relevant variables
by Aj : Aj ⊆ {1, . . . , p} \ j is the (or a) smallest set5 such that
Eadd [Xj |X{−j } ] = Eadd [Xj |XAj ].
We assume the following:
(B1) For all j = 1, . . . , p: for all k ∈ pa(j ),
Eadd Xj − Eadd Xj |XAj \k |Xk ≡ 0.
Assumption (B1) requires that for each j = 1, . . . , p: Xk [k ∈ pa(j )] has an addi-
tive influence on Xj given all additive effects from XAj \k .
5 Uniqueness of such a set is not a requirement but implicitly ensured by the compatibility condition
and sparsity which we invoke to guarantee (B2)(ii).
2542 P. BÜHLMANN, J. PETERS AND J. ERNEST
A proof is given in the supplemental article [3]. Lemma 4 justifies, for the pop-
ulation case, to pursue preliminary neighborhood selection followed by restricted
maximum likelihood estimation: because pa(j ) ⊆ Aj , the restriction in the maxi-
mum likelihood estimator is appropriate and a true permutation in π 0 ∈ 0 leads to
a valid restriction Rπ 0 ,j ⊇ pa(π 0 (j )) (when defined with the population sets Aj ).
For estimation, we assume the following:
(B2) The selected variables in Âj from neighborhood selection satisfy: with
probability tending to 1 as n → ∞,
(i) Âj ⊇ Aj (j = 1, . . . , p),
(ii) maxj =1,...,p |Âj | ≤ M < ∞ for some positive constant M < ∞.
Assumption (B2)(i) is a rather standard screening assumption. It holds for the
Group Lasso with sparsity-smoothness penalty: using a basis expansion as in (4),
the condition is implied by a sparsity assumption, a group compatibility condi-
tion (for the basis vectors), and a beta-min condition about the minimal size of
the 2 -norm of the coefficients for the basis functions of the active variables in
Aj ; see [4], Chapter 5.6, Theorem 8.2. The sparsity and group compatibility con-
dition ensure identifiability of the active set, and hence, they exclude concurvity
(or collinearity) among the additive functions in the structural equation model.
Assumption (B2)(ii) can be ensured by assuming maxj |Aj | ≤ M1 < ∞ for some
positive constant M1 < ∞ and, for example, group restricted eigenvalue assump-
tions for the design matrix (with the given basis); see [39, 45] for the case without
groups.
Finally, we need to strengthen assumption (A2) and (A3).
(B3) (i) For B⊆ {1, . . . , p} \ j with |B| ≤ M, with M as in (B2), denote by
j,g = (Xj − k∈B gk (Xk )) . For some 0 < K < ∞, it holds that
hB 2
max max j,g ≤ D1 < ∞,
sup ρK hB
j =1,...,p B⊆{1,...,p}\j,|B|≤M g∈F ⊕|B|
where
2 B
ρK hj,g = 2K 2 Eθ 0 exp hB B
j,g (X) /K − 1 − hj,g (X) /K .
(ii) For V = 1/α,
∞ (V +4)/8
V /(V +4)
max 2
Mm P[Xj ∈ Im ] ≤ D2 < ∞.
j =1,...,p
m=1
This assumption is typically weaker than what we require in (B3)(i), when
assuming that the values Mm are reasonable (e.g., bounded).
(iii)
4
max E|Xj |4 ≤ D3 < ∞, max sup Ef (Xj ) ≤ D4 < ∞.
j j f ∈F
CAM: CAUSAL ADDITIVE MODELS 2543
following condition.
(Badd ) Consider the basis functions br (·) appearing in Fn : for the true functions
0 ∈ F , we assume an expansion
fj,k
∞
0
fj,k (x) = αf 0 br (x)
j,k ;r
r=1
with smoothness condition:
∞
α 0 0 ≤ Ck −β .
fj,k ;r
r=k
−(β−1−κ)
Assuming (Badd ), we have that E[(fj,k
0 (X ) − f 0
k n;j,k (Xk )) ] = O(an
2 ) for
any κ > 0: for example, when using an → ∞ and for β > 1, E[(fj,k (Xk ) −0
0
fn;j,k (Xk ))2 ] → 0.
Uniform convergence can be obtained exactly as described after the discussion
of Theorem 1: when requiring the additional uniform versions (U3)–(U4) [since
(B3) and (B4) invoke already uniform bounds we do not need (U1) and (U2)],
and requiring uniform convergence of the probability in (B2), we obtain uniform
convergence over the corresponding class of distributions analogously as in (14).
2544 P. BÜHLMANN, J. PETERS AND J. ERNEST
F IG . 1. Step PNS. For each variable the set of possible parents is reduced (in this plot, a directed
edge from Xk to Xj indicates that Xk is a selected variable in Âj and a possible parent of Xj ). This
reduction leads to a considerable computational gain in the remaining steps of the procedure.
F IG . 2. Step IncEdge. At each iteration the edge leading to the largest decrease of the negative
log-likelihood is included.
CAM: CAUSAL ADDITIVE MODELS 2545
F IG . 3. Step Prune. For each node, variable selection techniques are exploited to remove nonrele-
vant edges.
hereby, we only consider variables that have been picked at least three times dur-
ing the iterations. The sets Âj obtained by this procedure estimate Aj ⊇ pa(j ) as
shown in Lemma 4. We construct a graph in which for each j , the set Âj is the
parental set for node j corresponding to the variable Xj . Figure 1 summarizes this
step. We say that the set of “possible parents” of node j has been reduced to the
set Âj . Importantly, we do not disregard true parents if the sample size is large
enough (Section 4.2, Lemma 4). Instead of the boosting method, we could alter-
natively use additive model fitting with a sparsity- or sparsity-smoothness penalty
[16, 28].
5.2. Estimating the correct order by greedy search: IncEdge. Let us first con-
sider the situation without PNS. Searching over all permutations π for finding π̂
in (11) is computationally infeasible if the number of variables p is large. We
propose a greedy estimation procedure that starts with an empty DAG and adds
at each iteration the edge k → j between nodes k and j that corresponds to the
largest gain in log-likelihood. We therefore compute the score function in (11),
with π corresponding to the current DAG,
j −1
p
π
p
π ˆπ π
log σ̂j = log Xj − fj,k Xk
j =1 j =1 k=1 (n)
and construct a matrix, whose entry (k, j ) specifies by how much this score is re-
duced after adding the edge k → j and, therefore, allowing a nonconstant function
fj,k (see Figure 2). For implementation, we employ additive model fitting with
penalized regression splines (with ten basis functions per variable), using the R-
function gam from the R-package mgcv, in order to obtain estimates fˆj,k and σ̂j .
After the addition of an edge, we only need to recompute the j th column of the
score matrix (see Figure 2) since the score decomposes over all nodes. In order
to avoid cycles, we remove further entries of the score matrix. After p(p − 1)/2
iterations, the graph has been completed to a fully connected DAG. The latter
corresponds to a unique permutation π̂ . This algorithm is computationally rather
2546 P. BÜHLMANN, J. PETERS AND J. ERNEST
efficient and can easily handle graphs of up to 30 nodes without PNS (see Sec-
tion 6.1.2).
If we have performed PNS as in Section 5.1 we sparsify the score matrix from
the beginning. We only consider entries (k, j ) for which k is considered to be a
possible parent of j . This way the algorithm is feasible for up to a few thousands
of nodes (see Section 6.1.3).
Alternative methods for (low-dimensional) additive model fitting include back-
fitting [14], cf.
5.3. Pruning of the DAG by feature selection: Prune. Section 2.5 describes
sparse regression techniques for pruning the DAG that has been estimated by
step IncEdge; see Figure 3. We implement this task by applying significance test-
ing of covariates, based on the R-function gam from the R-package mgcv and
declaring significance if the reported p-values are lower or equal to 0.001, inde-
pendently of the sample size (for problems with small sample size, the p-value
threshold should be increased).
If the DAG estimated by (PNS and) IncEdge is a super DAG of the true DAG,
the estimated interventional distributions are correct; see Section 2.6. This does
not change if Prune removes additional “superfluous” edges. The structural Ham-
ming distance to the true graph, however, may reduce significantly after perform-
ing Prune; see Section 6.1.2. Alternative methods for hypothesis testing in (low-
dimensional) additive are possible [43], cf., or one could use penalized additive
model fitting for variable selection [16, 28, 44].
6. Numerical results.
6.1. Simulated data. We show the effectiveness of each step in our algorithm
(Section 6.1.2) and compare the whole procedure to other state of the art methods
(Section 6.1.3). We investigate empirically the role of noninjective functions (Sec-
tion 6.1.4) and discuss the linear Gaussian case (Section 6.1.5). In Section 6.1.6,
we further check the robustness of our method against model misspecification, that
is, in the case of non-Gaussian noise or nonadditive functions. For evaluation, we
compute the structural intervention distance that we introduce in Section 6.1.1.
For simulating data, we randomly choose a correct ordering π 0 and connect
each pair of variables (nodes) with a probability pconn . If not stated otherwise,
each of the possible p(p − 1)/2 connections is included with a probability of
pconn = 2/(p − 1) resulting in a sparse DAG with an expected number of p
edges. Given the structure, we draw the functions fj,k from a Gaussian process
with a Gaussian (or RBF) kernel with bandwidth one and√add Gaussian noise with
standard deviation uniformly sampled between 1/5 √ and 2/5. All nodes without
parents have a standard deviation between 1 and 2. The experiments are based
on 100 repetitions if the description does not say differently.
All code is provided on the second author’s homepage.
CAM: CAUSAL ADDITIVE MODELS 2547
F IG . 4. The plots show the effect of the individual steps of our method. Prune reduces the SHD to
the true DAG but leaves the SID almost unchanged. PNS reduces the computation time, especially
for large p.
2548 P. BÜHLMANN, J. PETERS AND J. ERNEST
F IG . 5. SHD (left) and SID (right) for different methods on sparse DAGs with p = 10 (top) and
p = 100 (bottom); the sample size is n = 200.
with α = 0.01 and partial correlation as independence test. GES is used with a
linear Gaussian score function. Thus, only RESIT is able to model the class of
nonlinear additive functions. We apply the methods to DAGs of size p = 10 and
p = 100, whereas in both cases, the sample size is n = 200. RESIT is not appli-
cable for graphs with p = 100 due to computational reasons. Figure 5 shows that
our proposed method outperforms the other approaches both in terms of SID and
SHD.
The difference between the methods becomes even larger for dense graphs with
an expected number of 4p edges and strong varying degree of nodes (results not
shown).
Only the PC methods and the proposed method CAM scale to high-dimensional
data with p = 1000 and n = 200. Keeping the same (sparse) setting as above re-
sults in SHDs of 1214 ± 37, 1330 ± 40 and 477 ± 19 for PC, CPC and CAM,
respectively. These results are based on five experiments.
F IG . 6. SHD (left) and SID (right) for various values of p and n = 300. The plots compare the
performances of CAM for the additive SEM (1) with functions generated by Gaussian processes
(noninjective in general) and sigmoid-type functions (injective).
to linear functions which makes the direction of the corresponding edges diffi-
cult to identify. Figure 6 shows a comparison of the performance of CAM in the
previously applied setting with Gaussian processes and in the new setting with
sigmoid-type functions. As expected, the performance of CAM decreases in this
more difficult setting but is still better than for the competitors such as RESIT,
LiNGAM, PC, CPC and GES (not shown).
6.1.5. Linear Gaussian SEMs. In the linear Gaussian setting, we can only
identify the Markov equivalence class of the true graph (if we assume faithful-
ness). We now sample data from a linear Gaussian SEM and expand the DAGs
that are estimated by CAM and LiNGAM to CPDAGs, that is, we consider the
corresponding Markov equivalence classes. The two plots in Figure 7 compare
the different methods for p = 10 variables and n = 200. They show the structural
Hamming distance (SHD) between the estimated and the true Markov equivalence
class (left), as well as lower and upper bounds for the SID (right). (By the defini-
tion of lower and upper bounds of the SID, the SID between the true and estimated
DAG lies in between those values.) The proposed method has a disadvantage be-
cause it uses nonlinear regression instead of linear regression. The performance
F IG . 7. Data are generated by linear Gaussian SEM. SHD between true and estimated CPDAG
(left), lower and upper bounds for SID between true DAG and estimated CPDAG (right).
2550 P. BÜHLMANN, J. PETERS AND J. ERNEST
for different values of ω ∈ [0, 1] and Gaussian noise. Both, fj,k and fj are drawn
from a Gaussian process using an RBF kernel with bandwidth one. Note that ω = 1
corresponds to the fully additive model (3), whereas for ω = 0, the value of Xj is
given as a nonadditive function of all its parents. Figure 9 shows the result for a
F IG . 8. SHD (top) and SID (bottom) for p = 25 and n = 300 in the case of misspecified models. The
plot shows deviations of the noise from a normal distribution (only γ = 1 corresponds to Gaussian
noise).
CAM: CAUSAL ADDITIVE MODELS 2551
F IG . 9. SHD (left) and SID (right) for p = 25 and n = 300 in the case of misspecified models. The
plot shows deviations from additivity for sparse (top) and nonsparse (bottom) truths, respectively
(only ω = 1 corresponds to a fully additive model).
sparse truth with expected number of p edges (top) and a nonsparse truth with
expected number of 4p edges (lower). In sparse DAGs, many nodes have a small
number of parents and our algorithm yields a comparably small SID even if the
model contains nonadditive functions. If the underlying truth is nonsparse, the
performance of our algorithm becomes worse but it is still slightly better than PC
which achieves average lower bounds of SID values of roughly 520, both for ω = 1
and for ω = 0 (not shown).
F IG . 10. Gene expressions in isoprenoid pathways. The twenty best scoring edges provided by the
method CAM.
7.1. Extensions. The estimation principle of first pursuing order search based
on nonregularized maximum likelihood and then using penalized regression for
CAM: CAUSAL ADDITIVE MODELS 2553
F IG . 11. Gene expressions in isoprenoid pathways. Edges estimated by stability selection: all di-
rections are in correspondence with the direction of the pathways.
feature selection works with other structural equation models where the underly-
ing DAG is identifiable from the observational distribution. Closely related exam-
ples include nonlinearly transformed additive structural equation models [46] or
Gaussian structural equation models with same error variances [25].
If the DAG D is nonidentifiable from the distribution P , the methodology needs
to be adapted; see also Remark 1 considering the linear Gaussian SEM. The true
orders 0 can be defined as the set of permutations which lead to most sparse
autoregressive representations as in (5): assuming faithfulness, these orders cor-
respond to the Markov equivalence class of the underlying DAG. Therefore, for
estimation, we should use regularized maximum likelihood estimation leading to
sparse solutions with, for example, the 0 -penalty [6, 38].
Finally, it would be very interesting to extend (sparse) permutation search to
(possibly nonidentifiable) models with hidden variables [7, 11, 23, 34] or with
graph structures allowing for cycles [19, 20, 30, 33]. Note that unlike linear Gaus-
sian models, CAMs are not closed under marginalization: if X, Y and Z follow a
CAM (1), then X and Y do not necessarily remain in the class of CAMs.
2554 P. BÜHLMANN, J. PETERS AND J. ERNEST
SUPPLEMENTARY MATERIAL
Supplement to “CAM: Causal additive models, high-dimensional order
search and penalized regression” (DOI: 10.1214/14-AOS1260SUPP; .pdf). This
supplemental article [3] contains all proofs.
REFERENCES
[1] B REIMAN , L. and F RIEDMAN , J. H. (1985). Estimating optimal transformations for multiple
regression and correlation. J. Amer. Statist. Assoc. 80 580–619. With discussion and with
a reply by the authors. MR0803258
[2] B ÜHLMANN , P. and H OTHORN , T. (2007). Boosting algorithms: Regularization, prediction
and model fitting. Statist. Sci. 22 477–505. MR2420454
[3] B ÜHLMANN , P., P ETERS , J. and E RNEST, J. (2014). Supplement to “CAM: Causal addi-
tive models, high-dimensional order search and penalized regression.” DOI:10.1214/14-
AOS1260SUPP.
[4] B ÜHLMANN , P. and VAN DE G EER , S. (2011). Statistics for High-Dimensional Data: Methods,
Theory and Applications. Springer, Heidelberg. MR2807761
[5] B ÜHLMANN , P. and Y U , B. (2003). Boosting with the L2 loss: Regression and classification.
J. Amer. Statist. Assoc. 98 324–339. MR1995709
[6] C HICKERING , D. M. (2002). Optimal structure identification with greedy search. J. Mach.
Learn. Res. 3 507–554. MR1991085
[7] C OLOMBO , D., M AATHUIS , M. H., K ALISCH , M. and R ICHARDSON , T. S. (2012). Learning
high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist.
40 294–321. MR3014308
[8] H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. MR2722294
[9] H OTHORN , T., B ÜHLMANN , P., K NEIB , T., S CHMID , M. and H OFNER , B. (2010). Model-
based boosting 2.0. J. Mach. Learn. Res. 11 2109–2113. MR2719848
[10] I MOTO , S., G OTO , T. and M IYANO , S. (2002). Estimation of genetic networks and functional
structures between genes by using Bayesian network and nonparametric regression. In
Proceedings of the Pacific Symposium on Biocomputing (PSB) 175–186. Lihue, HI.
[11] JANZING , D., P ETERS , J., M OOIJ , J. M. and S CHÖLKOPF, B. (2009). Identifying confounders
using additive noise models. In Proceedings of the 25th Annual Conference on Uncer-
tainty in Artificial Intelligence (UAI) 249–257. AUAI Press, Corvallis, OR.
[12] L AURITZEN , S. L. (1996). Graphical Models. Oxford Univ. Press, New York. MR1419991
[13] L OH , P. and B ÜHLMANN , P. (2013). High-dimensional learning of linear causal net-
works via inverse covariance estimation. J. Mach. Learn. Res. To appear. Available at
arXiv:1311.3492.
[14] M AMMEN , E. and PARK , B. U. (2006). A simple smooth backfitting method for additive
models. Ann. Statist. 34 2252–2271. MR2291499
[15] M ARRA , G. and W OOD , S. N. (2011). Practical variable selection for generalized additive
models. Comput. Statist. Data Anal. 55 2372–2387. MR2786996
[16] M EIER , L., VAN DE G EER , S. and B ÜHLMANN , P. (2009). High-dimensional additive model-
ing. Ann. Statist. 37 3779–3821. MR2572443
CAM: CAUSAL ADDITIVE MODELS 2555
[17] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selec-
tion with the lasso. Ann. Statist. 34 1436–1462. MR2278363
[18] M EINSHAUSEN , N. and B ÜHLMANN , P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat.
Methodol. 72 417–473. MR2758523
[19] M OOIJ , J. and H ESKES , T. (2013). Cyclic causal discovery from continuous equilibrium data.
In Proceedings of the 29th Annual Conference on Uncertainty in Artificial Intelligence
(UAI) 431–439. AUAI Press, Corvallis, OR.
[20] M OOIJ , J., JANZING , D., H ESKES , T. and S CHÖLKOPF, B. (2011). On causal discovery with
cyclic additive noise models. In Advances in Neural Information Processing Systems 24
(NIPS) 639–647. Curran, Red Hook, NY.
[21] M OOIJ , J., JANZING , D., P ETERS , J. and S CHÖLKOPF, B. (2009). Regression by dependence
minimization and its application to causal inference. In Proceedings of the 26th Interna-
tional Conference on Machine Learning (ICML) 745–752. ACM, New York.
[22] N OWZOHOUR , C. and B ÜHLMANN , P. (2013). Score-based causal learning in additive noise
models. Available at arXiv:1311.6359.
[23] P EARL , J. (2000). Causality: Models, Reasoning, and Inference. Cambridge Univ. Press, Cam-
bridge. MR1744773
[24] P ETERS , J. and B ÜHLMANN , P. (2013). Structural intervention distance (SID) for evaluating
causal graphs. Neural Comput. To appear. Available at arXiv:1306.1043.
[25] P ETERS , J. and B ÜHLMANN , P. (2014). Identifiability of Gaussian structural equation models
with equal error variances. Biometrika 101 219–228. MR3180667
[26] P ETERS , J., M OOIJ , J., JANZING , D. and S CHÖLKOPF, B. (2014). Causal discovery with
continuous additive noise models. J. Mach. Learn. Res. 15 2009–2053.
[27] R AMSEY, J., Z HANG , J. and S PIRTES , P. (2006). Adjacency-faithfulness and conservative
causal inference. In Proceedings of the 22nd Annual Conference on Uncertainty in Artifi-
cial Intelligence (UAI) 401–408. AUAI Press, Corvallis, OR.
[28] R AVIKUMAR , P., L AFFERTY, J., L IU , H. and WASSERMAN , L. (2009). Sparse additive mod-
els. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030. MR2750255
[29] R ÉNYI , A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hung. 10 441–451
(unbound insert). MR0115203
[30] R ICHARDSON , T. (1996). A discovery algorithm for directed cyclic graphs. In Uncertainty in
Artificial Intelligence (Portland, OR, 1996) 454–461. Morgan Kaufmann, San Francisco,
CA. MR1617227
[31] S CHMIDT, M., N ICULESCU -M IZIL , A. and M URPHY, K. (2007). Learning graphical model
structure using L1-regularization paths. In Proceedings of the National Conference on
Artificial Intelligence 22 1278. AAAI Press, Menlo Park, CA.
[32] S HIMIZU , S., H OYER , P. O., H YVÄRINEN , A. and K ERMINEN , A. (2006). A linear non-
Gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7 2003–2030.
MR2274431
[33] S PIRTES , P. (1995). Directed cyclic graphical representations of feedback models. In Proceed-
ings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI) 491–499. Mor-
gan Kaufmann, San Francisco, CA.
[34] S PIRTES , P., G LYMOUR , C. and S CHEINES , R. (2000). Causation, Prediction, and Search,
2nd ed. MIT Press, Cambridge, MA. MR1815675
[35] T EYSSIER , M. and KOLLER , D. (2005). Ordering-based search: A simple and effective algo-
rithm for learning Bayesian networks. In Proceedings of the 21st Conference on Uncer-
tainty in Artificial Intelligence (UAI) 584–590. AUAI Press, Corvallis, OR.
[36] T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.
Ser. B 58 267–288. MR1379242
[37] VAN DE G EER , S. (2014). On the uniform convergence of empirical norms and inner products,
with application to causal inference. Electron. J. Stat. 8 543–574. MR3211024
2556 P. BÜHLMANN, J. PETERS AND J. ERNEST
[38] VAN DE G EER , S. and B ÜHLMANN , P. (2013). 0 -penalized maximum likelihood for sparse
directed acyclic graphs. Ann. Statist. 41 536–567. MR3099113
[39] VAN DE G EER , S., B ÜHLMANN , P. and Z HOU , S. (2011). The adaptive and the thresholded
Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J.
Stat. 5 688–749. MR2820636
[40] VOORMAN , A., S HOJAIE , A. and W ITTEN , D. (2014). Graph estimation with joint additive
models. Biometrika 101 85–101. MR3180659
[41] WAINWRIGHT, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recov-
ery using 1 -constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55
2183–2202. MR2729873
[42] W ILLE , A., Z IMMERMANN , P., V RANOVÁ , E., F ÜRHOLZ , A., L AULE , O., B LEULER , S.,
H ENNIG , L., P RELIC , A., VON ROHR , P., T HIELE , L., Z ITZLER , E., G RUISSEM , W.
and B ÜHLMANN , P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene
network in Arabidopsis thaliana. Genome Biol. 5 R92.
[43] W OOD , S. N. (2006). Generalized Additive Models: An Introduction with R. Chapman &
Hall/CRC, Boca Raton, FL. MR2206355
[44] Y UAN , M. and L IN , Y. (2006). Model selection and estimation in regression with grouped
variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. MR2212574
[45] Z HANG , C.-H. and H UANG , J. (2008). The sparsity and bias of the Lasso selection in high-
dimensional linear regression. Ann. Statist. 36 1567–1594. MR2435448
[46] Z HANG , K. and H YVÄRINEN , A. (2009). On the identifiability of the post-nonlinear causal
model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence
(UAI) 647–655. AUAI Press, Corvallis, OR.
[47] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res.
7 2541–2563. MR2274449
[48] Z OU , H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–
1429. MR2279469