Causal Discovery with Attention-Based Convolutional Neural Networks
<p>A temporal causal graph learnt from multivariate observational time series data. A graph node models one time series. A directed edge denotes a causal relationship and is annotated with the time delay between cause and effect.</p> "> Figure 2
<p>Temporal causal graphs showing causal relationships and delays between cause and effect.</p> "> Figure 3
<p>Overview of Temporal Causal Discovery Framework (TCDF). With time series data as input, TCDF performs four steps (gray boxes) using the technique described in the white box and outputs a temporal causal graph.</p> "> Figure 4
<p>TCDF with <span class="html-italic">N</span> independent CNNs <math display="inline"><semantics> <msub> <mi mathvariant="script">N</mi> <mn>1</mn> </msub> </semantics></math>…<math display="inline"><semantics> <msub> <mi mathvariant="script">N</mi> <mi>n</mi> </msub> </semantics></math>, all having time series <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mn>1</mn> </msub> </semantics></math>…<math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mi>n</mi> </msub> </semantics></math> of length <span class="html-italic">T</span> as input (<span class="html-italic">N</span> is equal to the number of time series in the input data set). <math display="inline"><semantics> <msub> <mi mathvariant="script">N</mi> <mi>j</mi> </msub> </semantics></math> predicts <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mi>j</mi> </msub> </semantics></math> and also outputs, besides <math display="inline"><semantics> <msub> <mover accent="true"> <mi mathvariant="bold">X</mi> <mo stretchy="false">^</mo> </mover> <mi>j</mi> </msub> </semantics></math>, the kernel weights <math display="inline"><semantics> <msub> <mi mathvariant="script">W</mi> <mi>j</mi> </msub> </semantics></math> and attention scores <math display="inline"><semantics> <msub> <mi mathvariant="bold">a</mi> <mi>j</mi> </msub> </semantics></math>. After attention interpretation, causal validation and delay discovery, TCDF constructs a temporal causal graph.</p> "> Figure 5
<p>Dilated TCN to predict <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mn>2</mn> </msub> </semantics></math>, with <math display="inline"><semantics> <mrow> <mi>L</mi> <mo>=</mo> <mn>3</mn> </mrow> </semantics></math> hidden layers, kernel size <math display="inline"><semantics> <mrow> <mi>K</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math> (shown as arrows) and dilation coefficient <math display="inline"><semantics> <mrow> <mi>c</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>, leading to a receptive field <math display="inline"><semantics> <mrow> <mi>R</mi> <mo>=</mo> <mn>16</mn> </mrow> </semantics></math>. A PReLU activation function is applied after each convolution. To predict the first values (shown as dashed arrows), zero padding is added to the left of the sequence. Weights are shared across layers, indicated by the identical colors.</p> "> Figure 6
<p>Attention-based Dilated Depthwise Separable Temporal Convolutional Network <math display="inline"><semantics> <msub> <mi mathvariant="script">N</mi> <mn>2</mn> </msub> </semantics></math> to predict target time series <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mn>2</mn> </msub> </semantics></math>. The <span class="html-italic">N</span> channels have <math display="inline"><semantics> <mrow> <mi>T</mi> <mo>=</mo> <mn>13</mn> </mrow> </semantics></math> time steps, <math display="inline"><semantics> <mrow> <mi>L</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> hidden layer in the depthwise convolution and <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <mn>2</mn> </mrow> </semantics></math> kernels with kernel size <math display="inline"><semantics> <mrow> <mi>K</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math> (denoted by colored blocks). The attention scores <span class="html-italic">a</span> are multiplied element-wise with the input time series, followed by an element-wise multiplication with the kernel. In the pointwise convolution, all channel outputs are combined to construct the prediction <math display="inline"><semantics> <msub> <mover accent="true"> <mi mathvariant="bold">X</mi> <mo stretchy="false">^</mo> </mover> <mn>2</mn> </msub> </semantics></math>.</p> "> Figure 7
<p>Threshold <math display="inline"><semantics> <msub> <mi>τ</mi> <mi>j</mi> </msub> </semantics></math> is set equal to the attention score at the left side of the largest gap <math display="inline"><semantics> <msub> <mi>g</mi> <mi>k</mi> </msub> </semantics></math> where <math display="inline"><semantics> <mrow> <mi>k</mi> <mo>≠</mo> <mn>0</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>k</mi> <mo><</mo> <mo>|</mo> <mi mathvariant="bold">G</mi> <mo>|</mo> <mo>/</mo> <mn>2</mn> </mrow> </semantics></math>. In this example, <math display="inline"><semantics> <msub> <mi>τ</mi> <mi>j</mi> </msub> </semantics></math> is set equal to the third largest attention score.</p> "> Figure 8
<p>How TCDF deals, in theory, with hidden confounders (denoted by squares). A black square indicates that the hidden confounder is discovered by TCDF; a grey square indicates that it is not discovered. Black edges indicate causal relationships that will be included in the learnt temporal causal graph <math display="inline"><semantics> <msub> <mi mathvariant="script">G</mi> <mi>L</mi> </msub> </semantics></math>; grey edges will not be included in <math display="inline"><semantics> <msub> <mi mathvariant="script">G</mi> <mi>L</mi> </msub> </semantics></math>.</p> "> Figure 9
<p>Discovering the delay between cause <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mn>1</mn> </msub> </semantics></math> and target <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mn>2</mn> </msub> </semantics></math>, both having <math display="inline"><semantics> <mrow> <mi>T</mi> <mo>=</mo> <mn>16</mn> </mrow> </semantics></math>. Starting from the top convolutional layer, the algorithm traverses through the path with the highest kernel weights. Eventually, the algorithm ends in input value <math display="inline"><semantics> <msubsup> <mi>X</mi> <mn>1</mn> <mn>10</mn> </msubsup> </semantics></math>, indicating a delay of <math display="inline"><semantics> <mrow> <mn>16</mn> <mo>−</mo> <mn>10</mn> <mo>=</mo> <mn>6</mn> </mrow> </semantics></math> time steps.</p> "> Figure 10
<p>Example datasets and causal graphs: simulation 17 from <tt>FMRI</tt> (<b>top</b>), graph 20-1A from <tt>FINANCE</tt> (<b>bottom</b>). A colored line corresponds to one time series (node) in the causal graph.</p> "> Figure 11
<p>Adapted ground truth for the hidden confounder experiment, showing graphs 20-1A (<b>left</b>) and 40-1-3 (<b>right</b>) from FINANCE. Only one grey node was removed per experiment.</p> "> Figure 12
<p>Example with three variables showing that <math display="inline"><semantics> <msub> <mi mathvariant="script">G</mi> <mi>L</mi> </msub> </semantics></math> has TP = 0, FP = 1 (<math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mn>1</mn> <mo>,</mo> <mn>3</mn> </mrow> </msub> </semantics></math>), TP’ = 1 (<math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mn>1</mn> <mo>,</mo> <mn>3</mn> </mrow> </msub> </semantics></math>), FP’ = 0 and FN = 2 (<math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>e</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>3</mn> </mrow> </msub> </semantics></math>). Therefore, F1 = 0 and F1’= 0.5.</p> ">
Abstract
:1. Introduction
- We present a new temporal causal discovery method (TCDF) that uses attention-based CNNs to discover causal relationships in time series data, to discover the time delay between each cause and effect, and to construct a temporal causal graph of causal relationships with delays.
- We evaluate TCDF and several other temporal causal discovery methods on two benchmarks: financial data describing stock returns, and FMRI data measuring brain blood flow.
2. Problem Statement
- The method should distinguish direct from indirect causes. Vertex is seen as an indirect cause of if and if there is a two-edge path (Figure 2a). Pairwise methods, i.e., methods that only find causal relationships between two variables, are often unable to make this distinction [10]. In contrast, multivariate methods take all variables into account to distinguish between direct and indirect causality [11].
- The method should learn instantaneous causal effects, where the delay between cause and effect is 0 time steps. Neglecting instantaneous influences can lead to misleading interpretations [13]. In practice, instantaneous effects mostly occur when cause and effect refer to the same time step that cannot be causally ordered a priori, because of a too coarse time scale.
- The presence of a confounder, a common cause of at least two variables, is a well-known challenge for causal discovery methods (Figure 2b). Although confounders are quite common in real-world situations, they complicate causal discovery since the confounder’s effects ( and in Figure 2b) are correlated, but are not causally related. Especially when the delays between the confounder and its effects are not equal, one should be careful to not incorrectly include a causal relationship between the confounder’s effects (the grey edge in Figure 2b).
- A particular challenge occurs when a confounder is not observed (a hidden (or latent) confounder). Although it might not even be known how many hidden confounders exist, it is important that a causal discovery method can hypothesise the existence of a hidden confounder to prevent learning an incorrect causal relation between its effects.
3. Related Work
3.1. Temporal Causal Discovery
3.2. Deep Learning for Non-Temporal Causal Discovery
3.3. Time Series Prediction
3.4. Attention Mechanism in Neural Networks
4. TCDF—Temporal Causal Discovery Framework
4.1. The Architecture for Time Series Prediction
4.1.1. Dilations
4.1.2. Adaption for Discovering Self-Causation
4.1.3. Adaption for Multivariate Causal Discovery
4.1.4. The Attention Mechanism
4.1.5. Residual Connections
4.2. Attention Interpretation
- We require that , since all scores are initialized at 1 and a score will only be increased through backpropagation if the network attends to that time series.
- Since a temporal causal graph is usually sparse, we require that the gap selected for lies in the first half of (if ) to ensure that the algorithm does not include low attention scores in the selection. At most 50% of the input time series can be a potential cause of target . By this requirement, we limit the number of time series labeled as potential causes. Although this number can be configured, we experimentally estimated that 50% gives good results.
- We require that the gap for cannot be in first position (i.e., between the highest and second-highest attention score). This ensures that the algorithm does not truncate to zero the scores for time series which were actually a cause of the target time series, but were weaker than the top scorer. Thus, the potential causes for target will include at least two time series.
- and : is not correlated with and vice versa.
- and : is added to since is a potential cause of because of:
- (a)
- (In)direct causal relation from to , or
- (b)
- Presence of a (hidden) confounder between and where the delay from the confounder to is smaller than the delay to .
- and : is added to since is a potential cause of because of:
- (a)
- (In)direct causal relation from to , or
- (b)
- Presence of a (hidden) confounder between and where the delay from the confounder to is smaller than the delay to .
- and : is added to and is added to because of:
- (a)
- Presence of a 2-cycle where causes and causes , or
- (b)
- Presence of a (hidden) confounder with equal delays to its effects and .
4.3. Causal Validation
- Temporal precedence: the cause precedes its effect,
- Physical influence: manipulation of the cause changes its effect.
4.3.1. Permutation Importance Validation Method
4.3.2. Dealing with Hidden Confounders
4.4. Delay Discovery
5. Experiments
5.1. Data Sets
5.2. Experimental Setup
5.3. Evaluation Measures
5.4. Results
5.4.1. Overall Performance
5.4.2. Impact of the Causal Validation
5.4.3. Case Study: Detection of Hidden Confounders
5.5. Summary
6. Discussion
6.1. Hyperparameters
6.2. Limitations of Experiments
7. Summary and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Kleinberg, S. Why: A Guide to Finding and Using Causes; O’Reilly: Springfield, MA, USA, 2015. [Google Scholar]
- Kleinberg, S. Causality, Probability, and Time; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
- Zorzi, M.; Sepulchre, R. AR Identification of Latent-Variable Graphical Models. IEEE Trans. Autom. Control 2016, 61, 2327–2340. [Google Scholar] [CrossRef] [Green Version]
- Spirtes, P. Introduction to causal inference. J. Mach. Learn. Res. 2010, 11, 1643–1662. [Google Scholar]
- Zhang, K.; Schölkopf, B.; Spirtes, P.; Glymour, C. Learning causality and causality-related learning: Some recent progress. Natl. Sci. Rev. 2017, 5, 26–29. [Google Scholar] [CrossRef]
- Danks, D. The Psychology of Causal Perception and Reasoning. In The Oxford Handbook of Causation; Helen Beebee, C.H., Menzies, P., Eds.; Oxford University Press: Oxford, UK, 2009; Chapter 21; pp. 447–470. [Google Scholar]
- Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B.Y.; Kankanhalli, M. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; ACM: New York, NY, USA, 2018; p. 582. [Google Scholar]
- Runge, J.; Sejdinovic, D.; Flaxman, S. Detecting causal associations in large nonlinear time series datasets. arXiv, 2017; arXiv:1702.07007. [Google Scholar]
- Huang, Y.; Kleinberg, S. Fast and Accurate Causal Inference from Time Series Data. In Proceedings of the FLAIRS Conference, Hollywood, FL, USA, 18–20 May 2015; pp. 49–54. [Google Scholar]
- Hu, M.; Liang, H. A copula approach to assessing Granger causality. NeuroImage 2014, 100, 125–134. [Google Scholar] [CrossRef]
- Papana, A.; Kyrtsou, C.; Kugiumtzis, D.; Diks, C. Detecting causality in non-stationary time series using partial symbolic transfer entropy: Evidence in financial data. Comput. Econ. 2016, 47, 341–365. [Google Scholar] [CrossRef]
- Müller, B.; Reinhardt, J.; Strickland, M.T. Neural Networks: An Introduction; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Hyvärinen, A.; Shimizu, S.; Hoyer, P.O. Causal modelling combining instantaneous and lagged effects: An identifiable model based on non-Gaussianity. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 424–431. [Google Scholar]
- Malinsky, D.; Danks, D. Causal discovery algorithms: A practical guide. Philos. Compass 2018, 13, e12470. [Google Scholar] [CrossRef]
- Quinn, C.J.; Coleman, T.P.; Kiyavash, N.; Hatsopoulos, N.G. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. J. Comput. Neurosci. 2011, 30, 17–44. [Google Scholar] [CrossRef]
- Gevers, M.; Bazanella, A.S.; Parraga, A. On the identifiability of dynamical networks. IFAC-PapersOnLine 2017, 50, 10580–10585. [Google Scholar] [CrossRef]
- Friston, K.; Moran, R.; Seth, A.K. Analysing connectivity with Granger causality and dynamic causal modelling. Curr. Opin. Neurobiol. 2013, 23, 172–178. [Google Scholar] [CrossRef] [Green Version]
- Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Papana, A.; Kyrtsou, K.; Kugiumtzis, D.; Diks, C. Identifying Causal Relationships in Case of Non-Stationary Time Series; Technical Report; Universiteit van Amsterdam: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Chu, T.; Glymour, C. Search for additive nonlinear time series causal models. J. Mach. Learn. Res. 2008, 9, 967–991. [Google Scholar]
- Entner, D.; Hoyer, P.O. On causal discovery from time series data using FCI. In Proceedings of the Fifth European Workshop on Probabilistic Graphical Models, Helsinki, Finland, 13–15 September 2010; pp. 121–128. [Google Scholar]
- Peters, J.; Janzing, D.; Schölkopf, B. Causal inference on time series using restricted structural equation models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2013; pp. 154–162. [Google Scholar]
- Jiao, J.; Permuter, H.H.; Zhao, L.; Kim, Y.H.; Weissman, T. Universal estimation of directed information. IEEE Trans. Inf. Theory 2013, 59, 6220–6242. [Google Scholar] [CrossRef]
- Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econom. J. Econom. Soc. 1969, 37, 424–438. [Google Scholar] [CrossRef]
- Chen, Y.; Bressler, S.L.; Ding, M. Frequency decomposition of conditional Granger causality and application to multivariate neural field potential data. J. Neurosci. Methods 2006, 150, 228–237. [Google Scholar] [CrossRef] [Green Version]
- Zorzi, M.; Chiuso, A. Sparse plus low rank network identification: A nonparametric approach. Automatica 2017, 76, 355–366. [Google Scholar] [CrossRef] [Green Version]
- Marinazzo, D.; Pellicoro, M.; Stramaglia, S. Kernel method for nonlinear Granger causality. Phys. Rev. Lett. 2008, 100, 144103. [Google Scholar] [CrossRef]
- Luo, Q.; Ge, T.; Grabenhorst, F.; Feng, J.; Rolls, E.T. Attention-dependent modulation of cortical taste circuits revealed by Granger causality with signal-dependent noise. PLoS Comput. Biol. 2013, 9, e1003265. [Google Scholar] [CrossRef]
- Spirtes, P.; Zhang, K. Causal discovery and inference: Concepts and recent methodological advances. In Applied Informatics; Springer: Berlin, Germany, 2016; Volume 3, p. 3. [Google Scholar]
- Spirtes, P.; Glymour, C.N.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Liu, Y.; Aviyente, S. The relationship between transfer entropy and directed information. In Proceedings of the Statistical Signal Processing Workshop (SSP), Ann Arbor, MI, USA, 5–8 August 2012; pp. 73–76. [Google Scholar]
- Guo, T.; Lin, T.; Lu, Y. An Interpretable LSTM Neural Network for Autoregressive Exogenous Model. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Louizos, C.; Shalit, U.; Mooij, J.M.; Sontag, D.; Zemel, R.; Welling, M. Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 6446–6456. [Google Scholar]
- Goudet, O.; Kalainathan, D.; Caillou, P.; Guyon, I.; Lopez-Paz, D.; Sebag, M. Causal Generative Neural Networks. arXiv, 2018; arXiv:1711.08936v2. [Google Scholar]
- Kalainathan, D.; Goudet, O.; Guyon, I.; Lopez-Paz, D.; Sebag, M. SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning. arXiv, 2018; arXiv:1803.04929. [Google Scholar]
- Bai, S.; Kolter, J.Z.; Koltun, V. Convolutional Sequence Modeling Revisited. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef]
- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1243–1252. [Google Scholar]
- Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Conditional image generation with pixelCNN decoders. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2016; pp. 4790–4798. [Google Scholar]
- Borovykh, A.; Bohte, S.; Oosterlee, C.W. Conditional time series forecasting with convolutional neural networks. In Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence; Springer: Berlin, Germany, 2017; pp. 729–730. [Google Scholar]
- Binkowski, M.; Marti, G.; Donnat, P. Autoregressive Convolutional Neural Networks for Asynchronous Time Series. arXiv, 2017; arXiv:1703.04122. [Google Scholar]
- Walther, D.; Rutishauser, U.; Koch, C.; Perona, P. On the usefulness of attention for object recognition. In Proceedings of the Workshop on Attention and Performance in Computational Vision at ECCV, Prague, Czech Republic, 15 May 2004; pp. 96–103. [Google Scholar]
- Yin, W.; Schütze, H.; Xiang, B.; Zhou, B. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. Trans. Assoc. Comput. Linguist. 2016, 4, 259–272. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv, 2016; arXiv:1609.03499. [Google Scholar]
- Sifre, L.; Mallat, S. Rigid-Motion Scattering for Image Classification. 2014. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.672.7091&rep=rep1&type=pdf (accessed on 15 October 2018).
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Martins, A.; Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1614–1623. [Google Scholar]
- Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Wang, S.; Zhang, C. Reinforced Self-Attention Network: A Hybrid of Hard and Soft Attention for Sequence Modeling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4345–4352. [Google Scholar]
- Eichler, M. Causal inference in time series analysis. In Causality: Statistical Perspectives and Applications; Wiley: Hoboken, NJ, USA, 2012; pp. 327–354. [Google Scholar]
- Woodward, J. Making Things Happen: A Theory of Causal Explanation; Oxford University Press: Oxford, UK, 2005. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Van der Laan, M.J. Statistical inference for variable importance. Int. J. Biostat. 2006, 2. [Google Scholar] [CrossRef]
- Datta, A.; Sen, S.; Zick, Y. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 23–25 May 2016; pp. 598–617. [Google Scholar]
- Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat. 2013, 41, 2324–2358. [Google Scholar] [CrossRef] [Green Version]
- Fama, E.F.; French, K.R. The cross-section of expected stock returns. J. Financ. 1992, 47, 427–465. [Google Scholar] [CrossRef]
- Smith, S.M.; Miller, K.L.; Salimi-Khorshidi, G.; Webster, M.; Beckmann, C.F.; Nichols, T.E.; Ramsey, J.D.; Woolrich, M.W. Network modelling methods for FMRI. Neuroimage 2011, 54, 875–891. [Google Scholar] [CrossRef]
- Buxton, R.B.; Wong, E.C.; Frank, L.R. Dynamics of blood flow and oxygenation changes during brain activation: The balloon model. Magn. Reson. Med. 1998, 39, 855–864. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization, 2014. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Hyndman, R.; Khandakar, Y. Automatic Time Series Forecasting: The forecast Package for R. J. Stat. Softw. 2008, 27, 95405. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 3 December 2018).
- Rohrer, J.M. Thinking clearly about correlations and causation: Graphical causal models for observational data. Adv. Methods Pract. Psychol. Sci. 2018, 1, 27–42. [Google Scholar] [CrossRef]
FINANCE | FMRI | FMRI | |
---|---|---|---|
#datasets | 9 | 27 | 6 |
#non-stationary datasets | 0 | 1 | 0 |
#variables (time series) | 25 | {5, 10} | |
#causal relationships | |||
time series length | 4000 | 50–5000 (mean: 774) | 1000–5000 (mean: 2867) |
delays [timesteps] | 1–3 | n.a. | n.a. |
self-causation | ✓ | ✓ | ✓ |
confounders | ✓ | ✓ | ✓ |
type of relationship | linear | non-linear | non-linear |
FINANCE TEST | FMRI TEST | FMRI TEST | |
---|---|---|---|
TCDF () | 0.38 ± 0.09 | 0.84 ± 0.38 | 0.71 ± 0.05 |
TCDF () | 0.38 ± 0.10 | 1.06 ± 0.49 | 0.72 ± 0.07 |
TCDF () | 0.40 ± 0.10 | 1.13 ± 0.45 | 0.74 ± 0.08 |
FINANCE (9 Data Sets) | FMRI (27 Data Sets) | FMRI (6 Data Sets) | ||||
---|---|---|---|---|---|---|
F1 | F1′ | F1 | F1′ | F1 | F1′ | |
TCDF () | 0.64 ± 0.06 | 0.77 ± 0.08 | 0.60 ± 0.09 | 0.63 ± 0.09 | 0.68 ± 0.05 | 0.68 ± 0.05 |
TCDF () | 0.65 ± 0.09 | 0.78 ± 0.10 | 0.58 ± 0.15 | 0.62 ± 0.14 | 0.65 ± 0.13 | 0.68 ± 0.11 |
TCDF () | 0.64 ± 0.09 | 0.77 ± 0.09 | 0.55 ± 0.13 | 0.63 ± 0.11 | 0.70 ± 0.09 | 0.73 ± 0.08 |
PCMCI | 0.55 ± 0.22 | 0.56 ± 0.22 | 0.63 ± 0.10 | 0.67 ± 0.11 | 0.67 ± 0.04 | 0.67 ± 0.04 |
tsFCI | 0.37 ± 0.11 | 0.37 ± 0.12 | 0.49 ± 0.22 | 0.49 ± 0.22 | 0.48 ± 0.28 | 0.48 ± 0.28 |
TiMINo | 0.13 ± 0.05 | 0.21 ± 0.10 | 0.23 ± 0.12 | 0.37 ± 0.14 | 0.23 ± 0.11 | 0.37 ± 0.15 |
TCDF () | PCMCI | tsFCI | TiMINo | |
---|---|---|---|---|
FINANCE | 318 s | 10 s | 93 s | 499 s |
FMRI | 74 s | 1 s | 1 s | 14 s |
TCDF () | TCDF () | TCDF () | PCMCI | tsFCI | TiMINo | |
---|---|---|---|---|---|---|
FINANCE | 97.79% ± 2.56 | 96.42% ± 3.68 | 95.49% ± 4.15 | 100.00% ± 0.00 | 98.77% ± 3.49 | n.a. |
FINANCE (9 Data Sets) | FMRI (27 Data Sets) | FMRI (6 Data Sets) | ||||
---|---|---|---|---|---|---|
F1 | F1′ | F1 | F1′ | F1 | F1′ | |
TCDF () | 0.64 ± 0.06 | 0.77 ± 0.08 | 0.60 ± 0.09 | 0.63 ± 0.09 | 0.68 ± 0.05 | 0.68 ± 0.05 |
TCDF () w/o PIVM | 0.22 ± 0.09 | 0.30 ± 0.13 | 0.60 ± 0.09 | 0.63 ± 0.09 | 0.68 ± 0.05 | 0.68 ± 0.05 |
(PIVM) | −66% | −61% | 0% | 0% | 0% | 0% |
Dataset | Hidden Conf. | Effects | Equal Delays | Conf. Discovered | Learnt Causal Relationships |
---|---|---|---|---|---|
20-1A | , | ✓ | ✓ | , | |
40-1-3 | , | ✓ | ✓ | , | |
40-1-3 | , | ✗ | ✗ | ||
40-1-3 | , | ✓ | ✓ | , | |
40-1-3 | , | ✗ | ✗ | - | |
40-1-3 | , | ✗ | ✗ | - | |
40-1-3 | , | ✗ | ✗ | - | |
40-1-3 | , | ✗ | ✗ | ||
40-1-3 | , | ✗ | ✗ | - |
FINANCE HIDDEN | TCDF () | PCMCI | tsFCI | TiMINo |
---|---|---|---|---|
# Incorrect Causal Relationships | 2 | 0 | 3 | 8 |
# Discovered Hidden Confounders | 3 | 0 | 0 | 0 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nauta, M.; Bucur, D.; Seifert, C. Causal Discovery with Attention-Based Convolutional Neural Networks. Mach. Learn. Knowl. Extr. 2019, 1, 312-340. https://doi.org/10.3390/make1010019
Nauta M, Bucur D, Seifert C. Causal Discovery with Attention-Based Convolutional Neural Networks. Machine Learning and Knowledge Extraction. 2019; 1(1):312-340. https://doi.org/10.3390/make1010019
Chicago/Turabian StyleNauta, Meike, Doina Bucur, and Christin Seifert. 2019. "Causal Discovery with Attention-Based Convolutional Neural Networks" Machine Learning and Knowledge Extraction 1, no. 1: 312-340. https://doi.org/10.3390/make1010019
APA StyleNauta, M., Bucur, D., & Seifert, C. (2019). Causal Discovery with Attention-Based Convolutional Neural Networks. Machine Learning and Knowledge Extraction, 1(1), 312-340. https://doi.org/10.3390/make1010019