Review Article
Published: 09 July 2019

Towards algorithmic analytics for large-scale datasets

Danilo Bzdok^1,2,3,
Thomas E. Nichols^4,5 &
Stephen M. Smith⁴

Nature Machine Intelligence volume 1, pages 296–306 (2019)Cite this article

2258 Accesses
58 Citations
59 Altmetric
Metrics details

Subjects

Abstract

The traditional goal of quantitative analytics is to find simple, transparent models that generate explainable insights. In recent years, large-scale data acquisition enabled, for instance, by brain scanning and genomic profiling with microarray-type techniques, has prompted a wave of statistical inventions and innovative applications. Here we review some of the main trends in learning from ‘big data’ and provide examples from imaging neuroscience. Some main messages we find are that modern analysis approaches (1) tame complex data with parameter regularization and dimensionality-reduction strategies, (2) are increasingly backed up by empirical model validations rather than justified by mathematical proofs, (3) will compare against and build on open data and consortium repositories, as well as (4) often embrace more elaborate, less interpretable models to maximize prediction accuracy.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Strongest population mode that links intra-network connectivity patterns and inter-network connectivity patterns.**

**Fig. 3: Relevance of population associations between six brain-imaging modalities and thousands of behavioural phenotypes.**

A hitchhiker’s guide to working with large, open-source neuroimaging datasets

Article 07 December 2020

Supervised dimensionality reduction for big data

Article Open access 17 May 2021

Data leakage inflates prediction performance in connectome-based machine learning models

Article Open access 28 February 2024

References

Efron, B. Large-scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction Vol. 1 (Cambridge Univ. Press, 2012).
Nature 539, 467–468 (2016).
Efron, B. & Hastie, T. Computer-Age Statistical Inference (Cambridge Univ. Press, 2016).
Jordan, M. I. On statistics, computation and scalability. Bernoulli 19, 1378–1390 (2013).
Article MathSciNet Google Scholar
Donoho, D. 50 years of data science. J. Comput. Graph. Stat. 26, 745–766 (2017).
Article MathSciNet Google Scholar
Casella, G. & Berger, R. L. Statistical Inference Vol. 2 (Duxbury, 2002).
Efron, B. & Tibshirani, R. J. Statistical data analysis in the computer age. Science 253, 390–395 (1991).
Article Google Scholar
Nuzzo, R. Scientific method: statistical errors. Nature 506, 150–152 (2014).
Article Google Scholar
Wasserstein, R. L. & Lazar, N. A. The ASA’s statement on P-values: context, process, and purpose. Am. Stat. 70, 129–133 (2016).
Article MathSciNet Google Scholar
Blei, D. M. & Smyth, P. Science and data science. Proc. Natl Acad. Sci. USA 114, 8689–8692 (2017).
Article Google Scholar
Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009).
Article Google Scholar
Breiman, L. Statistical modeling: the two cultures. Stat. Sci. 16, 199–231 (2001).
Article MathSciNet Google Scholar
Jordan, M. I. et al. Frontiers in Massive Data Analysis (The National Academies Press, 2013).
Bzdok, D. & Yeo, B. T. T. Inference in the age of big data: future perspectives on neuroscience. NeuroImage 155, 549–564 (2017).
Article Google Scholar
Smith, S. M. & Nichols, T. E. Statistical challenges in “big data” human neuroimaging. Neuron 97, 263–268 (2018).
Article Google Scholar
Elliott, L. T. et al. Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature 562, 210–216 (2018).
Article Google Scholar
Amunts, K. et al. BigBrain: an ultrahigh-resolution 3D human brain model. Science 340, 1472–1475 (2013).
Article Google Scholar
McIntosh, A. R. & Mišić, B. Multivariate statistical analyses for neuroimaging data. Annu. Rev. Psychol. 64, 499–525 (2013).
Article Google Scholar
McIntosh, A., Bookstein, F., Haxby, J. V. & Grady, C. Spatial pattern analysis of functional brain images using partial least squares. NeuroImage 3, 143–157 (1996).
Article Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2001).
Giraud, C. Introduction to High-dimensional Statistics (CRC Press, 2014).
Hastie, T., Tibshirani, R. & Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations (CRC Press, 2015).
Mohri, M., Talwalkar, A. & Rostamizadeh, A. Foundations of Machine Learning (Adaptive Computation and Machine Learning Series, MIT Press, 2012).
Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge Univ. Press, 2014).
McElreath, R. Statistical Rethinking (Chapman & Hall/CRC, 2015).
Kruschke, J. K. Doing Bayesian Data Analysis (Elsevier, 2011).
Wipf, D. P. & Nagarajan, S. S. Sparse estimation using general likelihoods and non-factorial priors. In Advances in Neural Information Processing Systems 1625–1632 (NIPS, 2008).
Chen, G. et al. Handling multiplicity in neuroimaging through Bayesian lenses with multilevel modeling. Neuroinformatics https://doi.org/10.1007/s12021-018-9409-6 (2018).
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian Data Analysis Vol. 2 (Chapman & Hall/CRC, 2014).
MacKay, D. J. C. Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, 2003).
Smith, S. M. et al. A positive–negative mode of population covariation links brain connectivity, demographics and behavior. Nat. Neurosci. 18, 1565–1567 (2015).
Article Google Scholar
Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).
Article Google Scholar
Virtanen, S., Klami, A. & Kaski, S. Bayesian CCA via group sparsity. In Proc. 28th International Conference on International Conference on Machine Learning (eds Getoor, L. & Scheffer, T.) 457–464 (Omnipress, 2011).
Andrew, G., Arora, R., Bilmes, J. & Livescu, K. Deep canonical correlation analysis. In International Conference on Machine Learning 1247–1255 (PMLR, 2013).
Haufe, S. et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. NeuroImage 87, 96–110 (2014).
Article Google Scholar
Friston, K. J. et al. Statistical parametric maps in functional imaging: a general linear approach. Hum. Brain Mapp. 2, 189–210 (1994).
Article Google Scholar
Kernbach, J. M. et al. Subspecialization within default mode nodes characterized in 10,000 UK Biobank participants. Proc. Natl Acad. Sci. USA 115, 12295–12300 (2018).
Article Google Scholar
Bzdok, D. et al. Characterization of the temporo-parietal junction by combining data-driven parcellation, complementary connectivity analyses, and functional decoding. NeuroImage 81, 381–392 (2013).
Article Google Scholar
Wang, H.-T. et al. Dimensions of experience: exploring the heterogeneity of the wandering mind. Psychol. Sci. 29, 56–71 (2018).
Article Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at arXiv https://arxiv.org/abs/1611.03530 (2016).
Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Series B 36, 111–147 (1974).
MathSciNet MATH Google Scholar
Geisser, S. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70, 320–328 (1975).
Article Google Scholar
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
MATH Google Scholar
Efron, B. & Tibshirani, R. J. An Introduction to the Bootstrap (CRC Press, 1994).
Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523 (2016).
Article Google Scholar
Berkson, J. Some difficulties of interpretation encountered in the application of the chi-square test. J. Am. Stat. Assoc. 33, 526–536 (1938).
Article Google Scholar
Bzdok, D. Classical statistics and statistical learning in imaging neuroscience. Front. Neurosci. 11, 543 (2017).
Article Google Scholar
Nichols, T. E. & Holmes, A. P. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Hum. Brain Mapp. 15, 1–25 (2002).
Article Google Scholar
Winkler, A. M. et al. Non‐parametric combination and related permutation tests for neuroimaging. Hum. Brain Mapp. 37, 1486–1511 (2016).
Article Google Scholar
Ge, T., Yeo, B. T. T. & Winkler, A. A brief overview of permutation testing with examples. Organization for Human Brain Mapping https://www.ohbmbrainmappingblog.com/blog/a-brief-overview-of-permutation-testing-with-examples (2018).
Varoquaux, G. Cross-validation failure: small sample sizes lead to large error bars. NeuroImage 180, 68–77 (2017).
Article Google Scholar
Goodfellow, I. J., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Medland, S. E., Jahanshad, N., Neale, B. M. & Thompson, P. M. Whole-genome analyses of whole-brain data: working within an expanded search space. Nat. Neurosci. 17, 791–800 (2014).
Article Google Scholar
Leonelli, S. Data-centric Biology: A Philosophical Study (Univ. Chicago Press, 2016).
Poldrack, R. A. & Gorgolewski, K. J. Making big data open: data sharing in neuroimaging. Nat. Neurosci. 17, 1510–1517 (2014).
Article Google Scholar
Bron, E. E. et al. Standardized evaluation of algorithms for computer-aided diagnosis of dementia based on structural MRI: the CADDementia challenge. NeuroImage 111, 562–579 (2015).
Article Google Scholar
Sarica, A., Cerasa, A., Quattrone, A. & Calhoun, V. Editorial on special issue: machine learning on MCI. J. Neurosci. methods 302, 1 (2018).
Article Google Scholar
Arbabshirani, M. R., Plis, S., Sui, J. & Calhoun, V. D. Single subject prediction of brain disorders in neuroimaging: promises and pitfalls. NeuroImage 145, 137–165 (2017).
Article Google Scholar
Woo, C.-W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building better biomarkers: brain models in translational neuroimaging. Nat. Neurosci. 20, 365–377 (2017).
Article Google Scholar
Van Essen, D. C. et al. The Human Connectome Project: a data acquisition perspective. NeuroImage 62, 2222–2231 (2012).
Article Google Scholar
Petkova, E. et al. Statistical analysis plan for stage 1 EMBARC (Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care) study. Contemp. Clin. Trials Commun. 6, 22–30 (2017).
Article Google Scholar
Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521, 452–459 (2015).
Article Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
Shmueli, G. To explain or to predict? Stat. Sci. 25, 289–310 (2010).
Article MathSciNet Google Scholar
Harrell, F. Is medicine mesmerized by machine learning? Statistical Thinking http://www.fharrell.com/post/medml/ (2019).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 4765–4774 (NIPS, 2017).
Chen, J., Song, L., Wainwright, M. J. & Jordan, M. I. Learning to explain: an information-theoretic perspective on model interpretation. Preprint at https://arxiv.org/abs/1802.07814 (2018).
Szucs, D. & Ioannidis, J. When null hypothesis significance testing is unsuitable for research: a reassessment. Front. Hum. Neurosci. 11, 390 (2017).
Article Google Scholar
Bzdok, D. & Ioannidis, J. P. A. Exploration, inference and prediction in neuroscience and biomedicine. Trends Neurosci. 42, 251–262 (2019).
Article Google Scholar
Pearl, J. & Mackenzie, D. The Book of Why: The New Science of Cause and Effect (Basic Books, 2018).
Efron, B. Why isn’t everyone a Bayesian? Am. Stat. 40, 1–5 (1986).
MathSciNet MATH Google Scholar
Norvig, P. On chomsky and the two cultures of statistical learning. Peter Norvig http://norvig.com/chomsky.html (2011).
O’Neil, C. Weapons of Math Destruction. How Big Data Increases Inequality and Threatens Democracy (Crown, 2016).
Haynes, J.-D. A primer on pattern-based approaches to fMRI: principles, pitfalls, and perspectives. Neuron 87, 257–270 (2015).
Article Google Scholar
Henke, N. et al. The Age of Analytics: Competing in a Data-driven World Technical Report (McKinsey Global Institute, 2016).
Hoyos-Idrobo, A., Varoquaux, G., Schwartz, Y. & Thirion, B. FReM—scalable and stable decoding with fast regularized ensemble of models. NeuroImage 180, 160–172 (2018).
Article Google Scholar
Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12, 1100–1122 (2016).
Article Google Scholar
Friston, K. J. et al. Classical and Bayesian inference in neuroimaging: applications. NeuroImage 16, 484–512 (2002).
Article Google Scholar
Friston, K. J. et al. Classical and Bayesian inference in neuroimaging: theory. NeuroImage 16, 465–483 (2002).
Article Google Scholar
Körding, K. P. & Wolpert, D. M. Bayesian integration in sensorimotor learning. Nature 427, 244–247 (2004).
Article Google Scholar
Friston, K. J., Liddle, P. F., Frith, C. D., Hirsch, S. R. & Frackowiak, R. S. J. The left medial temporal region and schizophrenia. Brain 115, 367–382 (1992).
Article Google Scholar
Varoquaux, G. et al. Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage 145, 166–179 (2017).
Article Google Scholar
Pereira, F., Mitchell, T. & Botvinick, M. Machine learning classifiers and fMRI: a tutorial overview. NeuroImage 45, 199–209 (2009).
Article Google Scholar
Allen, E. A., Erhardt, E. B. & Calhoun, V. D. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron 74, 603–608 (2012).
Article Google Scholar
Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10, 94 (2016).
Article Google Scholar
Plis, S. M. et al. Deep learning for neuroimaging: a validation study. Front. Neurosci. 8, 299 (2014).
Article Google Scholar
Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 95, 245–258 (2017).
Article Google Scholar
Doria, V. et al. Emergence of resting state networks in the preterm human brain. Proc. Natl Acad. Sci. USA 107, 20015–20020 (2010).
Article Google Scholar
Sui, J. et al. A CCA+ ICA based model for multi-task brain imaging data fusion and its application to schizophrenia. NeuroImage 51, 123–134 (2010).
Article Google Scholar
Jonas, E. & Kording, K. P. Could a neuroscientist understand a microprocessor? PLoS Comput. Biol. 13, e1005268 (2017).
Article Google Scholar
Dai, T. & Guo, Y., Alzheimer’s Disease Neuroimaging Initiative. Predicting individual brain functional connectivity using a Bayesian hierarchical model. NeuroImage 147, 772–787 (2017).
Article Google Scholar
Eickhoff, S. B., Thirion, B., Varoquaux, G. & Bzdok, D. Connectivity-based parcellation: critique and implications. Hum. Brain Mapp. 36, 4771–4792 (2015).
Article Google Scholar
Woolrich, M. W. Bayesian inference in FMRI. NeuroImage 62, 801–810 (2012).
Article Google Scholar
Haxby, J. V. et al. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293, 2425–2430 (2001).
Article Google Scholar
Kriegeskorte, N., Goebel, R. & Bandettini, P. Information-based functional brain mapping. Proc. Natl Acad. Sci. USA 103, 3863–3868 (2006).
Article Google Scholar
Rasmussen, P. M., Hansen, L. K., Madsen, K. H., Churchill, N. W. & Strother, S. C. Model sparsity and brain pattern interpretation of classification models in neuroimaging. Pattern Recognit. 45, 2085–2100 (2012).
Article Google Scholar
Baldassarre, L., Pontil, M. & Mourão-Miranda, J. Sparsity is better with stability: combining accuracy and stability for model selection in brain decoding. Front. Neurosci. 11, 62 (2017).
Article Google Scholar
Woo, C. W., Krishnan, A. & Wager, T. D. Cluster-extent based thresholding in fMRI analyses: pitfalls and recommendations. NeuroImage 91, 412–419 (2014).
Article Google Scholar
Faisal, A. A., Selen, L. P. & Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 292–303 (2008).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychiatry, Psychotherapy and Psychosomatics, RWTH Aachen University, Aachen, Germany
Danilo Bzdok
JARA, Translational Brain Medicine, Aachen, Germany
Danilo Bzdok
Parietal Team, INRIA, Neurospin, CEA Saclay, Gif-sur-Yvette, France
Danilo Bzdok
Wellcome Trust Centre for Integrative Neuroimaging (WIN-FMRIB), University of Oxford, Oxford, UK
Thomas E. Nichols & Stephen M. Smith
Big Data Institute, University of Oxford, Oxford, UK
Thomas E. Nichols

Authors

Danilo Bzdok
View author publications
You can also search for this author in PubMed Google Scholar
Thomas E. Nichols
View author publications
You can also search for this author in PubMed Google Scholar
Stephen M. Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danilo Bzdok.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bzdok, D., Nichols, T.E. & Smith, S.M. Towards algorithmic analytics for large-scale datasets. Nat Mach Intell 1, 296–306 (2019). https://doi.org/10.1038/s42256-019-0069-5

Download citation

Received: 19 April 2018
Accepted: 05 June 2019
Published: 09 July 2019
Issue Date: July 2019
DOI: https://doi.org/10.1038/s42256-019-0069-5

This article is cited by

Supervised latent factor modeling isolates cell-type-specific transcriptomic modules that underlie Alzheimer’s disease progression
- Liam Hodgson
- Yue Li
- Danilo Bzdok
Communications Biology (2024)
Towards data-driven discovery of governing equations in geosciences
- Wenxiang Song
- Shijie Jiang
- Liangsheng Shi
Communications Earth & Environment (2024)
Bayesian stroke modeling details sex biases in the white matter substrates of aphasia
- Julius M. Kernbach
- Gesa Hartwigsen
- Danilo Bzdok
Communications Biology (2023)
The end game: respecting major sources of population diversity
- Jakub Kopal
- Lucina Q. Uddin
- Danilo Bzdok
Nature Methods (2023)
Rare CNVs and phenome-wide profiling highlight brain structural divergence and phenotypical convergence
- Jakub Kopal
- Kuldeep Kumar
- Danilo Bzdok
Nature Human Behaviour (2023)

Towards algorithmic analytics for large-scale datasets

Subjects

Abstract

Access options

Similar content being viewed by others

A hitchhiker’s guide to working with large, open-source neuroimaging datasets

Supervised dimensionality reduction for big data

Data leakage inflates prediction performance in connectome-based machine learning models

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

This article is cited by

Supervised latent factor modeling isolates cell-type-specific transcriptomic modules that underlie Alzheimer’s disease progression

Towards data-driven discovery of governing equations in geosciences

Bayesian stroke modeling details sex biases in the white matter substrates of aphasia

The end game: respecting major sources of population diversity

Rare CNVs and phenome-wide profiling highlight brain structural divergence and phenotypical convergence

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links