Abstract
Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child’s visual experience without strong inductive biases? To investigate this, we train state-of-the-art neural networks on a realistic proxy of a child’s visual experience without any explicit supervision or domain-specific inductive biases. Specifically, we train both embedding models and generative models on 200 hours of headcam video from a single child collected over two years and comprehensively evaluate their performance in downstream tasks using various reference models as yardsticks. On average, the best embedding models perform at a respectable 70% of a high-performance ImageNet-trained model, despite substantial differences in training data. They also learn broad semantic categories and object localization capabilities without explicit supervision, but they are less object-centric than models trained on all of ImageNet. Generative models trained with the same data successfully extrapolate simple properties of partially masked objects, like their rough outline, texture, colour or orientation, but struggle with finer object details. We replicate our experiments with two other children and find remarkably consistent results. Broadly useful high-level visual representations are thus robustly learnable from a sample of a child’s visual experience without strong inductive biases.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Except for SAYCam, all data used in this study are publicly available. Instructions for accessing the public datasets are detailed in Methods. The SAYCam dataset can be accessed by authorized users with an institutional affiliation from the following Databrary repository: https://doi.org/10.17910/b7.564. The ‘Labeled S’ evaluation dataset, which is a subset of SAYCam, is also available from the same repository under the session name ‘Labeled S’.
Code availability
All of our pretrained models (over 70 different models), as well as a variety of tools to use and analyse them, are available from the following public repository: https://github.com/eminorhan/silicon-menagerie (ref. 63). The repository also contains further examples of (1) attention and class activation maps, (2) t-SNE visualizations of embeddings, (3) nearest neighbour retrievals from the embedding models and (4) unconditional and conditional samples from the generative models. The code used for training and evaluating all the models is also publicly available from the same repository.
Change history
11 June 2024
In the version of the article initially published, the name of Cliona O’Doherty was not included in the peer review information for this article, which has now been amended.
References
Bomba, P. & Siqueland, E. The nature and structure of infant form categories. J. Exp. Child Psychol. 35, 294–328 (1983).
Murphy, G. The Big Book of Concepts (MIT, 2002).
Kellman, P. & Spelke, E. Perception of partly occluded objects in infancy. Cogn. Psychol. 15, 483–524 (1983).
Spelke, E., Breinlinger, K., Macomber, J. & Jacobson, K. Origin of knowledge. Psychol. Rev. 99, 605–632 (1992).
Ayzenberg, V. & Lourenco, S. Young children outperform feed-forward and recurrent neural networks on challenging object recognition tasks. J. Vis. 20, 310–310 (2020).
Huber, L. S., Geirhos, R. & Wichmann, F. A. The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. J. Vis. 23, 4 (2023).
Locke, J. An Essay Concerning Human Understanding (ed. Fraser, A. C.) (Clarendon Press, 1894).
Leibniz, G. New Essays on Human Understanding 2nd edn (eds Remnant, P. & Bennett, J.) (Cambridge Univ. Press, 1996).
Spelke, E. Initial knowledge: six suggestions. Cognition 50, 431–445 (1994).
Markman, E. Categorization and Naming in Children (MIT, 1989).
Merriman, W., Bowman, L. & MacWhinney, B. The mutual exclusivity bias in children’s word learning. Monogr. Soc. Res. Child Dev. 54, 1–132 (1989).
Elman, J., Bates, E. & Johnson, M. Rethinking Innateness: A Connectionist Perspective on Development (MIT, 1996).
Sullivan, J., Mei, M., Perfors, A., Wojcik, E. & Frank, M. SAYCam: a large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind 5, 20–29 (2022).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).
Zhou, P. et al. Mugs: a multi-granular self-supervised learning framework. Preprint at https://arxiv.org/abs/2203.14415 (2022).
He, K. et al. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15979–15988 (IEEE, 2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2020).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Smaira, L. et al. A short note on the Kinetics-700-2020 human action dataset. Preprint at https://arxiv.org/abs/2010.10864 (2020).
Grauman, K. et al. Ego4D: around the world in 3,000 hours of egocentric video. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012 (IEEE, 2022).
Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 12873–12883 (IEEE, 2021).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2921–2929 (IEEE, 2016).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Kuznetsova, A. et al. The Open Images Dataset V4. Int. J. Comput. Vis. 128, 1956–1981 (2020).
Smith, L. & Slone, L. A developmental approach to machine learning? Front. Psychol. 8, 2124 (2017).
Bambach, S., Crandall, D., Smith, L. & Yu, C. Toddler-inspired visual object learning. Adv. Neural Inf. Process. Syst. 31, 1209–1218 (2018).
Zaadnoordijk, L., Besold, T. & Cusack, R. Lessons from infant learning for unsupervised machine learning. Nat. Mach. Intell. 4, 510–520 (2022).
Orhan, E., Gupta, V. & Lake, B. Self-supervised learning through the eyes of a child. Adv. Neur. In. 33, 9960–9971 (2020).
Lee, D., Gujarathi, P. & Wood, J. Controlled-rearing studies of newborn chicks and deep neural networks. Preprint at https://arxiv.org/abs/2112.06106 (2021).
Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci. USA 118, e2014196118 (2021).
Zhuang, C. et al. How well do unsupervised learning algorithms model human real-time and life-long learning? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).
Vong, W. K., Wang, W., Orhan, A. E. & Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science 383, 504–511 (2024).
Locatello, F. et al. Object-centric learning with slot attention. Adv. Neur. In. 33, 11525–11538 (2020).
Lillicrap, T., Santoro, A., Marris, L., Akerman, C. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 21, 335–346 (2020).
Gureckis, T. & Markant, D. Self-directed learning: a cognitive and computational perspective. Perspect. Psychol. Sci. 7, 464–481 (2012).
Long, B. et al. The BabyView camera: designing a new head-mounted camera to capture children’s early social and visual environments. Behav. Res. Methods https://doi.org/10.3758/s13428-023-02206-1 (2023).
Moore, D., Oakes, L., Romero, V. & McCrink, K. Leveraging developmental psychology to evaluate artificial intelligence. In 2022 IEEE International Conference on Development and Learning (ICDL) 36–41 (IEEE, 2022).
Frank, M. C. Bridging the data gap between children and large language models. Trends Cogn. Sci. 27, 990–992 (2023).
Object stimuli. Brady Lab https://bradylab.ucsd.edu/stimuli/ObjectCategories.zip
Konkle, T., Brady, T., Alvarez, G. & Oliva, A. Conceptual distinctiveness supports detailed visual long-term memory for real-world objects. J. Exp. Psychol. Gen. 139, 558 (2010).
Lomonaco, V. & Maltoni, D. CORe50 Dataset. GitHub https://vlomonaco.github.io/core50 (2017).
Lomonaco, V. & Maltoni, D. CORe50: a new dataset and benchmark for continuous object recognition. In Proc. 1st Annual Conference on Robot Learning (eds Levine, S. et al.) 17–26 (PMLR, 2017).
Russakovsky, O. et al. ImageNet Dataset. https://www.image-net.org/download.php (2015).
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Geirhos, R. et al. Partial success in closing the gap between human and machine vision. Adv. Neur. In. 34, 23885–23899 (2021).
Geirhos, R. et al. ImageNet OOD Dataset. GitHub https://github.com/bethgelab/model-vs-human (2021).
Mehrer, J., Spoerer, C., Jones, E., Kriegeskorte, N. & Kietzmann, T. An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl Acad. Sci. USA 118, e2011417118 (2021).
Mehrer, J., Spoerer, C., Jones, E., Kriegeskorte, N. & Kietzmann, T. Ecoset Dataset. Hugging Face https://huggingface.co/datasets/kietzmannlab/ecoset (2021).
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: a 10 million image database for scene recognition. IEEE T. Pattern Anal. 40, 1452–1464 (2017).
Zhou, B. et al. Places365 Dataset. http://places2.csail.mit.edu (2017).
Pont-Tuset, J. et al. The 2017 DAVIS challenge on video object segmentation. Preprint at https://arxiv.org/abs/1704.00675 (2017).
Pont-Tuset, J. et al. DAVIS-2017 evaluation code, dataset and results. https://davischallenge.org/davis2017/code.html (2017).
Lin, T. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 (eds Fleet, D. et al.) 740–755 (2014).
COCO Dataset. https://cocodataset.org/#download (2014).
Jabri, A., Owens, A. & Efros, A. Space-time correspondence as a contrastive random walk. Adv. Neur. In. 33, 19545–19560 (2020).
Kinetics-700-2020 Dataset. https://github.com/cvdfoundation/kinetics-dataset#kinetics-700-2020 (2020).
Ego4D Dataset. https://ego4d-data.org/ (2022).
Kingma, D. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
VQGAN resources. GitHub https://github.com/CompVis/taming-transformers (2021).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6629–6640 (2017).
Orhan, A. E. eminorhan/silicon-menagerie: v1.0.0-alpha. Zenodo https://doi.org/10.5281/zenodo.8322408 (2023).
Acknowledgements
We thank W. K. Vong, A. Tartaglini and M. Ren for helpful discussions and comments on an earlier version of this paper. This work was supported by the DARPA Machine Common Sense program (B.M.L.) and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation and Responsibility for Data Science (B.M.L.).
Author information
Authors and Affiliations
Contributions
A.E.O. and B.M.L. conceptualized and designed the study. A.E.O. implemented the experiments. A.E.O. analysed the results with feedback from B.M.L. A.E.O. wrote the first draft. B.M.L. reviewed and edited the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Rhodri Cusack, Cliona O’Doherty, Masataka Sawayama and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–8 and Tables 1 and 2.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Orhan, A.E., Lake, B.M. Learning high-level visual representations from a child’s perspective without strong inductive biases. Nat Mach Intell 6, 271–283 (2024). https://doi.org/10.1038/s42256-024-00802-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00802-0
This article is cited by
-
Artificial intelligence tackles the nature–nurture debate
Nature Machine Intelligence (2024)