Article
Published: 07 March 2024

Learning high-level visual representations from a child’s perspective without strong inductive biases

Nature Machine Intelligence volume 6, pages 271–283 (2024)Cite this article

3540 Accesses
3 Citations
153 Altmetric
Metrics details

Subjects

This article has been updated

A preprint version of the article is available at arXiv.

Abstract

Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child’s visual experience without strong inductive biases? To investigate this, we train state-of-the-art neural networks on a realistic proxy of a child’s visual experience without any explicit supervision or domain-specific inductive biases. Specifically, we train both embedding models and generative models on 200 hours of headcam video from a single child collected over two years and comprehensively evaluate their performance in downstream tasks using various reference models as yardsticks. On average, the best embedding models perform at a respectable 70% of a high-performance ImageNet-trained model, despite substantial differences in training data. They also learn broad semantic categories and object localization capabilities without explicit supervision, but they are less object-centric than models trained on all of ImageNet. Generative models trained with the same data successfully extrapolate simple properties of partially masked objects, like their rough outline, texture, colour or orientation, but struggle with finer object details. We replicate our experiments with two other children and find remarkably consistent results. Broadly useful high-level visual representations are thus robustly learnable from a sample of a child’s visual experience without strong inductive biases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic overview of the experiments.**

**Fig. 2: Quantitative evaluation of the embedding models.**

**Fig. 3: Qualitative evaluation of the embedding models.**

**Fig. 4: t-distributed stochastic neighbor embeddings of the ImageNet classes.**

**Fig. 5: Nearest neighbours in the embedding space.**

**Fig. 6: Qualitative evaluation of the generative models.**

Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Article 13 November 2023

Capturing the objects of vision with neural networks

Article 20 September 2021

Capturing human categorization of natural images by combining deep networks and cognitive models

Article Open access 27 October 2020

Data availability

Except for SAYCam, all data used in this study are publicly available. Instructions for accessing the public datasets are detailed in Methods. The SAYCam dataset can be accessed by authorized users with an institutional affiliation from the following Databrary repository: https://doi.org/10.17910/b7.564. The ‘Labeled S’ evaluation dataset, which is a subset of SAYCam, is also available from the same repository under the session name ‘Labeled S’.

Code availability

All of our pretrained models (over 70 different models), as well as a variety of tools to use and analyse them, are available from the following public repository: https://github.com/eminorhan/silicon-menagerie (ref. ⁶³). The repository also contains further examples of (1) attention and class activation maps, (2) t-SNE visualizations of embeddings, (3) nearest neighbour retrievals from the embedding models and (4) unconditional and conditional samples from the generative models. The code used for training and evaluating all the models is also publicly available from the same repository.

Change history

11 June 2024
In the version of the article initially published, the name of Cliona O’Doherty was not included in the peer review information for this article, which has now been amended.

References

Bomba, P. & Siqueland, E. The nature and structure of infant form categories. J. Exp. Child Psychol. 35, 294–328 (1983).
Article Google Scholar
Murphy, G. The Big Book of Concepts (MIT, 2002).
Kellman, P. & Spelke, E. Perception of partly occluded objects in infancy. Cogn. Psychol. 15, 483–524 (1983).
Article Google Scholar
Spelke, E., Breinlinger, K., Macomber, J. & Jacobson, K. Origin of knowledge. Psychol. Rev. 99, 605–632 (1992).
Article Google Scholar
Ayzenberg, V. & Lourenco, S. Young children outperform feed-forward and recurrent neural networks on challenging object recognition tasks. J. Vis. 20, 310–310 (2020).
Article Google Scholar
Huber, L. S., Geirhos, R. & Wichmann, F. A. The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. J. Vis. 23, 4 (2023).
Locke, J. An Essay Concerning Human Understanding (ed. Fraser, A. C.) (Clarendon Press, 1894).
Leibniz, G. New Essays on Human Understanding 2nd edn (eds Remnant, P. & Bennett, J.) (Cambridge Univ. Press, 1996).
Spelke, E. Initial knowledge: six suggestions. Cognition 50, 431–445 (1994).
Article Google Scholar
Markman, E. Categorization and Naming in Children (MIT, 1989).
Merriman, W., Bowman, L. & MacWhinney, B. The mutual exclusivity bias in children’s word learning. Monogr. Soc. Res. Child Dev. 54, 1–132 (1989).
Elman, J., Bates, E. & Johnson, M. Rethinking Innateness: A Connectionist Perspective on Development (MIT, 1996).
Sullivan, J., Mei, M., Perfors, A., Wojcik, E. & Frank, M. SAYCam: a large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind 5, 20–29 (2022).
Article Google Scholar
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).
Zhou, P. et al. Mugs: a multi-granular self-supervised learning framework. Preprint at https://arxiv.org/abs/2203.14415 (2022).
He, K. et al. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15979–15988 (IEEE, 2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2020).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Smaira, L. et al. A short note on the Kinetics-700-2020 human action dataset. Preprint at https://arxiv.org/abs/2010.10864 (2020).
Grauman, K. et al. Ego4D: around the world in 3,000 hours of egocentric video. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012 (IEEE, 2022).
Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 12873–12883 (IEEE, 2021).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2921–2929 (IEEE, 2016).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Kuznetsova, A. et al. The Open Images Dataset V4. Int. J. Comput. Vis. 128, 1956–1981 (2020).
Smith, L. & Slone, L. A developmental approach to machine learning? Front. Psychol. 8, 2124 (2017).
Article Google Scholar
Bambach, S., Crandall, D., Smith, L. & Yu, C. Toddler-inspired visual object learning. Adv. Neural Inf. Process. Syst. 31, 1209–1218 (2018).
Zaadnoordijk, L., Besold, T. & Cusack, R. Lessons from infant learning for unsupervised machine learning. Nat. Mach. Intell. 4, 510–520 (2022).
Article Google Scholar
Orhan, E., Gupta, V. & Lake, B. Self-supervised learning through the eyes of a child. Adv. Neur. In. 33, 9960–9971 (2020).
Google Scholar
Lee, D., Gujarathi, P. & Wood, J. Controlled-rearing studies of newborn chicks and deep neural networks. Preprint at https://arxiv.org/abs/2112.06106 (2021).
Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci. USA 118, e2014196118 (2021).
Article Google Scholar
Zhuang, C. et al. How well do unsupervised learning algorithms model human real-time and life-long learning? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).
Vong, W. K., Wang, W., Orhan, A. E. & Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science 383, 504–511 (2024).
Locatello, F. et al. Object-centric learning with slot attention. Adv. Neur. In. 33, 11525–11538 (2020).
Google Scholar
Lillicrap, T., Santoro, A., Marris, L., Akerman, C. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 21, 335–346 (2020).
Article Google Scholar
Gureckis, T. & Markant, D. Self-directed learning: a cognitive and computational perspective. Perspect. Psychol. Sci. 7, 464–481 (2012).
Article Google Scholar
Long, B. et al. The BabyView camera: designing a new head-mounted camera to capture children’s early social and visual environments. Behav. Res. Methods https://doi.org/10.3758/s13428-023-02206-1 (2023).
Moore, D., Oakes, L., Romero, V. & McCrink, K. Leveraging developmental psychology to evaluate artificial intelligence. In 2022 IEEE International Conference on Development and Learning (ICDL) 36–41 (IEEE, 2022).
Frank, M. C. Bridging the data gap between children and large language models. Trends Cogn. Sci. 27, 990–992 (2023).
Object stimuli. Brady Lab https://bradylab.ucsd.edu/stimuli/ObjectCategories.zip
Konkle, T., Brady, T., Alvarez, G. & Oliva, A. Conceptual distinctiveness supports detailed visual long-term memory for real-world objects. J. Exp. Psychol. Gen. 139, 558 (2010).
Article Google Scholar
Lomonaco, V. & Maltoni, D. CORe50 Dataset. GitHub https://vlomonaco.github.io/core50 (2017).
Lomonaco, V. & Maltoni, D. CORe50: a new dataset and benchmark for continuous object recognition. In Proc. 1st Annual Conference on Robot Learning (eds Levine, S. et al.) 17–26 (PMLR, 2017).
Russakovsky, O. et al. ImageNet Dataset. https://www.image-net.org/download.php (2015).
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
Geirhos, R. et al. Partial success in closing the gap between human and machine vision. Adv. Neur. In. 34, 23885–23899 (2021).
Google Scholar
Geirhos, R. et al. ImageNet OOD Dataset. GitHub https://github.com/bethgelab/model-vs-human (2021).
Mehrer, J., Spoerer, C., Jones, E., Kriegeskorte, N. & Kietzmann, T. An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl Acad. Sci. USA 118, e2011417118 (2021).
Article Google Scholar
Mehrer, J., Spoerer, C., Jones, E., Kriegeskorte, N. & Kietzmann, T. Ecoset Dataset. Hugging Face https://huggingface.co/datasets/kietzmannlab/ecoset (2021).
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: a 10 million image database for scene recognition. IEEE T. Pattern Anal. 40, 1452–1464 (2017).
Zhou, B. et al. Places365 Dataset. http://places2.csail.mit.edu (2017).
Pont-Tuset, J. et al. The 2017 DAVIS challenge on video object segmentation. Preprint at https://arxiv.org/abs/1704.00675 (2017).
Pont-Tuset, J. et al. DAVIS-2017 evaluation code, dataset and results. https://davischallenge.org/davis2017/code.html (2017).
Lin, T. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 (eds Fleet, D. et al.) 740–755 (2014).
COCO Dataset. https://cocodataset.org/#download (2014).
Jabri, A., Owens, A. & Efros, A. Space-time correspondence as a contrastive random walk. Adv. Neur. In. 33, 19545–19560 (2020).
Google Scholar
Kinetics-700-2020 Dataset. https://github.com/cvdfoundation/kinetics-dataset#kinetics-700-2020 (2020).
Ego4D Dataset. https://ego4d-data.org/ (2022).
Kingma, D. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
VQGAN resources. GitHub https://github.com/CompVis/taming-transformers (2021).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6629–6640 (2017).
Orhan, A. E. eminorhan/silicon-menagerie: v1.0.0-alpha. Zenodo https://doi.org/10.5281/zenodo.8322408 (2023).

Download references

Acknowledgements

We thank W. K. Vong, A. Tartaglini and M. Ren for helpful discussions and comments on an earlier version of this paper. This work was supported by the DARPA Machine Common Sense program (B.M.L.) and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation and Responsibility for Data Science (B.M.L.).

Author information

Authors and Affiliations

Center for Data Science, New York University, New York, NY, USA
A. Emin Orhan & Brenden M. Lake
Department of Psychology, New York University, New York, NY, USA
Brenden M. Lake

Authors

A. Emin Orhan
View author publications
You can also search for this author in PubMed Google Scholar
Brenden M. Lake
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.E.O. and B.M.L. conceptualized and designed the study. A.E.O. implemented the experiments. A.E.O. analysed the results with feedback from B.M.L. A.E.O. wrote the first draft. B.M.L. reviewed and edited the paper.

Corresponding author

Correspondence to A. Emin Orhan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Rhodri Cusack, Cliona O’Doherty, Masataka Sawayama and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–8 and Tables 1 and 2.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Orhan, A.E., Lake, B.M. Learning high-level visual representations from a child’s perspective without strong inductive biases. Nat Mach Intell 6, 271–283 (2024). https://doi.org/10.1038/s42256-024-00802-0

Download citation

Received: 24 May 2023
Accepted: 05 February 2024
Published: 07 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1038/s42256-024-00802-0

This article is cited by

Artificial intelligence tackles the nature–nurture debate
- Justin N. Wood
Nature Machine Intelligence (2024)

Learning high-level visual representations from a child’s perspective without strong inductive biases

Subjects

Abstract

Access options

Similar content being viewed by others

Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Capturing the objects of vision with neural networks

Capturing human categorization of natural images by combining deep networks and cognitive models

Data availability

Code availability

Change history

11 June 2024

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

Artificial intelligence tackles the nature–nurture debate

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Capturing the objects of vision with neural networks

Capturing human categorization of natural images by combining deep networks and cognitive models

Data availability

Code availability

Change history

11 June 2024

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Artificial intelligence tackles the nature–nurture debate

Search

Quick links