Abstract The rapid development and open-source release of highly performant computer vision models offers new potential for examining how different inductive biases impact representation learning and emergent alignment with the high-level human ventral visual system. Here, we assess a diverse set of 224 models, curated to enable controlled comparison of different model properties, testing their brain predictivity using large-scale functional magnetic resonance imaging data. We find that models with qualitatively different architectures (e.g. CNNs versus Transformers) and markedly different task objectives (e.g. purely visual contrastive learning versus vision-language alignment) achieve near equivalent degrees of brain predictivity, when other factors are held constant. Instead, variation across model visual training diets yields the largest, most consistent effect on emergent brain predictivity. Overarching model properties commonly suspected to increase brain predictivity (e.g. greater effective dimensionality; learnable parameter count) were not robust indicators across this more extensive survey. We highlight that standard model-to-brain linear re-weighting methods may be too flexible, as most performant models have very similar brain-predictivity scores, despite significant variation in their underlying representations. Broadly, our findings point to the importance of visual diet, challenge common assumptions about the methods used to link models to brains, and more concretely outline future directions for leveraging the full diversity of existing open-source models as tools to probe the common computational principles underlying biological and artificial visual systems.
This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.