Incorporating natural language into vision models improves prediction and understanding of higher visual cortex

Authors

Aria Wang

Published

September 29, 2022

Abstract

Advances in neural networks have been catalyzed by joint training on images and natural language, increased dataset sizes, and data diversity. We explored whether the same factors support similar improvements in predicting visual responses in the human brain. We used models pre-trained with Contrastive Language-Image Pre-training (CLIP) - which learns image embeddings that best match text embeddings of image captions from diverse, large-scale datasets - to study visual representations. We built voxelwise encoding models based on CLIP image features to predict brain responses to real-world images. ResNet50 with CLIP explained up to R2 = 79% of variance in individual voxel responses in held-out test data, a significant increase from models trained only with image/label pairs (ImageNet trained ResNet) or text (BERT). Comparisons across different model backbones ruled out network architecture as a factor in performance improvements. Comparisons across models that controlled for dataset size and data diversity demonstrated that language feedback along with data diversity in larger datasets are important factors in explaining neural responses in high-level visual brain regions. Visualizations of model embeddings and Principal Component Analysis (PCA) revealed that our models capture both global and fine-grained semantic dimensions represented within human visual cortex.