Official repo for "Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers" (ICLR 2025)
We introduce BrainSAIL (Semantic Attribution and Image Localization), a method for linking neural selectivity with spatially distributed semantic visual concepts in natural scenes. BrainSAIL leverages recent advances in large-scale artificial neural networks, using them to provide insights into the functional topology of the brain. To overcome the challenge presented by the co-occurrence of multiple categories in natural images, BrainSAIL exploits semantically consistent, dense spatial features from pre-trained vision models, building upon their demonstrated ability to robustly predict neural activity. This method derives clean, spatially dense embeddings without requiring any additional training, and employs a novel denoising process that leverages the semantic consistency of images under random augmentations. By unifying the space of whole-image embeddings and dense visual features and then applying voxel-wise encoding models to these features, we enable the identification of specific subregions of each image which drive selectivity patterns in different areas of the higher visual cortex. This provides a powerful tool for dissecting the neural mechanisms that underlie semantic visual processing for natural images. We validate BrainSAIL on cortical regions with known category selectivity, demonstrating its ability to accurately localize and disentangle selectivity to diverse visual concepts. Next, we demonstrate BrainSAIL's ability to characterize high-level visual selectivity to scene properties and low-level visual features such as depth, luminance, and saturation, providing insights into the encoding of complex visual information. Finally, we use BrainSAIL to directly compare the feature selectivity of different brain encoding models across different regions of interest in visual cortex. Our innovative method paves the way for significant advances in mapping and decomposing high-level visual representations in the human brain.
We provide three possible backbones here -- OpenAI CLIP ViT-B/16, Meta's DINOv2 with registers (dinov2_vitb14_reg), and Nvidia's RADIO v2.5 implementation of SigLIP. Your choice of dense representation should depend on your choice of backbone that you use. For CLIP, we use apply the NACLIP adapter. For DINOv2 and SigLIP, we apply the MaskCLIP adapter (Extract Free Dense Labels from CLIP) modification.
For the CLIP based pipeline, you will need to download NACLIP and include it in the local path. For SigLIP we use the RADIO v2.5 implementation since it provides a conventional self-attention readout. For DINOv2, we use the version with registers. When the dense representations are saved, they are multiplied by 10 to reduce quantization errors and cast to float16. In practice, you may want to train your brain encoder on unit-norm [CLS] embeddings, and apply a similar normalization to the dense embeddings (normalization applied per-patch) prior to generating the attribution maps. If you absolutely need to use non-unit norm emebddings, it is recommended that you use the gap closing procedure proposed in BrainSCUBA (ICLR 2024).