Computational Similarities Between Visual and Auditory Cortex Studied With Convolutional Neural Networks and Functional MRI

Alex Kell, Massachusetts Institute of Technology

Photo of Alex Kell

Visual and auditory cortex both support impressively robust invariant recognition abilities, but operate on distinct classes of signals. To what extent are similar computations used across modalities? We examined this question by comparing state-of-the-art computational models to neural data from visual and auditory cortex.

Using recent “deep learning” techniques, we built two hierarchical convolutional neural networks: an auditory network optimized to recognize words from spectrograms and a visual network optimized to categorize objects from images. Each network performed as well as humans on the difficult recognition task on which it was trained. Independently, we measured neural responses to (i) a broad set of natural sounds in human auditory cortex (using fMRI); and (ii) diverse naturalistic images in macaque V4 and IT (using multi-array electrophysiology). We then computed the responses of each network to these same sounds and images and used cross-validated linear regression to determine how well each layer of each model predicted the measured neural responses.

Each network predicted the cortical responses in its modality well, explaining substantially more variance than alternative leading models. Moreover, for each modality, lower layers of the network better predicted primary cortical responses while higher layers better predicted non-primary cortical responses, suggestive of hierarchical functional organization. Our key finding is that both the visual network and the auditory network predicted auditory cortical responses equally well in primary auditory cortex and in some nearby non-primary regions (including regions implicated in pitch perception). In contrast, in areas more distant from primary auditory cortex, the auditory network predicted responses substantially better than the visual network.

Our findings suggest that early stages of sensory cortex could instantiate similar computations across modalities, potentially providing input to subsequent stages of processing that are modality-specific. We are currently analyzing the auditory network’s prediction of visual cortical responses.

Abstract Author(s): Alexander J.E. Kell, Daniel L.K. Yamins, Sam Norman-Haignere, Darren Seibert, Ha Hong, Jim J. DiCarlo, Josh H. McDermott