Slide Slam I3
A unifying computational account of temporal processing in natural speech across cortex
Shailee Jain1, Vy A. Vo2, Nicole M. Beckage2, Hsiang-Yun Sherry Chien3, Chiadika Obinwa1, Alexander G. Huth1; 1The University of Texas at Austin, 2Intel Labs, 3Johns Hopkins University
In order to understand natural speech, the human cortex must process information at several different timescales. Prior work suggests that cortex processes temporal information through a hierarchy of representations from short-timescale regions like auditory cortex to longer timescale regions such as prefrontal cortex. While some computational accounts of timescale phenomena have been proposed, none directly model and predict responses to natural stimuli. Moreover, these accounts collapse data across individuals and brain regions, limiting the models’ predictive power and resolution. We propose a computational account that predicts responses to ecologically-valid natural language stimuli and implicitly captures the timescale hierarchy, while providing all the benefits of single-subject, single-voxel predictions across cortex. Interpreting the estimated models further suggests a diversity in voxel function across cortex, tied to each region’s timescale. We first built a multi-timescale recurrent neural network (MT-RNN) trained as a self-supervised language model. Each unit in MT-RNN was made to integrate temporal information at a fixed timescale. Then we extracted representations of the stimulus from each unit of the MT-RNN in order to build multi-timescale encoding models (MT-EMs). MT-EMs were trained with fMRI data from 7 subjects listening to 5 hours of narrative English stories. The encoding model for each voxel learned to predict BOLD responses from the MT-RNN representations with ridge regression. We found that MT-EMs significantly predicted brain responses on a held-out dataset in much of the temporal lobe, precuneus and prefrontal cortex. We then investigated whether MT-EMs could capture known timescale differences across regions. First we estimated every voxel’s processing timescale based on its regression weights for MT-RNN units. This produced fine-grained single-subject maps of voxel timescales that corroborated previous reports of a temporal hierarchy. Next, we simulated experimental manipulations from previous neuroimaging studies that also investigated temporal context effects [Lerner 2011; Yeshurun 2017; Chien 2020]. These in silico experiments tested timescale properties of both the MT-RNN itself, as well as voxels in cortex using MT-EMs. In all experiments, the MT-RNN units exhibited temporal context effects in accordance with their assigned timescale-- short-timescale units responded to local, word-level information and retained information for short durations while long-timescale units encoded global, paragraph-level information and retained information over long durations. To measure temporal effects of single voxels, we used MT-EMs to simulate fMRI responses to each experimental manipulation. Despite learning to only predict BOLD responses in a passive listening task, MT-EMs could successfully replicate temporal context effects across cortex in nearly every test. While auditory cortex was sensitive to word-level manipulations and integrated information quickly, precuneus and prefrontal cortex integrated information slowly and were sensitive to long-timescale manipulations. Lastly, we probed the MT-RNN representations to infer what linguistic information was encoded in different timescale units. We found that the short-timescale units capture token-level information like part-of-speech or entity type, while longer timescales capture discourse-level information like narrative topic. Combined with the implied computational isomorphism between the MT-RNN and cortical processing, these results provide new evidence that different timescale regions in the brain may preferentially process different linguistic features.