Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Poster Slams

Evaluating effects of phoneme-level and word-level surprisal in continuous speech processing

Poster E22 in Poster Session E, Saturday, October 8, 3:15 - 5:00 pm EDT, Millennium Hall

Anne Marie Crinnion1, Christian Brodbeck1; 1University of Connecticut

Sentential context constraints in speech processing have long been thought to reflect sensitivity to word-level surprisal (i.e., how unlikely a word is in context). This evidence comes from research addressing semantic congruency effects using EEG and MEG, with peaks about 400 ms after word onset (usually referred to as the N400) reflecting this contextual processing. Surprisal effects reflect updating a contextual representation of the input (i.e., if a word is contextually consistent and can be predicted, then it leads to a smaller update of internal representations than a word that is highly surprising). Recent work has shown that brain responses track phoneme-level surprisal, representing a finer grained timescale of sensitivity to sentential and lexical context (Donhauser & Baillet, 2020). It remains a question, however, whether sensitivity to surprisal in continuous speech is completely incremental (occurring only at the phoneme level) or whether there is a separate level of representation that is updated at a slower rate (i.e., word). To answer this question, we used MEG data from Brodbeck et al. (2022) where participants listened to an audiobook (i.e., continuous speech). We used an multivariate temporal response function approach to predict neural responses from different combinations of predictor variables, evaluating model fit to held-out data. In order to get a rich, human-like estimate of contextual word likelihood, we used predictions from GPT-2, a state-of-the-art language model. First, we aimed to replicate N400 effects using continuous speech with a model including only word onset and word surprisal predictors. We observed typical N400 peaks, with surprisal predicting activity in temporal and frontal areas. Next, we modeled both phoneme-level and word-level surprisal jointly. Even when accounting for acoustic- and phoneme-level predictors (including phoneme-by-phoneme entropy and surprisal), a model including word-level surprisal still explained more variation in temporal lobe activity. However, the contribution of phoneme-level surprisal was significantly greater than that of word-level surprisal, suggesting that incremental, phoneme-level representations have greater contribution to the typically studied N400 responses. In other words, accounting for incremental (i.e., phoneme-level) updates of representations captures more variability in neural data than accounting for word-level updates of representations. However, because a model with both incremental and more global (i.e., word-level) updates outperforms a model with either timescale alone, representations are likely updated at different timescales. We then compared the contribution of phoneme predictors derived from GPT-2 and a 5-gram model. We found that phoneme predictors from both models account for more variation in neural activity beyond having either set of predictors alone (i.e., just GPT-2 or just 5-gram). Because GPT-2 predictors represent a more global context and 5-gram predictors represent a more constrained local context, these findings suggest that multiple types of information constrain phoneme-level predictions. Overall, this work suggests incremental phoneme predictors explain patterns of neural responses better than word-level predictors, yet evidence exists for contributions of both types of predictors. This simultaneous contribution to prediction, along with evidence for local and global context influencing phoneme-level predictions, supports the idea that multiple representations of continuous speech are maintained in parallel.

Topic Areas: Speech Perception, Computational Approaches