Slide Slam O11
A computational investigation of the transformation from talker-specific detail to talker-invariant lexical representations
Sahil Luthra1, James S. Magnuson1,2,3, Jay G. Rueckl1,4; 1University of Connecticut, 2Basque Center on Cognition Brain and Language, 3Ikerbasque - Basque Foundation for Science, 4Haskins Laboratories
Listeners accommodate a tremendous amount of talker variability during spoken word recognition, readily recognizing speech produced by myriad talkers despite significant differences in acoustic-phonetic patterns (Joos, 1948). However, a key barrier to explaining listeners’ robust performance across talkers is the fact that most computational models of spoken word recognition represent speech inputs in terms of abstract phonetic features, thereby sidestepping the issue of talker variability. Recently, our team introduced EARSHOT, a neural network model of human speech recognition that works on real speech as it unfolds over time (Magnuson et al., 2020). EARSHOT maps spectrogram-based input patterns to a lexical-semantic output layer via a hidden layer of long short-term memory (LSTM) nodes. We observed human-like patterns of lexical competition and moderate generalization to untrained words and talkers. Furthermore, despite not being trained explicitly on phonetic targets, EARSHOT’s hidden units exhibited phonetically organized responses resembling those observed in human superior temporal gyrus. In this follow-up work, we conducted a series of Representational Similarity Analyses (RSAs; Kriegeskorte et al., 2008) to characterize how talker and lexical information are represented in EARSHOT. We first constructed Representational Dissimilarity Matrices (RDMs) to describe the similarity structure in the activation patterns of 1161 different words each spoken by 16 different talkers. We constructed separate RDMs for the input layer and the hidden layer. We compared these to two theoretical RDMs: one where the hypothesized similarity structure was defined by word identity (regardless of talker) and one where the similarity structure was defined by talker identity (regardless of the word). The EARSHOT RDMs were compared to the theoretical RDMs, with one set of analyses collapsing across time (i.e., concatenating the patterns from all time steps) and another considering the pattern of activation at each time step separately. We found that word identity was not strongly represented in the input state patterns (r = 0.014, p < 0.01, collapsed across time), reflecting the known lack of invariance between acoustic signal and linguistic units (Liberman et al., 1957). However, talker information was strongly represented in the input patterns (r = 0.251, p < 0.01, collapsed across time). Over-time analyses revealed that the strength of this latter correlation varied as the input unfolded; fluctuations in this correlation appear to be associated with coarse-grained talker-specific details, such as characteristic speaking rate. In the hidden layer, we observed a relative increase in the extent to which word identity was represented (r = 0.104, p < 0.01, collapsed across time) and a relative decrease in the extent to which talker information was represented (r = 0.094, p < 0.01, collapsed across time). Over-time analyses indicated that the strength of the former correlation increased as the input unfolded, while the strength of the latter correlation decreased at later time steps. Thus, the hidden states appear to reflect an intermediate stage in the transformation from talker-specific surface details of utterances to abstract, talker-invariant lexical representations. Beyond elucidating how EARSHOT works, these analyses may generate new neural hypotheses about similar transformations in the brain.