My Account

Poster B70, Tuesday, August 20, 2019, 3:15 – 5:00 pm, Restaurant Hall

EARSHOT: Emulating Auditory Recognition of Speech by Humans Over Time

James Magnuson1, Heejo You1, Monica Li1, Jay Rueckl1, Monty Escabi1, Kevin Brown2, Hosung Nam3, Paul Allopenna1, Rachel Theodore1, Nicholas Monto1;1University of Connecticut, 2Oregon State University, 3Korea University

INTRODUCTION. A critical gap in human speech recognition is that no comprehensive cognitive models operate on real speech. We present EARSHOT, based on the DeepListener model we presented last year. EARSHOT is a neural network that incrementally maps spectrographic speech inputs to pseudo-semantics via a single recurrent layer of long short-term memory (LSTM) nodes. Compared to our previous report, we have achieved significant gains in accuracy, variability (17 talkers instead of 10), potential ecological validity (with 2 human talkers and 15 high-quality speech synthesized talkers), and we have conducted more sophisticated analyses that reveal emergent representations in hidden-unit responses. METHODS. We transformed speech files to 512-channel spectrograms (22050 hz sampling rate, approximately 11000 hz frequency range). Spectrograms were presented as model inputs as 512 channel vectors (~21.5 hz per channel) in 10 ms steps. For each talker, we recorded 240 words. The 512 inputs map to a recurrent layer of 512 LSTM nodes, which map to 300 pseudo-semantic sparse random vector outputs (30 of 300 elements "on;" a common simplification, given the largely arbitrary mapping from form to meaning). We trained 17 separate models. In each, a different talker was completely excluded from training. In addition, a different 15 words were excluded from training for each trained-on talker. In each epoch, 3600 items (225 words x 16 talkers) were presented in random order. A simulation was counted as correct if the output of the model was closer (as indexed by vector cosine similarity) to the target word than any other. Initially, we trained the model for 2000 epochs. We used two methods to assess emergent representations in the model's hidden units: sensitivity analyses based on methods from human electrocorticography, and representational similarity analysis (RSA) based on methods from human neuroimaging. RESULTS. Mean accuracy was 91% for trained items, 52% for words excluded for each trained talker, and 29% for talkers excluded from training. When training resumed with excluded items and talkers, accuracy increased to ~75% for excluded words, and ~55% for excluded talkers within 50 epochs, and to ~90% and 88% within 500 epochs. However, generalization was worse for excluded human than synthetic talkers; hence, future work will focus on real speech produced by humans. The timecourse of lexical competition closely resembled that observed in humans (Allopenna, Magnuson, & Tanenhaus, 19998) and the "gold standard" TRACE model (McClelland & Elman, 1986), although rhyme competition was somewhat diminished compared to our previous simulations with fewer talkers (10) but more words (1000). Sensitivity and RSA revealed emergent distributed phonological codes in hidden unit activation patterns even though the model was not trained to produce phonetic labels. Sensitivity closely resembles results from direct recordings from human superior temporal gyrus reported by Mesgarani et al. (2014). CONCLUSIONS. EARSHOT provides proof-of-concept that a shallow model can operate on real speech while producing similar behavior as previous models that worked on abstract inputs. Emergent internal representations closely resemble human cortical responses. This framework opens the way to ongoing work aimed at increasing ecological and developmental validity.

Themes: Computational Approaches, Speech Perception
Method: Computational Modeling

Back