Poster C67, Friday, August 17, 10:30 am – 12:15 pm, Room 2000AB
DeepListener: A computational model of human speech recognition that works with real speech and develops distributed phonological codes
Heejo You1, Hosung Nam2, Paul Allopenna1, Kevin Brown1, James Magnuson1;1University of Connecticut, 2Korea University
SUMMARY. Cognitive models of human speech recognition (HSR) can simulate complex over-time dynamics of lexical activation and competition but operate on idealized phonetic inputs rather than real speech. We report on DeepListener, a two-layer network with elements of deep networks, that processes real speech with high accuracy while demonstrating human-like activation and competition dynamics. We use a neural decoding approach to unpack distributed representations that the model learns. INTRODUCTION. McClelland and Elman (1986) introduced the tension between computational adequacy (model performance) and psychological (explanatory) adequacy (our ability to understand how and why a model works). Cognitive models of HSR prioritize psychological adequacy, while models used to perform ASR focus on computational adequacy. Leading models of HSR, whether connectionist or Bayesian, make the "temporary" simplifying assumption to use abstract phonetic codes or human diphone confusion probabilities as inputs rather than real speech, allowing models to remain simple. Leading ASR models use deep learning to engineer highly complex (many layers of many types), largely opaque systems that provide robust, high accuracy ASR deployable in the real world. Given the shared history of cognitive connectionist and deep learning models, we asked whether cognitive approaches could borrow aspects of deep learning to break free of simplified inputs without sacrificing psychological adequacy. METHODS. DeepListener receives 256-channel fast-Fourier transformed speech as input in 10ms windows. Inputs feed to a hidden layer of 256 long short-term memory units (LSTMs, often part of a deep learning pipeline for speech recognition). Target output patterns were random sparse vectors (10 elements "on" in a 300-element vector; a common simplifying assumption, given largely arbitrary form-meaning mapping). The network's task was to activate the correct output pattern for each word. Inputs were 1000 frequent words produced by each of 10 talkers, presented in random order in each training epoch. The model was trained with backpropogation through time (with recurrent hidden units). RESULTS. Lexical "activations" were calculated as cosine similarity of outputs to each word's defined pattern at each time step. A word was "recognized" when output-target cosine exceeded all other output-word cosines by 0.05. Mean peak output-target cosine was 0.91, while mean peak cosine similarity of the second-most similar word to the output was 0.75. After 20,000 epochs, accuracy was 95%. Predictors such as neighborhood and onset cohort density related to item-specific RTs similarly to how they relate to human RTs. Over-time competition dynamics (e.g., of onset and rhyme competitors) resembled that observed with humans. We borrowed a neural decoding approach from electrocorticography to identify spatiotemporal receptive fields of hidden units. Some units responded preferentially to specific phonemes or phonological classes (e.g., obstruents), while others had more complex response properties. These analyses allow us to begin to understand the inner workings of the model. CONCLUSIONS. Previous attempts to couple ASR and HSR approaches have led to interesting insights, but with low computational adequacy (e.g., very low accuracy rates). This is the first model with high potential psychological adequacy that can be been applied to real speech with high accuracy and human-like dynamics.
Topic Area: Computational Approaches