Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Poster Slams

Interpretability tools reveal what components of NLP models drive similarity to human brain activations in language processing

Poster B19 in Poster Session B and Reception, Thursday, October 6, 6:30 - 8:30 pm EDT, Millennium Hall
Also presenting in Poster Slam B, Thursday, October 6, 6:15 - 6:30 pm EDT, Regency Ballroom

Marianne de Heer Kloots1, Willem Zuidema1; 1University of Amsterdam

To successfully understand natural language, the human brain must integrate information across several representational domains and levels of abstraction. Large neural language models from the Natural Language Processing field process language well enough to be successful at many tasks, while internal activations generated in such models appear to be remarkably predictive of human brain responses during language comprehension. Although encoding models mapping from ANNs to brains are increasingly seen as promising ways to better understand human neural information integration, a problem for using Large Language Models (LLMs) in this approach is the opaque path from training corpora through model architectures to individual brain activity data (as illustrated by e.g. larger untrained architectures performing better than smaller trained ones; Schrimpf et al., 2021). Here, we aim to improve the interpretability of this pipeline by partitioning LLMs into components that have been successful targets of interpretability work in NLP (Rogers et al., 2021). We build on recent studies with Transformer LLMs, which found that better neural predictivity is generally achieved by activations from higher model layers (Caucheteux & King, 2022; Schrimpf et al., 2021). After replicating these results qualitatively, we investigate what model-internal operations make activation patterns at these specific layers more brain-like. In particular, we examine the role of individual attention heads within each layer, and the multi-head attention mechanism regulating information flow between layers. We compare model and brain responses to the same text, using uni- and bidirectional transformer language models (GPT-2 and BERT), Representational Similarity Analysis, and a dataset with fMRI recordings of human story reading (Wehbe et al., 2014). Next to activation vectors for entire model layers, we also extract activations from individual attention heads within each layer. We then study the model-internal behaviour producing these activations, using a metric that quantifies the importance of every input token for each attention head (Kobayashi et al., 2020). We indeed find generally higher brain-similarity with increasing layer depth, though similarity sometimes peaks in earlier layers (1-2) of BERT as well. However, representational similarity is not evenly distributed within model layers: there are big differences in brain-similarity scores between attention heads in the same layer. Additionally, individual head activation patterns sometimes show higher brain-similarity than whole layer activations. Aggregating the model's attention over all heads within layers, we find that earlier layers assign most importance to directly preceding words, whereas later layers integrate information over larger contexts. This aligns with differences in representational similarity when varying the amount of prior context text given to the model as input: almost all layers in both GPT-2 and BERT perform better with small amounts of prior context (8-12 words) compared to none, but only middle and higher layers show some additional benefit for context lengths beyond 16 words. Our work shows that we can successfully isolate smaller components of LLMs that drive much of the similarity between model-internal states and human brain activations. This opens possibilities to investigate whether these components (i.e. specific attention heads) are also most important for the models' linguistic performance.

Topic Areas: Computational Approaches, Methods