Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Lightning Talks

Impact of model size and fine-tuning techniques on LLMs’ resemblance to the human brain

There is a Poster PDF for this presentation, but you must be a current member or registered to attend SNL 2023 to view it. Please go to your Account Home page to register.

Poster B93 in Poster Session B, Tuesday, October 24, 3:30 - 5:15 pm CEST, Espace Vieux-Port

Changjiang Gao1,2, Shujian Huang1, Jixing Li2; 1Nanjing University, 2City University of Hong Kong

Introduction, Prior studies comparing GPT2 with the human brain have suggested shared computational principles between the two systems (e.g., Caucheteux & King, 2022; Goldstein et al., 2022; Schrimpf et al., 2021). Yet current LLMs such as ChatGPT are orders of magnitude larger, with novel fine-tuning techniques that enable improved contextual understanding and human-like few-shot learning. Consequently, it becomes imperative to investigate whether these latest language models exhibit enhanced human-like language comprehension at the neural level, and how model size and fine-tuning techniques affect the model’s resemblance to the human brain. Methods. We used the openly available Reading Brain dataset (Li et al., 2022), which includes concurrent eye-tracking and fMRI BOLD signals during naturalistic reading of 5 English articles. The subjects included 52 native English speakers (L1) and 56 non-native learners of English (L2). For each subject, we used the eye fixation time points to extract the fMRI signals within a left-lateralized language mask time-locked to each word and constructed an fMRI data matrix for each sentence. We also extracted the saccade matrix for every sentence and we compared the fMRI and saccade patterns with the attention patterns of LLMs. To assess the effect of varying model size and fine-tuning techniques, we included GPT2 (Radford et al., 2019) with 774M parameters and the LLaMA family (Touvron et al., 2023; Taori et al., 2023; Chiang et al., 2023) with 7B, 13B and 30B parameters. The LLaMA models also contain two different fine-tuning techniques: the instruction-tuned model (Alpaca-LoRA) and the conversation-tuned one (Vicuna). We fed the 5 English articles used in the fMRI experiment into these models sentence by sentence and obtained the attention patterns in each layer and each attention head. Then we used linear regression to predict each subject's fMRI and eye movement data using the models’ attention patterns at each layer. We obtained the regression scores at each layer for each subject, and compared their performance for L1 and L2 at the group level using two-sample t-tests with 1000 permutations. Results. Our regression results revealed a significant improvement in the regression score for both L1 and L2 speakers as the model size goes up from 774M to 30B. However, instruction or conversation fine-tuning did not significantly improve the model fit to either L1 or L2 speakers’ neural and behavioral patterns. Conclusion. Compared to their predecessors, current LLMs better capture the human behavioral and neural patterns during language comprehension with increased parameter size. The trending fine-tuning techniques with human feedback seem to not improve the model fit to human comprehension patterns, suggesting that the current training techniques of LLMs may encourage comprehension strategies that differ from natural reading in humans.

Topic Areas: Computational Approaches, Reading

SNL Account Login

Forgot Password?
Create an Account