Formant-invariant voice representations are pre-attentively formed from constantly varying speech and non-speech stimuli

Slide Slam Session C, Tuesday, October 5, 2021, 12:30 - 3:00 pm PDT

Giuseppe Di Dona1, Michele Scaltritti1, Simone Sulpizio2; 1University of Trento, Italy, 2University of Milano-Bicocca, Italy

Despite being integrally processed, phonological and talker-related information can be selectively extracted from speech for different communicative goals. Yet, as both dimensions are characterized by considerable amount of physical variability, listeners may build information-specific representations which are invariant to changes along the irrelevant dimension. Previous studies showed that listeners pre-attentively form abstract phoneme representations irrespectively of constant changes in the talkers’ voices. The aim of the present EEG study was to determine if listeners can also form abstract voice representations while ignoring constantly changing phonological information and if they can use the output of this mechanism to facilitate volitional voice change detection. Secondly, the study aimed at understanding whether the use of such abstraction mechanism is restricted to the speech domain, or if it could be deployed also in non-speech contexts. Fifteen Italian native speakers were involved in an EEG experiment which included a passive and an active oddball task, each featuring a speech and a non-speech condition. In the speech condition, participants heard constantly changing vowels produced by a male speaker as standard stimuli which were infrequently replaced by vowels uttered by a female speaker with a higher pitch. In the non-speech condition, participants heard the rotated-speech version of the stimuli, synthesized by rotating the spectrum along a pivot frequency. This manipulation results in a power exchange between high and low frequencies, disrupting the previously meaningful formant structure. Results showed that, in the passive task, the Mismatch Negativity (MMN) was elicited after the presentation of the deviant voice both for the speech and the non-speech condition. The elicitation of this component in both conditions signaled that listeners could successfully group together different stimuli into a formant-invariant voice representation. This suggests that listeners can represent abstract regularities along voice-dependent dimensions in auditory streams irrespectively of the presence of meaningful linguistic information. After the MMN, a stronger Late Discriminative Negativity for the speech condition was found, possibly indicating that phonological details could be included in voice representations but only later in time. In the active task, responses were faster and more accurate in the speech compared to the non-speech condition. Additionally, for the speech condition, the detection of the deviant stimuli highlighted an enhanced P3b amplitude. This suggests that when pre-attentively formed voice representations include familiar phonological information, pitch detection is facilitated. This facilitation in the speech condition was also testified by a stronger synchronization in the theta band (4-7 Hz), potentially pointing towards differences in encoding/retrieval processes, and by a reduced desynchronization in the beta band (13-30 Hz), suggesting that deviant events with a familiar formant structure induced an attenuated disruption of the previously formed representation. Taken together, the results show that whereas at a pre-attentive level the cognitive system can track pitch regularities while abstracting away from constantly changing formant frequency values both in speech and in non-speech, at a volitional level the use of such information is facilitated for speech sounds given the familiarity of listeners with meaningful formant structures.

