My Account

Poster A75, Tuesday, August 20, 2019, 10:15 am – 12:00 pm, Restaurant Hall

Tolerance to audiovisual asynchronies for early cross-modal interactions during speech processing: An electrophysiological study

Alexandra Jesse1, Elina Kaplan1;1University of Massachusetts Amherst

To recognize speech during face-to-face conversations, listeners evaluate and combine speech information obtained from hearing and seeing a speaker talk. Audiovisual speech is generally recognized more reliably than auditory-only speech. The audiovisual benefit arises in part because listeners integrate perceptual information arriving within a certain time window from the two modalities into a unitary percept. Cross-modal interactions can occur already early during auditory perception. Prior studies measuring event-related potentials (ERPs) have shown cross-modal interactions in early auditory processing in that the first negative peak (N1) typically found around 100 ms after an acoustic onset is smaller when auditory speech is accompanied by visual speech than when not. However, information from the two modalities does not necessarily have to arrive at the same time to be integrated or be perceived as synchronous. Listeners tolerate a certain degree of physical audiovisual asynchrony in their recognition of speech and in their judging of synchrony. Additionally, they allow for a larger temporal lead of visual information than of auditory information. The current study tested the time window within which visual and auditory information must occur to produce cross-modal interactions in the early auditory processing of speech. The prediction was for neural cross-modal interactions at the N1 to become less likely as auditory and visual inputs were more separated in time. While measuring their ERPs, young adults heard and saw a female speaker saying the syllable /pa/. This audiovisual speech stimulus was either presented as originally recorded (i.e., synchronous condition) or with systematically induced stimulus-onset asynchronies (SOAs). For two of the selected SOAs (-300 ms auditory lead; +500 auditory lag) the auditory and visual events have been commonly reported as being out of sync, and for the others as being in sync (-67 ms auditory lead; +233 auditory lag). After each presentation, participants categorized the audiovisual stimulus by button press as being presented in sync or out of sync. On additional trials with auditory-only (A) or visual-only (V) speech, participants simply pressed any button after the presentation. Auditory-only trials showed a randomly pixelated square spectrally matched to the video of the speaker. The N1 mean amplitude was measured as the mean activity between 90-140 ms after acoustic onset. As expected, overall, the N1 amplitude was reduced for audiovisual presentations (AV-V) compared to the auditory-only presentation (A), indicating multisensory interactions. Compared to the synchronous condition, the N1 amplitude was significantly larger when the sound was presented 300 ms before its natural occurrence. All other asynchronous SOA conditions (-67, +233, and +500 ms) had cross-modal interactions similar in size to those found in the synchronous condition. Visual and auditory speech information therefore does not have to be physically synchronous for cross-modal interactions to occur. Perceptual tolerance to asynchronies is already observable in early multisensory interactions during the processing of audiovisual speech.

Themes: Perception: Speech Perception and Audiovisual Integration, Speech Perception
Method: Electrophysiology (MEG/EEG/ECOG)

Back