The Handbook of Speech Perception. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 32

The Handbook of Speech Perception - Группа авторов

Скачать книгу

determine where in the perceptual and neurophysiological process integration occurs and whether integration is complete (for discussions of these topics, see Brancazio & Miller, 2005).

      However, a number of researchers have recently questioned whether the McGurk effect should be used as a primary test of multisensory integration (Alsius, Paré, & Munhall, 2017; Remez, Beltrone, & Willimetz, 2017; Rosenblum, 2019; Irwin & DiBlasi, 2017; Brown et al. 2018). There are multiple reasons for these concerns. First, there is wide variability in most aspects of McGurk methodology (for a review, see Alsius, Paré, & Munhall, 2017). Most obviously, the specific talkers used to create the stimuli usually vary from project to project. The dubbing procedure – specifically, how the audio and visual components are aligned – also vary across laboratories. Studies will also vary as to which syllables are used, as well as the type of McGurk effect tested (fusion; visual dominance). Procedurally, the tasks (e.g. open response vs. forced choice), stimulus ordering (fully randomized vs. blocked by modality), and the control condition chosen (e.g. audio‐alone vs. audiovisually congruent syllables) vary across studies (Alsius, Paré, & Munhall, 2017). This extreme methodological variability may account for the wide range of McGurk effect strengths reported across the literature. Finding evidence of the effect under such different conditions does speak to its durability. However, the methodological variability makes it difficult to know whether influences on the effect’s strength are attributable to the variable in question (e.g. facial inversion), or to some superfluous characteristic of idiosyncratic stimuli and/or tasks.

      Another concern about the McGurk effect is whether it is truly representative of natural (nonillusory) multisensory perception (Alsius, Paré, & Munhall, 2017; Remez Beltrone, & Willimetz, 2017). It could very well be that different perceptual and neurophysiological resources are recruited when integrating discrepant rather than congruent audiovisual components. In fact, it has long been known that McGurk‐effect syllables (e.g. audio ba + visual a = va) are less compelling and take longer to identify (Brancazio, 2004; Brancazio, Best, & Fowler, 2006; Green & Kuhl, 1991; Jerger et al., 2017; Massaro & Ferguson, 1993; Rosenblum & Saldaña, 1992) than analogous audiovisual congruent syllables (audio va + visual va = va). This is true even when McGurk syllables are identified with comparable frequency (98 percent va; Rosenblum & Saldaña, 1992) to the congruent syllables. Relatedly, there is evidence that, when spatial and temporal offsets are applied to the audio and visual components, McGurk stimuli are more readily perceived as separate components than as audiovisually congruent syllables (e.g. Bishop & Miller, 2011; van Wassenhove, Grant, & Poeppel, 2007).

      Additional evidence that the McGurk effect may not be representative of normal integration comes from intersubject differences. It turns out that there is little evidence for a correlation between a subject’s likelihood to display a McGurk effect and their benefit in using visual speech to enhance noisy auditory speech (at least in normal hearing subjects; e.g. Van Engen, Xie, & Chandrasekaran, 2016; but see Grant & Seitz, 1998). Relatedly, the relationship between straight lip‐reading skill and susceptibility to the McGurk effect is weak at best (Cienkowski & Carney, 2002; Strand et al., 2014; Wilson et al., 2016; Massaro et al., 1986).

      A particularly troubling concern regarding the McGurk effect is evidence that its failure does not mean integration has not occurred (Alsius, Paré, & Munhall, 2017; Rosenblum, 2019). Multiple studies have shown that when the McGurk effect seems to fail and a subject reports hearing just the auditory segment (e.g. auditory /b/ + visual /g/ = perceived /b/), the influences of the visual, and perhaps integrated, segment are present in the gestural nuances of the subject’s spoken response (Gentilucci & Cattaneo, 2005; Sato et al., 2010; see Rosenblum, 2019 for further discussion). In another example, Brancazio and Miller (2005) showed that in instances when a visual /ti/ failed to change identification of an audible /pi/, a simultaneous manipulation of spoken rate of the visible /ti/ did influence the voice‐onset time perceived in the /pi/ (see also Green & Miller, 1985). Thus, information for voice‐onset time was integrated across the visual and audible syllables even when the McGurk effect failed to change the identification of the /pi/.

      It is unclear why featural integration can still occur in the face of a failed McGurk effect (Rosenblum, 2019; Alsius, Paré, & Munhall, 2017). It could be that standard audiovisual segment integration does occur in these instances, but the resultant segment does not change enough to be categorized differently. As stated, perceived segments based on McGurk stimuli are less robust than audiovisually congruent (or audio‐alone) perceived segments. It could be that some integration almost always occurs for McGurk segments, but the less canonical integrated segment sometimes induces a phonetic categorization that is the same as the auditory‐alone segment. Regardless, the fact that audiovisual integration of some type can occur when the McGurk effect appears to fail forces a reconsideration of the effect as a primary test of integration.

      The question of where in the speech function the modal streams integrate (merge) continues to be one of the most studied in the multisensory literature. Since 2005, much of this research has used neurophysiological methods. After the aforementioned fMRI report by Calvert and her colleagues (1997; see also Pekkola et al., 2005), numerous studies have also shown visual speech activation of the auditory cortex, often using other technologies, for example, functional near‐infrared spectroscopy (fNIR) (van de Rijt et al., 2016); electroencephalography (EEG; Callan et al., 2001; Besle et al., 2004); intercranial EEG (ECoG; e.g. Besle et al., 2008); magneto‐encephalography (MEG; Arnal et al., 2009; for a review, see Rosenblum, Dorsi, & Dias, 2016). More recent evidence shows that visual speech can modulate neurophysiological areas considered to be further upstream including the auditory brainstem (Musacchia et al., 2006), which is one of the earliest locations at which direct visual modulation could occur. There is even evidence of visual speech modulation of cochlear functioning (otoacoustic emissions; Namasivayam et al., 2015). While it is likely that visual influences on such peripheral auditory mechanisms are based on feedback from downstream areas, that it can occur indicates the importance of visual input to the speech function.

      Other neurophysiological findings suggest that the integration of the streams also happens early. A very recent EEG study revealed that N1 auditory‐evoked potentials (known to reflect primary auditory cortex activity) for visually induced (McGurk) fa and ba syllables (auditory ba + visual fa; auditory fa + visual ba, respectively) resemble the N1 responses for the corresponding auditory‐alone syllables (Shahin et al. 2018; and see van Wassenhove, Grant, & Poeppel, 2005). The degree of resemblance was larger for individuals whose identification responses showed greater visual influences, suggesting that this modulated

Скачать книгу