The Handbook of Speech Perception. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 33

The Handbook of Speech Perception - Группа авторов

Скачать книгу

in N1) corresponds to an integrated perceived segment. This finding is less consistent with the alternative model that separate unimodal analyses are first conducted at primary cortexes, with their outcomes then combined at a multisensory integrator, such as the posterior STS (pSTS; e.g. Beauchamp et al., 2004).

      The behavioral research also continues to show evidence of early crossmodal influences (for a review, see Rosenblum, Dorsi, & Dias, 2016). Evidence suggests that visual influences likely occur before auditory feature extraction (e.g. Brancazio, Miller, & Paré, 2003; Fowler, Brown, & Mann, 2000; Green & Gerdeman, 1995; Green & Kuhl, 1989; Green & Miller, 1985; Green & Norrix, 2001; Schwartz, Berthommier, & Savariaux, 2004). Other research shows that information in one modality is able to facilitate perception in the other even before the information is usable – and sometimes even detectable – on its own (e.g. Plass et al., 2014). For example, Plass and his colleagues (2014) used flash suppression to render visually presented articulating faces (consciously) undetectable. Still, if these undetected faces were presented with auditory speech that was consistent and synchronized with the visible articulation, then subjects were faster at recognizing that auditory speech. This suggests that useful crossmodal influences can occur even without awareness of information in one of the modalities.

      Other examples of the extreme super‐additive nature of speech integration have been shown in the context of auditory speech detection (Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004; Palmer & Ramsey, 2012) and identification (Schwartz, Berthommier, & Savariaux, 2004), as well audiovisual speech identification (Eskelund, Tuomainen, & Andersen, 2011; Rosen, Fourcin, & Moore, 1981). Much of this research has been interpreted to suggest that, even without its own (conscious) clear phonetic determination, each modality can help the perceiver attend to critical information in the other modality through analogous patterns of temporal change in the two signals. These crossmodal correspondences are thought to be influential at an especially early stage (before feature extraction) to serve as a “bimodal coherence‐masking protection” against everyday signal degradation (e.g. Grant & Seitz, 2000; Kim & Davis, 2004; Schwartz, Berthommier, & Savariaux, 2004; see also Gordon, 1997). The impressive utility of these crossmodal correspondences will also help motivate the theoretical position proposed later in this chapter.

      However, other interpretations of these results have been offered which are consistent with early integration (Brancazio, 2004; Rosenblum, 2008). It may be that lexicality and sentence context does not bear on the likelihood of integration, but instead on how the post‐integrated segment is categorized. As stated, it is likely that syllables perceived from conflicting audiovisual information are less canonical than those based on congruent (or audio‐alone) information. This fact likely makes those syllables less robust, even when they are being identified as visually influenced segments. This could mean that, despite incongruent segments being fully integrated, the resultant perceived segment is more susceptible to contextual (e.g. lexical) influences than audiovisually congruent (and auditory‐alone) segments. This is certainly known to be the case for less canonical, more ambiguous audio‐alone segments as demonstrated in the Ganong effect, that is, when an ambiguous segment equally heard as k or g in isolation will be heard as the former when placed in front of the syllable iss, but as the latter if heard in front of ift (Connine & Clifton, 1987; Ganong, 1980). If the same is true of incongruent audiovisual segments, then lexical context may not bear on audiovisual integration as such, but on the categorization of the post‐integrated (and less canonical) segment (e.g. Brancazio, 2004).

      Still, other recent evidence has been interpreted as showing that a semantic analysis is conducted on the individual streams before integration is fully complete (see also Bernstein, Auer, & Moore, 2004). Ostrand and her colleagues (2016) present data showing that, despite a McGurk word being perceived as visually influenced (e.g. audio bait + visual date = heard date), the auditory component of the stimulus provides stronger priming of semantically related auditory words (audio bait + visual date primes worm more strongly than it primes calendar). This finding could suggest that the auditory component goes through a semantic analysis before it is merged with the visual component and provides stronger priming than the visible word. If this contention were true, then it would mean that the channels are not fully integrated until at least a good amount of processing has occurred on the individual channels.

      In sum, much of the new results from the behavioral, and especially neurophysiological, research suggest that the audio and visual streams are merged as early as can be currently observed (but see Bernstein, Auer, & Moore, 2004). In the previous version of this chapter we argued that this fact, along with the ubiquity and automaticity of multisensory speech, suggests that the speech function is designed around multisensory input (Rosenblum, 2005). We further argued that the function may make use of the fact that there is a common informational form across the modalities. This contention will be addressed in the final section of this chapter.

      The notion that the speech mechanism may be sensitive to a form of information that is not tied to a specific sensory modality has been discussed for over three decades (e.g. Summerfield, 1987). This construal of multisensory speech information has been alternatively referred to as amodal, modality‐neutral (e.g. Rosenblum, 2005), and supramodal (Fowler, 2004; Rosenblum et al., 2016, 2017). The

Скачать книгу