The Handbook of Speech Perception. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 36
Interestingly, because the supramodal learning hypothesis suggests that perceptual experience is of articulatory properties regardless of modality, another surprising prediction can be made. Observers should be able to show a bimodal training benefit using a modality they have rarely, if ever, used before: haptic speech. We have recently shown that, by listening to distorted auditory speech while touching the face of a speaker, observers are later able to understand the distorted speech on its own better than control subjects who touched a still face while listening (Dorsi et al., 2016). These results, together with our crossmodal talker facilitation findings (Rosenblum et al., 2007; Sanchez et al., 2013) suggests that the experiential basis of bimodal training benefits require neither long‐term experience with the involved modalities nor concurrent presentation of the streams. What is required for a bimodal training benefit is access to some lawful auditory/visual/haptic information for articulatory actions and their indexical properties.
In sum, as we argued in our 2005 chapter, both auditory and visual speech share the general informational commonalities of being composed of time‐varying information which is intimately tied to indexical information. However, since 2005, another category of informational commonality can be added to this list: information in both streams can act to guide the indexical details of a production response. It is well known that during live conversation each participant’s productions are influenced by the indexical details of the speech they have just heard (e.g. Pardo, 2006; Pardo et al., 2013; for a review, see Pardo et al., 2017). This phonetic convergence shows that interlocuters’ utterances often subtly mimic aspects of the utterances of the person with whom they are speaking. This phenomenon occurs not only during live interaction, but also when subjects are asked to listen to recorded words and to say each word out loud. There have been many explanations for this phenomenon, including that it helps facilitate the interaction socially (e.g. Pardo et al., 2012). Phonetic convergence may also reveal the tacit connection between speech perception and production, as if the two function share a “common currency” (e.g. Fowler, 2004).
Importantly, recent research from our lab and others suggests that phonetic convergence is not an alignment toward an interlocuter’s sound of speech as much as toward their articulatory style – conveyed supramodally. We have shown that, despite having no formal lip‐reading experience, perceivers will produce words containing the indexical properties of words they have just lip‐read (Miller, Sanchez, & Rosenblum, 2010). Further, the degree to which talkers converge toward lip‐read words is comparable to that observed for convergence to heard words. Other research from our lab shows that, during live interactions, seeing an interlocuter increases the degree of convergence over simply hearing them (Dias & Rosenblum, 2011), and that this increase is based on the availability of visible speech articulation (Dias & Rosenblum, 2016). Finally, it seems that the visual information for articulatory features (voice‐onset time) can integrate with auditory information to shape convergence (Sanchez, Miller, & Rosenblum, 2010). This finding also suggests that the streams are merged by the time they influence a spontaneous production response.
This evidence for multimodal influences on phonetic convergence is consistent with neurophysiological research showing visual speech modulation of speech motor areas. As has been shown for auditory speech, visual speech can induce speech motor system (cortical) activity during lip‐reading of syllables, words, and sentences (e.g. Callan et al., 2003, 2004; Hall, Fussell, & Summerfield, 2005; Nishitani & Hari, 2002; Olson, Gatenby, & Gore, 2002; Paulesu et al., 2003). This motor system activity also occurs when a subject is attending to another task and passively perceives visual speech (Turner et al., 2009). Other research shows an increase in motor system activity when visual information is added to auditory speech (e.g. Callan, Jones, & Callan, 2014; Irwin et al., 2011; Miller & D’Esposito, 2005; Swaminathan et al., 2013; Skipper, Nusbaum, & Small, 2005; Skipper et al., 2007; Uno et al., 2015; Venezia, Fillmore, et al., 2016; but see Matchin, Groulx, & Hickok, 2014). This increase is proportionate to the relative visibility of the particular segments present in the stimuli (Skipper, Nusbaum, & Small, 2005). Relatedly, with McGurk‐effect types of stimuli (audio /pa/ + video /ka/), segment‐specific reactivity in the motor cortex follows the integrated perceived syllable (/ta/; Skipper et al., 2007). This finding is consistent with other research showing that with transcranial magnetic stimulation (TMS) priming of the motor cortex, electromyographic (EMG) activity in the articulatory muscles follow the integrated segment (Sundara, Namasivayam, & Chen, 2001; but see Sato et al., 2010). These findings are also consistent with our own evidence that phonetic convergence in production responses is based on the integration of audio and visual channels (Sanchez, Miller, & Rosenblum, 2010).
There is currently a debate on whether the involvement of motor areas is necessary for audiovisual integration and for speech perception, in general (for a review, see Rosenblum, Dorsi, & Dias, 2016). But it is clear that the speech system treats auditory and visual speech information similarly for priming phonetic convergence in production responses. Thus, phonetic convergence joins the characteristics of critical time‐varying and indexical dimensions as an example of general informational commonality across audio and video streams. In this sense, the recent phonetic convergence research supports a supramodal perspective.
Conclusions
Research on multisensory speech has flourished since 2005. This research has spearheaded a revolution in our understanding of the perceptual brain. The brain is now thought to be largely designed around multisensory input, with most major sensory areas showing crossmodal modulation. Behaviorally, research has shown that even our seemingly unimodal experiences are continuously influenced by crossmodal input, and that the senses have a surprising degree of parity and flexibility across multiple perceptual tasks. As we have argued, research on multisensory speech has provided seminal neurophysiological, behavioral, and phenomenological demonstrations of these principles.
Arguably, as this research has grown, it has continued to support claims made in the first version of this chapter. There is now more evidence that multisensory speech perception is ubiquitous and (largely) automatic. This ubiquity is demonstrated in the new research showing that tactile and kinesthetic speech information can be used, and can readily integrate, with heard speech. Next, the majority of the new research continues to reveal a function for which the streams are integrated at the earliest stages of the speech function. Much of this research comes from neurophysiological research showing that auditory brainstem and even cochlear functioning is modulated by visual speech information. Finally, evidence continues to accumulate for the salience of a supramodal form of information. This evidence now includes findings that, like auditory speech, visual speech can act to influence an alignment response, and can modulate motor‐cortex activity for that purpose. Other support shows that the speech and talker experience gained through one modality can be shared with another modality, suggesting a mechanism sensitive to the supramodal articulatory dimensions of the stimulus: the supramodal learning hypothesis.
There is also recent evidence that can be interpreted as unsupportive of a supramodal approach. Because the supramodal approach claims that “integration” is a consequence of the informational form across modalities, evidence should show that the function is early, impenetrable, and complete. As stated, however, there are findings that have been interpreted as showing that integration can be delayed until after some lexical analysis is conducted on unimodal input (e.g. Ostrand et al., 2016). There is also evidence interpreted as showing that integration