The Handbook of Speech Perception. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 34

The Handbook of Speech Perception - Группа авторов

Скачать книгу

that can be instantiated in multiple modalities. Such a mechanism would not need to contend with translating information across modality‐specific codes, or to involve a formal process of sensory integration (or merging), as such. From this perspective, the integration is a characteristic of the relevant information itself. Of course, the energetic details of the (light, sound, tactile‐mechanical) input and their superficial receptor reactions are necessarily distinct. But the deeper speech function may act to register the phonetically relevant higher‐order patterns of energy that can be functionally the same across modalities.

      The supramodal theory has been motivated by the characteristics of multisensory speech discussed earlier, including: (1) neurophysiological and behavioral evidence for the automaticity and ubiquity of multisensory speech; (2) neurophysiological evidence for a speech mechanism sensitive to multiple sensory forms; and (3) neurophysiological and behavioral evidence for integration occurring at the earliest observable stage; and (4) informational analyses showing a surprising close correlation between optic and acoustic informational variables for a given articulatory event. The theory is consistent with Carol Fowler’s direct approach to speech perception (e.g. Fowler, 1986, 2010), and James Gibson’s theory of multisensory perception (Gibson, 1966, 1979; and see Stoffregen & Bardy, 2001). The theory is also consistent with the task‐machine and metamodal theories of general multisensory perception which argue that function and task, rather than sensory system, is the guiding principle of the perceptual brain (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Striem‐Amit et al., 2011; see also Fowler, 2004; Rosenblum, 2013; Rosenblum, Dias, & Dorsi, 2017).

      Summerfield (1987) was the first to suggest that the informational form for certain articulatory actions can be construed as the same across vision and audition. As an intuitive example, he suggested that the higher‐order information for a repetitive syllable would be the same in sound and light. Consider a speaker repetitively articulating the syllable /ma/. For hearing, a repetitive oscillation of the amplitude and spectral structure of the acoustic signal would be lawfully linked to the repetitive movements of the lips, jaw, and tongue. For sight, a repetitive restructuring of the light reflecting from the face would also be lawfully linked to the same movements. While the energetic details of the information differ across modalities, the more abstract repetitive informational restructuring occurs in both modalities in the same oscillatory manner, with the same time course, so as to be specific to the articulatory movements. Thus, repetitive informational restructuring could be considered supramodal information – available in both the light and the sound – that acts to specify a speech event of repetitive articulation. A speech mechanism sensitive to this form of supramodal information would function without regard to the sensory details specific to each modality: the relevant form of information exists in the same way (abstractly defined) in both modalities. In this sense, a speech function that could pick up on this abstract form of information in multiple modalities would not require integration or translation of the information across modalities.

      Summerfield (1987) offered other examples of supramodal information such as how quantal changes in articulation (e.g. bilabial contact to no contact), and reversals in articulation (e.g. during articulation of a consonant–vowel–consonant such as /wew/) would be accompanied by corresponding quantal and reversal changes in the acoustic and optic structure.

      Other recent research has determined that some of the strongest correlations across audible and visible signals lie in the acoustic range of 2–3 kHz (Chandrasekaran et al., 2009). This may seem unintuitive because it is within this range that the presumed less visible articulatory movements of the tongue and pharynx play their largest role in sculpting the sound. However, the configurations of these articulators were shown to systematically influence subtle visible mouth movements. This fact suggests that there is a class of visible information that strongly correlates with the acoustic information formed by internal articulators. In fact, visual speech research has shown that the presumably “hidden” articulatory dimensions (e.g. lexical tone, intraoral pressure) are actually visible in corresponding face surface changes, and can be used as speech information (Burnham et al., 2000; Han et al., 2018; Munhall & Vatikiotis‐Bateson, 2004). That visible mouth movements can inform about internal articulation may explain a striking recent finding. It turns out that, when observers are shown cross‐sectional ultrasound displays of internal tongue movements, they can readily integrate these novel displays with synchronized auditory speech information (D’Ausilio et al., 2014; see also Katz & Mehta, 2015).

      The strong correspondences between auditory and visual speech information has allowed auditory speech to be synthesized based on tracking kinematic dimensions available on the face (e.g. Barker & Berthommier, 1999; Yehia, Kuratate, & Vatikiotis‐Bateson, 2002). Conversely, the correspondences have allowed facial animation to be effectively created based on direct acoustic signal parameters (e.g. Yamamoto, Nakamura, & Shikano, 1998). There is also evidence for surprisingly close correspondences between audible and visible macaque calls, which macaques can easily perceive as corresponding (Ghazanfar et al., 2005). This finding may suggest a traceable phylogeny of the supramodal basis for multisensory communication.

Скачать книгу