The Handbook of Multimodal-Multisensor Interfaces, Volume 1. Sharon Oviatt

Чтение книги онлайн.

Читать онлайн книгу The Handbook of Multimodal-Multisensor Interfaces, Volume 1 - Sharon Oviatt страница 19

Автор:
Жанр:
Серия:
Издательство:
The Handbook of Multimodal-Multisensor Interfaces, Volume 1 - Sharon Oviatt ACM Books

Скачать книгу

Complex multisensory or multimodal actions, compared with unimodal or simpler actions, can have a substantial and broad facilitatory effect on cognition [James 2010, Kersey and James 2013, Oviatt 2013].

      As an example, writing complex letter shapes creates a long-term sensory-motor memory, which is part of an integrated multisensory-multimodal “reading neural circuit” [Nakamura et al. 2012]. The multisensory experience of writing includes a combination of haptic, auditory, and visual feedback. In both children and adults, actively writing letters has been shown in fMRI studies to increase brain activation to a greater extent than passively viewing, naming, or typing them [James 2010, James and Engelhardt 2012, James 2010, Kersey and James 2013, Longcamp et al. 2005, 2008]. Compared with simple tapping on keys during typing, constructing letter shapes also improves the accuracy of subsequent letter recognition, a prerequisite for successful comprehension and reading. Writing letters basically leads to a more elaborated and durable ability to recognize letter shapes over time. Research by Berninger and colleagues [Berninger et al. 2009, Hayes and Berninger 2010] has further documented that the multisensory-multimodal experience of writing letter shapes, compared with typing, facilitates spelling, written composition, and the content of ideas expressed in a composition. This extensive body of neuroscience and behavioral findings has direct implications for the broad cognitive advantages of pen-based and multimodal interfaces.

      Figure 1.3 Embodied cognition view of the perception-action loop during multisensory integration, which utilizes the Maximum Likelihood Estimation (MLE) model and combines prior knowledge with multisensory sources of information. (From Ernst and Bulthoff [2004])

      In research on multisensory integration, Embodied Cognition theory also has provided a foundation for understanding human interaction with the environment from a systems perspective. Figure 1.3 illustrates how multisensory signals from the environment are combined with prior knowledge to form more accurate percepts [Ernst and Bulthoff 2004]. During multisensory integration, Ernst and colleagues describe the Maximum Likelihood Estimation (MLE) model, using Bayes’ rule. As introduced earlier, MLE integrates sensory signal input to minimize variance in the final estimate under different circumstances. It determines the degree to which information from one modality will dominate over another [Ernst and Banks 2002, Ernst and Bulthoff 2004]. For example, the MLE rule predicts that visual capture will occur whenever the visual stimulus is relatively noise-free and its estimate of a property has less variance than the haptic estimate. Conversely, haptic capture will prevail when the visual stimulus is noisier.

      Empirical research has shown that the human nervous system’s multisensory perceptual integration process is very similar to the MLE integrator model. Ernst and Banks [2002] demonstrated this in a visual and haptic task. The net effect is that the final estimate has lower variance than either the visual or the haptic estimator alone. To support decision-making, prior knowledge is incorporated into the sensory integration model to further disambiguate sensory information. As depicted in Figure 1.3, this embodied perception-action process provides a basis for deciding what goal-oriented action to pursue. Selective action may in turn recruit further sensory information, alter the environment that is experienced, or change people’s understanding of their multisensory experience. See James and colleagues’ Chapter 2 in this volume [James et al. 2017] for an extensive discussion and empirical evidence supporting Embodied Cognition Theory.

      Communication Accommodation theory presents a socially situated perspective on embodied cognition. It has shown that interactive human dialogue involves extensive co-adaptation of communication patterns between interlocutors. Interpersonal conversation is a dynamic adaptive exchange in which speakers’ lexical, syntactic, and speech signal features all are tailored in a moment-by-moment manner to their conversational partner. In most cases, children and adults adapt all aspects of their communicative behavior to converge with those of their partner, including speech amplitude, pitch, rate of articulation, pause structure, response latency, phonological features, gesturing, drawing, body posture, and other aspects [Burgoon et al. 1995, Fay et al. 2010, Giles et al. 1987, Welkowitz et al. 1976]. The impact of these communicative adaptations is to enhance the intelligibility, predictability, and efficiency of interpersonal communication [Burgoon et al. 1995, Giles et al. 1987, Welkowitz et al. 1976]. For example, if one speaker uses a particular lexical term, then their partner has a higher likelihood of adopting it as well. This mutual shaping of lexical choice facilitates language learning, and also the comprehension of newly introduced ideas between people.

      Communication accommodation occurs not only in interpersonal dialogue, but also during human-computer interaction [Oviatt et al. 2004b, Zolton-Ford 1991]. These mutual adaptations also occur across different modalities (e.g., handwriting, manual signing), not just speech. For example, when drawing interlocutors typically shift from initially sketching a careful likeness of an object to converging with their partner’s simpler drawing [Fay et al. 2010]. A similar convergence of signed gestures has been documented between deaf communicators. Within a community of previously isolated deaf Nicaraguans who were brought together in a school for the deaf, a novel sign language became established rapidly and spontaneously. This new sign language and its lexicon most likely emerged through convergence of the signed gestures, which then became widely produced among community members as they formed a new language [Kegl et al. 1999, Goldin-Meadow 2003].

      At the level of neurological processing, convergent communication patterns are controlled by the mirror and echo neuron systems [Kohler et al. 2002, Rizzolatti and Craighero 2004]. Mirror and echo neurons provide the multimodal neurological substrate for action understanding, both at the level of physical and communicative actions. Observation of an action in another person primes an individual to prepare for action, and also to comprehend the observed action. For example, when participating in a dialogue during a cooking class, one student may observe another’s facial expressions and pointing gesture when she says, “I cut my finger.” In this context, the listener is primed multimodally to act, comprehend, and perhaps reply verbally. The listener experiences neurological priming, or activation of their own brain region and musculature associated with fingers. This prepares the listener to act, which may involve imitating retraction that they observe with their own fingers. The same neurological priming enables the listener to comprehend the speaker’s physical experience and emotional state. This socially situated perception-action loop provides the evolutionary basis for imitation learning, language learning, and mutual comprehension of ideas.

      This theory and related literature on convergence of multimodal communication patterns has been applied to designing more effective conversational software personas and social robots. One direct implication of this work is that the design of a system’s multimodal output can be used to transparently guide users to provide input that is more compatible with a system’s processing repertoire, which improves system reliability and performance [Oviatt et al. 2004b]. As examples, users interacting with a computer have been shown to adopt a more processable volume, rate, and lexicon [Oviatt et al. 2004b, Zolton-Ford 1991].

      Affordance theory presents a systems-theoretic view closely related to Gestalt theory. It also is a complement to Activity theory, because it specifies the type of activity that users are most likely to engage in when using different types of computer interface. It states that people have perceptually based expectations about objects, including computer interfaces, which involve different constraints on how one can act on them to achieve goals. These affordances of objects establish behavioral attunements that transparently but powerfully prime the likelihood that people will act in specific ways [Gibson 1977, 1979]. Affordance theory has been widely applied to human interface design, especially the design of input devices [Gaver 1991, Norman 1988].

      Since object perception is multisensory, people are influenced by an array

Скачать книгу