The Handbook of Speech Perception. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 26
Generic auditory organization and speech perception
The intelligibility of sinewave replicas of utterances, of noise‐band vocoded speech, and of speech chimeras reveals that a perceiver can find and follow a speech signal composed of dissimilar acoustic and auditory constituents, in contrast to the principles on which gestalt‐based generic functions operate. These findings show that perceptual organization of speech can occur solely by virtue of attention to the complex coordinate variation of an acoustic pattern. The use of such exotic acoustic signals for the proof creates some uncertainty that ordinary speech perception is satisfactorily characterized by tests using these acoustic oddities. An argument of Remez et al. (1994) for considering these tests to be a useful index of the perception of commonplace speech signals begins by noting that phonetic perception of sinewave replicas of utterances depends on a simple instruction to listen to the tones as speech. Because the disposition to hear sinewave words and sentences appears readily, without arduous or lengthy training, this prompt adaptation to phonetic organization and analysis suggests that the ordinary cognitive resources of speech perception are operating for sinewave speech. Although some form of short‐term perceptual learning might be involved, the swiftness of the appearance of adequate perceptual function is evidence that any special induction to accommodate sinewave signals is a marginal component of perception.
Despite all, natural speech consists of large stretches of glottal pulsing, which creates amplitude comodulation over time and harmonic relations between concurrent portions of the spectrum. This has led to a reasonable proposal (Barker & Cooke, 1999; Darwin, 2008) that generic auditory grouping functions, although not necessary for the perceptual organization of speech, contribute to perceptual organization when speech spectra satisfy the gestalt criteria. The consistent finding that speech spectra organize quickly – on the order of milliseconds – and generic auditory grouping takes time to build – on the order of seconds – may justify doubt in the asserted privilege of gestalt‐based grouping by similarity. A critical empirical test was provided by Carrell and Opie (1992), which offers an index of the plausibility of the claim. In the test, the intelligibility of sinewave sentences was compared in two acoustic conditions: (1) three‐tone time‐varying sinusoids; and (2) three‐tone time‐varying sinusoids on which a regular amplitude pulse was imposed. Although the tone patterns in the first condition were not susceptible to gestalt‐based grouping, because they failed to exhibit similarity in each of the relevant dimensions that we have discussed, the pulsed tone patterns in the second condition exhibited amplitude comodulation and harmonicity in its complex spectra (Bregman, Levitan, & Liao, 1990). All other things being equal, the perceptual organization attributable to complex coordinate variation should have been reinforced by perceptual organization attributable to similarity that triggers generic auditory grouping. Indeed, Carrell and Opie found that pulsed sentences were more intelligible than smoothly varying sinusoids, as if the spectral components once bound more securely were more successfully analyzed.
The assertion offered by Barker and Cooke (1999) about this phenomenon is that generic auditory functions can reinforce the grouping of speech signals, although on close examination the evidence does not yet warrant an endorsement of a hybrid model of perceptual organization. Carrell and Opie had used a range of pulse rates and conditions in their study, and reported that the intelligibility gain attributable to pulsing a sinewave sentence was restricted to a pulse rate in the range of 50–100 Hz. No benefit of pulsing was observed for a pulse rate of 200 Hz. While this topic merits additional examination, the available evidence encourages a doubtful conclusion about this hypothetical hybrid character of perceptual organization, which would necessarily be limited in applicability to speech signals produced by low bass voices; its benefit would not extend to tenors, to say nothing of altos and sopranos. Most generously, we might conclude that the relation of primitive gestalt‐based generic auditory grouping and the more abstract organization by sensitivity to coordinate variation cannot be defined without stronger evidence, and that it is premature to conclude that the gestalt set plays a prominent or even a secondary role in the perceptual organization of speech.
Implications of perceptual organization for theories of speech perception
The nature of speech cues
What causes the perception of speech? A classic answer takes a linguistically significant contrast – voicing, for instance – and provides an inventory of natural acoustic correlates of a careful articulation of the contrast (e.g. Lisker, 1978). A perceptual account that reverses the method depicts a meticulous listener collecting individual acoustic correlates as they land and assembling them in a stream, thereby to tally the strength with which a constellation of cues indicates the likely occurrence of a linguistic constituent. Klatt’s retrospective survey of perceptual accounts describes many normative approaches that treat the acoustic signal as a straightforward composite of acoustic correlates. The function of perceptual organization, usually omitted in such accounts, establishes the perceiver’s compliance with the acoustic products of a specific source of sound, and in the case of speech it is the probabilistic function that finds and tracks the likely acoustic products of vocalization. However, it is clear from evidence of several sorts – tolerance of distortion, effectiveness of impossible signals, forgiveness of departures from natural timbre – that the organizational component of perception that yields a speech stream fit to be analyzed cannot collect acoustic cues piecemeal, as this simple view describes. The functions of perceptual organization act, instead, as if attuned to a complex form of regular if unpredictable spectro‐temporal variation within which the specific acoustic and auditory elements matter far less than their overall configuration.
The evolving portrait of speech perception that includes organization and analysis recasts the raw cue as the property of perception that gives speech its phenomenality, though not its phonetic effect. The transformation of natural speech to chimera, to noise‐band vocoded signal, and to sinewave replica is phonetically conservative, preserving the fine details of subphonemic variation while varying to the extremes of timbre or auditory quality. It is apparent that the competent listener derives phonetic impressions from the properties that these different kinds of signal share, and derives qualitative impressions from their unique attributes. The shared attribute, for want of a more precise description, is a complex modulation of spectrum envelopes, although the basis for the similar effect of the infinitely sharp peaks of sinewave speech and the far coarser spectra of chimerical and noise‐band vocoded speech has still to be explained. None of these manifests the cues present in natural speech despite the success of listeners in understanding the message. The conclusion supported by these findings is clear: phonetic perception does not require the sensory registration of natural speech cues. Instead, the organizational component of speech perception operates on a spectro‐temporal grain that is requisite both for finding and following a speech signal and for analyzing its linguistic properties. The speech cues that seemed formerly to bear the burden of stimulating phonetic analyzers into action appear in hindsight to provide little more than auditory quality subordinate to the phonetic stream.
An additional source of evidence is encountered in the phenomenal experience of perceivers who listen to speech via electrocochlear prostheses (Goh et al., 2001; Liebenthal et al., 2003). Intelligibility of speech perceived via cochlear implant is often excellent, rivaling that of normal hearing, and recent studies with infant and juvenile subjects (Svirsky et al., 2000) suggest that this form of sensory substitution is effective even at the earliest stages of language development (see Hunter & Pisoni, Chapter 20). The mechanism of acoustic transduction at the auditory periphery is anomalous, it goes without saying, and the phenomenal experience of listeners receiving this appliance to initiate neural activity differs hugely from the ordinary auditory experience of natural speech. Despite the absence of veridical perceptual experience of the raw qualities of natural speech, electrocochlear prostheses are effective in the self‐regulation of speech production by its users, and are effective perceptually despite the abject deficit in faithfully presenting natural acoustic elements of speech. What brings