The Handbook of Speech Perception. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 44
This concept of tonotopy is quite central to the way all sounds, not just speech sounds, are usually thought to be represented along the lemniscal auditory pathway. All the stations of the lemniscal auditory pathway shown in Figure 3.1, from the cochlea to the primary auditory cortex, contain at least one, and sometimes several tonotopic maps, that is, arrays of frequency‐tuned neurons arranged in a systematic array from low to high preferred frequency. It is therefore worth examining this notion of tonotopy in some detail to understand its origin, and to ask what tonotopy can and cannot do to represent fundamental features of speech.
In the mammalian brain, tonotopy arises quite naturally from the way in which sounds are transduced into neural responses by the basilar membrane and organ of corti in the cochlea. When sounds are transmitted from the ear drum to the inner ear via the ossicles, the mechanical vibrations are transmitted to the basilar membrane via fluid‐filled chambers of the inner ear. The basilar membrane itself has a stiffness gradient, being stiff at the basal end, near the ossicles, and floppy at the far end, the apex. Sounds transmitted through the far end have little mechanical resistance from the stiffness of the basilar membrane, but have to displace more inert fluid column in the inner ear. Sounds traveling through the near end face less inertia, but more stiffness. The upshot of this is that one can think of the basilar membrane as a bank of mechanical spring‐mass filters, with filters tuned to high frequencies at the base, and to increasingly lower frequencies toward the apex. Tiny, highly sensitive hair cells that sit on the basilar membrane then pick up these frequency‐filtered vibrations and translate them into electrical signals, which are then encoded as trains of nerve impulses (also called action potentials or spikes) in the bundles of auditory nerve fibers that connect the inner ear to the brain. Thus, each nerve fiber in the auditory nerve is frequency tuned, and the sound frequency it is most sensitive to is known as its characteristic frequency (CF).
The cochlea, and the basilar membrane inside it, is curled up in a spiral, and the organization of the auditory nerve mirrors that of the basilar membrane: inside it we have something that could be described as a rate–place code for sounds, where the amount of sound energy at the lowest audible frequencies (around 50 Hz) is represented by the firing rates of nerve fibers right at the center, and increasingly higher frequencies are encoded by nerve fibers that are arranged in a spiral around that center. Once the auditory nerve reaches the cochlear nuclei, this orderly spiral arrangement unwraps to project systematically across the extent of the nuclei, creating tonotopic maps, which are then passed on up the auditory pathway by orderly anatomical connections from one station to the next. What this means for the encoding of speech in the early auditory system is that formant peaks of speech sounds, and maybe also the peaks of harmonics, should be represented by systematic differences in firing rates across the tonotopic array. The human auditory nerve contains about 30,000 such nerve fibers, each capable of firing anywhere between zero and several hundred spikes a second. So there are many hundreds of thousands of nerve impulses per second available to represent the shape of the sound spectrum across the tonotopic array. And, indeed, there is quite a lot of experimental evidence that systematic firing‐rate differences across this array of nerve fibers is not a bad first‐order approximation of what goes on in the auditory system (Delgutte, 1997), but, as so often in neurobiology, the full story is a lot more complicated.
Thanks to decades of physiological and anatomical studies on experimental animals by dozens of teams, the mechanisms of sound encoding in the auditory nerve are now known in sufficient detail that it has become possible to develop computer models that can predict the activity of auditory nerve fibers to arbitrary sound inputs (Zhang et al., 2001; Heinz, Colburn, & Carney, 2002; Sumner et al., 2002; Meddis & O’Mard, 2005; Zhang & Carney, 2005; Ferry & Meddis, 2007), and here we shall use the model of Zilany, Bruce, and Carney (2014) to look at the encoding of speech sounds in the auditory nerve in a little more detail.
The left panel of Figure 3.2 shows the power spectrum of a recording of the spoken vowel [ɛ], as in head (IPA [hɛːd]). The spectrum shows many sharp peaks at multiples of about 145 Hz – the harmonics of the vowel. These sharp peaks ride on top of broad peaks centered around 500, 1850, and 2700 Hz – the formants of the vowel. The right panel of the figure shows the distribution of firing rates of low spontaneous rate (LSR) auditory nerve fibers in response to the same vowel, according to the auditory nerve fiber model by Zilany, Bruce, and Carney (2014). Along the x‐axis we plot the CF of each nerve fiber, and along the y‐axis we expect the average number of spikes the fiber would be expected to fire per second when presented with the vowel [ɛ] at a sound level of 65 dB SPL (sound pressure level), the sort of sound level that would be typical during a calm conversation with a quiet background.
Figure 3.2 A power spectrum representing an instantaneous spectrogram (left) and a simulated distribution of firing rates for an auditory nerve fiber (right) for the vowel [ɛ] in head [hɛːd].
Comparing the spectrogram on the left with the distribution of firing rates on the right, it is apparent that the broad peaks of the formants are well reflected in the firing rate distribution, if anything perhaps more visibly than in the spectrum, but that most of the harmonics are not. Indeed, only the lowest three harmonics are visible; the others have been ironed out by the fact that the frequency tuning of cochlear filters is often broad compared to the frequency interval between individual harmonics, and becomes broader for higher frequencies. Only the very lowest harmonics are therefore resolved by the rate–place code of the tonotopic nerve fiber array, and we should think of tonotopy as well adapted to representing formants but poorly adapted to representing pitch or voicing information. If you bear in mind that many telephones will high‐pass filter speech at 300 Hz, thereby effectively cutting off the lowest harmonic peak, there really is not much information about the harmonicity of the sound left reflected in the tonotopic firing‐rate distribution. But there are important additional cues to voicing and pitch, as we shall see shortly.
The firing rates of auditory nerve fibers increase monotonically with increasing sound level, but these fibers do need a minimum‐threshold sound level, and they cannot increase their firing rates indefinitely when sounds keep getting louder. This gives auditory nerve fibers a limited dynamic range, which usually covers 50 dB or less. At the edges of the dynamic range, the formants of speech sounds cannot be effectively represented across the tonotopic array because the neurons in the array either fire not at all (or not above their spontaneous firing rates), or because they all fire as fast as they can. However, people can usually understand speech well over a very broad range of sound levels. To be able to code sounds effectively over a wide range of sound levels, the ear appears to have evolved different types of auditory nerve fibers, some of which specialize in hearing quiet sounds, with low thresholds but also relatively low‐saturation sound levels, and others of which specialize in hearing louder sounds, with higher thresholds and higher saturation levels. Auditory physiologists call the more sensitive of these fiber types high spontaneous rate (HSR) fibers, given that these auditory nerve fibers may fire nerve impulses at fairly elevated rates (some 30 spikes per second or so), even in the absence of any external sound, and the less sensitive fibers the LSR fibers, which we have already encountered, and which fire only a handful of spikes per second in