The Handbook of Speech Perception. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу The Handbook of Speech Perception - Группа авторов страница 43

The Handbook of Speech Perception - Группа авторов

Скачать книгу

from speech signals is difficult and intricate. Once the physical aspects of the acoustic waveform are encoded, phonetic properties such as formant frequencies, voicing, and voice pitch must be inferred, interpreted, and classified in a context‐dependent manner, which in turn facilitates the creation of a semantic representation of speech. In the auditory brain, this occurs along a processing hierarchy, where the lowest levels of the auditory nervous system – the inner ear, auditory nerve fibers and brainstem – encode the physical attributes of the sound and compute what may be described as low‐level features, which are then passed on via the midbrain and the thalamus toward an extensive network of auditory and multisensory cortical areas, whose task it is to form phonetic and semantic representations. As this chapter progresses, we will look in some detail at this progressive transformation of an initially largely acoustic representation of speech sounds in the auditory nerve, brainstem, midbrain, and primary cortex to an increasingly linguistic feature representation in a part of the brain called the superior temporal gyrus, and finally to semantic representations in brain areas stretching well beyond those classically thought of as auditory structures.

      While it is apt to think of this neural speech‐processing stream as a hierarchical process, it would nevertheless be wrong to think of it as entirely a feed‐forward process. It is well known that, for each set of ascending nerve fibers carrying auditory signals from the inner ear to the brainstem, from brainstem to midbrain, from midbrain to thalamus, and from thalamus to cortex, there is a parallel descending pathway going from cortex back to thalamus, midbrain, brainstem and all the way back to the ear. This is thought to allow feedback signals to be sent in order to focus attention and to make use of the fact that the rules of language make the temporal evolution of speech sounds partly predictable, and such predictions can facilitate hearing speech in noise, or to tune the ear to the voice or dialect of a particular speaker.

Schematic illustration of the ear showing the early stages of the ascending auditory pathway.

      The complexity of the anatomy is quite bewildering, and much remains unknown about the detailed structure and function of its many subdivisions. But we have nevertheless learned a great deal about these structures and the physiological mechanisms that are at work within them and that underpin our ability to hear speech. Animal experiments have been invaluable in elucidating basic physiological mechanisms of sound encoding, auditory learning, and pattern classification in the mammalian brain. Clinical studies on patients with various forms of hearing impairment or aphasia have also helped to identify key cortical structures. More recently, functional brain imaging on normal volunteers, as well as invasive electrophysiological recordings from the brains of patients who are undergoing brain surgery for epilepsy have further refined our knowledge of speech representations, particularly in higher‐order cortical structures.

      In the sections that follow we shall highlight some of the insights that have been gained from these types of studies. The chapter will be structured as a journey: we shall accompany speech sounds as they leave the vocal tract of a speaker, enter the listener’s ear, become encoded as trains of nerve impulses in the cochlea and auditory nerve, and then travel along the pathways just described and spread out across a phenomenally intricate network of hundreds of millions of neurons whose concerted action underpins our ability to perform the everyday magic of communicating abstract thoughts across space and time through the medium of the spoken word.

      When we speak, the different types of sound sources, whether unvoiced noises or voiced harmonic series, are shaped by resonances in the vocal tract, which we must deftly manipulate by dynamically changing the volume and the size of the openings of a number of cavities in our throat, mouth, and nose, which we do by articulatory movements of the jaw, soft palate, tongue, and lips. The resonances in our vocal tracts impose broad spectral peaks on the spectra of the speech sounds, and these broad spectral peaks are known as formants. The dynamic pattern of changing formant frequencies encodes the lion’s share of the semantic information in speech. Consequently, to interpret a speech stream that arrives at our ears, one might think that our ears and brains will chiefly need to examine the incoming sounds for broad peaks in the spectrum to identify formants. But, to detect voicing and to determine voice pitch, the brain must also look either for sharp peaks at regular intervals in the spectrum to identify harmonics or, alternatively, for periodicities in the temporal waveform. Pitch information provided by harmonicity or, equivalently, periodicity is a vital cue to help identify speakers, gain prosodic information, or determine the tone of a vowel in tonal languages like Chinese or Thai, which use pitch contours to distinguish between otherwise identical homophonic syllables. Encoding information about these fundamental features, formants, and harmonicity or periodicity, is

Скачать книгу