The Concise Encyclopedia of Applied Linguistics. Carol A. Chapelle

Чтение книги онлайн.

Читать онлайн книгу The Concise Encyclopedia of Applied Linguistics - Carol A. Chapelle страница 110

The Concise Encyclopedia of Applied Linguistics - Carol A. Chapelle

Скачать книгу

listening tests. In G. J. Ockey & E. Wagner (Eds.), Assessing L2 listening: Moving towards authenticity (pp. 13–28). Amsterdam, Netherlands: John Benjamins.

      1 Based in part on S. McKay (2012). Authenticity in the language teaching curriculum. In C. A. Chapelle (Ed.), The Encyclopedia of Applied Linguistics. John Wiley & Sons Inc., with permission.

      JOHN LEVIS AND RUSLAN SUVOROV

      Automatic speech recognition (ASR) is an independent, machine‐based process of decoding and transcribing oral speech. A typical ASR system receives acoustic input from a speaker through a microphone; analyzes it using some pattern, model, or algorithm; and produces an output, usually in the form of a text (Lai, Karat, & Yankelovich, 2008).

      It is important to distinguish speech recognition from speech understanding, the latter being the process of determining the meaning of an utterance rather than its transcription. Speech recognition is also different from voice (or speaker) recognition: Whereas speech recognition refers to the ability of a machine to recognize the words and phrases that are spoken (i.e., what is being said), speaker (or voice) recognition involves the ability of a machine to recognize the person who is speaking.

      Pioneering work on ASR dates to the early 1950s. The first ASR system, developed at Bell Telephone Laboratories by Davis, Biddulph, and Balashek (1952), could recognize isolated digits from 0 to 9 for a single speaker. In 1956, Olson and Belar created a phonetic typewriter that could recognize 10 discrete syllables. It was also speaker dependent and required extensive training.

      These early ASR systems used template‐based recognition based on pattern matching that compared the speaker's input with prestored acoustic templates or patterns. Pattern matching operates well at the word level for recognition of phonetically distinct items in small vocabularies but is less effective for larger vocabulary recognition. Another limitation of pattern matching is its inability to match and align input speech signals with prestored acoustic models of different lengths. Therefore, the performance of these ASR systems was lackluster because they used acoustic approaches that only recognized basic units of speech clearly enunciated by a single speaker (Rabiner & Juang, 1993).

image

      The main strength of an HMM is that it can describe the probability of states and represent their order and variability through matching techniques such as the Baum‐Welch or Viterbi algorithms. In other words, HMMs can adequately analyze both the temporal and spectral variations of speech signals, and can recognize and efficiently decode continuous speech input. However, HMMs require extensive training and huge computational power for model‐parameter storage and likelihood evaluation (Burileanu, 2008).

      The main advantage of ANNs lay in the classification of static patterns (including noisy acoustic data), which was particularly useful for recognizing isolated speech units. However, pure ANN‐based systems were not effective for continuous speech recognition, so ANNs were often integrated with HMMs in a hybrid approach (Torkkola, 1994).

      The use of HMMs and ANNs in the 1980s led to considerable efforts toward constructing systems for large‐vocabulary continuous speech recognition. During this time ASR was introduced in public telephone networks, and portable speech recognizers were offered to the public. Commercialization continued in the 1990s, when ASR was integrated into products, from PC‐based dictation systems to air traffic control training systems.

image

      The 2000s witnessed further progress in ASR, including the development of new algorithms and modeling techniques, advances in noisy speech recognition, and the integration of speech recognition into mobile technologies. Another recent trend is the development of emotion recognition systems that identify emotions and other paralinguistic content from speech using facial expressions, voice tone, and gestures (Schuller, Batliner, Steidl, & Seppi, 2009; Anagnostopoulos, Iliou, & Giannoukos, 2015). However, one area that has truly revolutionized ASR in recent years

Скачать книгу