Speech Recognition

How does speech recognition work?

Speech recognition systems attempt to convert the acoustic waveform that we hear into a corresponding sequence of words and to do this accurately, they must cope with the huge variability in the way that human speech is produced. In simple terms, everybody speaks differently and even the speech of the same speaker changes significantly according to the context, emphasis, mood and acoustic environment.

To handle all of this variability, speech recognition systems use a statistical modelling approach. Words are assumed to be composed of basic sounds such as "ee", "eh", "p", "t", etc., called phones. Each of these sounds is represented by a so-called "Hidden Markov Model (HMM)" which is a statistical device capable of recording the typical way a sound is produced and the range of variations to expect. The parameters of these HMMs are estimated using very large acoustic databases (typically 100+ hours) which contain examples of each sound spoken in many different contexts by many different people.

When faced with a new and unknown utterance, a properly trained set of phone HMMs can be used to find the sequence of sounds which were most likely to have generated the utterance. The words spoken can then be found by using a pronouncing dictionary to constrain this best-matching phone sequence to correspond to actual words. Recognition can be further improved by constraining the recognised word sequence to conform to a grammar. These grammars are typically also statistical and are based on the probabilities of word sequences estimated from a large text corpus.