Speech recognition systems attempt to convert the acoustic waveform that
we hear into a corresponding sequence of words and to do this accurately,
they must cope with the huge variability in the way that human speech
is produced. In simple terms, everybody speaks differently and even the
speech of the same speaker changes significantly according to the context,
emphasis, mood and acoustic environment.
To handle all of this variability, speech recognition systems use a statistical
modelling approach. Words are assumed to be composed of basic sounds such
as "ee", "eh", "p", "t", etc., called phones. Each of these sounds is
represented by a so-called "Hidden Markov Model (HMM)" which is a statistical
device capable of recording the typical way a sound is produced and the
range of variations to expect. The parameters of these HMMs are estimated
using very large acoustic databases (typically 100+ hours) which contain
examples of each sound spoken in many different contexts by many different
people.
When faced with a new and unknown utterance, a properly trained set of
phone HMMs can be used to find the sequence of sounds which were most
likely to have generated the utterance. The words spoken can then be found
by using a pronouncing dictionary to constrain this best-matching phone
sequence to correspond to actual words. Recognition can be further improved
by constraining the recognised word sequence to conform to a grammar.
These grammars are typically also statistical and are based on the probabilities
of word sequences estimated from a large text corpus.