Speech Recognition

What does speech look like?

Speech consists of various types of sounds: primarily vowels, fricatives, nasals and stops. Vowels are produced by forcing air from our lungs through our vocal cords which causes them to vibrate. This acoustic energy then travels through the variable-sized tube formed by our jaw, tongue, and lips where some frequencies are amplified just like the notes are formed in a church organ pipe. These amplified frequencies called formants then determine which sound we hear. Nasals are similar to vowels except that the sound is forced through our noses rather than our mouths causing the formants to be attenuated. Fricatives are generated in a similar way except that the acoustic energy is created by forcing air through a constriction in our mouth rather than through our vocal cords. This results in high frequency energy with little formant structure. Finally, stops are formed by stopping the air-flow altogether.

Speech can be visualised by plotting the energy at each frequency as a function of time - this is called a spectrogram. In the picture below, high energy is represented by the 'hot' red colours and low energy by the 'cold' blue colours. In principle, a machine can be made to recognise speech by examining spectrograms and identifying the various sounds. In practice this is very hard because there is a huge amount of variability in the way that each sound is produced depending on its acoustic neighbours, the speaker, emotion, physiological state, background noise, etc.