|
|
What does speech look like?
Speech consists of various types of sounds: primarily vowels, fricatives,
nasals and stops. Vowels are produced by forcing air from our lungs through
our vocal cords which causes them to vibrate. This acoustic energy then
travels through the variable-sized tube formed by our jaw, tongue, and
lips where some frequencies are amplified just like the notes are formed
in a church organ pipe. These amplified frequencies called formants then
determine which sound we hear. Nasals are similar to vowels except that
the sound is forced through our noses rather than our mouths causing the
formants to be attenuated. Fricatives are generated in a similar way except
that the acoustic energy is created by forcing air through a constriction
in our mouth rather than through our vocal cords. This results in high
frequency energy with little formant structure. Finally, stops are formed
by stopping the air-flow altogether.
Speech can be visualised by plotting the energy at each frequency as
a function of time - this is called a spectrogram. In the picture below,
high energy is represented by the 'hot' red colours and low energy by
the 'cold' blue colours. In principle, a machine can be made to recognise
speech by examining spectrograms and identifying the various sounds. In
practice this is very hard because there is a huge amount of variability
in the way that each sound is produced depending on its acoustic neighbours,
the speaker, emotion, physiological state, background noise, etc.
|