Any people are familiar with commercially available software which allows users to dictate directly into their computers. This type of system relies on a particular user training their computer to the idiosyncrasies of their voice, speaking words carefully, usually with limited background noise and a closely held microphone.
The goal of current research is to remove such limitations and produce a system which will recognise speech above background noise, allow for variability in the speaker's accent and handle irregularities caused by degraded speech heard over imperfect communication channels (such as crackly radios and phone lines). In fact, the goal is to produce a system that is as good as a human in decoding speech!
Researchers in speech recognition technology have an annual opportunity to compare the performance of their systems against those from other groups around the world by taking part in a standardised test run by the US National Institute of Standards and Technology and the US Defence Advanced Research Projects Agency. The current evaluation task is transcription of several hours of audio taken from television and radio news broadcasts. In the 1997 evaluation, the HTK system developed at CUED was ranked first overall. It had a lower word-error rate, by a statistically significant margin, than its competitors, which included major companies such as IBM, Dragon and Philips, as well as universities such as Carnegie Mellon. The error rate achieved by the CUED HTK system was around 16%, which is remarkably good for this type of task.
How they work
The speech-recognition systems currently being developed use a statistical modelling approach to estimate the most probable word sequence from the input speech. 'There are three sources of information that are used to model the speech signal', explains Phil Woodland. 'The first describes the way the audio signal varies when different sounds are produced. These acoustic models allow us to compute the probability that unknown audio corresponds to a particular sound, or phone. Secondly, we use a pronunciation dictionary that lists the phone sequences that make up allowable words and, finally, we use probability of sequences of words estimated by examining statistics gathered on a large-text corpus. Given some input speech, we search for the most likely sequence of words based on these models. Of course, the key issue is how to construct the models!'
The HTK system, like most modern speech recognisers, uses simple statistical networks, known as Hidden Markov Models (HMMs) to assign probabilities to sequences of spectra each corresponding to about 10ms of speech. The HMM parameters are trained on large quantities of transcribed audio data and as more data is input into the system, the HMMs can be further tuned. To provide high accuracy, the system uses separate HMMs for a sound in different contexts. An important feature of this approach is that it is capable of automatic learning: given some transcribed audio data, a pronunciation dictionary and a text corpus, it is possible to build a speech-recognition system based on the same technology for many different languages.
For further information, please contact Phil Woodland, tel: 01223 332669.
|number 7, December '98|