Speech Recognition

The US ARPA Speech Recognition Evaluations

Progress in speech recognition over the last decade has been monitored by a series of annual evaluations run by the US National Institute of Standards and Technology and the US Defence Advanced Research Projects Agency. In the early 1990s, the best recognisers in the world produced around 15% errors on a 20,000 word speaker-independent dictation task. In 1994, the HTK system developed at CUED recorded just 7% errors on a much harder unrestricted vocabulary task beating 14 other competing systems including ones from IBM, AT&T, Dragon, BBN Systems and various other universities.

Although still not perfect, it was known that speaker-specific enrolment would improve accuracy further. The HTK system therefore effectively demonstrated that desktop dictation was possible and companies like Dragon and IBM subsequently converted the ideas demonstrated in these research systems into commercial reality. In the meantime, the CUED team continued to improve their system tackling harder tasks such as dictation in noise, and most recently transcription of broadcast news material. The latter is particularly difficult because the recogniser must cope with a sequence of unknown speakers, speaking over different channels with varying degrees of background noise including music and sound effects. Despite their limited resources, the CUED team have continued to stay ahead of the competition. For example, in the 1997 broadcast news transcription evaluation, the HTK system had a word-error rate of 16% which was the lowest recorded and better by a statistically significant margin than its nearest rival IBM.