12th December 2000 - Speech Recognition in Broadcasting

David Kirby, BBC Research and Development

David Kirby brought along a large pile of computers and switchboxes for his highly informative talk on Tuesday 12th December. He has been looking at speech recognition for the BBC for several years. He started his talk with an overview of speech recognition technology.

Increasing computing power and memory has enabled the use of more sophisticated speech and language models. This has improved the performance of recognition systems such that they are now suitable for widespread use. Obtaining good accuracy, however, becomes much harder as the size of the vocabulary or the number of voices increase.

Dictation systems are the most visible product. These can now recognise continuous speech, with a vocabulary of over 64k words. They do this, however, only for one user, who must train the system to their voice. Accuracy improves towards 98% as the system adapts in response to the user's correction of mistakes. They are sensitive to the type and position of microphone and the room acoustic.

At the other extreme are systems designed to recognise any voice, but only a small number of words. David played a recording of an experimental automated directory enquiries system developed by BT.

Large vocabulary, speaker-independent continuous speech recognition is still under research. The HTK and Abbot toolkits both consist of two stages first the speech is converted into a stream of phoneme (shortest sound element in speech) probabilities using a speech model, then the best word sequence is deduced using a language model. For competitions, these systems have been trained to recognise people reading from the Wall Street Journal, giving a 10-15% error rate.

Putting the studio parts of a Radio 4 news bulletin through Abbot gives a 25% error rate, as both the language and delivery differ from its training set. The system was therefore retrained for BBC English. The speech model used 45 hours of verbatim transcriptions from the newsroom. The language model consists of tables of probabilities of likely word sequences, which allow the system to use context to bias its decisions. This is produced by analysing millions of words of text; as well as scripts, the complete text of BBC online was used. Retraining improved the system's error rate to 5% for the same news extract.

In general broadcast material, there are many speakers, sometimes overlapping, there is casual and unscripted speech which does not follow grammatical rules, the acoustics vary, and there is background noise and music. David demonstrated these problems by showing how the error rate varied during a complete news programme. Whereas the performance was good in the studio, a report from a correspondent yielded a 50% error rate, and an interview with a policeman a 92% rate. The average was 25%. David then addressed possible applications. The most obvious one is transcription. He showed that the news-trained model performed badly with a wildlife programme due to the more florid speech.

Archive indexing and searching is more promising. The THISL project automatically records and recognises 35 hours of news material each week. The database can be queried over the BBC intranet. A keyword search returns a list of stories in order of relevance, any of which can then be streamed to the user. Since most keywords are used many times in a relevant story, the high error rate is acceptable.

He demonstrated a teleprompter system. Here the challenge is rather different the system knows what is going to be said, as it has the script; it just needs to work out where the presenter has reached, and to recover if text is skipped or misread. In David's demonstration, it performed well. The system may be extended to use trigger words in the script for studio automation.

Although dictation systems are being used by Channel 4 to subtitle live sports (the subtitler listens to and re-voices the commentary), recognition systems are clearly not yet accurate enough for automated subtitling. However, of the 12-16 hours it takes to subtitle an hour of drama, most of the time is spent synchronising the text with the picture. The new "assisted subtitling" system automatically extracts the text from the script (difficult!), then uses the timecode for each word (produced by a speech recogniser) to time the subtitles correctly. Although the output is usually corrected by hand, the system reduces the time required by 30-50%.

A final intriguing possibility is the use of recognised text as a guide when editing audio. The BBC has developed a word processor in which the audio is roughly cut to match rearrangements of the text. It can also export an edit decision list so that, in an audio editor, the sound waveforms are labelled with the corresponding words.

David concluded that although the error rates are still high, speech recognition systems are already good enough for some broadcast applications. They are enabling new ways of working, and improving all the time. The talk was followed by lively discussion touching on the problems with recognition of names in archives, the desirability of punctuation in recognised output, and the possibility of pre-treating the audio to improve recognition.

Paul Troughton