24. May 2013

Benchmarking Google's Speech Recognition Web Service

Curious: how far can we push Google’s speech recognition web service that is used in Chrome 25+ for speech input? I measured the word accuracy of the service on the TSP speech dataset that consists of recordings of the Harvard sentences in order to get the picture.

In previous articles, I presented a Python script (thanks to a KTH student for his help here) which allows one to feed audio directly to Google’s speech recognition web service. This script lays the foundation for a systematic performance evaluation. The script sends audio recordings directly to the web service and returns the transcriptions.

The Harvard sentences consist of 72 lists with 10 sentences each, making a total of 720 test sentences. I consider the sentences rather difficult, especially because they appear without much context. To my great delight, there is a TSP speech database of more than 1440 audio recordings of these sentences freely available on the web – published by the Telecommunications and Signal Processing Labotary at McGill. The audio recordings have a strong signal and little noise. The test speakers (about half male and half female) learned the sentences by heart and spoke them in natural speed.

In order not just to count how many test sentences were transcribed correctly, I decided to compute the word accuracy of the transcriptions in the same way as it is done by the Hidden Markov Model Toolkit. Here the transcription and the test sentences are aligned by minimum Levenshtein distance; the optimal alignment of a test sentence with N words and the transcription is achieved by some number of insertions (I), deletions (D), and substitutions (S) in the test sentence. The word accuracy is then:

30dde98aba4390f412777533471712edee9a7514

All sentences in the TSP speech database were sent to Google with a pause of one minute between two sentences in order to spread the traffic over a day. The precaution might have been unnecessary, but well, don’t be evil.

Here are the results, averaged over all the 1444 transcriptions of the audio recordings. It seems that it makes no difference whether the speakers are male or female:

Name Sentences correct Word accuracy
all 21% 73%
male 21% 74%
female 20% 72%

In summary, I was positively surprised by the results. While most of the transcriptions contained at least one mistake, the word accuracy was quite good given the out-of-context Harvard sentences. It is a good idea though, not to look at numbers alone but also run a few tests on your own on the Web Speech API demo page. You can also have a look at the full report of the benchmark.

Edit: There is another evaluation of Google’s speech recognizer in various languages on the web, showing qualitative results for example sentences spoken by a single speaker in many languages.