Benchmarking Google's Speech Recognition Web Service

Curious: how far can we push Google’s speech recognition web service that is used in Chrome 25+ for speech input? I measured the word accuracy of the service on the TSP speech dataset that consists of recordings of the Harvard sentences in order to get the picture.

In previous articles, I presented a Python script (thanks to a KTH student for his help here) which allows one to feed audio directly to Google’s speech recognition web service. This script lays the foundation for a systematic performance evaluation. The script sends audio recordings directly to the web service and returns the transcriptions.

The Harvard sentences consist of 72 lists with 10 sentences each, making a total of 720 test sentences. I consider the sentences rather difficult, especially because they appear without much context. To my great delight, there is a TSP speech database of more than 1440 audio recordings of these sentences freely available on the web – published by the Telecommunications and Signal Processing Labotary at McGill. The audio recordings have a strong signal and little noise. The test speakers (about half male and half female) learned the sentences by heart and spoke them in natural speed.

In order not just to count how many test sentences were transcribed correctly, I decided to compute the word accuracy of the transcriptions in the same way as it is done by the Hidden Markov Model Toolkit. Here the transcription and the test sentences are aligned by minimum Levenshtein distance; the optimal alignment of a test sentence with N words and the transcription is achieved by some number of insertions (I), deletions (D), and substitutions (S) in the test sentence. The word accuracy is then:

$30dde98aba4390f412777533471712edee9a7514$

All sentences in the TSP speech database were sent to Google with a pause of one minute between two sentences in order to spread the traffic over a day. The precaution might have been unnecessary, but well, don’t be evil.

Here are the results, averaged over all the 1444 transcriptions of the audio recordings. It seems that it makes no difference whether the speakers are male or female:

Name	Sentences correct	Word accuracy
all	21%	73%
male	21%	74%
female	20%	72%

In summary, I was positively surprised by the results. While most of the transcriptions contained at least one mistake, the word accuracy was quite good given the out-of-context Harvard sentences. It is a good idea though, not to look at numbers alone but also run a few tests on your own on the Web Speech API demo page. You can also have a look at the full report of the benchmark.

Edit: There is another evaluation of Google’s speech recognizer in various languages on the web, showing qualitative results for example sentences spoken by a single speaker in many languages.

« Using Google's Speech Recognition Web Service with Python

Recursive circle packing with PostScript »

a blog by Julius Adorf

Posts in TechnologyPomodoro Timer: Prototype, Round 3 Pub combinatorics: the joy of rediscovery Quick-fix: Typing ÄÖÜ on a UK Keyboard Pomodoro Timer: Prototype, Round 2 Pomodoro Timer: Prototype with an ATmega32 Right control key on keyboard as i3 modifier in Ubuntu 20.04 A formula for converting pace from min/mile to min/km in Google Spreadsheets Visualizing Strava activities with BigQuery and Google Data Studio Thoughts on Model Thinking: a smörgåsbord Statistics tell you when to stop practicing Applying Machine Learning to Strava activities using BigQuery ML Inspecting air pollution data from OpenAQ using Colab, Pandas, and BigQuery What probability theory tells you about starting on time Analysing Strava activities using Colab, Pandas & Matplotlib (Part 4)Analysing Strava activities using Colab, Pandas & Matplotlib (Part 3)Analysing Strava activities using Colab, Pandas & Matplotlib (Part 2)Analysing Strava activities using Colab, Pandas & Matplotlib (Part 1)Misleading infographics: How Not To Bubble Chart Memories from University: Teaching the Computer to play Connect Four Missing Maps: Use Your Phone for the Better How data can assist us in forming good habits Missing Maps: Putting People on the Map Energy from Thin Air: Measuring Air Pollution with CleanSpace Bletchley Park and the rebuilt bombe Motion Segmentation of RGB-D Videos via Trajectory Clustering Preview: Motion Segmentation of RGB-D Videos via Trajectory Clustering Fixing a Shimano EF50-8R bicycle shifter Programmer-friendly German keyboard layout on GNU/Linux Case study: when average speed matters Recursive circle packing with PostScript Managing encrypted devices with LVM on top of LUKS with luksctlBenchmarking Google's Speech Recognition Web ServiceAsus Xtion Pro Live – First Impressions Using Google's Speech Recognition Web Service with Python Speech Input in Google Chrome: x-webkit-speech Clustering Crash Simulation Data with LLCA German PC keyboard layout in Mac OS Prolonging the Life of a Logitech K340 Keyboard Computing PageRank for the Swedish Wikipedia Case Study: Role-Playing Game in C++Artificial Neural Network: Animation of Training Inspecting Algorithms with Graphs Behind the scenes: a thought abroad HP Officejet 6500 e710n-z on Arch Linux Task Manager with Focus on Usability: dropandforget Netgear WNR612 Classic Wireless Router – Good Value for Money Version Control on Top of Dropbox Public Transport in Munich now on Google Maps Quick-fix for X11: Typing Å on German Keyboard Rudimentary Recognition of Spoken Words at KTH Recognizing Textured Planar Objects with OpenCV The Viterbi Algorithm and Breadth-First Search Arch Linux: switched to systemd Rotating Backups with rsnapshot Olve Maudal and Deep C++Mappotino: A Robot for Exploration, Mapping, and Object Recognition Template Tracking using Hyperplane Approximation Fix for Wireless Presenters and Flash-based Full-screen Prezi Reinventing the Wheel: Panorama Stitching with Matlab Saving the Parrots with Homogeneous Coordinates A Connection between Motion Blur and the Fourier Transform Disabling hot-corner effect in Gnome 3 Dual-booting Arch and Ubuntu with LVM on top of LUKS Team Black Sheep presents amazing stunts with first-person-view RC plane Sampling from a Poisson distribution - a benchmark Understanding someone else's source code Enhancing Details with Unsharp Masking Nearest-Neighbor-Resampling in Matlab Zweidimensionale Bereiche plotten mit Wolfram|Alpha Hosting bei Dreamhost, Domain woanders Eine weitere Identität für Binomialkoeffizienten Remote Procedure Calls über den DBus Syntaxhervorhebung mit Pygments 2D-Grafik-Ausgabe mit Cairo und OCaml Programmierkonzepte für Multi-Core-Prozessoren Funktionsgraphen zeichnen mit PostScript