A Comparison of Automatic Speech Recognition (ASR) Systems

Back in March 2016 I wrote Semi-automated podcast transcription about my interest in finding ways to make archives of podcast content more accessible. Please read that post for details of my motivations and goals.

Some 11 months later, in February 2017, I wrote Comparing Transcriptions describing how I was exploring measuring transcription accuracy. That turned out to be more tricky, and interesting, than I’d expected. Please read that post for details of the methods I’m using and what the WER (word error rate) score means.

Here, after another over-long gap, I’m returning to post the current results, and start thinking about next steps. One cause of the delay has been that whenever I returned to the topic there had been significant changes in at least one of the results, most recently when Google announced their enhanced models . In the end the delay turned out to be helpful.

The Scores

The table below shows the results of my tests on many automated speech recognition services, ordered by WER score (lower is better). I’ll note a major caveat up front: I only used a single audio file for these tests . An almost two hour interview in English between two North American males with no strong accents and good audio quality. I can’t be sure how the results would differ for female voices, more accented voices, lower audio quality etc. I plan to retest the top tier services with at least one other file in due course.

You can’t beat a human, at least not yet. All the human services scored between 4 and 6. I described them in myprevious post, so I won’t dwell on them here.

Service WER

Punctuation

( . / , / ? / names )

Timing

Other

Features

Approx Cost

(not bulk)

Human ( 3PlayMedia ) 4.5 1261/1470/76/1064 $3/min Human ( Voicebase ) 4.6 1090/1626/57/1056 $1.5/min Human ( Scribie ) 5.1 923/1450/49/1153 $0.75/min Human (Volunteer) 5.3 840/1748/60/1208 Goodwill Google Text-to-Speech (video model, not enhanced) 10.7 792/421/29/1238 Words C, A, V $0.048/min Otter AI 11.50 786/1166/35/1030 Pgfs E, S Free up to 600 mins/month Spext 11.81 813/369/30/1263 Lines E $0.16/min Go-Transcribe 12.1 979/0/0/922 Pgfs E $0.22/min SimonSays 12.2 941/0/0/893 Line E, S $0.17/min Trint 12.3 968/0/0/894 Lines E $0.33/min Speechmatics 12.3 955/0/0/929 Words S, C $0.08/min Sonix 12.3 943/0/0/900 Lines D, S, E $0.083/min+$15/mon Temi 12.5 915/1329/51/862 Pgfs S, E $0.10/min TranscribeMe 12.9 1203/0/63/836 Lines $0.25/min Scribie ASR 12.9 970/1307/48/973 None E Currently free YouTube Captions 15.0 0/0/0/1075 Lines S Currently free Voicebase 16.6 116/0/0/1119 Lines E, V $0.02/min AWS Transcribe 22.2 772/0/85/67 Words S, C, A, V $0.02/min Vocapia VoxSigma 23.6 771/599/0/931 Words S, C $0.02/min approx IBM Watson 25.2 11/0/0/896 Words C, A, V $0.02/min Dragon +vocabulary 25.3 9/7/0/967 None Free + €300 for app Deepgram 27.9 715/1262/52/443 Pgfs S, E $0.0183 SpokenData 36.5 1457/0/0/680 Words S, E $0.12/min

WER : Word error rate (lower is better).
Punctuation : Number of sentences / commas / question marks / capital letters not at the start of a sentence (a rough proxy for proper nouns).
Timing : Approximate highest precision timing: Words typically means a data format like JSON or XML with timing information for each word, Lines typically means a subtitle format like SRT, Pgfs (paragraphs) means some lower precision.
Other Features : E =online editor, S =speaker identification (diarisation), A =suggested alternatives, C =confidence score, V =custom vocabulary (not used in these tests).
Approx Cost : base cost, before any bulk discount, in USD.

Note the clustering of WER scores. After the human services scoring from 4–6, the top-tier ASR services all score 10–16, with most around 12. The scores in the next tier are roughly double: 22–28. Seems likely that the top-tier systems are using more modern technology .

Formy goals I prioritise these features:

Accuracy is a priority, naturally, so most systems in the top-tier would do.
A custom vocabulary would further improve accuracy.
Cost . Clearly $0.02/min is much more attractive than $0.33/min when there are hundreds of hours of archives to transcribe. (I’m ignoring bulk discounts for now.)
Word level timing enables accurate linking to audio segments and helps enable comparison/merging of transcripts from multiple sources (such as taking punctuation from one transcript and applying it to another).
Good punctuation reduces the manual review effort required to polish the automated transcript into something pleasantly readable. Recognition of questions would also help with topic segmentation .
Speaker identification would also help identify questions and enable multiple ‘timelines’ to help resolve transcripts where there’s cross-talk.

Before Google released their updated Speech-to-Text service in April there wasn’t a clear winner for me. Now there is. Their new video premium model is significantly better than anything else I’ve tested.

I also tested their enhanced models a few weeks after I initially posted this. It didn’t help for my test file. I also tried setting interactionType and industryNaicsCodeOfAudio in the recognition metadata of the video model but that made the WER slightly worse. Perhaps they will improve over time.

Punctuation is clearly subjective but both Temi and Scribie get much closer than Google to the number of question marks and commas used by the human transcribers. Google did very well on capital letters though (a rough proxy for proper nouns).

I think we’ll see a growing ecosystem of tools and services using Google Speech-to-Text service as a backend. The Descript app is an interesting example.

Differential Analysis

While working on Comparing Transcriptions I’d realized that comparing transcripts from multiple services is a good way to find errors because they tend to make different mistakes.

So for this post I also compared most of the top-tier services against one another, i.e. using the transcript from one as the ‘ground truth’ for scoring others. A higher WER score in this test is good . It means the services are making different mistakes and those differences would highlight errors.

Google, Otter AI, Temi, Voicebase, Scribie, and TranscribeMe all scored a high WER, over 10, against all the others. Go-Transcribe vs Speechmatics had a WER of 6.1. SimonSays had a WER of 5.2 against Sonix, Trint, and Speechmatics. Trint, Sonix, and Speechmatics have very little difference between the transcripts, a WER of just 1.4. That suggests those three services are using very similar models and training data.

What Next?

My primary goal is to get the transcripts available and searchable, so the next phase would be developing a simple process to transcribe each podcast and convert the result into web pages. That much seems straightforward using the Google Text-to-Speech API. Then there’s working with the podcast host to integrate with their website, style, menus etc.

After that the steps are a more fuzzy. I’ll be crossing the river by feeling the stones…

The automated transcripts will naturally have errors that people notice (and more that they won’t). To improve the quality it’s important to make it very easy for them to contribute corrections. Being able to listen to the corresponding section of audio would be a great help. All that will require a web-based user interface backed by a service and a suitable data model.

The suggested corrections will need reviewing and merging. That will require its own low-friction workflow. I have a vague notion of using GitHub for this.

Generating transcripts from at least one other service would provide a way to highlight possible errors, in both words and punctuation. Those highlights would be useful for readers and also encourage the contribution of corrections. Otter API, Speechmatics and Voicebase are attractive low-cost options for these extra transcriptions, as are any contributed by volunteers. This kind of multi-transcription functionality has significant implications for the data model.

I’d like to directly support translations of the transcriptions. The original transcription is a moving target as corrections are submitted over time, so the translations would need to track corrections applied to the original transcription since the translation was created. Translators are also very likely to notice errors in the original, especially if they’re working from the audio.

Before getting into any design or development work, beyond the basic transcriptions, I’d want to do another round of due-dilligence research, looking for what services and open source projects might be useful components or form good foundations. Amara springs to mind. If you know of any existing projects or services that may be relevant please add a comment or let me know in some other way.

I’m not sure when, or even if, I’ll have any further updates on this hobby project. If you’re interested in helping out feel free to email me.

I hope you’ve found my rambling explorations interesting.

Updates:

25th May 2018: Updated SimonSays.ai with much improved score
10th June 2018: Updated notes about Google enhanced model (not helping WER score).
8th September 2018: Added Otter AI, prompted by a note in a blog post by Descript comparing ASR systems .
10th September 2018: Emphasised that I only used a single audio file for these tests. Noted that Otter.ai is free up to 600 mins/month.
14th September 2018: Added Spex.

The Scores

Differential Analysis

What Next?

Recommend

Cttrie – Compile-time trie-based string matching for C++

Interactive Git Cheatsheet with Visualisation

Sabri on Twitter: "How to entirely freeze ChromeOS / Chrome in one line of...

运动也能上瘾，而且有害？

GitHub - MichalTKwiecien/Layoutable: Extension for UIView that make use of Auto...

GitHub - pointfreeco/swift-web: ? A collection of Swift server-side frameworks f...

交互设计常识：文案设计的原则与方法

苹果发布会，杜蕾斯又亮了

离开华为三年，我才真正认同狼性文化

谁会成为 YC 中国的第一批学员？首批冬季创业营开始招生

About Joyk