How to Use ASR System for Accurate Transcription Properties of Your Digital Product

April 7th 2021

@zilunpengGeorgian.io

fin tech company

Thanks to advances in speech recognition, companies can now build a whole range of products with accurate transcription capabilities at their heart. Conversation intelligence platforms, personal assistants and video and audio editing tools, for example, all rely on speech to text transcription. However, you often need to train these systems for every domain you want to transcribe, using supervised data. In practice, you need a large body of transcribed audio that’s similar to what you are transcribing just to get started in a new domain.

0 reactions

Recently, Facebook released wav2vec 2.0 which goes some way towards addressing this challenge. wav2vec 2.0 allows you to pre-train transcription systems using audio only — with no corresponding transcription — and then use just a tiny transcribed dataset for training.

0 reactions

In this blog, we share how we worked with wav2vec 2.0, with great results.

0 reactions

What is an end-to-end automatic speech recognition system?

Before we dive into wav2vec 2.0, let’s take a few steps back to cover a couple of key terms you’ll need to understand to see what makes wav2vec 2.0 so special. First, let’s look at end-to-end automatic speech recognition systems.

0 reactions

An end-to-end automatic speech recognition (ASR) system takes speech audio waveform and outputs the corresponding text. Traditionally, these systems use Hidden Markov Models (HMMs), where the speech audio is modeled using a stochastic process. In recent years, deep learning ASR systems have become popular thanks to increased computing power and amounts of training data.

0 reactions

You can measure an ASR system’s performance with a word error rate (WER) metric. WER reflects the number of corrections needed to convert the ASR output into the ground truth. Generally, a lower WER means a better quality ASR system.

0 reactions

This figure is adapted from https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-evaluate-data

0 reactions

The example above shows how to calculate the WER. We can see that the ASR has made a few errors. It has inserted an “a”, identified “John” as “Jones” and deleted the word “are” from the ground truth.

0 reactions

To calculate WER, we can use this formula: (D+I+S)/N. D is the number of deletions, I is the number of insertions, S is the number of substitutions and N is the number of words in the ground truth. In this example, the ASR output made 3 mistakes in total from 5 words in the ground truth. In this case, the WER would be 3 / 5 = 0.6.

0 reactions

The LibriSpeech dataset

Next, we’ll briefly touch on the LibriSpeech dataset. The LibriSpeech dataset is the most commonly used audio processing dataset in speech research. It was created by Vassil Panayotov and Daniel Povey in 2015 [3]. LibriSpeech consists of 960 hours of labelled speech data and is the standard benchmark for training and evaluating ASR systems.

0 reactions

The dev-clean dataset from LibriSpeech contains 5.4 hours of “clean” speech data. It’s generally used as a validation dataset. In the figure below, we show the transcription for one audio sample in the dev-clean dataset.

0 reactions

The transcription for one audio sample in the dev-clean dataset

0 reactions

What is wav2vec 2.0?

Now that we understand what an ASR system and the LibriSpeech dataset are, we’re ready to take a closer look at wav2vec 2.0.

0 reactions

What’s different about wav2vec 2.0?

0 reactions

ASR systems come in two flavors:

0 reactions

The first are hybrid systems such as Kaldi [7] that train a deep acoustic model to predict phonemes from audio processed into Mel Frequency Cepstral Coefficients (MFCCs), combine the phonemes using a pronunciation dictionary and finally pick the most likely results using a language model (both count based LM and RNN based LM).
The second are end-to-end systems using a deep neural network to predict words directly from the audio or MFCCs. Such systems like RNN-T [6] or wav2vec [1, 4] require a lot more training data and GPU resources for training.

Due to the massive data requirements of end-to-end systems only the biggest companies have used them to date. The data requirements also make it hard to train for new domains (even in the same language) and new languages or accents. Using a hybrid system, it is much easier to create a model for a new domain using minimal training data and a pronunciation dictionary with words added for that domain.

0 reactions

The promise of wav2vec 2.0 is pre-training without the supervised data using a large data set of recordings in the target domain. Afterwards, the model can be tuned using the supervised approach to maximize the accuracy. Wav2vec 2.0 shows that it’s possible to achieve low WER on LibriSpeech validation datasets using only ten minutes of labelled audio data. Another option is to use the pre-trained model (such as the libri-speech model) and just fine tune it for your domain with a few hours of labelled audio.

0 reactions

The architecture of wav2vec 2.0

0 reactions

The breakthrough wav2vec 2.0 achieved is in adopting the masked pre-training method of the massive language model BERT [8]. BERT masks a few words in each training sentence and the model trains by attempting to fill the gaps.

0 reactions

Instead of masking words, wav2vec 2.0 masks a part of the audio representation and requires the transformer network to fill in the gap.

0 reactions

The figure below shows the wav2vec 2.0 architecture with its two major components: CNN layers and transformer layers.

0 reactions

image credit: https://arxiv.org/pdf/2006.11477.pdf

0 reactions

Self-supervised learning

0 reactions

So how does self-supervised learning work in wav2vec 2.0? The raw audio waveform (X in the figure above) first passes through CNN layers, and we get latent speech representations (Z in the figure above). Now, two things happen in parallel:

0 reactions

We mask a random subset of Z, let’s call it masked_Z. We pass masked_Z into transformer layers. The output of the transformer layers is called context representations (C in the figure above).
We apply product quantization [5] on Z and get quantized representations (Q in the figure above).

We expect C to be close to Q over the masked parts. The “error” between C and Q over the masked parts is called the contrastive loss. Minimizing contrastive loss enables transformer layers to learn the structure inside latent speech representations (Z).

0 reactions

Where does wav2vec 2.0 fit in the big picture?

In the figure above, we saw that context representations were the output of transformer layers. Wav2vec 2.0 passes these context representations into a linear layer, followed by a softmax operation. The final output contains probability distributions over 32 tokens. A token can be a character, or it can represent word and sentence boundaries, as well as unknowns.

0 reactions

How do we convert these probability distributions into text? The answer is a decoder! The authors of wav2vec 2.0 used a beam search decoder. Below, we show you how to use a Viterbi decoder to convert the output of wav2vec 2.0 into text.

0 reactions

Similarity with word2vec

0 reactions

Word2vec [2] generates a feature vector for a given word, such that feature vectors of similar words have closer cosine similarity. Similar to word2vec, we can think of the wav2vec 2.0 output as a feature vector for an audio segment.

0 reactions

Using Python and PyTorch to build an end to end speech recognition system with wav2vec 2.0

Now, let’s look at how to create a working ASR with wav2vec 2.0 that generates text given audio waveforms from the LibriSpeech dataset. We used Python and PyTorch framework in our sample code snippets.

0 reactions

First, download the wav2vec 2.0 model and the dev-clean dataset from LibriSpeech. The dev-clean dataset contains 5.4 hours of “clean” speech data, and it’s generally used as a validation dataset.

0 reactions

model_path = "/home/models/wav2vec_big_960h.pt"
data_path = "/home/datasets/"

In the code above, we declare

model_path

, which is the path to the wav2vec 2.0 model that we just downloaded.

data_path

is the path to the dev-clean dataset. Store it under “/home/datasets/”.

0 reactions

We mentioned in section 3.5 that wav2vec 2.0 outputs a probability distribution over 32 tokens. We convert these tokens to letters with the help from ltr_dict.txt. We download ltr_dict.txt from here, and save it at /home/ltr_dict.txt.

0 reactions

You might notice that ltr_dict.txt contains only 28 letters and tokens. The remaining four tokens are <s>, <pad>, </s>, and <unk>, and they are added when we call fairseq_mod.data.Dictionary.load() with the path to ltr_dict.txt.

0 reactions

target_dict = fairseq_mod.data.Dictionary.load('ltr_dict.txt')

Now, create the wav2vec 2.0 model.

0 reactions

w2v = torch.load(model_path)
model = Wav2VecCtc.build_model(w2v["args"], target_dict)
model.load_state_dict(w2v["model"], strict=True)

In the code above, we first load from

model_path

. We get

w2v

, which contains the argument setup and the model’s weights.Then, we build a wav2vecCTC object. wav2vecCTC is the model definition of wav2vec 2.0. Finally, we load weights into the model we just created.

0 reactions

We know that we need a decoder to convert the output of wav2vec 2.0 into text. Create a Viterbi decoder, as in code below.

0 reactions

decoder = W2lViterbiDecoder(target_dict)

Next, we need to create a data loader for our dataset. Luckily, torchaudio knows how to process the LibriSpeech dataset! To use it, we just need to call torchaudio.datasets.LIBRISPEECH.

0 reactions

dev_clean_librispeech_data = torchaudio.datasets.LIBRISPEECH(data_path, url='dev-clean', download=False)
data_loader = torch.utils.data.DataLoader(dev_clean_librispeech_data, batch_size=1, shuffle=False)

In the steps so far, we have created wav2vec 2.0, a Viterbi decoder, and the data loader. Now, we are ready to convert raw waveforms into text using wav2vec 2.0 and the decoder.

0 reactions

The code below shows how we pass one data sample into wav2vec 2.0.

encoder_input

is the data sample, a dictionary containing speech audio waveforms and other arguments that we need to pass into wav2vec 2.0. The modeloutputs

encoder_out

, representing logits over tokens at each time step. To get

encoder_out

, we project the output of wav2vec 2.0 into tokens through a linear layer. The dimension of

encoder_out

is L*B*C, where L is the sequence length, B is the batch size and C is the number of tokens.

0 reactions

As we saw in section 3.4, we know we need to pass probability distributions over tokens to the decoder to get transcribed texts. Since

encoder_out

are logits over tokens, we take the log softmax of these logits (through

model.get_normalized_probs

), and get

emissions

, which are probability distributions over tokens.

0 reactions

encoder_out = model(**encoder_input)
emissions = model.get_normalized_probs(encoder_out, log_probs=True)
emissions = emissions.transpose(0, 1).float().cpu().contiguous()

Next, we pass emissions into the decoder, like this:

0 reactions

decoder_out = decoder.decode(emissions)

In our third post in this series, we describe what happens inside the

decode

method. We need to do some post processing on

decoder_out

to finalize the output text, but we omit those details here. Check out post process_sentence if you are interested in knowing more.

0 reactions

That’s it! We just finished processing one data sample. If you want to convert all data samples from the dev-clean dataset into texts and get a WER score, try this notebook and you should get a WER of 2.63%.

0 reactions

What’s next?

In this post, we introduced the ASR system, as well as wav2vec 2.0. We also showed you how to get an ASR system working with wav2vec 2.0. Note that wav2vec 2.0 is a big model and its largest version has 317 million parameters! So, read our next post next to learn how to compress wav2vec 2.0.

0 reactions

About Georgian R&D

Georgian is a fintech that invests in high-growth software companies.

0 reactions

At Georgian, the R&D team works on building our platform that identifies and accelerates the best growth stage software companies. As part of this work, we take the latest AI research and use it to help solve the business challenges of the companies where we are investors. We then create reusable toolkits so that it’s easier for our other companies to adopt these techniques.

0 reactions

We wrote this series of posts after an engagement where we collaborated closely with the team at Chorus. Chorus is a conversation intelligence platform that uses AI to analyze sales calls to drive team performance.

0 reactions

Take a look at our open opportunities if you’re interested in a career at Georgian.

0 reactions

References

[1] Baevski et al. (2020). Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. https://arxiv.org/abs/2006.11477
[2] Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781
[3] Panayotov et al. (2015). Librispeech: an asr corpus based on public domain audio books. https://ieeexplore.ieee.org/document/7178964
[4] Schneider et al. (2019). wav2vec: Unsupervised Pre-training for Speech Recognition. https://arxiv.org/abs/1904.05862
[5] Jegou et al. (2011). Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128
[6] Alex Graves (2012) Sequence Transduction with Recurrent Neural Networks. https://arxiv.org/pdf/1211.3711.pdf
[7] Povey et al. (2011) The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. https://kaldi-asr.org/doc/about.html
[8] Devlin et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

Also published at https://medium.com/georgian-impact-blog/how-to-make-an-end-to-end-automatic-speech-recognition-system-with-wav2vec-2-0-dca6f8759920

0 reactions

Share this story

@zilunpengGeorgian.io

Read my stories

fin tech company

Join Hacker Noon

Create your free account to unlock your custom reading experience.

How to Use ASR System for Accurate Transcription Properties of Your Digital Prod...

How to Use ASR System for Accurate Transcription Properties of Your Digital Product

@zilunpengGeorgian.io

What is an end-to-end automatic speech recognition system?

The LibriSpeech dataset

What is wav2vec 2.0?

Where does wav2vec 2.0 fit in the big picture?

Using Python and PyTorch to build an end to end speech recognition system with wav2vec 2.0

What’s next?

About Georgian R&D

References

@zilunpengGeorgian.io

Recommend

GAMES: The Quest for Zero Percent Management

Coding For The Generation Alpha: Should Our Kids Learn Java Or Python?

Best Tips For An Effective Remote Working

I Wasn’t Feeling Refreshed Even After 7.5 Hours of Sleep. Here’s How I Fixed It

Sihamba Ngolayini Original Mp3 Download Fakaza

Why You Shouldn't Cheat In A Relationship > CEOWORLD magazine

Hacker Noon Partners with Den

Why Pets Are Important To Us

Are You Familiar With The Three Apples That Changed The World?

A solution to boost Python speed 1000x times

About Joyk