7

How to Use ASR System for Accurate Transcription Properties of Your Digital Prod...

 3 years ago
source link: https://hackernoon.com/how-to-use-asr-system-for-accurate-transcription-properties-of-your-digital-product-ms3g33em
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to Use ASR System for Accurate Transcription Properties of Your Digital Product

@zilunpengGeorgian.io

fin tech company

Thanks to advances in speech recognition, companies can now build a whole range of products with accurate transcription capabilities at their heart. Conversation intelligence platforms, personal assistants and video and audio editing tools, for example, all rely on speech to text transcription. However, you often need to train these systems for every domain you want to transcribe, using supervised data. In practice, you need a large body of transcribed audio that’s similar to what you are transcribing just to get started in a new domain.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Recently, Facebook released wav2vec 2.0 which goes some way towards addressing this challenge. wav2vec 2.0 allows you to pre-train transcription systems using audio only — with no corresponding transcription — and then use just a tiny transcribed dataset for training.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In this blog, we share how we worked with wav2vec 2.0, with great results.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What is an end-to-end automatic speech recognition system?

Before we dive into wav2vec 2.0, let’s take a few steps back to cover a couple of key terms you’ll need to understand to see what makes wav2vec 2.0 so special. First, let’s look at end-to-end automatic speech recognition systems.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

An end-to-end automatic speech recognition (ASR) system takes speech audio waveform and outputs the corresponding text. Traditionally, these systems use Hidden Markov Models (HMMs), where the speech audio is modeled using a stochastic process. In recent years, deep learning ASR systems have become popular thanks to increased computing power and amounts of training data.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can measure an ASR system’s performance with a word error rate (WER) metric. WER reflects the number of corrections needed to convert the ASR output into the ground truth. Generally, a lower WER means a better quality ASR system.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

This figure is adapted from https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-evaluate-data

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The example above shows how to calculate the WER. We can see that the ASR has made a few errors. It has inserted an “a”, identified “John” as “Jones” and deleted the word “are” from the ground truth.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To calculate WER, we can use this formula: (D+I+S)/N. D is the number of deletions, I is the number of insertions, S is the number of substitutions and N is the number of words in the ground truth. In this example, the ASR output made 3 mistakes in total from 5 words in the ground truth. In this case, the WER would be 3 / 5 = 0.6.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The LibriSpeech dataset

Next, we’ll briefly touch on the LibriSpeech dataset. The LibriSpeech dataset is the most commonly used audio processing dataset in speech research. It was created by Vassil Panayotov and Daniel Povey in 2015 [3]. LibriSpeech consists of 960 hours of labelled speech data and is the standard benchmark for training and evaluating ASR systems.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The dev-clean dataset from LibriSpeech contains 5.4 hours of “clean” speech data. It’s generally used as a validation dataset. In the figure below, we show the transcription for one audio sample in the dev-clean dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The transcription for one audio sample in the dev-clean dataset

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What is wav2vec 2.0?

Now that we understand what an ASR system and the LibriSpeech dataset are, we’re ready to take a closer look at wav2vec 2.0.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What’s different about wav2vec 2.0?

0 reactions
heart.png
light.png
money.png
thumbs-down.png

ASR systems come in two flavors:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • The first are hybrid systems such as Kaldi [7] that train a deep acoustic model to predict phonemes from audio processed into Mel Frequency Cepstral Coefficients (MFCCs), combine the phonemes using a pronunciation dictionary and finally pick the most likely results using a language model (both count based LM and RNN based LM).
  • The second are end-to-end systems using a deep neural network to predict words directly from the audio or MFCCs. Such systems like RNN-T [6] or wav2vec [1, 4] require a lot more training data and GPU resources for training.

Due to the massive data requirements of end-to-end systems only the biggest companies have used them to date. The data requirements also make it hard to train for new domains (even in the same language) and new languages or accents. Using a hybrid system, it is much easier to create a model for a new domain using minimal training data and a pronunciation dictionary with words added for that domain.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The promise of wav2vec 2.0 is pre-training without the supervised data using a large data set of recordings in the target domain. Afterwards, the model can be tuned using the supervised approach to maximize the accuracy. Wav2vec 2.0 shows that it’s possible to achieve low WER on LibriSpeech validation datasets using only ten minutes of labelled audio data. Another option is to use the pre-trained model (such as the libri-speech model) and just fine tune it for your domain with a few hours of labelled audio.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The architecture of wav2vec 2.0

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The breakthrough wav2vec 2.0 achieved is in adopting the masked pre-training method of the massive language model BERT [8]. BERT masks a few words in each training sentence and the model trains by attempting to fill the gaps.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Instead of masking words, wav2vec 2.0 masks a part of the audio representation and requires the transformer network to fill in the gap.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The figure below shows the wav2vec 2.0 architecture with its two major components: CNN layers and transformer layers.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

image credit: https://arxiv.org/pdf/2006.11477.pdf

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Self-supervised learning

0 reactions
heart.png
light.png
money.png
thumbs-down.png

So how does self-supervised learning work in wav2vec 2.0? The raw audio waveform (X in the figure above) first passes through CNN layers, and we get latent speech representations (Z in the figure above). Now, two things happen in parallel:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  1. We mask a random subset of Z, let’s call it masked_Z. We pass masked_Z into transformer layers. The output of the transformer layers is called context representations (C in the figure above).
  2. We apply product quantization [5] on Z and get quantized representations (Q in the figure above).

We expect C to be close to Q over the masked parts. The “error” between C and Q over the masked parts is called the contrastive loss. Minimizing contrastive loss enables transformer layers to learn the structure inside latent speech representations (Z).

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Where does wav2vec 2.0 fit in the big picture?

In the figure above, we saw that context representations were the output of transformer layers. Wav2vec 2.0 passes these context representations into a linear layer, followed by a softmax operation. The final output contains probability distributions over 32 tokens. A token can be a character, or it can represent word and sentence boundaries, as well as unknowns.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

How do we convert these probability distributions into text? The answer is a decoder! The authors of wav2vec 2.0 used a beam search decoder. Below, we show you how to use a Viterbi decoder to convert the output of wav2vec 2.0 into text.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Similarity with word2vec

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Word2vec [2] generates a feature vector for a given word, such that feature vectors of similar words have closer cosine similarity. Similar to word2vec, we can think of the wav2vec 2.0 output as a feature vector for an audio segment.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Using Python and PyTorch to build an end to end speech recognition system with wav2vec 2.0

Now, let’s look at how to create a working ASR with wav2vec 2.0 that generates text given audio waveforms from the LibriSpeech dataset. We used Python and PyTorch framework in our sample code snippets.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

First, download the wav2vec 2.0 model and the dev-clean dataset from LibriSpeech. The dev-clean dataset contains 5.4 hours of “clean” speech data, and it’s generally used as a validation dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model_path = "/home/models/wav2vec_big_960h.pt"
data_path = "/home/datasets/"

In the code above, we declare 

model_path
, which is the path to the wav2vec 2.0 model that we just downloaded. 
data_path
 is the path to the dev-clean dataset. Store it under “/home/datasets/”.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

We mentioned in section 3.5 that wav2vec 2.0 outputs a probability distribution over 32 tokens. We convert these tokens to letters with the help from ltr_dict.txt. We download ltr_dict.txt from here, and save it at /home/ltr_dict.txt.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You might notice that ltr_dict.txt contains only 28 letters and tokens. The remaining four tokens are <s>, <pad>, </s>, and <unk>, and they are added when we call fairseq_mod.data.Dictionary.load() with the path to ltr_dict.txt.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
target_dict = fairseq_mod.data.Dictionary.load('ltr_dict.txt')

Now, create the wav2vec 2.0 model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
w2v = torch.load(model_path)
model = Wav2VecCtc.build_model(w2v["args"], target_dict)
model.load_state_dict(w2v["model"], strict=True)

In the code above, we first load from 

model_path
. We get 
w2v
, which contains the argument setup and the model’s weights.Then, we build a wav2vecCTC object. wav2vecCTC is the model definition of wav2vec 2.0. Finally, we load weights into the model we just created.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

We know that we need a decoder to convert the output of wav2vec 2.0 into text. Create a Viterbi decoder, as in code below.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
decoder = W2lViterbiDecoder(target_dict)

Next, we need to create a data loader for our dataset. Luckily, torchaudio knows how to process the LibriSpeech dataset! To use it, we just need to call torchaudio.datasets.LIBRISPEECH.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
dev_clean_librispeech_data = torchaudio.datasets.LIBRISPEECH(data_path, url='dev-clean', download=False)
data_loader = torch.utils.data.DataLoader(dev_clean_librispeech_data, batch_size=1, shuffle=False)

In the steps so far, we have created wav2vec 2.0, a Viterbi decoder, and the data loader. Now, we are ready to convert raw waveforms into text using wav2vec 2.0 and the decoder.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The code below shows how we pass one data sample into wav2vec 2.0.

encoder_input
 is the data sample, a dictionary containing speech audio waveforms and other arguments that we need to pass into wav2vec 2.0. The modeloutputs 
encoder_out
, representing logits over tokens at each time step. To get 
encoder_out
, we project the output of wav2vec 2.0 into tokens through a linear layer. The dimension of 
encoder_out
 is L*B*C, where L is the sequence length, B is the batch size and C is the number of tokens.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

As we saw in section 3.4, we know we need to pass probability distributions over tokens to the decoder to get transcribed texts. Since

encoder_out
 are logits over tokens, we take the log softmax of these logits (through 
model.get_normalized_probs
), and get 
emissions
, which are probability distributions over tokens.
0 reactions
heart.png
light.png
money.png
thumbs-down.png
encoder_out = model(**encoder_input)
emissions = model.get_normalized_probs(encoder_out, log_probs=True)
emissions = emissions.transpose(0, 1).float().cpu().contiguous()

Next, we pass emissions into the decoder, like this:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
decoder_out = decoder.decode(emissions)

In our third post in this series, we describe what happens inside the

decode
 method. We need to do some post processing on 
decoder_out
 to finalize the output text, but we omit those details here. Check out post process_sentence if you are interested in knowing more.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

That’s it! We just finished processing one data sample. If you want to convert all data samples from the dev-clean dataset into texts and get a WER score, try this notebook and you should get a WER of 2.63%.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What’s next?

In this post, we introduced the ASR system, as well as wav2vec 2.0. We also showed you how to get an ASR system working with wav2vec 2.0. Note that wav2vec 2.0 is a big model and its largest version has 317 million parameters! So, read our next post next to learn how to compress wav2vec 2.0.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

About Georgian R&D

Georgian is a fintech that invests in high-growth software companies.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

At Georgian, the R&D team works on building our platform that identifies and accelerates the best growth stage software companies. As part of this work, we take the latest AI research and use it to help solve the business challenges of the companies where we are investors. We then create reusable toolkits so that it’s easier for our other companies to adopt these techniques.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

We wrote this series of posts after an engagement where we collaborated closely with the team at Chorus. Chorus is a conversation intelligence platform that uses AI to analyze sales calls to drive team performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Take a look at our open opportunities if you’re interested in a career at Georgian.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

References

Also published at https://medium.com/georgian-impact-blog/how-to-make-an-end-to-end-automatic-speech-recognition-system-with-wav2vec-2-0-dca6f8759920

0 reactions
heart.png
light.png
money.png
thumbs-down.png
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
Share this story

@zilunpengGeorgian.io

Read my stories

fin tech company

Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK