6

Google Cloud launches new models for more accurate Speech AI

 2 years ago
source link: https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-updates-speech-api-models-for-improved-accuracy
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Google Cloud updates Speech API models for improved accuracy

With voice continuing to emerge as the new frontier in human-computer interaction, many enterprises may seek to level up their technology and present consumers with speech recognition systems that more reliably and accurately recognize what their users are saying. Think about it: higher speech recognition quality can enable people to talk to their applications and devices the way they would talk to their friends, their doctors, or other people they interact with. 

This opens up a world of use cases, from hands-free applications for drivers to voice assistants across smart devices. Moreover, beyond giving machines instructions, accurate speech recognition enables live captions in video meetings, insights from live and recorded conversations, and much more. In the five years since we launched our Speech-to-Text (STT) API, we’ve seen customer enthusiasm for the technology increase, with the API now processing more than 1 billion minutes of speech each month. That’s equivalent to listening to Wagner’s 15-hour Der Ring des Nibelungen over 1.1 million times, and assuming around 140 words spoken per minute, it's enough each month to transcribe Hamlet (Shakespeare’s longest play) nearly 4.6 million times.     

That’s why today, we’re announcing the availability of our newest models for the STT API. We’re also announcing a new model tag, “latest,” to help you access them. A major improvement in our technology, these models can help improve accuracy across 23 of the languages and 61 of the locales STT supports, helping you to more effectively connect with your customers at scale through voice. 

New models for better accuracy and understanding

The effort towards this new neural sequence-to-sequence model for speech recognition is the latest step in an almost eight-year journey that required extensive amounts of research, implementation, and optimization to provide the best quality characteristics across different use cases, noise environments, acoustic conditions, and vocabularies. The architecture underlying the new model is based on cutting-edge ML techniques and lets us leverage our speech training data more efficiently to see optimized results. 

So what’s different about this model versus the one currently in production? 

For the past several years, automated speech recognition (ASR) techniques have been based on separate acoustic, pronunciation, and language models. Historically, each of these three individual components was trained separately, then assembled afterwards to do speech recognition. 

The conformer models that we’re announcing today are based on a single neural network. As opposed to training three separate models that need to be subsequently brought together, this approach offers more efficient use of model parameters. Specifically, the new architecture augments a transformer model with convolution layers (hence the name con-former), allowing us to capture both the local and global information in the speech signal.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK