RealTalk: Speech Synthesis Model Recreates a Human Voice Perfectly

RealTalk: This Speech Synthesis Model Our Engineers Built Recreates a Human Voice Perfectly (Part I)

…and it’s the voice of Joe Rogan. Disclaimer: he didn’t actually endorse our work like this, it’s a clip from the video the team created featuring their work. Video and more after the jump!

Today we’re excited to announce that our Machine Learning Engineers Hashiam Kadhim, Joe Palermo and Rayhane Mama have produced the most realistic AI simulation of a voice we’ve heard to date.

It’s the voice of someone you’ve probably heard of before — Joe Rogan. (For those who haven’t: Joe Rogan is the creator and host one of the world’s most popular podcasts, which to date has nearly 1300 episodes and counting.)

Obviously, something like this has to be heard to be believed. So without further ado, check it out for yourself:

Remember: 100% of the following audio was generated from the machine learning model using only text input. This includes the breathes, ‘um’s and ‘ah’s, and all other noises.

The replica of Rogan’s voice the team created was produced using a text-to-speech deep learning system they developed called RealTalk , which generates life-like speech using only text inputs .

IT’S CRAZY, RIGHT? If you’re like us, and specifically, like our Principal ML Architect, Alex Krizhevsky , you’re probably thinking that it’s “one of the coolest, but scariest, things I’ve seen yet in artificial intelligence. Unlike The Singularity, which is this theoretical thing that could happen in 40, 100 years, speech synthesis is soon going to be a reality everywhere.”

What Does This Mean? Considering Societal Impact

It’s surreal for our engineers to be able to say they’ve legitimately created a life-like replica of Joe Rogan’s voice using AI. Not to mention the fact that the model would be capable of producing a replica of anyone’s voice, provided that sufficient data is available.

As AI practitioners building real-world applications, we’re especially cognizant of the fact that we need to be talking about the implications of this.

Because clearly, the societal implications for technologies like speech synthesis are massive. And the implications will affect everyone . Poor consumers and rich consumers. Enterprises and governments.

Right now, technical expertise, ingenuity, computing power and data are required to make models like RealTalk perform well. So not just anyone can go out and do it. But in the next few years (or even sooner), we’ll see the technology advance to the point where only a few seconds of audio are needed to create a life-like replica of anyone’s voice on the planet.

It’s pretty f*cking scary.

Here are some examples of what might happen if the technology got into the wrong hands:

Spam callers impersonating your mother or spouse to obtain personal information
Impersonating someone for the purposes of bullying or harassment
Gaining entrance to high security clearance areas by impersonating a government official
An ‘audio deepfake’ of a politician being used to manipulate election results or cause a social uprising

Obviously, though, not everything is doom and gloom. There are also some really good things that could come out of speech synthesis models. Here are some examples:

Talking to a voice assistant in a way that feels as natural as talking to a friend
Customized voice applications — for instance, a workout app that contains a personalized pre-workout pep talk from Arnold Schwarzenegger
Improved accessibility options for people that communicate through text-to-speech devices, for example, people with Lou Gehrig’s disease
Automating voice dubbing for any media and in any language

As the recent report “ The Malicious Uses of Artificial Intelligence ” by Oxford’s Future of Humanity Institute notes, new advancements in artificial intelligence not only expand existing threats, but also create new ones. (We highly recommend checking out the report, which is freely available to download here .)

We won’t pretend to have all the answers about how to build this technology ethically. That said, we think it will be inevitably built and increasingly implemented into our world over the coming years. So in addition to raising awareness and acknowledging these issues, we also want to show this work as a way of starting a conversation that must be had.

Everyone should know what kinds of things are possible with the development of speech synthesis technologies. We think that as AI voices become more and more life-like, it’s crucial that public awareness of what’s possible with the technology doesn’t lag behind.

As we’ve seen with Deepfakes , public awareness and dialogue also pushes governments, policymakers and lawmakers to take action and develop countermeasures swiftly.

A crucial advantage and responsibility we have as an applied AI company is knowing that there’s a huge difference between exploring AI in research and implementing it into the real world. To work on things like this responsibly, we think the public should first be made aware of the implications that speech synthesis models present before releasing anything open source.

Because of this, at this time we will not be releasing our research, model or datasets publicly.

We will however follow up in the coming days with Part II of this post, which will feature a technical overview of the team’s work and some of the features that went into building it.

Next steps

For those of you reading, we encourage you to remember that speech synthesis is getting better and better everyday. On the horizon, it’s not outlandish to believe that the implications we mentioned (and of course, many more) will soon make their way into the fabric of society.

So pay attention! Join the conversation! Knowledge is power, and we encourage individuals, companies and governments to think about how we can responsibly implement these technologies into our society.

Learn more about RealTalk:For anyone who has questions, feedback or inquires about the project, connect with us by email at [email protected].

Curious about how RealTalk was built?Stay tuned in the coming days for Part II of the blog on RealTalk, which will feature a technical overview of how the model works and what went into building it.

In the meantime, we encourage you to check out a modified Turing Test game the RealTalk team built to showcase the naturalness and intelligibility of this model, which can be found at www.fakejoerogan.com .

This project was developed as part of Dessa’s Meta Labs initiative, an internal program we’ve developed that encourages employees to work on independent projects that advance their knowledge of machine learning — and then some. Ambitious in scope, these projects are a labour of love by our engineers (read countless hours outside of work), and often end up stretching the boundaries of what’s possible for the technology. Major props!

What Does This Mean? Considering Societal Impact

Here are some examples of what might happen if the technology got into the wrong hands:

Next steps

Recommend

分层数据Hierarchical Data探索(1.递归)

正则表达式不要背 - scq000 - 博客园

HTTP/2 in GO(四)

Simple script to backup all SQL Server databases

Build a Swipe Gallery using Vue.js & Tailwind

理解Golang多重赋值

如何使用VSCode中的Code Runner插件执行golang代码

Spring Boot 高级篇 Web 之 websocket 的使用说明

Build Secure Microservices in Your Spring REST API

Yet another floating point tutorial

About Joyk