Microsoft Launched VALL-E, A Voice DALL-E

News

Microsoft has recently released VALL-E, a new language model for text-to-speech synthesis (TTS) that uses audio codec codes to represent intermediate representations. After being trained on 60,000 hours worth of English speech data, it demonstrated in-context learning abilities in zero-shot situations.

VALL-E allows you to create high-quality, personalized speech with just a 3-second recording of an oblique speaker as an acoustic prompt. It allows for prompt-based TTS techniques that are zero-shot and contextual.

There is no need to add structural engineering or pre-designed acoustic features. Microsoft used much semi-supervised information to create a generalized TTS system for the speaker dimension. This indicates that semi-supervised data has not been fully utilized in scaling up TTS.

VALL-E can produce multiple outputs from the same input text while keeping the speaker’s emotion intact and the acoustical prompt. VALL-E can synthesize natural speech using prompting in the zero-shot scenario. Evaluation results show that VALL-E performs better than any zero-shot TTS system on LibriSpeech or VCTK. VALL-E even produced new, state-of-the-art zero-shot TTS results for LibriSpeech & VCTK. You can also read the research paper from here.

Interestingly, those who have lost their voices can talk again using this text-to-speech method, provided they have previously recorded voice recordings.

What Are The Features Of Vall-E?

Synthesis of Diversity: VALLE’s output can vary for the same input text because it generates discrete tokens using the sampling-based algorithm. It can synthesize different samples of personalized speech by using random seeds.

Acoustic Environment Maintenance: VALL-E can generate personalized speech while maintaining the speaker prompt’s audio environment. VALL-E is trained using large-scale data with more acoustic variables than the baseline. We used samples from the Fisher dataset to create the audio and transcriptions.

Speaker’s emotion maintenance: VALL-E uses the Emotional Voices database for audio prompts to build personalized speech and preserve the prompt’s emotional tone. The speech corresponds to a transcription of the emotion label and an emotion label in a Supervised Emotional TTS dataset. This is traditional training. VALL-E can retain the prompt’s emotion in a zero-shot environment.

VALL-E still has to overcome weaknesses like synthesis robustness and data coverage.

Microsoft Launched VALL-E, A Voice DALL-E

Microsoft Launched VALL-E, A Voice DALL-E

What Are The Features Of Vall-E?

Recommend

Elon Musk Twitter Employees Finally Get Severance Agreements

蚂蚁集团调整股东结构，进一步与阿里「切割」；罗永浩回应炮轰：你记性不好，股东会每...

Carvana: From Powerhouse To Flop

周期更迭，新能源投资如何穿越迷雾？丨看见未来

得物拥抱2亿用户

Kubernetes on Hetzner Cloud the easiest way

Use The Pomodoro Technique To Improve Developer Productivity

Replication of Bank Master and Cost Center from SAP ERP or S/4 to SuccessFactors...

女排联赛-A组中浙江3-1胜出天津获四连胜

Virgin Orbit Set for First Orbital Launch from British Soil

About Joyk