5

Microsoft Launched VALL-E, A Voice DALL-E

 1 year ago
source link: https://www.theinsaneapp.com/2023/01/microsoft-launched-a-voice-based-dall-e-called-vall-e.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Microsoft Launched VALL-E, A Voice DALL-E

Microsoft has recently released VALL-E, a new language model for text-to-speech synthesis (TTS) that uses audio codec codes to represent intermediate representations. After being trained on 60,000 hours worth of English speech data, it demonstrated in-context learning abilities in zero-shot situations.

VALL-E allows you to create high-quality, personalized speech with just a 3-second recording of an oblique speaker as an acoustic prompt. It allows for prompt-based TTS techniques that are zero-shot and contextual.

There is no need to add structural engineering or pre-designed acoustic features. Microsoft used much semi-supervised information to create a generalized TTS system for the speaker dimension. This indicates that semi-supervised data has not been fully utilized in scaling up TTS.

VALL-E can produce multiple outputs from the same input text while keeping the speaker’s emotion intact and the acoustical prompt. VALL-E can synthesize natural speech using prompting in the zero-shot scenario. Evaluation results show that VALL-E performs better than any zero-shot TTS system on LibriSpeech or VCTK. VALL-E even produced new, state-of-the-art zero-shot TTS results for LibriSpeech & VCTK. You can also read the research paper from here.

Interestingly, those who have lost their voices can talk again using this text-to-speech method, provided they have previously recorded voice recordings.

What Are The Features Of Vall-E?

Synthesis of Diversity: VALLE’s output can vary for the same input text because it generates discrete tokens using the sampling-based algorithm. It can synthesize different samples of personalized speech by using random seeds.

Acoustic Environment Maintenance: VALL-E can generate personalized speech while maintaining the speaker prompt’s audio environment. VALL-E is trained using large-scale data with more acoustic variables than the baseline. We used samples from the Fisher dataset to create the audio and transcriptions.

Speaker’s emotion maintenance: VALL-E uses the Emotional Voices database for audio prompts to build personalized speech and preserve the prompt’s emotional tone. The speech corresponds to a transcription of the emotion label and an emotion label in a Supervised Emotional TTS dataset. This is traditional training. VALL-E can retain the prompt’s emotion in a zero-shot environment.

VALL-E still has to overcome weaknesses like synthesis robustness and data coverage.

Related Stories:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK