Microsoft VALL-E can simulate anyone’s voice with 3 seconds of audio

Microsoft has just unveiled VALL-E (Voice-Aware Language-Learned Encoder-Decoder), a new text-to-speech AI model that can simulate anyone’s voice with just a three-second audio sample. VALL-E is based on Meta’s EnCodec audio compression technology, which employs artificial intelligence to compress high-quality audio to data rates much lower than MP3 files.

Microsoft’s new AI can preserve a speaker’s emotional tone and acoustic environment.

The technology behind VALL-E is groundbreaking, as it allows the model to analyze how a person sounds and then break that information down into discrete components called “tokens.” VALL-E can use this information to match what it “knows” about how that voice would sound if it spoke other phrases besides the three-second sample.

Text-to-speech systems today require high-quality, very clean training data, and it is done in a recording studio with professional equipment. Microsoft has advanced in the field with VALL-E, allowing the model to simulate anyone’s voice using only a three-second sample. VALL-E can now simulate almost anyone’s voice without them having to spend weeks in a studio.

Gizchina News of the week

Join GizChina on Telegram

AI can simulate anyone’s voice with 3 seconds of audio

VALL-E’s capabilities were honed using the LibriLight audio library, which contains 60K hours of speech from over 7K speakers. This enables VALL-E to generate realistic-sounding voices in English. When combined with other generative AI models, it has the potential for high-quality text-to-speech applications.

Microsoft has made available a large collection of VALL-E-generated samples, allowing you to hear for yourself. While the results are not perfect, the VALL-E-generated samples sound natural and indistinguishable from the original speaker’s sample.

Despite VALL-impressive E’s capabilities, Microsoft is aware of the technology’s potential for abuse. According to the company, harmful personnel can use audio for malicious purposes such as spoofing voice identification or impersonating. To mitigate these risks, Microsoft suggests developing a detection model to distinguish between synthesized and genuine speech generated by VALL-E.

Finally, VALL-E is a significant advancement in text-to-speech technology. Its ability to simulate anyone’s voice using only a three-second audio sample is revolutionary for various uses. However, Microsoft must continue to improve VALL-E while ensuring that appropriate safeguards are in place to prevent its misuse.

Source/VIA :

arsTechnica

Microsoft’s new AI can preserve a speaker’s emotional tone and acoustic environment.

Gizchina News of the week

Recommend

PowerUsageSummary.java源码分析 - 春告鳥

中国人事科学研究院与腾讯召开专家研讨会，首家“新职业与新就业腾讯观测站”挂牌成立

Kang the Conqueror is shattering timelines in new Quantumania trailer

[2301.02432] Myths and Legends in High-Performance Computing

Microsoft's Surface Duo 3 will have a foldable screen

消息称采用MicroLED屏幕的苹果Apple Watch将于2025年推出

老刀舅舅爆料：暴雪国服代理商谈判进入“二选一”，网之易已裁团队

2023微信公开课PRO：全场希望视频号重构微信生态-品玩

Not Boring Capital 推出规模 3000 万美元的基金，将投资 Web3 等领域

央数藏将于明日发布 2023 网络春晚“开新灶物”系列数字藏品

About Joyk