You can see previous news in the old version of the news blog. Watch
Microsoft introduced VALL-E, an AI model that mimics a voice from a small sample.
Microsoft has unveiled an artificial intelligence (AI) model called VALL-E that converts text to speech, accurately imitating a human voice, and a recording of just three seconds can serve as a sample. At the same time, AI retains the emotional coloring of the sample's speech.
The authors of the project say that the system will be useful when developing applications with high-quality text-to-speech capabilities and when creating audio content in combination with other AI content generators like GPT-3. While they also acknowledge that it can be used to edit audio from transcripts, the model can "make" a person say words they never actually said.
When creating the model, the EnCodec technology developed by Meta was used, which provides efficient audio signal compression. Unlike traditional text-to-speech methods, VALL-E does not construct sound waves, but analyzes the characteristics of human speech, breaks this data into separate components (so-called "tokens") and generates a record based on what it already "knows" about. sample - models the voice as it might sound outside of the 3 second sample. The model was trained on the LibriLight library compiled by Meta - which, in turn, was built on 60,000 hours of English speech from more than 7,000 speakers: the data was borrowed mainly from the LibriVox collection.
In the samples presented on the project website, the “Speaker Prompt” column contains speech samples; the column "Ground Truth" presents a record of the required text performed by the person from whom the sample was recorded; "Baseline" is an example of the work of traditional text-to-speech converters, and "VALL-E" is the work of a new AI model. The neural network can also offer several options for the required text with a voice on the sample. The creators of the system added that it not only gives the voice on the generated recording the necessary emotional color, but also imitates the "acoustic environment" of the sample - if the original recording was made from a telephone conversation, then the result will resemble a telephone conversation.
Due to the danger of technology abuse, Microsoft did not publish the VALL-E code for experiments, so everyone who wants to test the model will not be able to. The company added that they would do the same with other projects if they carry a potential threat of abuse.