A nervous voice cloning system in naturalness and similarity of two methods
Recently, baidu researchers published a paper, using the two methods, only a small amount of sample in a few seconds, and can be synthetic natural and similarity high voice.
In recent years about the quality of speech synthesis method is a lot of, but can be done in such a short time is rare.
Voice cloning is highly idealized personalized voice interaction field function, the speech synthesis system based on neural network can already for a large number of speakers to generate high quality voice.
In this paper, baidu's researchers told us about a nervous voice cloning system, only need to input a small amount of speech samples, can lifelike speech synthesis.
Here studied two kinds of methods: speaker adaptation (
And the speaker encoding (
, the final result shows that two methods in the naturalness of speech similarity and are in good performance.
Because the researchers from the limited and unfamiliar voice speech samples of cloning, it is equivalent to a 'voice what - in a specific context
Shot generated modeling 'problem.
If the sample is enough, for any target speaker training generation model.
However, what -
Shot generation model although it sounds very attractive, but it is a challenge.
Generate models need to be a small amount of information to learn the characteristics of their speakers, then generate a new voice.
Voice cloning we plan to design more than a speaker generation model (
, ti said text, si said the speaker.
Parametric model with W, as the training parameters of the encoder and decoder.
Esi is corresponding to the si can be embedded training speaker.
W and esi are optimized by minimizing the loss function L, L loss function to generate audio and the difference between real audio to punish.
Here S is a set of speakers, Tsi is for si - text
Audio training set, ai and j is the real audio ti and j.
Expectation is the speaker's text - through all the training
Audio to estimate.
In speech cloning, the purpose of the experiment is extracted from a group of clone audio Ask sk voice features, and use the sound generated different audio.
To measure the standard of the result of a generation 2: whether speech is natural;
Look at the generated voice with the original audio is similar.
The figure below summarizes the speaker adaptation and speaker voice of the two methods of cloning methods: speaker adaptation using gradient descent principle, use a small number of audio and fine-tune the corresponding text to speech model, embedded fine-tuning can be used for the speaker or the entire model.
And speaker coding methods are estimated from the speaker's audio samples embedded speaker.
This model does not need in the process of speech cloning fine-tuning, so it can be used for any unknown speaker.
Speaker speech cloning evaluation speech cloning can be the result of the encoder structure through the crowdsourcing platform evaluation conducted by the human, but this model development process is very slow and expensive.
The researchers used discriminant model two assessment methods are put forward.
The speaker classification (
The speaker classifier decided to the source of the audio samples.
For voice cloning evaluation, speaker classifier can be used for voice training on cloning.
High quality voice cloning is helpful to improve the accuracy of the classifier.
The speaker verification (
Speaker verification is used to detect voice of similarity, in particular, it USES binary classification test whether audio and the audio generated from the same speaker.
We compared the experimental process, two kinds of methods (
The speaker adaptation and speaker code)
In the speech on cloning.
For speaker adaptation, we trained a generation model, make it by making small adjustments to the level of target speaker.
For speakers coding, we train for more than a speaker generation model and a speaker encoder, will be embedded into more speakers generation model generated in the target voice.
Two methods of training data set is LibriSpeech, the data set contains 2484 samples of audio, the total length of about 820 hours, 16 KHZ.
LibriSpeech is a data set for automatic speech recognition, the audio quality is lower than speech synthesis data sets.
Voice cloning is conducted on VCTK dataset, which includes 108 different accents, native English speakers, audio.
In order to keep consistent with LibriSpeech VCTK audio samples of compressed to 16 KHZ.
The figure below summarizes the different methods on the voice cloning: the speaker adaptation and speaker needs of coding in speech on cloning.
Assumptions are conducted on Titan X for speaker adaptation method, the figure below shows the results: the classification precision and the iterative time diagram of different sample size and number of fine-tuning cloning speaker adaptation and speaker coding under different cloning samples under different cloning sample size classification precision comparing, the speaker verification on the equal error rate (
The following two tables show the result of human assessment, these two results show that cloning audio, the more the better speaker adaptation method.
Conclusion researchers, using two methods proved that they can use less natural, similar to that of new audio voice sample.
They believe that voice still has improve the prospect of cloning in the future.
With the progress of the meta learning, this field will be effectively improved, for example, to adapt the speaker or coding integrated into the training, the two methods than the speaker or by embedding more flexible way to infer model weights.