Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsatisfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system.
Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention.
MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features.
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion. Disong Wang et al.
FlowCPCVC(Ours)
Any-to-Any task
speakers come from vctk in test-dataset, all speakers are unseen during training
Man-to-female
source
target
FragmentVC
MediumVC
VQMIVC
FlowCPCVC
female-to-Man
source
target
FragmentVC
MediumVC
VQMIVC
FlowCPCVC
female-to-female or Man-to-Man
source
target
FragmentVC
MediumVC
VQMIVC
FlowCPCVC
Audios of target come from libritts, which is another dataset different to vctk
Speakers of target in libritts and speakers of source in test-dataset come from vctk. All speakers are unseen during training. We train the model only with vctk.
source
target
FragmentVC
MediumVC
VQMIVC
FlowCPCVC
Results of emotional voice conversion
The mood swing of source audios are high. The source audios come from libritts or other dataset instead of vckt and audios of target come from libritts. Both of them are unseen in training.
source
target
FragmentVC
MediumVC
VQMIVC
FlowCPCVC
Any-to-Many
Audios of source come from libritts and timbre of target come from vctk in training dataset.