Skip to the content.

FlowCPCVC: A contrastive predictive coding supervised flow framework for any-to-any voice conversion

Bigo Technology PTE. LTD, Singapore
Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang

Abstract

Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsatisfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system.

Paper

Compared systems


Any-to-Any task

speakers come from vctk in test-dataset, all speakers are unseen during training

Man-to-female

source target FragmentVC MediumVC VQMIVC FlowCPCVC

female-to-Man

source target FragmentVC MediumVC VQMIVC FlowCPCVC

female-to-female or Man-to-Man

source target FragmentVC MediumVC VQMIVC FlowCPCVC

Audios of target come from libritts, which is another dataset different to vctk


Speakers of target in libritts and speakers of source in test-dataset come from vctk. All speakers are unseen during training. We train the model only with vctk.

source target FragmentVC MediumVC VQMIVC FlowCPCVC

Results of emotional voice conversion

The mood swing of source audios are high. The source audios come from libritts or other dataset instead of vckt and audios of target come from libritts. Both of them are unseen in training.

source target FragmentVC MediumVC VQMIVC FlowCPCVC

Any-to-Many

Audios of source come from libritts and timbre of target come from vctk in training dataset.

Some examples for target timbre

p236 p264 p269 p263 p259 p256

Converted results

source to_p236 to_p264 to_p269 to_p263 to_p259 to_p256