This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. This model's latent representation is shown to be speaker-independent and similar to phonemes, which is a desirable feature for VC systems. While the current VC systems primarily focus on achieving the highest overall speech quality, this paper tries to balance the development concerning resources needed to run the systems. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.
The Voice Conversion Challenge has two tasks: English-English conversion (first 4 source-target pairs) and cross-lingual conversion (last 4 source-target pairs). For each source-target pair we provide the conversion obtained with 5 different systems:
For the evaluation, you should take into account the quality of the converted audio, the similarity to the target speaker. Moreover, the uttered words should be the same as the source speaker. Please use headphones and listen to the audios in a quiet environment if possible.
Description | Source speaker (content) | Target speaker (style) | Conversion | |
---|---|---|---|---|
English Female - English Female | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Male - English Male | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Male - English Female | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Female - English Male | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Female - Mandarin Male | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Male - Mandarin Female | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Male - Finnish Female | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||
English Female - German Male | VCC Baseline: ASR+TTS | |||
VCC Baseline: CycleVAE | ||||
autovc | ||||
fastvc | ||||
fastvc-neck32-freq16 | ||||