Fast VC: Fast Voice Conversion for non-parallel data


Abstract

This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. This model's latent representation is shown to be speaker-independent and similar to phonemes, which is a desirable feature for VC systems. While the current VC systems primarily focus on achieving the highest overall speech quality, this paper tries to balance the development concerning resources needed to run the systems. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.

Resources


Audio samples

The Voice Conversion Challenge has two tasks: English-English conversion (first 4 source-target pairs) and cross-lingual conversion (last 4 source-target pairs). For each source-target pair we provide the conversion obtained with 5 different systems:


  1. VCC Baseline: ASR+TTS: Model provided by VCC organizers. Transcribe input audio to text (ASR) and synthesize it with the style of the target speaker (TTS).
  2. VCC Baseline: CycleVAE: Model provided by VCC organizers. Rely on bottleneck to discard unwanted information.
  3. autovc: The model that we took as a starting point.
  4. fastvc: The model that we took for the VCC submission (only for the second task: last 4 source-target pairs).
  5. fastvc-neck32-freq16: Our model but with autovc's bottleneck hyperparameters.

For the evaluation, you should take into account the quality of the converted audio, the similarity to the target speaker. Moreover, the uttered words should be the same as the source speaker. Please use headphones and listen to the audios in a quiet environment if possible.


DescriptionSource speaker (content)Target speaker (style)Conversion
English Female - English FemaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Male - English MaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Male - English FemaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Female - English MaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Female - Mandarin MaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Male - Mandarin FemaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Male - Finnish FemaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16
English Female - German MaleVCC Baseline: ASR+TTS
VCC Baseline: CycleVAE
autovc
fastvc
fastvc-neck32-freq16