Fast VC: Fast Voice Conversion for non-parallel data

Abstract

This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. This model's latent representation is shown to be speaker-independent and similar to phonemes, which is a desirable feature for VC systems. While the current VC systems primarily focus on achieving the highest overall speech quality, this paper tries to balance the development concerning resources needed to run the systems. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.

Resources

Paper at arXiv

INTERSPEECH 2020 slides

Master Thesis

Audio samples

The Voice Conversion Challenge has two tasks: English-English conversion (first 4 source-target pairs) and cross-lingual conversion (last 4 source-target pairs). For each source-target pair we provide the conversion obtained with 5 different systems:

VCC Baseline: ASR+TTS: Model provided by VCC organizers. Transcribe input audio to text (ASR) and synthesize it with the style of the target speaker (TTS).
VCC Baseline: CycleVAE: Model provided by VCC organizers. Rely on bottleneck to discard unwanted information.
autovc: The model that we took as a starting point.
fastvc: The model that we took for the VCC submission (only for the second task: last 4 source-target pairs).
fastvc-neck32-freq16: Our model but with autovc's bottleneck hyperparameters.

For the evaluation, you should take into account the quality of the converted audio, the similarity to the target speaker. Moreover, the uttered words should be the same as the source speaker. Please use headphones and listen to the audios in a quiet environment if possible.

Description	Source speaker (content)	Target speaker (style)	Conversion
English Female - English Female			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Male - English Male			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Male - English Female			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Female - English Male			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Female - Mandarin Male			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Male - Mandarin Female			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Male - Finnish Female			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16
English Female - German Male			VCC Baseline: ASR+TTS
			VCC Baseline: CycleVAE
			autovc
			fastvc
			fastvc-neck32-freq16