Voice Conversion using Perturbation AUTOVC and Adaptive Instance Normalization..

Alaa, Yasmin; M. Aref, Mostafa; Alfonse, Marco

doi:10.21608/ijicis.2025.388149.1396

Voice Conversion using Perturbation AUTOVC and Adaptive Instance Normalization..

Document Type : Original Article

Authors

¹ Computer Science Department Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

² Computer Science Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

10.21608/ijicis.2025.388149.1396

Abstract

Perturbation AutoVC is a model that is derived from the AutoVC model. AutoVC is an autoencoder model that performs Voice Conversion (VC) using non-parallel data and self-reconstruction loss only to train the model. Although AutoVC is a simple and easy model, it has a problem of the bottleneck layer which is used to separate the linguistic content from the speaker information. To tackle this bottleneck problem, Perturbation AutoVC appeared to remove the need of the bottleneck layer to achieve the VC task. In this paper, we use the Perturbation AutoVC as it achieves promising results while changing the way the speaker information is conditioned through using a normalization technique called Adaptive Instance Normalization (AdaIN) instead of the channel-wise concatenation. We setup two experiments seen-to-seen (many-to-many) VC and zero-shot (any-to-any) VC to compare our proposed model with Perturbation AutoVC. We use VCTK corpus (training and testing) and LibriTTS dataset (testing). In seen-to-seen, our proposed model and Perturbation AutoVC achieve d-vector cosine similarity of 0.65 and 0.64 respectively, Mean-Opinion-Score (MOSNet) of 3.48 and 3.32 respectively, Character Error Rate (CER) of 0.09 and 0.13 respectively and Word Error Rate (WER) of 0.15 and 0.22 respectively. In Zero-shot (any-to-any), our proposed model and Perturbation AutoVC achieve MOSNet of 3.40 and 3.16 respectively, d-vector cosine similarity of 0.60 and 0.59 respectively, CER of 0.065 and 0.073 respectively and WER of 0.11 and 0.12 respectively.

Keywords