Variational Autoencoder-Based Synthesis of Marmoset Vocalizations Using Linear Spectrograms

Yegang Du, Emma Woolgar, Yukiko Kikuchi, Toshimi Ogawa

October, 2025

Abstract

Synthetic voice generation for socially assistive robotics requires biologically validated approaches to ensure effective human-robot interaction. This paper presents a Variational Autoencoder (VAE) based system for generating species-specific vocalizations with behavioral validation using marmoset. Our approach processes linear spectrograms through a symmetric encoder-decoder architecture with Kullback-Leibler divergence regularization and adaptive KL annealing. The system was trained on 18 marmoset ‘twitter’ calls and validated through controlled behavioral experiments with three adult female marmosets. Generated vocalizations achieved 86.79% Mel-Frequency Cepstrum Coefficients (MFCC) similarity to natural calls and had a significant main effect on two marmoset behavior (stationary behavior: χ² = 11.47, p = 0.04; leg-stand contact behavior: χ² = 12.12, p = 0.03), although behavioral responses were different to those seen in the equivalent natural call type. Results demonstrate the feasibility of VAE-based vocalization synthesis while highlighting the importance of biological validation for developing emotionally appropriate synthetic voices in assistive robotics applications.

Type

Conference paper

Publication

2025 IEEE Cyber Science and Technology Congress (CyberSciTech). In 2025 IEEE Cyber Science and Technology Congress (CyberSciTech)

Variational Autoencoder-Based Synthesis of Marmoset Vocalizations Using Linear Spectrograms

Abstract

Yegang Du

Assistant Professor