Synthetic voice generation for socially assistive robotics requires biologically validated approaches to ensure effective human-robot interaction. This paper presents a Variational Autoencoder (VAE) based system for generating species-specific vocalizations with behavioral validation using marmoset. Our approach processes linear spectrograms through a symmetric encoder-decoder architecture with Kullback-Leibler divergence regularization and adaptive KL annealing. The system was trained on 18 marmoset ‘twitter’ calls and validated through controlled behavioral experiments with three adult female marmosets. Generated vocalizations achieved 86.79% Mel-Frequency Cepstrum Coefficients (MFCC) similarity to natural calls and had a significant main effect on two marmoset behavior (stationary behavior: χ2 = 11.47, p = 0.04; leg-stand contact behavior: χ2 = 12.12, p = 0.03), although behavioral responses were different to those seen in the equivalent natural call type. Results demonstrate the feasibility of VAE-based vocalization synthesis while highlighting the importance of biological validation for developing emotionally appropriate synthetic voices in assistive robotics applications.