RESEARCH27

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

arXiv CS.CL·May 11, 2026

VITA-QinYu is the first expressive end-to-end (E2E) spoken language model that supports both role-playing and singing generation. It adopts a hybrid speech-text paradigm with multi-codebook audio tokens and was trained on 15.8K hours of data, outperforming other SLMs in expressiveness.

role-playing expressive AI speech synthesis spoken language model singing generation

Read original ↗