RESEARCH27
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
arXiv CS.CLΒ·May 11, 2026
VITA-QinYu is the first expressive end-to-end (E2E) spoken language model that supports both role-playing and singing generation. It adopts a hybrid speech-text paradigm with multi-codebook audio tokens and was trained on 15.8K hours of data, outperforming other SLMs in expressiveness.
Read original β