RESEARCHarXiv CS.CL·5/11/2026
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end (E2E) spoken language model that supports both role-playing and singing generation. It adopts a hybrid speech-text paradigm with multi-codebook audio tokens and was trained on 15.8K hours of data, outperforming other SLMs in expressiveness.
27