singing generation — AI articles, news & research

RESEARCHarXiv CS.CL·5/11/2026

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

VITA-QinYu is the first expressive end-to-end (E2E) spoken language model that supports both role-playing and singing generation. It adopts a hybrid speech-text paradigm with multi-codebook audio tokens and was trained on 15.8K hours of data, outperforming other SLMs in expressiveness.

role-playing expressive AI speech synthesis spoken language model