RESEARCHarXiv CS.CL·4/13/2026
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
WAND introduces a framework to adapt pretrained autoregressive text-to-speech (AR-TTS) models for constant computational and memory complexity. It achieves this by separating attention into global and local sliding-window mechanisms, employing curriculum learning, and utilizing knowledge distillation to maintain high-fidelity speech synthesis with significant KV cache memory reduction.
27