← heapsort
RESEARCH27

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv CS.CLΒ·April 13, 2026

WAND introduces a framework to adapt pretrained autoregressive text-to-speech (AR-TTS) models for constant computational and memory complexity. It achieves this by separating attention into global and local sliding-window mechanisms, employing curriculum learning, and utilizing knowledge distillation to maintain high-fidelity speech synthesis with significant KV cache memory reduction.

Read original β†—