RESEARCH27

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv CS.CL·April 13, 2026

WAND introduces a framework to adapt pretrained autoregressive text-to-speech (AR-TTS) models for constant computational and memory complexity. It achieves this by separating attention into global and local sliding-window mechanisms, employing curriculum learning, and utilizing knowledge distillation to maintain high-fidelity speech synthesis with significant KV cache memory reduction.

Knowledge Distillation Autoregressive Text-to-Speech Attention Mechanism Computational Efficiency Memory Reduction

Read original ↗