RESEARCH35
Dispatch-Aware Ragged Attention for Pruned Vision Transformers
arXiv CS.LGΒ·April 20, 2026
This paper investigates the dispatch-overhead bottleneck that prevents token pruning from fully realizing latency reductions in Vision Transformers (ViTs). It proposes a lightweight Triton attention kernel with a lower dispatch floor, achieving up to 2.24x end-to-end throughput for pruned ViTs.
Read original β