RESEARCH35

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

arXiv CS.LG·April 20, 2026

This paper investigates the dispatch-overhead bottleneck that prevents token pruning from fully realizing latency reductions in Vision Transformers (ViTs). It proposes a lightweight Triton attention kernel with a lower dispatch floor, achieving up to 2.24x end-to-end throughput for pruned ViTs.

AI models deep learning Performance optimization attention mechanisms Vision Transformers

Read original ↗