decoding

3 items

RESEARCHarXiv CS.CL·7d ago

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth of the Key-Value (KV) cache. This paper proposes Attention Run-time Termination (ART), a lightweight mechanism that optimizes KV cache access, leading to a 20% higher generation throughput.

LLMs memory management decoding performance

RESEARCHarXiv CS.LG·29d ago

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

A new training-free inference framework, Positive-and-Negative Decoding (PND), is introduced to address object hallucination in Vision-Language Models (VLMs). PND enforces visual fidelity by using a dual-path contrast mechanism, leading to state-of-the-art performance without retraining.

multimodal AI hallucination Vision-Language Models decoding

RESEARCHarXiv CS.CL·27d ago

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

This research addresses the lack of diversity in LLM outputs, attributing it to how models allocate probability mass across valid and invalid continuations during decoding. It introduces a validity-diversity framework that decomposes the problem into two complementary forms of miscalibration: order calibration and shape calibration.

Calibration diversity LLMs decoding