← heapsort-ai

large language models

262 items

RESEARCHarXiv CS.AI·5/11/2026

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

This paper introduces SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to theoretical physics problems. It investigates how the interaction between researchers and AI agents affects results in physics reasoning tasks, demonstrating that multi-turn dialogue significantly improves over single-shot attempts.

28
RESEARCHarXiv CS.LG·4/23/2026

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

This paper evaluates speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by fine-tuned Nemotron models. The study demonstrates significant performance improvements, including 22-49% throughput increase and 18-33% latency reduction at zero additional hardware cost.

28
RESEARCHarXiv CS.LG·4/23/2026

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

This research introduces the Tool-Augmented Markov Decision Process (TA-MDP) to formally model multimodal agentic decision-making, addressing theoretical gaps in reinforcement fine-tuning for Large Vision-Language Models (LVLMs). It specifically investigates how composite verifiable rewards affect GRPO convergence and why training on small datasets generalizes to out-of-distribution domains for agentic LVLMs.

28
RESEARCHarXiv CS.LG·4/23/2026

Super Apriel: One Checkpoint, Many Speeds

Super Apriel, a 15B-parameter supernet, has been released, offering four trained mixer choices per decoder layer to enable multiple speed/quality presets from a single checkpoint. This allows for 2.9x to 10.7x decode throughput gains with 96% to 77% quality retention, and also facilitates speculative decoding without a separate draft model.

28
RESEARCHarXiv CS.CL·13d ago

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

FLUID is a new framework designed to efficiently adapt Autoregressive (AR) backbones to the diffusion paradigm for parallel text generation. It enables initialization from GPT-style models and introduces an entropy-driven mechanism called Elastic Horizons, achieving state-of-the-art performance with significantly reduced training costs.

28
RESEARCHarXiv CS.CL·4/16/2026

Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

Dental-TriageBench introduces the first expert-annotated benchmark for multimodal reasoning in hierarchical dental triage, comprising 246 authentic, de-identified cases. The research highlights a substantial performance gap between 19 MLLMs and junior dentists, particularly in treatment-level triage tasks requiring multiple referral domains.

28
RESEARCHarXiv CS.CL·23d ago

Greedy or not, here I come: Language production under vocabulary constraints in humans and resource-rational models

This research explores how humans communicate with limited vocabularies, comparing their strategies to computational sampling algorithms powered by large language models. The study reveals that human language production under constraint often mirrors greedy sampling, although more skilled individuals exhibit non-greedy revision behaviors.

28
RESEARCHarXiv CS.CL·23d ago

Fluency and Faithfulness in Human and Machine Literary Translation

This research investigates the balance between fluency and faithfulness in literary translation, comparing human, Google Translate, and TranslateGemma performance across 106 novels in 16 source languages. It reveals a consistent negative correlation between fluency and faithfulness, particularly for human and Google Translate, and indicates that segment length significantly impacts automatic evaluation.

28
RESEARCHarXiv CS.CL·6d ago

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

A large-scale study re-evaluates Retrieval-Augmented Generation (RAG) in medical question answering, finding only small and inconsistent improvements over no-retrieval baselines. It suggests that the choice of the backbone model is more critical than retrieval methods, and the main bottleneck is the model's ability to effectively use retrieved evidence.

28
RESEARCHarXiv CS.AI·6d ago

Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research

This commentary introduces PEEL, a working scaffolding combining deterministic distant reading with LLM interpretation, grounded in Peircean semiotics and abductive reasoning. Applied to AI-generated condensations, PEEL reveals systematic distortions invisible without non-AI measurement, implying deterministic instruments must accompany AI tools to ensure fidelity and epistemic authority.

28