← heapsort-ai

LLMs

723 items

RESEARCHarXiv CS.LG·4/13/2026

Distributionally Robust Token Optimization in RLHF

To address LLMs' susceptibility to failures from small prompt shifts, especially in multi-step reasoning, researchers propose Distributionally Robust Token Optimization (DRTO). This approach combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO) to enhance consistency under distribution shifts, showing improvements on mathematical reasoning benchmarks.

27
RESEARCHarXiv CS.CL·4/14/2026

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

This research introduces the '100-Endings metric' to address LLMs' failure in generating compelling stories and recognizing their own quality issues. The metric measures narrative tension by predicting story endings sentence-by-sentence, proving more effective than current rubrics at distinguishing high-quality human narratives from AI outputs.

27
RESEARCHarXiv CS.CL·5/5/2026

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

The CLEAR framework is introduced to assess how ambiguity and uncertainty impact medical Large Language Models' (LLMs) reliability, moving beyond simplified evaluation benchmarks. It systematically perturbs answer options and their semantic framing, revealing that increased plausible answers degrade LLM performance and caution decreases with uncertain abstention phrasing.

27
RESEARCHarXiv CS.AI·5/9/2026

BALAR : A Bayesian Agentic Loop for Active Reasoning

This paper introduces BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm enabling structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and significantly outperforms baselines across diverse reasoning benchmarks.

27
RESEARCHarXiv CS.CL·4/9/2026

Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Este artigo introduz o Text2DistBench, um novo benchmark para avaliar a capacidade de LLMs inferirem conhecimento distribucional a partir de linguagem natural. Diferente dos benchmarks tradicionais, ele foca em tarefas do mundo real, como estimar proporções de sentimentos ou identificar tópicos frequentes em coleções de texto como comentários do YouTube.

27
RESEARCHarXiv CS.AI·4/25/2026

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

This paper proposes a new framework for evaluating rule-governed AI, particularly in content moderation, by moving beyond simple agreement metrics. It introduces the Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) to assess policy-grounded correctness and reasoning stability, using LLM traces to verify logical derivability from governing rules.

27
RESEARCHarXiv CS.LG·4/14/2026

A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

This paper provides a comparative theoretical analysis of entropy control strategies in Reinforcement Learning, focusing on traditional regularization versus a novel covariance-based mechanism for LLM training. It establishes a unified framework, showing that covariance-based methods achieve asymptotic unbiasedness by selectively regularizing high-covariance tokens, unlike traditional methods that introduce persistent bias.

27
RESEARCHarXiv CS.CL·4/9/2026

Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models

Este artigo propõe um arcabouço teórico para estudar a transferência interlinguística e a adaptação eficiente de parâmetros em LLMs multilingues para a família de línguas túrquicas. Ele busca abordar a sub-representação de línguas de baixos recursos nesses modelos, como azerbaijano, cazaque, uzbeque, turcomeno e gagauz.

27
RESEARCHarXiv CS.CL·4/30/2026

LLMs Generate Kitsch

This paper proposes that Large Language Models (LLMs) systematically generate kitsch as a consequence of their training method. Empirically, the study shows readers perceive LLM-generated stories as kitschier, with implications for future study design and creative tasks.

27
RESEARCHarXiv CS.LG·4/9/2026

RAGEN-2: Reasoning Collapse in Agentic RL

Este estudo introduz o conceito de 'colapso de template', uma falha em agentes LLM de múltiplas interações onde a resposta se torna agnóstica à entrada, mesmo com entropia estável. Propõe a Informação Mútua (MI) como uma métrica superior à entropia para diagnosticar a qualidade do raciocínio, correlacionando-se mais fortemente com o desempenho final.

27
RESEARCHarXiv CS.LG·5/1/2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

This research proposes using LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for synthetic mental health data augmentation to address data scarcity and privacy regulations. A comprehensive evaluation framework is introduced, assessing semantic fidelity, lexical diversity, and privacy/plagiarism to mitigate risks like mode collapse or memorization.

27
RESEARCHarXiv CS.CL·4/30/2026

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

This paper introduces MATH-PT, a novel dataset of 1,729 mathematical problems in European and Brazilian Portuguese, to address the linguistic bias in LLM mathematical reasoning evaluations. The benchmark reveals that frontier reasoning models achieve strong performance in multiple-choice questions but their performance decreases for open-ended questions.

27
RESEARCHarXiv CS.CL·5/1/2026

Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

This paper introduces an ILR-informed framework to evaluate Claude (Sonnet 4.6) for cross-lingual response consistency across six languages. It analyzes responses to semantically equivalent prompts using quantitative metrics and expert ILR qualitative assessment, revealing language-specific variations like response length differences and surface divergence in creative clusters.

27
RESEARCHarXiv CS.CL·4/30/2026

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

This research explores the use of lightweight Large Language Models (LLMs) for Biomedical Named Entity Recognition, demonstrating their competitive performance against larger models. The study highlights their potential as resource-efficient alternatives and identifies specific output formats that consistently improve performance.

27
RESEARCHarXiv CS.CL·4/16/2026

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

This paper argues that the primary bottleneck in multimodal scaling for MLLMs is knowledge density in training data, rather than task format. It demonstrates that task-specific supervision like VQA adds little incremental semantic information beyond image captions, and that increasing knowledge density leads to consistent performance improvements.

27