LLMs

723 items

RESEARCHarXiv CS.LG·4/13/2026

Distributionally Robust Token Optimization in RLHF

To address LLMs' susceptibility to failures from small prompt shifts, especially in multi-step reasoning, researchers propose Distributionally Robust Token Optimization (DRTO). This approach combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO) to enhance consistency under distribution shifts, showing improvements on mathematical reasoning benchmarks.

DRO LLMs RLHF Distributionally Robust Optimization

RESEARCHarXiv CS.LG·4/13/2026

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

QuanBench+ is a new unified multi-framework benchmark for evaluating Large Language Models (LLMs) in quantum code generation, covering Qiskit, PennyLane, and Cirq. It assesses models across 42 tasks and demonstrates significant improvements with feedback-based repair.

LLMs PennyLane Quantum Code Generation benchmarking

RESEARCHarXiv CS.CL·4/14/2026

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

This research introduces the '100-Endings metric' to address LLMs' failure in generating compelling stories and recognizing their own quality issues. The metric measures narrative tension by predicting story endings sentence-by-sentence, proving more effective than current rubrics at distinguishing high-quality human narratives from AI outputs.

LLMs storytelling Evaluation Metrics Narrative Tension

RESEARCHarXiv CS.CL·4/10/2026

Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma

O conteúdo descreve o DFR-Gemma, um novo framework que permite que LLMs raciocinem diretamente sobre embeddings geoespaciais densos. Ele alinha embeddings de alta dimensão com o espaço latente de um LLM através de um projetor leve, injetando-os como tokens semânticos.

Geospatial AI LLMs Geospatial Embeddings Spatio-temporal Data

RESEARCHarXiv CS.CL·5/5/2026

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

The CLEAR framework is introduced to assess how ambiguity and uncertainty impact medical Large Language Models' (LLMs) reliability, moving beyond simplified evaluation benchmarks. It systematically perturbs answer options and their semantic framing, revealing that increased plausible answers degrade LLM performance and caution decreases with uncertain abstention phrasing.

Ambiguity LLMs evaluation reliability

RESEARCHarXiv CS.CL·5/1/2026

Semantic Structure of Feature Space in Large Language Models

This study demonstrates that the geometric relationships between semantic features in large language models' hidden states closely mirror human psychological associations. It shows that word vector projections onto semantic axes correlate with human ratings, and the similarity between these axes predicts the interconnections of semantic scales.

LLMs cognitive science semantic representation NLP

RESEARCHarXiv CS.AI·5/9/2026

BALAR : A Bayesian Agentic Loop for Active Reasoning

This paper introduces BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm enabling structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and significantly outperforms baselines across diverse reasoning benchmarks.

LLMs interactive AI Reasoning Bayesian models

RESEARCHarXiv CS.CL·4/9/2026

Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Este artigo introduz o Text2DistBench, um novo benchmark para avaliar a capacidade de LLMs inferirem conhecimento distribucional a partir de linguagem natural. Diferente dos benchmarks tradicionais, ele foca em tarefas do mundo real, como estimar proporções de sentimentos ou identificar tópicos frequentes em coleções de texto como comentários do YouTube.

Distributional Information Reading Comprehension LLMs benchmarking

RESEARCHarXiv CS.AI·4/25/2026

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

This paper proposes a new framework for evaluating rule-governed AI, particularly in content moderation, by moving beyond simple agreement metrics. It introduces the Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) to assess policy-grounded correctness and reasoning stability, using LLM traces to verify logical derivability from governing rules.

LLMs content moderation AI ethics AI evaluation

RESEARCHarXiv CS.LG·4/14/2026

A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

This paper provides a comparative theoretical analysis of entropy control strategies in Reinforcement Learning, focusing on traditional regularization versus a novel covariance-based mechanism for LLM training. It establishes a unified framework, showing that covariance-based methods achieve asymptotic unbiasedness by selectively regularizing high-covariance tokens, unlike traditional methods that introduce persistent bias.

Entropy Control Policy Entropy LLMs reinforcement learning

RESEARCHarXiv CS.CL·4/9/2026

Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models

Este artigo propõe um arcabouço teórico para estudar a transferência interlinguística e a adaptação eficiente de parâmetros em LLMs multilingues para a família de línguas túrquicas. Ele busca abordar a sub-representação de línguas de baixos recursos nesses modelos, como azerbaijano, cazaque, uzbeque, turcomeno e gagauz.

LLMs Turkic languages cross-lingual transfer Parameter-efficient adaptation

RESEARCHarXiv CS.CL·4/30/2026

LLMs Generate Kitsch

This paper proposes that Large Language Models (LLMs) systematically generate kitsch as a consequence of their training method. Empirically, the study shows readers perceive LLM-generated stories as kitschier, with implications for future study design and creative tasks.

LLMs Content Generation AI creativity

RESEARCHarXiv CS.AI·4/27/2026

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

This content introduces a novel concept, 'Background Temperature', to characterize the hidden randomness present in Large Language Models.

LLMs machine learning randomness large language models

RESEARCHarXiv CS.LG·4/9/2026

RAGEN-2: Reasoning Collapse in Agentic RL

Este estudo introduz o conceito de 'colapso de template', uma falha em agentes LLM de múltiplas interações onde a resposta se torna agnóstica à entrada, mesmo com entropia estável. Propõe a Informação Mútua (MI) como uma métrica superior à entropia para diagnosticar a qualidade do raciocínio, correlacionando-se mais fortemente com o desempenho final.

LLMs reinforcement learning Reasoning Evaluation Metrics

RESEARCHarXiv CS.LG·5/1/2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

This research proposes using LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for synthetic mental health data augmentation to address data scarcity and privacy regulations. A comprehensive evaluation framework is introduced, assessing semantic fidelity, lexical diversity, and privacy/plagiarism to mitigate risks like mode collapse or memorization.

synthetic data LLMs security Data Augmentation

RESEARCHarXiv CS.CL·4/30/2026

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

This paper introduces MATH-PT, a novel dataset of 1,729 mathematical problems in European and Brazilian Portuguese, to address the linguistic bias in LLM mathematical reasoning evaluations. The benchmark reveals that frontier reasoning models achieve strong performance in multiple-choice questions but their performance decreases for open-ended questions.

Dataset mathematical reasoning LLMs benchmarking

RESEARCHarXiv CS.CL·5/1/2026

Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

This paper introduces an ILR-informed framework to evaluate Claude (Sonnet 4.6) for cross-lingual response consistency across six languages. It analyzes responses to semantically equivalent prompts using quantitative metrics and expert ILR qualitative assessment, revealing language-specific variations like response length differences and surface divergence in creative clusters.

Multilingual AI LLMs AI evaluation

RESEARCHarXiv CS.CL·4/30/2026

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

This research explores the use of lightweight Large Language Models (LLMs) for Biomedical Named Entity Recognition, demonstrating their competitive performance against larger models. The study highlights their potential as resource-efficient alternatives and identifies specific output formats that consistently improve performance.

LLMs named entity recognition Model Evaluation NLP

RESEARCHarXiv CS.LG·5/1/2026

Automatic Causal Fairness Analysis with LLM-Generated Reporting

The `FairMind` software prototype automates causal fairness analysis at the dataset level, addressing the lack of fairness consideration in most AutoML frameworks. It utilizes the standard fairness model and LLMs to generate accurate reports on fairness based on counterfactual causal effects.

LLMs causal AI AI ethics fairness

RESEARCHarXiv CS.CL·4/16/2026

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

This paper argues that the primary bottleneck in multimodal scaling for MLLMs is knowledge density in training data, rather than task format. It demonstrates that task-specific supervision like VQA adds little incremental semantic information beyond image captions, and that increasing knowledge density leads to consistent performance improvements.

multimodal AI LLMs machine learning Research Paper