reinforcement learning

154 items

RESEARCHarXiv CS.AI·4/15/2026

When to Forget: A Memory Governance Primitive

This paper proposes a new metric, Memory Worth (MW), for governing memory quality in agent systems, deciding which memories to trust, suppress, or deprecate. MW uses a two-counter per-memory system tracking co-occurrences with successful versus failed outcomes, converging to the conditional success probability of a task.

Memory governance reinforcement learning memory management agent systems

RESEARCHarXiv CS.LG·4/28/2026

CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

CoFi-PGMA is a new framework for optimizing learning in multi-agent LLM systems, addressing filtered feedback in both routing and collaborative scenarios. It introduces a counterfactual per-agent training objective based on marginal contribution to correct the learning signal.

LLMs reinforcement learning Multi-Agent Systems

RESEARCHarXiv CS.LG·4/28/2026

KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning

KARL is a novel framework designed to mitigate hallucinations in large language models by enabling them to appropriately abstain from questions beyond their knowledge. It achieves this through a Knowledge-Boundary-Aware Reward that dynamically estimates the model's knowledge and a Two-Stage RL Training Strategy that prevents excessive caution.

reinforcement learning hallucinations AI Safety LLM

RESEARCHarXiv CS.AI·4/13/2026

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

RAMP proposes a novel strategy for learning numeric planning action models online through environmental interactions, integrating Deep Reinforcement Learning (DRL), action model learning, and planning. This creates a positive feedback loop where the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy.

Deep Reinforcement Learning Action Model Learning Numeric Planning reinforcement learning

RESEARCHarXiv CS.LG·4/14/2026

Belief-State RWKV for Reinforcement Learning under Partial Observability

This paper proposes Belief-State RWKV, a stronger RL formulation where the recurrent state is explicitly interpreted as a belief state. The method maintains a compact uncertainty-aware state, allowing policies to depend on both memory and confidence in partially observed settings.

Belief State RWKV Partial Observability reinforcement learning

RESEARCHarXiv CS.LG·4/14/2026

A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

This paper provides a comparative theoretical analysis of entropy control strategies in Reinforcement Learning, focusing on traditional regularization versus a novel covariance-based mechanism for LLM training. It establishes a unified framework, showing that covariance-based methods achieve asymptotic unbiasedness by selectively regularizing high-covariance tokens, unlike traditional methods that introduce persistent bias.

Entropy Control Policy Entropy LLMs reinforcement learning

RESEARCHarXiv CS.LG·4/9/2026

RAGEN-2: Reasoning Collapse in Agentic RL

Este estudo introduz o conceito de 'colapso de template', uma falha em agentes LLM de múltiplas interações onde a resposta se torna agnóstica à entrada, mesmo com entropia estável. Propõe a Informação Mútua (MI) como uma métrica superior à entropia para diagnosticar a qualidade do raciocínio, correlacionando-se mais fortemente com o desempenho final.

LLMs reinforcement learning Reasoning Evaluation Metrics

RESEARCHarXiv CS.CL·4/27/2026

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.

reinforcement learning AI training Large Language Models (LLMs)Model Evaluation

RESEARCHarXiv CS.AI·5/9/2026

From History to State: Constant-Context Skill Learning for LLM Agents

This paper proposes constant-context skill learning, a novel framework for LLM agents to manage recurring workflows more efficiently. It addresses privacy, cost, and capability challenges by learning reusable procedures in task-family modules and conditioning inference on a compact state block. Its effectiveness is demonstrated across benchmarks like ALFWorld, WebShop, and SciWorld.

LLM agents reinforcement learning Skill Learning AI research

RESEARCHarXiv CS.CL·4/27/2026

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

This work explores neuro-symbolic language reasoning in VLMs, leveraging Reinforcement Learning to improve analytical abilities and efficiency. It achieved a 3.33% accuracy increase on a vision-language evaluation dataset while reducing reasoning tokens by 75%.

Vision-Language Models reinforcement learning Reasoning Neuro-symbolic AI

RESEARCHarXiv CS.CL·4/8/2026

Document Optimization for Black-Box Retrieval via Reinforcement Learning

Este artigo de pesquisa propõe uma nova abordagem para otimização de documentos, transformando-os para melhor alinhamento com sistemas de recuperação via Reinforcement Learning (GRPO), utilizando melhorias de ranking como recompensa. O método, aplicável a retrievers de caixa preta, demonstrou ganhos em tarefas de recuperação de código e documentos visuais.

language models Vision-Language Models reinforcement learning document optimization

RESEARCHarXiv CS.LG·4/9/2026

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

Este artigo apresenta Probabilistic Language Tries (PLTs), uma representação unificada que explicita a estrutura de prefixo de qualquer modelo generativo sobre sequências. PLTs atuam como compressor lossless ideal, representação de política para problemas de decisão sequencial (como jogos e robótica) e índice de memoização para reuso de execução, com um teorema chave sobre caching guiado por prior.

sequence generation reinforcement learning data compression Probabilistic Models

RESEARCHarXiv CS.AI·5/4/2026

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

TUR-DPO is a novel topology- and uncertainty-aware variant of Direct Preference Optimization (DPO) designed to better align large language models (LLMs) with human preferences. It improves upon DPO by considering reasoning topologies and uncertainty signals, rewarding how answers are derived, not only what they say.

reinforcement learning DPO AI alignment Machine Learning

RESEARCHarXiv CS.AI·5/7/2026

Regularized Centered Emphatic Temporal Difference Learning

This paper introduces Regularized Emphatic Temporal-Difference Learning (RETD) to address the stability, projection geometry, and variance trade-off in off-policy temporal-difference learning. It proposes a method that regularizes the auxiliary centering recursion to maintain the positive-definiteness of the ETD key matrix and proves its convergence.

reinforcement learning learning temporal-difference learning algorithm

RESEARCHarXiv CS.CL·5/7/2026

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

This research introduces Adaptive Power-Mean Policy Optimization (APMPO) to improve Large Language Model (LLM) reasoning capabilities within Reinforcement Learning with Verifiable Rewards (RLVR). APMPO combines a generalized power-mean objective and feedback-adaptive clipping to enhance learning dynamics and performance, addressing limitations of static optimization schemes.

Policy optimization LLMs reinforcement learning Machine Learning

RESEARCHarXiv CS.LG·22d ago

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

This research addresses the challenge of poor credit assignment in reinforcement learning for multi-step reasoning with large language models, caused by sparse terminal rewards leading to high gradient variance and unstable training. It proposes a counterfactual comparison-based framework and Implicit Behavior Policy Optimization (IBPO) to create step-sensitive learning signals, significantly improving training stability and performance.

reinforcement learning AI training Machine learning research large language models

RESEARCHarXiv CS.LG·8d ago

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

This survey addresses the lack of a unified framework for world models, which are internal simulators used in AI for prediction, planning, and reasoning. It proposes a multi-axis taxonomy organizing their diverse aspects like architecture, methodology, reasoning paradigms, and applications across fields such as reinforcement learning and robotics.

Survey AGI reinforcement learning World Models

RESEARCHarXiv CS.LG·8d ago

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

Researchers propose Demo2Reward, a test-time adaptation technique to optimize Vision-Language Model (VLM) reward models in robotics. It uses a few demonstrations to reduce false positives while preserving true positives, without requiring additional model training.

Vision-Language Models reinforcement learning Prompt Optimization robotics

RESEARCHarXiv CS.LG·26d ago

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

This paper introduces TraFL, a novel post-training approach for diffusion language models that addresses "trajectory locking" observed in reward-maximizing methods. TraFL, a trajectory-balance objective, outperforms other methods across mathematical reasoning and code generation benchmarks.

Diffusion Models language models reinforcement learning Machine Learning

RESEARCHarXiv CS.LG·29d ago

Distributional Reinforcement Learning via the Cram\'er Distance

This paper introduces the Cramér-based Distributional Soft Actor-Critic (C-DSAC) algorithm, applying Soft Actor-Critic within a distributional reinforcement learning framework by minimizing the squared Cramér distance. Empirical results demonstrate that C-DSAC outperforms baseline SAC and other distributional methods, particularly in high-complexity environments, attributed to its confidence-driven Q-value updates.

deep learning reinforcement learning learning Algorithms