reinforcement learning

154 items

RESEARCHarXiv CS.LG·4/23/2026

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

This research introduces the Tool-Augmented Markov Decision Process (TA-MDP) to formally model multimodal agentic decision-making, addressing theoretical gaps in reinforcement fine-tuning for Large Vision-Language Models (LVLMs). It specifically investigates how composite verifiable rewards affect GRPO convergence and why training on small datasets generalizes to out-of-distribution domains for agentic LVLMs.

Theoretical AI reinforcement learning vision models large language models

RESEARCHarXiv CS.CL·28d ago

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

ReAD proposes a Reinforcement-guided Capability Distillation framework for Large Language Models, aiming to compress LLMs while preserving essential abilities for downstream tasks. It explicitly accounts for the interdependence of capabilities, optimizing token budget usage and mitigating degradation of useful abilities.

Model Compression Knowledge Distillation LLMs reinforcement learning

RESEARCHarXiv CS.LG·13d ago

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

The paper introduces a personalized observation normalization (PON) method for federated reinforcement learning (FedRL) to address challenges in heterogeneous environments. PON allows each agent to locally normalize state inputs, ensuring consistent scaling and improving performance in heterogeneous MuJoCo tasks.

reinforcement learning Machine Learning federated learning Normalization

RESEARCHarXiv CS.AI·4/13/2026

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.

LLMs reasoning tasks reinforcement learning PPO

RESEARCHarXiv CS.AI·4/16/2026

Exploration and Exploitation Errors Are Measurable for Language Model Agents

This research introduces a method to systematically quantify exploration and exploitation errors in Language Model (LM) agents, addressing the challenge of evaluation without access to internal policies. It proposes controllable environments and a policy-agnostic metric to measure these errors, revealing flaws even in state-of-the-art LMs.

language models reinforcement learning Evaluation Metrics AI Agents

RESEARCHarXiv CS.LG·4/8/2026

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Este trabalho introduz uma estrutura de aprendizado por reforço baseada em modelo de ordem reduzida (ROM) adaptativo para controle de fluxo ativo. Ele visa melhorar a eficiência de amostragem do DRL, substituindo o crítico por um ROM que estima gradientes e se atualiza continuamente com novos dados.

Sample Efficiency reinforcement learning Flow Control Reduced-Order Models

ARTICLEDEV.to AI·4/19/2026

Meta-Optimized Continual Adaptation for bio-inspired soft robotics maintenance with zero-trust governance guarantees

The author encountered significant degradation in a bio-inspired soft robotic gripper, revealing the inadequacy of standard reinforcement learning for time-evolving simulation-to-reality gaps. This led to a focus on meta-optimized continual adaptation for maintenance, integrating zero-trust governance.

soft robotics reinforcement learning zero-trust maintenance

RESEARCHDEV.to AI·27d ago

Meta-Optimized Continual Adaptation for smart agriculture microgrid orchestration during mission-critical recovery windows

The text discusses the failure of static AI models in dynamic, unpredictable environments, illustrated by an RL agent's malfunction during a wildfire-induced power outage in a smart agriculture microgrid. This critical incident motivated the exploration of meta-optimized continual adaptation for system resilience.

smart agriculture reinforcement learning continual adaptation meta-optimization

RESEARCHarXiv CS.CL·4/7/2026

Self-Execution Simulation Improves Coding Models

Este trabalho demonstra que LLMs de código podem ser treinados para simular a execução de programas passo a passo, melhorando o desempenho em programação competitiva. A abordagem combina fine-tuning supervisionado e aprendizado por reforço, permitindo que os modelos realizem auto-verificação e correção iterativa.

LLMs reinforcement learning code generation program execution simulation

RESEARCHarXiv CS.AI·12d ago

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

This paper introduces STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction. It replaces the covariance metric with the symmetric part of the behavior-policy Bellman matrix, providing a more informative update geometry.

Off-Policy Prediction reinforcement learning learning temporal-difference learning

RESEARCHarXiv CS.AI·4/17/2026

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

This work introduces Group Fine-Tuning (GFT), a unified post-training framework for large language models. It addresses intrinsic limitations of supervised fine-tuning (SFT), such as single-path dependency and entropy collapse, through Group Advantage Learning and Dynamic Coefficient Rectification.

LLMs reinforcement learning post-training Machine Learning

RESEARCHarXiv CS.LG·4/22/2026

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Curiosity-Critic introduces an intrinsic reward for world model training, focusing on improving cumulative prediction error rather than just current transitions. It uses a learned critic to estimate an asymptotic error baseline, effectively separating epistemic from aleatoric error and directing exploration towards learnable transitions.

Epistemic Uncertainty reinforcement learning World Models curiosity

RESEARCHarXiv CS.AI·4/22/2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

ARES introduces a framework to address systemic weaknesses in RLHF-aligned LLMs, where imperfect Reward Models fail to penalize unsafe behaviors. It uses a "Safety Mentor" for adaptive red-teaming to discover and mitigate these dual vulnerabilities in both the LLM and its Reward Model.

LLMs reinforcement learning security

ARTICLEDEV.to AI·21d ago

Continual Harness: The Gemini Pokémon Agent That Rewrites Its Own Loop

The Continual Harness work explores the idea of an AI agent, like the Gemini Plays Pokémon, editing its own supporting 'harness' code in real-time. This allows the model to refine its interactions and tools with the environment, rather than requiring human intervention for adjustments. The innovation enables the agent to dynamically learn and adapt during its execution, improving its performance.

Pokémon self-improvement reinforcement learning Gemini

RESEARCHarXiv CS.AI·4/13/2026

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO is a novel reinforcement learning framework designed to improve the logical consistency and structural coherence of large language models in complex reasoning tasks. It explicitly incorporates stability metrics, such as Autocorrelation Function and Path Efficiency, to evaluate local step-to-step coherence and global goal-directedness of the reasoning process.

Policy optimization LLMs reinforcement learning Reasoning

RESEARCHarXiv CS.AI·4/25/2026

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

This paper introduces COSPLAY, a co-evolution framework designed to enhance LLM decision-making in long-horizon interactive environments. It enables an LLM agent to retrieve skills from a learnable skill bank while an agent pipeline discovers and retains reusable skills from its own unlabeled rollouts.

LLMs reinforcement learning Skill Discovery AI Agents

RESEARCHarXiv CS.LG·4/22/2026

Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

This research introduces EasyRL, a novel data-efficient reinforcement learning approach for self-evolving LLMs, designed to overcome high annotation costs and performance issues in existing methods. Inspired by cognitive learning theory, EasyRL integrates knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy for difficult unlabeled data.

Data efficiency reinforcement learning Machine Learning LLM

RESEARCHarXiv CS.AI·27d ago

Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

This research introduces Macro-Action Value Correction for Instruction Compliance (MAVIC) to address inconsistencies in multi-agent reinforcement learning when external instructions interrupt long-horizon objectives. MAVIC modifies Bellman backups at instruction boundaries to enable consistent value estimation under stochastic instruction switching within a unified policy.

Instruction Following reinforcement learning Multi-Agent Systems Value Function

RESEARCHarXiv CS.LG·22d ago

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

This research investigates adversarial action masking in self-play reinforcement learning, where an attacker selectively removes legal actions from a victim's action set. The study found that learned masking causes significantly more damage than random masking or perturbation baselines, highlighting action availability as a critical robustness surface in self-play RL.

reinforcement learning security self-play adversarial attacks

RESEARCHarXiv CS.LG·6d ago

Position: Deployed Reinforcement Learning should be Continual

This position paper argues that deployed Reinforcement Learning (RL) agents should engage in continual learning rather than a train-then-fix paradigm. It identifies four sources of non-stationarity post-deployment, highlighting the necessity for agents to continuously adapt to achieve optimal performance in real-world scenarios.

reinforcement learning learning Adaptive AI AI Deployment