← heapsort-ai

reinforcement learning

154 items

RESEARCHarXiv CS.LG·4/23/2026

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

This research introduces the Tool-Augmented Markov Decision Process (TA-MDP) to formally model multimodal agentic decision-making, addressing theoretical gaps in reinforcement fine-tuning for Large Vision-Language Models (LVLMs). It specifically investigates how composite verifiable rewards affect GRPO convergence and why training on small datasets generalizes to out-of-distribution domains for agentic LVLMs.

28
RESEARCHarXiv CS.LG·13d ago

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

The paper introduces a personalized observation normalization (PON) method for federated reinforcement learning (FedRL) to address challenges in heterogeneous environments. PON allows each agent to locally normalize state inputs, ensuring consistent scaling and improving performance in heterogeneous MuJoCo tasks.

28
RESEARCHarXiv CS.AI·4/13/2026

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.

28
RESEARCHarXiv CS.AI·4/16/2026

Exploration and Exploitation Errors Are Measurable for Language Model Agents

This research introduces a method to systematically quantify exploration and exploitation errors in Language Model (LM) agents, addressing the challenge of evaluation without access to internal policies. It proposes controllable environments and a policy-agnostic metric to measure these errors, revealing flaws even in state-of-the-art LMs.

28
RESEARCHarXiv CS.LG·4/8/2026

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Este trabalho introduz uma estrutura de aprendizado por reforço baseada em modelo de ordem reduzida (ROM) adaptativo para controle de fluxo ativo. Ele visa melhorar a eficiência de amostragem do DRL, substituindo o crítico por um ROM que estima gradientes e se atualiza continuamente com novos dados.

28
ARTICLEDEV.to AI·4/19/2026

Meta-Optimized Continual Adaptation for bio-inspired soft robotics maintenance with zero-trust governance guarantees

The author encountered significant degradation in a bio-inspired soft robotic gripper, revealing the inadequacy of standard reinforcement learning for time-evolving simulation-to-reality gaps. This led to a focus on meta-optimized continual adaptation for maintenance, integrating zero-trust governance.

28
RESEARCHDEV.to AI·27d ago

Meta-Optimized Continual Adaptation for smart agriculture microgrid orchestration during mission-critical recovery windows

The text discusses the failure of static AI models in dynamic, unpredictable environments, illustrated by an RL agent's malfunction during a wildfire-induced power outage in a smart agriculture microgrid. This critical incident motivated the exploration of meta-optimized continual adaptation for system resilience.

28
RESEARCHarXiv CS.LG·4/22/2026

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Curiosity-Critic introduces an intrinsic reward for world model training, focusing on improving cumulative prediction error rather than just current transitions. It uses a learned critic to estimate an asymptotic error baseline, effectively separating epistemic from aleatoric error and directing exploration towards learnable transitions.

27
ARTICLEDEV.to AI·21d ago

Continual Harness: The Gemini Pokémon Agent That Rewrites Its Own Loop

The Continual Harness work explores the idea of an AI agent, like the Gemini Plays Pokémon, editing its own supporting 'harness' code in real-time. This allows the model to refine its interactions and tools with the environment, rather than requiring human intervention for adjustments. The innovation enables the agent to dynamically learn and adapt during its execution, improving its performance.

27
RESEARCHarXiv CS.AI·4/13/2026

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO is a novel reinforcement learning framework designed to improve the logical consistency and structural coherence of large language models in complex reasoning tasks. It explicitly incorporates stability metrics, such as Autocorrelation Function and Path Efficiency, to evaluate local step-to-step coherence and global goal-directedness of the reasoning process.

27
RESEARCHarXiv CS.LG·4/22/2026

Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

This research introduces EasyRL, a novel data-efficient reinforcement learning approach for self-evolving LLMs, designed to overcome high annotation costs and performance issues in existing methods. Inspired by cognitive learning theory, EasyRL integrates knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy for difficult unlabeled data.

27
RESEARCHarXiv CS.AI·27d ago

Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

This research introduces Macro-Action Value Correction for Instruction Compliance (MAVIC) to address inconsistencies in multi-agent reinforcement learning when external instructions interrupt long-horizon objectives. MAVIC modifies Bellman backups at instruction boundaries to enable consistent value estimation under stochastic instruction switching within a unified policy.

27
RESEARCHarXiv CS.LG·22d ago

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

This research investigates adversarial action masking in self-play reinforcement learning, where an attacker selectively removes legal actions from a victim's action set. The study found that learned masking causes significantly more damage than random masking or perturbation baselines, highlighting action availability as a critical robustness surface in self-play RL.

27