reinforcement learning

154 items

RESEARCHarXiv CS.LG·4/16/2026

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

This paper presents a necessary condition for intra-group learning algorithm design in Reinforcement Learning, requiring objectives to maintain gradient exchangeability across token updates to prevent reward-irrelevant drift. It proposes minimal transformations to restore this cancellation structure, which stabilizes training and improves sample efficiency.

reinforcement learning large language models gradient dynamics model optimization

RESEARCHarXiv CS.LG·4/16/2026

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

This research introduces Adaptive Memory Crystallization (AMC), a novel memory architecture designed for autonomous AI agents to progressively consolidate experiences in dynamic environments without forgetting prior knowledge. AMC models memory as a continuous crystallization process across a three-phase hierarchy, inspired by synaptic tagging and capture theory and governed by stochastic differential equations.

reinforcement learning Machine Learning memory architecture AI Agents

RESEARCHarXiv CS.AI·6d ago

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL is a novel framework that enhances LLM-based RTL code generation by combining stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT). It uses dense feedback from a PRM to guide reinforcement-style updates and Monte Carlo Tree Search (MCTS) to enrich the training dataset.

LLMs reinforcement learning code generation RTL Synthesis

DOCAWS Machine Learning Blog·5/7/2026

Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

This post details the implementation of verifiable rewards-based reinforcement learning (RLVR) to enhance training performance by ensuring transparency and correctness in reward signals. It covers techniques like GRPO and few-shot examples, demonstrated with the GSM8K dataset for improving math problem-solving accuracy.

Policy optimization reinforcement learning learning AI training

RESEARCHarXiv CS.LG·18d ago

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

The paper introduces HealthCraft, a public reinforcement-learning environment designed to evaluate the safety of frontier language models in emergency medicine. It focuses on trajectory-level safety, tool misuse, and clinical pressure, built on a FHIR R4 world state and offering 195 tasks for comprehensive assessment.

LLMs evaluation reinforcement learning medical AI

RESEARCHarXiv CS.LG·4/6/2026

OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

O artigo aborda a baixa eficiência de consulta em Aprendizado por Reforço Baseado em Preferências (PbRL) offline, propondo o algoritmo OPRIDE. Este algoritmo visa melhorar a eficiência de consulta através de uma estratégia de exploração informativa e um mecanismo de agendamento de desconto para mitigar a superotimização da função de recompensa.

reinforcement learning Query Efficiency Exploration Offline Learning

RESEARCHarXiv CS.LG·28d ago

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

This paper introduces -DPO, a direct preference optimization method using a ratio reward margin, to address the challenge of hyperparameter tuning in SimPO. The research analyzes SimPO and reformulates the preference objective to improve interpretability across datasets with varying reward gap structures.

preference optimization deep learning reinforcement learning Hyperparameter Tuning

RESEARCHarXiv CS.LG·21d ago

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit is a new reinforcement learning framework designed to improve large language models' performance in scientific critic interaction. It addresses the issue of LLMs abandoning correct solutions after user criticism by focusing on inter-turn correctness transitions and categorizing behaviors like correction, sycophancy, and robustness.

reinforcement learning learning Scientific Reasoning large language models

ARTICLEDEV.to AI·4d ago

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a framework designed to train any AI agents using Reinforcement Learning. It aims to simplify and accelerate the process of developing and optimizing intelligent agents.

reinforcement learning AI training Machine Learning AI Agents

RESEARCHDEV.to AI·4/14/2026

Adaptive Neuro-Symbolic Planning for deep-sea exploration habitat design in hybrid quantum-classical pipelines

A reinforcement learning agent designed for deep-sea habitat optimization failed to produce a physically viable design, highlighting the limitations of purely sub-symbolic AI when symbolic constraints are not strictly enforced. This experience led to a research focus on adaptive neuro-symbolic planning for mission-critical design challenges.

AI limitations Habitat Design reinforcement learning Deep-sea exploration

RESEARCHDEV.to AI·4/10/2026

Deep Reinforcement Learning for Sepsis Treatment

Este conteúdo aborda a aplicação de Aprendizado por Reforço Profundo para o tratamento de sepse, uma condição médica grave. Ele explora como técnicas avançadas de IA podem otimizar decisões terapêuticas em ambientes clínicos complexos.

Medical Treatment deep learning reinforcement learning Sepsis

RESEARCHarXiv CS.CL·4/21/2026

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

This work introduces a reciprocal co-training framework that couples a Large Language Model (LLM) with a Random Forest (RF) classifier via reinforcement learning. It creates an iterative feedback loop where each model improves using signals from the other, demonstrating consistent performance gains across medical datasets.

Random Forests LLMs reinforcement learning Machine Learning

RESEARCHarXiv CS.LG·4/23/2026

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

DR-Venus introduces a frontier 4B deep research agent for edge-scale deployment, trained effectively with only 10K open data. Its two-stage training recipe combines agentic supervised fine-tuning for basic capabilities and agentic reinforcement learning for improved execution reliability on long-horizon tasks, optimizing data quality and utilization.

Edge AI reinforcement learning machine learning training SLMs

ARTICLEDEV.to AI·4/23/2026

Explainable Causal Reinforcement Learning for smart agriculture microgrid orchestration with zero-trust governance guarantees

This article details a developer's epiphany while debugging a black-box Reinforcement Learning agent failing to synchronize smart agriculture microgrids. The realization that the agent lacked causal understanding led to exploring Explainable AI and causal inference frameworks to prevent cascading power failures.

smart agriculture microgrids reinforcement learning Explainable AI

RESEARCHarXiv CS.LG·4/6/2026

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

O artigo analisa a interação entre Chain-of-Thought (CoT) e Reinforcement Learning (RL) na geração de imagens a partir de texto (T2I) usando uma análise sistemática baseada em entropia. Ele revela que menor entropia dos tokens de imagem e do CoT textual se correlaciona com melhor qualidade de imagem, propondo a estratégia Entropy-Guided Group Relative Policy Optimization (EG-GRPO) para otimização com base na incerteza.

Optimization deep learning reinforcement learning Text-to-Image Generation

RESEARCHDEV.to AI·4/9/2026

Human-Aligned Decision Transformers for deep-sea exploration habitat design under real-time policy constraints

Este conteúdo explora uma pesquisa sobre o design de sistemas de IA que tomam decisões complexas e sequenciais em ambientes extremos, como a exploração em alto-mar. A investigação focou em integrar preferências humanas no projeto de habitats através de Decision Transformers e aprendizagem por reforço.

decision-transformers reinforcement learning Deep-sea exploration human-aligned AI

RESEARCHarXiv CS.LG·22d ago

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

This paper shows that a threshold in decision capacity governs collapse in self-play reinforcement learning agents under asymmetric rule perturbations. Eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, while preserving even a single such decision prevents this collapse.

Decision-making reinforcement learning learning game theory

RESEARCHarXiv CS.LG·4/17/2026

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

This research addresses the challenge of decision-making in environments with strategic adversaries or external factors, where traditional policies can fail catastrophically in safety-critical settings. It proposes an optimistic policy learning approach designed to account for these interactions and provide regret and violation guarantees.

reinforcement learning robust AI adversarial AI

RESEARCHarXiv CS.LG·4/8/2026

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Este trabalho apresenta o ambiente Territory Paint Wars para investigar modos de falha do PPO em aprendizado por reforço multiagente competitivo. Ele identifica falhas de implementação que causam baixo desempenho e, após a correção, revela um novo problema de overfitting competitivo que prejudica a generalização.

failure modes reinforcement learning self-play PPO

RESEARCHarXiv CS.CL·4/23/2026

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

OThink-SRR1 is a framework that enhances LLMs with an iterative Search-Refine-Reason process trained via reinforcement learning. It addresses RAG's challenges by distilling relevant facts from retrieved documents, improving efficiency and accuracy in complex multi-hop QA.

multi-hop-qa LLMs reinforcement learning RAG