reinforcement learning

154 items

RESEARCHarXiv CS.LG·6d ago

Self-Distilled Policy Gradient

This paper introduces Self-Distilled Policy Gradient (SDPG), a novel framework that enhances sparse-reward reinforcement learning through on-policy self-distillation. SDPG integrates group-relative verifier advantages, exact full-vocabulary self-distillation, and KL regularization, demonstrating improved stability and performance over existing baselines.

language models deep learning reinforcement learning Policy Gradient

RESEARCHarXiv CS.CL·14d ago

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

RICE-PO is a novel critic-free policy optimization framework addressing the credit-assignment challenge in interactive language agents. It converts retrieval interactions into localized learning signals, evaluating executable actions and propagating credit to latent reasoning steps.

Policy optimization reinforcement learning Retrieval systems AI Agents

ARTICLEAnalytics Vidhya·23d ago

Top 10 AI Research Papers of 2025

AI research in 2025 saw a significant shift from chatbots to reasoning, autonomous agent, and multimodal systems. Companies like Google DeepMind and OpenAI drove advancements in areas such as coding agents and scalable safety systems.

multimodal AI reinforcement learning reasoning AI autonomous agents

RESEARCHDEV.to AI·13d ago

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillationand Agentic RL

This research introduces Chain-of-Agents, an end-to-end framework for developing agent foundation models. It leverages multi-agent distillation and agentic reinforcement learning to enhance AI agent capabilities.

AI models reinforcement learning Machine Learning foundation models

RESEARCHDEV.to AI·4/26/2026

RecoGym: A Reinforcement Learning Environment for the problem of ProductRecommendation in Online Advertising

RecoGym is a reinforcement learning environment designed to simulate product recommendation problems in online advertising. It provides a platform for researchers and practitioners to test and develop new RL algorithms for recommender systems.

Online Advertising reinforcement learning Machine Learning Simulation Environment

RESEARCHarXiv CS.CL·4/20/2026

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

CoLabScience is introduced as a proactive LLM assistant aimed at accelerating biomedical discovery by facilitating collaborations between AI and human experts. It features PULI, a novel reinforcement learning framework for timely interventions in scientific discussions, and also presents BSDD, a new benchmark dataset of simulated research dialogue.

LLMs AI collaboration reinforcement learning datasets

RESEARCHDEV.to AI·5/7/2026

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

ReTool introduces a novel reinforcement learning framework designed to enhance the strategic tool-use capabilities of Large Language Models. This approach aims to improve how LLMs select and utilize external tools to solve complex tasks more effectively and efficiently.

LLMs reinforcement learning Machine Learning tool use

RESEARCHDEV.to AI·18d ago

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

This research explores the entropy mechanism within reinforcement learning, specifically its application to enhance reasoning capabilities in language models. It investigates how entropy can be leveraged to improve the learning process and decision-making for more robust language model reasoning.

language models reinforcement learning learning Reasoning

RESEARCHDEV.to AI·4/12/2026

Explainable Causal Reinforcement Learning for wildfire evacuation logistics networks in carbon-negative infrastructure

This research focuses on overcoming the limitations of standard Reinforcement Learning models in optimizing wildfire evacuations. The author applies causal inference, inspired by Judea Pearl and Bernhard Schölkopf, to address inexplicable recommendations and confounding variables.

wildfire evacuation reinforcement learning Explainable AI Causal Reinforcement Learning

ARTICLEDEV.to AI·5/7/2026

Meta-Optimized Continual Adaptation for circular manufacturing supply chains in carbon-negative infrastructure

The author describes a pivotal moment when static optimization, including meta-learning, proved obsolete for dynamic circular manufacturing supply chains, failing catastrophically under sudden policy changes like a carbon tax. This experience exposed the fundamental limitation of traditional methods in adapting to real-world complexities.

Meta-Learning carbon-negative infrastructure reinforcement learning supply chain optimization

RESEARCHDEV.to AI·5/6/2026

Generative Simulation Benchmarking for deep-sea exploration habitat design during mission-critical recovery windows

This content describes a researcher's journey into using generative AI for autonomous deep-sea habitat design. After an initial failure, they embarked on a year-long study to develop methods for benchmarking generative models against real-world constraints in extreme environments.

reinforcement learning benchmarking Deep-sea exploration simulation

RESEARCHDEV.to AI·4/21/2026

Explainable Causal Reinforcement Learning for satellite anomaly response operations under multi-jurisdictional compliance

The text discusses the need for explainable and causal AI in space operations, illustrating with a satellite incident where an automated correction violated data sovereignty regulations. It highlights the failure of traditional AI approaches to handle the complexity of technical constraints, operational priorities, and jurisdictional boundaries.

Anomaly Detection Aerospace AI reinforcement learning Explainable AI

RESEARCHDEV.to AI·5/1/2026

Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning

This content discusses Deep Dyna-Q, an approach that integrates planning for dialogue policy learning in conversational AI systems. The focus is on optimizing the task-completion process through spoken interaction with AI.

reinforcement learning Natural Language Processing AI algorithms dialogue systems

ARTICLEDEV.to AI·14d ago

Human-Aligned Decision Transformers for bio-inspired soft robotics maintenance under real-time policy constraints

A personal account details a researcher's struggle with a Decision Transformer failing to maintain bio-inspired soft robotic grippers in real-world deployment, despite high simulation performance. The critical issue identified was the misalignment between the AI's learned policy and human safety expectations for the delicate hardware.

decision-transformers reinforcement learning learning maintenance

DOCDEV.to AI·5/10/2026

Understanding Reinforcement Learning with Neural Networks Part 2: Why Backpropagation Is Not Enough

This article, part of a series, explains why standard backpropagation is insufficient for certain reinforcement learning scenarios. It highlights the necessity of policy gradients by demonstrating how error calculation and derivative application differ from traditional neural network training.

neural networks reinforcement learning learning backpropagation

ARTICLEHugging Face Blog·5/6/2026

vLLM V0 to V1: Correctness Before Corrections in RL

This content discusses the transition from vLLM V0 to V1, focusing on the importance of correctness over corrections in Reinforcement Learning. It explores development principles and enhancements to ensure integrity and performance in AI systems.

LLMs reinforcement learning Machine Learning AI development

RESEARCHDEV.to AI·27d ago

Episodic Exploration for Deep Deterministic Policies: An Application toStarCraft Micromanagement Tasks

This research paper introduces episodic exploration techniques applied to deep deterministic policies. It focuses on enhancing AI performance in complex StarCraft micromanagement tasks.

Episodic Exploration deep learning reinforcement learning Game AI

ARTICLEDEV.to AI·4/16/2026

Policy Gradients — Deep Dive + Problem: Valid Parentheses

Policy Gradients is a fundamental Reinforcement Learning algorithm that directly optimizes the policy, mapping states to actions, using gradient-based methods. It's crucial for handling high-dimensional action spaces and learning stochastic policies, offering advantages over value-based methods by learning the policy directly.

reinforcement learning Machine Learning Policy Gradients

RESEARCHarXiv CS.CL·4/15/2026

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-Distillation Zero (SD-Zero) is a novel post-training method designed to be more training sample-efficient than traditional reinforcement learning, without requiring external teachers or high-quality demonstrations. It operates by having a single model act as both a Generator and a Reviser, using the Reviser's improved responses and token distributions to provide dense supervision for the Generator through on-policy self-distillation.

reinforcement learning post-training Dense Supervision Self-Distillation

RESEARCHarXiv CS.AI·4/15/2026

Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

This research investigates the utility of self-monitoring capabilities (metacognition, self-prediction) in reinforcement learning agents, finding they offer no significant benefit. The implemented modules collapsed to near-constant outputs, indicating the ineffectiveness of the tested mechanisms.

reinforcement learning Metacognition self-monitoring continuous-time agents