← heapsort-ai

reinforcement learning

154 items

RESEARCHarXiv CS.LG·6d ago

Self-Distilled Policy Gradient

This paper introduces Self-Distilled Policy Gradient (SDPG), a novel framework that enhances sparse-reward reinforcement learning through on-policy self-distillation. SDPG integrates group-relative verifier advantages, exact full-vocabulary self-distillation, and KL regularization, demonstrating improved stability and performance over existing baselines.

27
RESEARCHarXiv CS.CL·4/20/2026

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

CoLabScience is introduced as a proactive LLM assistant aimed at accelerating biomedical discovery by facilitating collaborations between AI and human experts. It features PULI, a novel reinforcement learning framework for timely interventions in scientific discussions, and also presents BSDD, a new benchmark dataset of simulated research dialogue.

27
RESEARCHDEV.to AI·4/12/2026

Explainable Causal Reinforcement Learning for wildfire evacuation logistics networks in carbon-negative infrastructure

This research focuses on overcoming the limitations of standard Reinforcement Learning models in optimizing wildfire evacuations. The author applies causal inference, inspired by Judea Pearl and Bernhard Schölkopf, to address inexplicable recommendations and confounding variables.

27
ARTICLEDEV.to AI·5/7/2026

Meta-Optimized Continual Adaptation for circular manufacturing supply chains in carbon-negative infrastructure

The author describes a pivotal moment when static optimization, including meta-learning, proved obsolete for dynamic circular manufacturing supply chains, failing catastrophically under sudden policy changes like a carbon tax. This experience exposed the fundamental limitation of traditional methods in adapting to real-world complexities.

27
RESEARCHDEV.to AI·4/21/2026

Explainable Causal Reinforcement Learning for satellite anomaly response operations under multi-jurisdictional compliance

The text discusses the need for explainable and causal AI in space operations, illustrating with a satellite incident where an automated correction violated data sovereignty regulations. It highlights the failure of traditional AI approaches to handle the complexity of technical constraints, operational priorities, and jurisdictional boundaries.

27
ARTICLEDEV.to AI·14d ago

Human-Aligned Decision Transformers for bio-inspired soft robotics maintenance under real-time policy constraints

A personal account details a researcher's struggle with a Decision Transformer failing to maintain bio-inspired soft robotic grippers in real-world deployment, despite high simulation performance. The critical issue identified was the misalignment between the AI's learned policy and human safety expectations for the delicate hardware.

27
RESEARCHarXiv CS.CL·4/15/2026

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-Distillation Zero (SD-Zero) is a novel post-training method designed to be more training sample-efficient than traditional reinforcement learning, without requiring external teachers or high-quality demonstrations. It operates by having a single model act as both a Generator and a Reviser, using the Reviser's improved responses and token distributions to provide dense supervision for the Generator through on-policy self-distillation.

27
RESEARCHarXiv CS.AI·4/15/2026

Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

This research investigates the utility of self-monitoring capabilities (metacognition, self-prediction) in reinforcement learning agents, finding they offer no significant benefit. The implemented modules collapsed to near-constant outputs, indicating the ineffectiveness of the tested mechanisms.

27