← heapsort-ai

reinforcement learning

154 items

RESEARCHarXiv CS.LG·4/16/2026

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

This paper presents a necessary condition for intra-group learning algorithm design in Reinforcement Learning, requiring objectives to maintain gradient exchangeability across token updates to prevent reward-irrelevant drift. It proposes minimal transformations to restore this cancellation structure, which stabilizes training and improves sample efficiency.

29
RESEARCHarXiv CS.LG·4/16/2026

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

This research introduces Adaptive Memory Crystallization (AMC), a novel memory architecture designed for autonomous AI agents to progressively consolidate experiences in dynamic environments without forgetting prior knowledge. AMC models memory as a continuous crystallization process across a three-phase hierarchy, inspired by synaptic tagging and capture theory and governed by stochastic differential equations.

29
RESEARCHarXiv CS.AI·6d ago

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL is a novel framework that enhances LLM-based RTL code generation by combining stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT). It uses dense feedback from a PRM to guide reinforcement-style updates and Monte Carlo Tree Search (MCTS) to enrich the training dataset.

29
DOCAWS Machine Learning Blog·5/7/2026

Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

This post details the implementation of verifiable rewards-based reinforcement learning (RLVR) to enhance training performance by ensuring transparency and correctness in reward signals. It covers techniques like GRPO and few-shot examples, demonstrated with the GSM8K dataset for improving math problem-solving accuracy.

29
RESEARCHarXiv CS.LG·4/6/2026

OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

O artigo aborda a baixa eficiência de consulta em Aprendizado por Reforço Baseado em Preferências (PbRL) offline, propondo o algoritmo OPRIDE. Este algoritmo visa melhorar a eficiência de consulta através de uma estratégia de exploração informativa e um mecanismo de agendamento de desconto para mitigar a superotimização da função de recompensa.

29
RESEARCHarXiv CS.LG·21d ago

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit is a new reinforcement learning framework designed to improve large language models' performance in scientific critic interaction. It addresses the issue of LLMs abandoning correct solutions after user criticism by focusing on inter-turn correctness transitions and categorizing behaviors like correction, sycophancy, and robustness.

29
RESEARCHDEV.to AI·4/14/2026

Adaptive Neuro-Symbolic Planning for deep-sea exploration habitat design in hybrid quantum-classical pipelines

A reinforcement learning agent designed for deep-sea habitat optimization failed to produce a physically viable design, highlighting the limitations of purely sub-symbolic AI when symbolic constraints are not strictly enforced. This experience led to a research focus on adaptive neuro-symbolic planning for mission-critical design challenges.

28
RESEARCHarXiv CS.CL·4/21/2026

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

This work introduces a reciprocal co-training framework that couples a Large Language Model (LLM) with a Random Forest (RF) classifier via reinforcement learning. It creates an iterative feedback loop where each model improves using signals from the other, demonstrating consistent performance gains across medical datasets.

28
RESEARCHarXiv CS.LG·4/23/2026

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

DR-Venus introduces a frontier 4B deep research agent for edge-scale deployment, trained effectively with only 10K open data. Its two-stage training recipe combines agentic supervised fine-tuning for basic capabilities and agentic reinforcement learning for improved execution reliability on long-horizon tasks, optimizing data quality and utilization.

28
ARTICLEDEV.to AI·4/23/2026

Explainable Causal Reinforcement Learning for smart agriculture microgrid orchestration with zero-trust governance guarantees

This article details a developer's epiphany while debugging a black-box Reinforcement Learning agent failing to synchronize smart agriculture microgrids. The realization that the agent lacked causal understanding led to exploring Explainable AI and causal inference frameworks to prevent cascading power failures.

28
RESEARCHarXiv CS.LG·4/6/2026

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

O artigo analisa a interação entre Chain-of-Thought (CoT) e Reinforcement Learning (RL) na geração de imagens a partir de texto (T2I) usando uma análise sistemática baseada em entropia. Ele revela que menor entropia dos tokens de imagem e do CoT textual se correlaciona com melhor qualidade de imagem, propondo a estratégia Entropy-Guided Group Relative Policy Optimization (EG-GRPO) para otimização com base na incerteza.

28
RESEARCHDEV.to AI·4/9/2026

Human-Aligned Decision Transformers for deep-sea exploration habitat design under real-time policy constraints

Este conteúdo explora uma pesquisa sobre o design de sistemas de IA que tomam decisões complexas e sequenciais em ambientes extremos, como a exploração em alto-mar. A investigação focou em integrar preferências humanas no projeto de habitats através de Decision Transformers e aprendizagem por reforço.

28
RESEARCHarXiv CS.LG·22d ago

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

This paper shows that a threshold in decision capacity governs collapse in self-play reinforcement learning agents under asymmetric rule perturbations. Eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, while preserving even a single such decision prevents this collapse.

28
RESEARCHarXiv CS.LG·4/17/2026

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

This research addresses the challenge of decision-making in environments with strategic adversaries or external factors, where traditional policies can fail catastrophically in safety-critical settings. It proposes an optimistic policy learning approach designed to account for these interactions and provide regret and violation guarantees.

28