← heapsort-ai

reinforcement learning

154 items

RESEARCHarXiv CS.CL·5/7/2026

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

FREIA is a novel reinforcement learning algorithm designed to enhance LLMs for unsupervised reasoning, addressing the lack of adaptability in existing methods. It employs Free Energy-Driven Reward (FER) to balance consensus and exploration, and Adaptive Advantage Shaping (AAS) to adjust learning signals. FREIA outperforms unsupervised baselines across various reasoning tasks, particularly in mathematical reasoning.

27
RESEARCHarXiv CS.CL·26d ago

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

This research introduces Inquisitive Conversational Agents (ICAs) designed to proactively extract information, specifically tailored for U.S. Supreme Court oral arguments. It proposes a Dual Hierarchical Reinforcement Learning framework to coordinate strategic dialogue management and fine-grained utterance generation, significantly outperforming baselines.

27
RESEARCHarXiv CS.CL·8d ago

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

This paper proposes CSRP, a three-stage framework for Chinese Grammatical Error Correction (CGEC) using Large Language Models (LLMs). CSRP addresses challenges of general-purpose models and metric optimization with continual pre-training, Chain-of-Thought SFT, and policy optimization with efficiency-aware rewards that penalize unnecessary edits, achieving state-of-the-art performance on the NACGEC benchmark.

27
RESEARCHarXiv CS.LG·28d ago

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in reinforcement learning for diffusion models, which often causes mode collapse and degrades generative diversity. It replaces scalar reward maximization with trajectory-level reward distribution matching, using a Softmax Trajectory Balance objective to align policy probabilities with a reward-induced Boltzmann distribution.

27
RESEARCHarXiv CS.AI·8d ago

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

This research introduces a novel delayed per-step reward attribution method for training language model agents in multi-agent strategic interactions. It addresses the challenge of entangled outcomes by computing rewards at episode end and backpropagating them, enabling stable and sample-efficient reinforcement learning.

27
RESEARCHarXiv CS.CL·27d ago

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

This paper proposes Verifiable Process Supervision (VPS), a post-training framework to jointly optimize language model prediction accuracy and reasoning quality. VPS uses supervised fine-tuning to induce a structured reasoning format, evaluating intermediate claims against ground-truth signals with adaptive reward weighting.

27
RESEARCHarXiv CS.LG·27d ago

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

The paper introduces Multi-Rollout On-Policy Distillation (MOPD), a framework that uses a student's local rollout group to construct more informative teacher signals for post-training large language models. MOPD conditions the teacher on both successful and failed peer rollouts, leveraging successes for valid reasoning patterns and failures for avoiding plausible mistakes.

27
RESEARCHarXiv CS.LG·27d ago

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

This paper introduces a communication-efficient reinforcement learning approach where a single policy learns both control inputs and timing decisions, secured by a pointwise Lyapunov safety shield. A run-time assurance layer overrides the policy to provide strictly stronger safety guarantees and achieve significantly higher mean inter-sample intervals on various systems.

27
RESEARCHarXiv CS.AI·28d ago

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ is an offline-to-online reinforcement learning objective designed to enhance sample efficiency by leveraging pre-collected datasets. It mitigates issues with inaccurate critics and limited data coverage by using a self-supervised multi-term ranking loss, which enforces structured action ordering and directs the Q-function towards higher-quality actions.

27
RESEARCHarXiv CS.AI·27d ago

State-Centric Decision Process

The State-Centric Decision Process (SDP) is a new framework addressing the lack of runtime structure in language environments, such as web browsers, which emit raw text instead of states. It enables an agent to construct missing MDP inputs, like state space and certified transitions, by taking actions and checking observations against natural-language predicates.

27
RESEARCHarXiv CS.LG·21d ago

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL introduces a KL-constrained adversarial curriculum where a policy exposes high-error trajectories of a diffusion-based world model. This method improves model robustness by focusing on rare, interaction-critical transitions, converting failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation.

27
RESEARCHarXiv CS.AI·12d ago

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

This paper introduces behavior-aware auxiliary corrections for off-policy temporal-difference prediction, aiming to stabilize TD learning with function approximation. It replaces the TDC auxiliary matrix with the behavior Bellman matrix to develop BA-TDC and BA-TDRC, providing a model for auxiliary-geometry design in neural-network value approximation.

27