reinforcement learning

154 items

RESEARCHarXiv CS.CL·5/7/2026

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

FREIA is a novel reinforcement learning algorithm designed to enhance LLMs for unsupervised reasoning, addressing the lack of adaptability in existing methods. It employs Free Energy-Driven Reward (FER) to balance consensus and exploration, and Adaptive Advantage Shaping (AAS) to adjust learning signals. FREIA outperforms unsupervised baselines across various reasoning tasks, particularly in mathematical reasoning.

LLMs reinforcement learning AI algorithms Reasoning

RESEARCHarXiv CS.CL·26d ago

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

This research introduces Inquisitive Conversational Agents (ICAs) designed to proactively extract information, specifically tailored for U.S. Supreme Court oral arguments. It proposes a Dual Hierarchical Reinforcement Learning framework to coordinate strategic dialogue management and fine-grained utterance generation, significantly outperforming baselines.

reinforcement learning legal tech dialogue systems Conversational AI

RESEARCHarXiv CS.LG·22d ago

Language Game: Talking to Non-Human Systems

This paper explores direct communication with non-human systems (like gene regulatory networks or fungi) recognized as computational substrates, moving beyond LLMs acting as proxies. It proposes a "language game" approach using reinforcement learning with linear interfaces to enable these systems to "speak in their own voice."

reinforcement learning AI communication large language models non-human systems

RESEARCHarXiv CS.CL·8d ago

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

This paper proposes CSRP, a three-stage framework for Chinese Grammatical Error Correction (CGEC) using Large Language Models (LLMs). CSRP addresses challenges of general-purpose models and metric optimization with continual pre-training, Chain-of-Thought SFT, and policy optimization with efficiency-aware rewards that penalize unnecessary edits, achieving state-of-the-art performance on the NACGEC benchmark.

reinforcement learning Grammar Correction Natural Language Processing AI research

RESEARCHarXiv CS.AI·5/11/2026

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

This paper introduces Weblica, a framework designed to construct reproducible and scalable web environments for visual web agents. It utilizes HTTP-level caching and LLM-based environment synthesis to scale RL training across thousands of diverse environments and tasks, outperforming baselines on web navigation benchmarks.

scalability reinforcement learning Machine Learning AI Agents

RESEARCHarXiv CS.LG·22d ago

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

This paper investigates how action information can be incorporated into the state update function of a recurrent cell within recurrent neural networks (RNNs) for reinforcement learning (RL). The authors discuss several choices and empirically evaluate the resulting architectures on illustrative domains.

State Building reinforcement learning learning Action Encodings

RESEARCHarXiv CS.LG·27d ago

Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

This paper introduces FPILOT, a plugin inference-time optimization framework for reinforcement learning trading agents. It uses predicted price trajectories to optimize the policy at inference-time before executing a trade, being compatible with any pre-trained agent.

Optimization financial trading reinforcement learning AI in finance

RESEARCHarXiv CS.LG·28d ago

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in reinforcement learning for diffusion models, which often causes mode collapse and degrades generative diversity. It replaces scalar reward maximization with trajectory-level reward distribution matching, using a Softmax Trajectory Balance objective to align policy probabilities with a reward-induced Boltzmann distribution.

Diffusion Models reinforcement learning AI alignment generative AI

RESEARCHarXiv CS.LG·8d ago

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

This paper studies tool-calling in large language model (LLM) agents, examining its effectiveness and efficiency. It analyzes evaluation pipelines, showing results are sensitive to implementation choices, and identifies computational waste in reinforcement learning training.

LLMs evaluation reinforcement learning tool-calling

RESEARCHarXiv CS.AI·8d ago

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

This research introduces a novel delayed per-step reward attribution method for training language model agents in multi-agent strategic interactions. It addresses the challenge of entangled outcomes by computing rewards at episode end and backpropagating them, enabling stable and sample-efficient reinforcement learning.

language models Generalization reinforcement learning Multi-Agent Systems

RESEARCHarXiv CS.CL·27d ago

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

This paper proposes Verifiable Process Supervision (VPS), a post-training framework to jointly optimize language model prediction accuracy and reasoning quality. VPS uses supervised fine-tuning to induce a structured reasoning format, evaluating intermediate claims against ground-truth signals with adaptive reward weighting.

language models reinforcement learning AI training verifiable AI

RESEARCHarXiv CS.AI·23d ago

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

ICRL proposes a novel framework to train large language model agents to internalize self-critique, converting feedback into unassisted problem-solving. It jointly trains a solver and a critic from a shared backbone, rewarding the critic for actionable feedback to foster iterative self-improvement.

reinforcement learning learning self-critique large language models

RESEARCHarXiv CS.LG·27d ago

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

The paper introduces Multi-Rollout On-Policy Distillation (MOPD), a framework that uses a student's local rollout group to construct more informative teacher signals for post-training large language models. MOPD conditions the teacher on both successful and failed peer rollouts, leveraging successes for valid reasoning patterns and failures for avoiding plausible mistakes.

distillation reinforcement learning AI training Machine Learning

RESEARCHarXiv CS.LG·27d ago

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

This paper introduces a communication-efficient reinforcement learning approach where a single policy learns both control inputs and timing decisions, secured by a pointwise Lyapunov safety shield. A run-time assurance layer overrides the policy to provide strictly stronger safety guarantees and achieve significantly higher mean inter-sample intervals on various systems.

reinforcement learning Machine Learning safety-critical-ai Control Systems

RESEARCHarXiv CS.AI·28d ago

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ is an offline-to-online reinforcement learning objective designed to enhance sample efficiency by leveraging pre-collected datasets. It mitigates issues with inaccurate critics and limited data coverage by using a self-supervised multi-term ranking loss, which enforces structured action ordering and directs the Q-function towards higher-quality actions.

Offline-to-Online Learning Action Ranking reinforcement learning self-supervised learning

RESEARCHarXiv CS.AI·27d ago

State-Centric Decision Process

The State-Centric Decision Process (SDP) is a new framework addressing the lack of runtime structure in language environments, such as web browsers, which emit raw text instead of states. It enables an agent to construct missing MDP inputs, like state space and certified transitions, by taking actions and checking observations against natural-language predicates.

Decision Processes reinforcement learning Natural Language Processing AI Agents

RESEARCHarXiv CS.AI·23d ago

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

This paper introduces SDOF, a framework that treats multi-agent execution as a constrained state machine to enforce real business process constraints. It incorporates an RLHF-trained intent router and a state-aware dispatcher, outperforming GPT-4o on an adversarial routing benchmark in a recruitment system.

hiring AI frameworks reinforcement learning orchestration

RESEARCHarXiv CS.LG·21d ago

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL introduces a KL-constrained adversarial curriculum where a policy exposes high-error trajectories of a diffusion-based world model. This method improves model robustness by focusing on rare, interaction-critical transitions, converting failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation.

reinforcement learning model learning security World Models

RESEARCHarXiv CS.AI·12d ago

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

This paper introduces behavior-aware auxiliary corrections for off-policy temporal-difference prediction, aiming to stabilize TD learning with function approximation. It replaces the TDC auxiliary matrix with the behavior Bellman matrix to develop BA-TDC and BA-TDRC, providing a model for auxiliary-geometry design in neural-network value approximation.

neural networks reinforcement learning learning temporal-difference learning

RESEARCHarXiv CS.LG·12d ago

Self-Play Reinforcement Learning under Imperfect Information in Big 2

This study develops a self-play reinforcement learning framework for the imperfect-information card game Big 2. It demonstrates that PPO outperforms other value-approximating agents and benefits from entropy regularization and current-policy self-play.

reinforcement learning learning self-play imperfect-information-games