reinforcement learning

154 items

RESEARCHarXiv CS.AI·15d ago

Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

This paper introduces "Quantum Frog", a two-player cooperative game built on a novel quantized-time mechanic, inspired by Frogger. It uses reinforcement learning to analyze game difficulty scaling, optimal policies, and emergent cooperative strategies.

reinforcement learning Multi-Agent Systems game theory Cooperative AI

RESEARCHarXiv CS.AI·9d ago

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

This paper proposes an uncertainty-aware framework for reinforcement learning in autonomous driving, leveraging expert advice to guide exploration safely while avoiding long-term dependence. It employs adaptive thresholds for advice triggering and a commitment-cooldown strategy to regulate guidance, demonstrating improved performance in CARLA simulations.

reinforcement learning autonomous driving Exploration uncertainty

RESEARCHarXiv CS.AI·16d ago

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

NeuroNL2LTL is a neurosymbolic architecture that unifies learned translation with formal verification to translate natural language into Linear Temporal Logic. It employs verifier-in-the-loop training, where verification outcomes serve as reward signals for reinforcement learning, optimizing for formal correctness.

reinforcement learning Neurosymbolic AI Formal verification Natural Language Processing

RESEARCHDEV.to AI·5/3/2026

R1-Searcher: Incentivizing the Search Capability in LLMs via ReinforcementLearning

The paper discusses improving the search capability in Large Language Models (LLMs) through the application of Reinforcement Learning. It proposes a method to incentivize search behavior in LLMs.

LLMs reinforcement learning Machine Learning Search

RESEARCHDEV.to AI·4/21/2026

Multi-Objective Deep Reinforcement Learning

This content explores the field of Multi-Objective Deep Reinforcement Learning. It likely delves into techniques for training AI agents to optimize multiple performance criteria concurrently.

Optimization deep learning reinforcement learning

ARTICLEHugging Face Blog·7d ago

Direct Preference Optimization Beyond Chatbots

This article explores Direct Preference Optimization (DPO), a method for aligning AI models with human preferences, examining its potential applications beyond traditional chatbots. It delves into how DPO can be utilized in various AI domains.

language models reinforcement learning learning DPO

RESEARCHarXiv CS.LG·4/30/2026

A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication

This paper surveys GNN-based communication in multi-agent reinforcement learning (MARL), noting a lack of explicit structure in existing approaches. It proposes a generalized GNN-based communication process to make the underlying concepts more obvious and accessible.

reinforcement learning Graph Neural Networks Multi-Agent Systems

RESEARCHarXiv CS.LG·5/6/2026

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

This paper examines the impact of systematic verification errors on Reinforcement Learning with Verifiable Rewards (RLVR), a method used to enhance the reasoning capabilities of large language models. Unlike prior analyses that treated errors as random, this work shows that systematic errors can lead models to learn unwanted behaviors. Experiments on arithmetic tasks reveal that systematic false negatives have similar effects to random noise, while systematic false positives can have more complex impacts.

reinforcement learning AI Errors Verification large language models

RESEARCHarXiv CS.LG·5/6/2026

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. It formalizes rollout pipelines with a unified notation and introduces the Generate-Filter-Control-Replay (GFCR) lifecycle taxonomy, decomposing pipelines into four modular stages.

Rollout Strategies reinforcement learning Machine Learning AI research

RESEARCHarXiv CS.CL·29d ago

AIPO: : Learning to Reason from Active Interaction

AIPO is a novel reinforcement learning framework that enhances LLM reasoning through active multi-agent interaction during exploration. It addresses the limitations of existing RL algorithms, which are constrained by the policy model's inherent capabilities and rely on sample-inefficient guidance.

LLMs reinforcement learning learning AI Reasoning

ARTICLETogether AI Blog·4/24/2026

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

DAS (distribution-aware speculative decoding) addresses the rollout bottleneck in RL post-training. It accelerates rollouts by up to 50% without compromising reward quality.

Optimization AI acceleration reinforcement learning Machine Learning

RESEARCHarXiv CS.LG·4/6/2026

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

O artigo apresenta PRISM, uma estrutura para Reinforcement Learning que fundamenta as decisões de agentes em conceitos discretos e causalmente validados, usando-os como interface de transferência zero-shot. Ele demonstra que esses conceitos impulsionam diretamente o comportamento do agente e que a importância de um conceito pode ser dissociada de sua frequência de uso.

Strategy Mapping reinforcement learning Transfer Learning interpretability

RESEARCHarXiv CS.CL·4/6/2026

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Este artigo propõe uma estrutura de Reinforcement Learning (RL) que utiliza um LLM como juiz para gerar recompensas, permitindo a destilação de conhecimento sem a necessidade de rótulos de verdade fundamental. A abordagem demonstra ganhos substanciais de desempenho em benchmarks de raciocínio matemático, sugerindo que avaliadores baseados em LLM podem produzir sinais de treinamento eficazes.

language models Unlabeled Data Knowledge Distillation Math Reasoning

RESEARCHarXiv CS.LG·4/6/2026

Contextual Intelligence The Next Leap for Reinforcement Learning

O texto aborda as limitações de generalização do Reinforcement Learning (RL), onde políticas aprendidas falham fora da distribuição de treinamento. Propõe uma nova taxonomia de contextos (alógenos e autógenos) e identifica direções de pesquisa cruciais para desenvolver uma verdadeira inteligência contextual.

Generalization Contextual Intelligence reinforcement learning Taxonomy

RESEARCHarXiv CS.LG·4/6/2026

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Este conteúdo apresenta o PROGRS, um framework para melhorar o raciocínio matemático em LLMs, combinando modelos de recompensa de processo (PRMs) com a priorização da correção do resultado final. Ele busca resolver o problema de PRMs que podem recompensar raciocínios intermediários fluentes, mas que levam a respostas incorretas, otimizando o aprendizado com feedback mais alinhado.

mathematical reasoning Process Rewards reinforcement learning AI

RESEARCHarXiv CS.AI·20d ago

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

This paper introduces OSCToM, an approach for modeling nested belief conflicts in LLM-based Theory of Mind tasks. It combines reinforcement learning and compositional surrogate models to generate these conflicts, with OSCToM-8B showing the best results in experiments.

LLMs reinforcement learning AI research Theory of Mind

RESEARCHarXiv CS.AI·20d ago

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR is an autonomous AI agent designed to overcome challenges of large language models in dynamic environments by enabling lifelong learning and continual adaptation. It uses parameter-level meta-learning and multi-level reinforcement learning to self-improve and discover adaptation strategies.

Meta-Learning reinforcement learning learning Lifelong Learning

RESEARCHarXiv CS.AI·20d ago

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax is a new fully vectorized Riichi Mahjong environment implemented in JAX, designed to enable large-scale rollout parallelization on GPUs for reinforcement learning research. It facilitates tabula rasa learning and includes a high-quality visualization tool for debugging trained agents.

reinforcement learning learning GPU Mahjong

RESEARCHHugging Face Blog·4/16/2026

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

This research introduces Ecom-RLVE, a framework designed to create adaptive and verifiable environments for conversational agents operating in e-commerce. It focuses on developing robust and reliable AI systems for online shopping interactions.

reinforcement learning Adaptive systems verifiable AI e-commerce

RESEARCHDEV.to AI·4/21/2026

Learning to be Safe: Deep RL with a Safety Critic

This content explores a novel approach to Deep Reinforcement Learning by integrating a "safety critic" to prevent unsafe actions. The methodology aims to enhance the reliability and robustness of AI agents, making them suitable for real-world deployment where safety is critical.

deep learning reinforcement learning security Machine Learning