← heapsort-ai

reinforcement learning

154 items

RESEARCHarXiv CS.AI·9d ago

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

This paper proposes an uncertainty-aware framework for reinforcement learning in autonomous driving, leveraging expert advice to guide exploration safely while avoiding long-term dependence. It employs adaptive thresholds for advice triggering and a commitment-cooldown strategy to regulate guidance, demonstrating improved performance in CARLA simulations.

27
RESEARCHarXiv CS.AI·16d ago

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

NeuroNL2LTL is a neurosymbolic architecture that unifies learned translation with formal verification to translate natural language into Linear Temporal Logic. It employs verifier-in-the-loop training, where verification outcomes serve as reward signals for reinforcement learning, optimizing for formal correctness.

27
RESEARCHarXiv CS.LG·5/6/2026

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

This paper examines the impact of systematic verification errors on Reinforcement Learning with Verifiable Rewards (RLVR), a method used to enhance the reasoning capabilities of large language models. Unlike prior analyses that treated errors as random, this work shows that systematic errors can lead models to learn unwanted behaviors. Experiments on arithmetic tasks reveal that systematic false negatives have similar effects to random noise, while systematic false positives can have more complex impacts.

27
RESEARCHarXiv CS.LG·5/6/2026

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. It formalizes rollout pipelines with a unified notation and introduces the Generate-Filter-Control-Replay (GFCR) lifecycle taxonomy, decomposing pipelines into four modular stages.

27
RESEARCHarXiv CS.LG·4/6/2026

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

O artigo apresenta PRISM, uma estrutura para Reinforcement Learning que fundamenta as decisões de agentes em conceitos discretos e causalmente validados, usando-os como interface de transferência zero-shot. Ele demonstra que esses conceitos impulsionam diretamente o comportamento do agente e que a importância de um conceito pode ser dissociada de sua frequência de uso.

27
RESEARCHarXiv CS.CL·4/6/2026

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Este artigo propõe uma estrutura de Reinforcement Learning (RL) que utiliza um LLM como juiz para gerar recompensas, permitindo a destilação de conhecimento sem a necessidade de rótulos de verdade fundamental. A abordagem demonstra ganhos substanciais de desempenho em benchmarks de raciocínio matemático, sugerindo que avaliadores baseados em LLM podem produzir sinais de treinamento eficazes.

27
RESEARCHarXiv CS.LG·4/6/2026

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Este conteúdo apresenta o PROGRS, um framework para melhorar o raciocínio matemático em LLMs, combinando modelos de recompensa de processo (PRMs) com a priorização da correção do resultado final. Ele busca resolver o problema de PRMs que podem recompensar raciocínios intermediários fluentes, mas que levam a respostas incorretas, otimizando o aprendizado com feedback mais alinhado.

27