Policy optimization

6 items

RESEARCHarXiv CS.LG·19d ago

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

This paper introduces GROW, an RL framework for open-world VLM agents, addressing limitations of existing Supervised Fine-Tuning methods. It proposes a novel approach for Group Relative Policy Optimization (GRPO) by decomposing trajectories into state-action samples rather than full entities.

VLM Agents Policy optimization Open-world AI reinforcement learning

DOCAWS Machine Learning Blog·5/7/2026

Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

This post details the implementation of verifiable rewards-based reinforcement learning (RLVR) to enhance training performance by ensuring transparency and correctness in reward signals. It covers techniques like GRPO and few-shot examples, demonstrated with the GSM8K dataset for improving math problem-solving accuracy.

Policy optimization reinforcement learning learning AI training

RESEARCHarXiv CS.AI·4/13/2026

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO is a novel reinforcement learning framework designed to improve the logical consistency and structural coherence of large language models in complex reasoning tasks. It explicitly incorporates stability metrics, such as Autocorrelation Function and Path Efficiency, to evaluate local step-to-step coherence and global goal-directedness of the reasoning process.

Policy optimization LLMs reinforcement learning Reasoning

RESEARCHarXiv CS.CL·13d ago

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

RICE-PO is a novel critic-free policy optimization framework addressing the credit-assignment challenge in interactive language agents. It converts retrieval interactions into localized learning signals, evaluating executable actions and propagating credit to latent reasoning steps.

Policy optimization reinforcement learning Retrieval systems AI agents

RESEARCHarXiv CS.CL·5/7/2026

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

This research introduces Adaptive Power-Mean Policy Optimization (APMPO) to improve Large Language Model (LLM) reasoning capabilities within Reinforcement Learning with Verifiable Rewards (RLVR). APMPO combines a generalized power-mean objective and feedback-adaptive clipping to enhance learning dynamics and performance, addressing limitations of static optimization schemes.

Policy optimization LLMs reinforcement learning machine learning

RESEARCHQwen Blog·7/27/2025

GSPO: Towards Scalable Reinforcement Learning for Language Models

O Reinforcement Learning é crucial para escalar modelos de linguagem, mas algoritmos existentes sofrem de instabilidade e colapso do modelo. Para resolver isso e permitir o escalonamento bem-sucedido, propõe-se o algoritmo Group Sequence Policy Optimization (GSPO).

Scalability Policy optimization language models reinforcement learning