← heapsort-ai

Policy optimization

6 items

DOCAWS Machine Learning Blog·5/7/2026

Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

This post details the implementation of verifiable rewards-based reinforcement learning (RLVR) to enhance training performance by ensuring transparency and correctness in reward signals. It covers techniques like GRPO and few-shot examples, demonstrated with the GSM8K dataset for improving math problem-solving accuracy.

29
RESEARCHarXiv CS.AI·4/13/2026

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO is a novel reinforcement learning framework designed to improve the logical consistency and structural coherence of large language models in complex reasoning tasks. It explicitly incorporates stability metrics, such as Autocorrelation Function and Path Efficiency, to evaluate local step-to-step coherence and global goal-directedness of the reasoning process.

27
RESEARCHarXiv CS.CL·5/7/2026

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

This research introduces Adaptive Power-Mean Policy Optimization (APMPO) to improve Large Language Model (LLM) reasoning capabilities within Reinforcement Learning with Verifiable Rewards (RLVR). APMPO combines a generalized power-mean objective and feedback-adaptive clipping to enhance learning dynamics and performance, addressing limitations of static optimization schemes.

27