RESEARCH27
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
arXiv CS.CLΒ·May 7, 2026
This research introduces Adaptive Power-Mean Policy Optimization (APMPO) to improve Large Language Model (LLM) reasoning capabilities within Reinforcement Learning with Verifiable Rewards (RLVR). APMPO combines a generalized power-mean objective and feedback-adaptive clipping to enhance learning dynamics and performance, addressing limitations of static optimization schemes.
Read original β