← heapsort
RESEARCH27

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

arXiv CS.LGΒ·May 19, 2026

This research addresses the challenge of poor credit assignment in reinforcement learning for multi-step reasoning with large language models, caused by sparse terminal rewards leading to high gradient variance and unstable training. It proposes a counterfactual comparison-based framework and Implicit Behavior Policy Optimization (IBPO) to create step-sensitive learning signals, significantly improving training stability and performance.

Read original β†—