RESEARCH27
Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
arXiv CS.CLΒ·April 27, 2026
This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.
reinforcement learningAI trainingLarge Language Models (LLMs)Model EvaluationChain-of-Thought Reasoning
Read original β