RESEARCH27

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

arXiv CS.CL·April 27, 2026

This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.

reinforcement learning AI training Large Language Models (LLMs)Model Evaluation Chain-of-Thought Reasoning

Read original ↗