Chain-of-Thought Reasoning — AI articles, news & research

RESEARCHarXiv CS.CL·4/27/2026

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.

reinforcement learning AI training Large Language Models (LLMs)Model Evaluation