← heapsort-ai

AI training

43 items

RESEARCHarXiv CS.CL·4/27/2026

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.

27
RESEARCHarXiv CS.LG·21d ago

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

This research addresses the challenge of poor credit assignment in reinforcement learning for multi-step reasoning with large language models, caused by sparse terminal rewards leading to high gradient variance and unstable training. It proposes a counterfactual comparison-based framework and Implicit Behavior Policy Optimization (IBPO) to create step-sensitive learning signals, significantly improving training stability and performance.

27
RESEARCHarXiv CS.CL·26d ago

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

This paper proposes Verifiable Process Supervision (VPS), a post-training framework to jointly optimize language model prediction accuracy and reasoning quality. VPS uses supervised fine-tuning to induce a structured reasoning format, evaluating intermediate claims against ground-truth signals with adaptive reward weighting.

27
RESEARCHarXiv CS.LG·26d ago

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

The paper introduces Multi-Rollout On-Policy Distillation (MOPD), a framework that uses a student's local rollout group to construct more informative teacher signals for post-training large language models. MOPD conditions the teacher on both successful and failed peer rollouts, leveraging successes for valid reasoning patterns and failures for avoiding plausible mistakes.

27
RESEARCHarXiv CS.CL·4/6/2026

Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training

Este estudo apresenta o LLMimic, um tutorial gamificado e interativo que permite aos participantes simular o treinamento de um LLM para aumentar a alfabetização em IA. A pesquisa avalia como essa intervenção proativa mitiga a persuasão por IA em cenários realistas, como doações ou recomendações, em comparação com um grupo de controle.

27
ARTICLEDEV.to AI·14d ago

Understanding Reinforcement Learning with Human Feedback Part 6: How the Reward Model Trains the Original Model

This article, part of a series on Reinforcement Learning with Human Feedback (RLHF), details how a pre-trained reward model is leveraged to train an original AI model. It explains that new prompts are used, the original model generates responses, and the reward model provides feedback signals, allowing the original model to learn to generate more helpful and human-aligned outputs.

24