← heapsort-ai

post-training

4 items

RESEARCHarXiv CS.CL·4/15/2026

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-Distillation Zero (SD-Zero) is a novel post-training method designed to be more training sample-efficient than traditional reinforcement learning, without requiring external teachers or high-quality demonstrations. It operates by having a single model act as both a Generator and a Reviser, using the Reviser's improved responses and token distributions to provide dense supervision for the Generator through on-policy self-distillation.

27
RESEARCHarXiv CS.AI·28d ago

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

This research proposes distinguishing between capability elicitation and capability creation in large language model post-training. It argues that elicitation reweights existing behaviors within a model's accessible support, while creation changes that support itself, developing this through a free-energy view.

27