← heapsort-ai

training methods

1 items

RESEARCH↑ trendingReddit r/MachineLearning·4/22/2026

Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R]

A training-time intervention for 1.2B-parameter LMs, using a precision-weighted gain function and divergence-scaled gradients, resulted in significantly higher human preference (63.4%, p < 0.00002) compared to standard training. Notably, this preference shift occurred without altering the aggregate validation loss metric, indicating that training interventions beyond RLHF can be effective.

47