← heapsort
RESEARCH27

Distributionally Robust Token Optimization in RLHF

arXiv CS.LGΒ·April 13, 2026

To address LLMs' susceptibility to failures from small prompt shifts, especially in multi-step reasoning, researchers propose Distributionally Robust Token Optimization (DRTO). This approach combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO) to enhance consistency under distribution shifts, showing improvements on mathematical reasoning benchmarks.

Read original β†—