RESEARCH27
Distributionally Robust Token Optimization in RLHF
arXiv CS.LGΒ·April 13, 2026
To address LLMs' susceptibility to failures from small prompt shifts, especially in multi-step reasoning, researchers propose Distributionally Robust Token Optimization (DRTO). This approach combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO) to enhance consistency under distribution shifts, showing improvements on mathematical reasoning benchmarks.
Read original β