RESEARCH27

Distributionally Robust Token Optimization in RLHF

arXiv CS.LG·April 13, 2026

To address LLMs' susceptibility to failures from small prompt shifts, especially in multi-step reasoning, researchers propose Distributionally Robust Token Optimization (DRTO). This approach combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO) to enhance consistency under distribution shifts, showing improvements on mathematical reasoning benchmarks.

DRO LLMs RLHF Distributionally Robust Optimization large language models

Read original ↗