RESEARCH27

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

arXiv CS.AI·May 4, 2026

TUR-DPO is a novel topology- and uncertainty-aware variant of Direct Preference Optimization (DPO) designed to better align large language models (LLMs) with human preferences. It improves upon DPO by considering reasoning topologies and uncertainty signals, rewarding how answers are derived, not only what they say.

reinforcement learning DPO AI alignment machine learning LLM

Read original ↗