RESEARCH29

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

arXiv CS.LG·May 13, 2026

This paper introduces -DPO, a direct preference optimization method using a ratio reward margin, to address the challenge of hyperparameter tuning in SimPO. The research analyzes SimPO and reformulates the preference objective to improve interpretability across datasets with varying reward gap structures.

Preference Optimization deep learning reinforcement learning Hyperparameter Tuning machine learning

Read original ↗