RESEARCH27
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
arXiv CS.LGΒ·May 13, 2026
Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in reinforcement learning for diffusion models, which often causes mode collapse and degrades generative diversity. It replaces scalar reward maximization with trajectory-level reward distribution matching, using a Softmax Trajectory Balance objective to align policy probabilities with a reward-induced Boltzmann distribution.
Read original β