← heapsort
RESEARCH27

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

arXiv CS.LGΒ·May 13, 2026

Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in reinforcement learning for diffusion models, which often causes mode collapse and degrades generative diversity. It replaces scalar reward maximization with trajectory-level reward distribution matching, using a Softmax Trajectory Balance objective to align policy probabilities with a reward-induced Boltzmann distribution.

Read original β†—