RESEARCH28
Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
arXiv CS.AIΒ·May 29, 2026
This paper introduces STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction. It replaces the covariance metric with the symmetric part of the behavior-policy Bellman matrix, providing a more informative update geometry.
Off-Policy Predictionreinforcement learninglearningtemporal-difference learningStochastic Approximation
Read original β