Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
This paper introduces STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction. It replaces the covariance metric with the symmetric part of the behavior-policy Bellman matrix, providing a more informative update geometry.