RESEARCH28

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

arXiv CS.AI·May 29, 2026

This paper introduces STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction. It replaces the covariance metric with the symmetric part of the behavior-policy Bellman matrix, providing a more informative update geometry.

Off-Policy Prediction reinforcement learning learning temporal-difference learning Stochastic Approximation

Read original ↗