RESEARCH28

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

arXiv CS.LG·April 23, 2026

This research introduces the Tool-Augmented Markov Decision Process (TA-MDP) to formally model multimodal agentic decision-making, addressing theoretical gaps in reinforcement fine-tuning for Large Vision-Language Models (LVLMs). It specifically investigates how composite verifiable rewards affect GRPO convergence and why training on small datasets generalizes to out-of-distribution domains for agentic LVLMs.

Theoretical AI reinforcement learning vision models large language models AI agents

Read original ↗