RESEARCHarXiv CS.LG·7d ago
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Researchers propose Demo2Reward, a test-time adaptation technique to optimize Vision-Language Model (VLM) reward models in robotics. It uses a few demonstrations to reduce false positives while preserving true positives, without requiring additional model training.
27