RESEARCHarXiv CS.CL·4d ago
Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
This research investigates optimizing Large Language Models (LLMs) for heart-focused medical question answering using Group Relative Policy Optimization (GRPO) for post-training. A Variance-Aware Reward Framework is proposed to enhance rubric-based supervision with continuous analytical reward functions.
30