RESEARCHarXiv CS.CL·20d ago
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
Deep research agents automate complex information-seeking tasks, necessitating scalable and reliable evaluation. The paradigm of LLM-as-judge for supervision raises concerns about its reliability, underscoring the critical need for a meta-evaluation specifically for these judges.
27