meta-evaluation — AI articles, news & research

RESEARCHarXiv CS.CL·20d ago

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Deep research agents automate complex information-seeking tasks, necessitating scalable and reliable evaluation. The paradigm of LLM-as-judge for supervision raises concerns about its reliability, underscoring the critical need for a meta-evaluation specifically for these judges.

REFLECT meta-evaluation evaluation research agents