← heapsort-ai

Evaluation Metrics

7 items

RESEARCHarXiv CS.AI·4/16/2026

Exploration and Exploitation Errors Are Measurable for Language Model Agents

This research introduces a method to systematically quantify exploration and exploitation errors in Language Model (LM) agents, addressing the challenge of evaluation without access to internal policies. It proposes controllable environments and a policy-agnostic metric to measure these errors, revealing flaws even in state-of-the-art LMs.

28
RESEARCHarXiv CS.CL·21d ago

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

SKG-Eval addresses the challenge of evaluating multi-turn dialogue systems by modeling dialogue as an evolving Semantic Knowledge Graph (SKG). This framework incrementally updates the graph through structured triple extraction to detect long-range issues like contradiction and inconsistency, offering improved evaluation beyond turn-isolated representations.

27
RESEARCHarXiv CS.CL·4/14/2026

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

This research introduces the '100-Endings metric' to address LLMs' failure in generating compelling stories and recognizing their own quality issues. The metric measures narrative tension by predicting story endings sentence-by-sentence, proving more effective than current rubrics at distinguishing high-quality human narratives from AI outputs.

27
RESEARCHarXiv CS.LG·4/9/2026

RAGEN-2: Reasoning Collapse in Agentic RL

Este estudo introduz o conceito de 'colapso de template', uma falha em agentes LLM de múltiplas interações onde a resposta se torna agnóstica à entrada, mesmo com entropia estável. Propõe a Informação Mútua (MI) como uma métrica superior à entropia para diagnosticar a qualidade do raciocínio, correlacionando-se mais fortemente com o desempenho final.

27