evaluation

53 items

RESEARCHarXiv CS.CL·20d ago

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Deep research agents automate complex information-seeking tasks, necessitating scalable and reliable evaluation. The paradigm of LLM-as-judge for supervision raises concerns about its reliability, underscoring the critical need for a meta-evaluation specifically for these judges.

REFLECT meta-evaluation evaluation research agents

RESEARCHarXiv CS.AI·13d ago

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor is a task-generation pipeline that addresses "artifact drift" in AI agent benchmark creation. It formalizes domain experts' specifications into constraint optimization programs, jointly producing consistent instructions, environments, solutions, and verifiers for business operations.

evaluation task generation Benchmarking business workflows

ARTICLEDEV.to AI·23d ago

AI Agent Evaluation in 2026: Beyond the Benchmark Trap

The content highlights the significant gap between high AI agent scores on benchmarks and their poor performance in production, arguing that current benchmarks test narrow capabilities and miss critical real-world challenges. This discrepancy is identified as the defining challenge for AI agent evaluation in 2026.

evaluation AI deployment Benchmarks AI development

ARTICLEDEV.to AI·29d ago

Best AI Answering Service for Contractors: An Operator's Evaluation Framework

A founder of an AI answering service for trade contractors presents a framework for evaluating such services, acknowledging his bias. The article provides an in-depth operational guide on testing, instrumentation, negotiation, and common production issues, specifically for builders and operators.

framework evaluation contractors answering service

RESEARCHDEV.to AI·5/5/2026

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

BrowseComp is a new and challenging benchmark designed to evaluate browsing agents. It focuses on complex tasks that require contextual understanding and interaction with web interfaces, offering a new metric for AI performance.

evaluation research Benchmarks AI

ARTICLEDEV.to AI·4/14/2026

The 5 Levels of RAG Maturity: How to Know When Your RAG Is Actually Production-Ready

This article addresses the common challenge of evaluating RAG (Retrieval-Augmented Generation) systems, highlighting that many projects fail to move beyond initial demos due to a lack of proper evaluation. It introduces a 0-to-5 maturity model designed to help organizations assess their RAG systems and determine when they are truly production-ready.

Production-Ready AI LLMs evaluation RAG

RESEARCHDEV.to AI·4/19/2026

Evaluation of Retrieval-Augmented Generation: A Survey

This survey evaluates Retrieval-Augmented Generation (RAG), analyzing its current state, architectures, and performance metrics. It provides a comprehensive overview of existing RAG techniques and their applications.

Survey evaluation RAG NLP

RESEARCHDEV.to AI·4/14/2026

Don't forget, there is more than forgetting: new metrics for Continual Learning

This content introduces novel metrics for Continual Learning, broadening evaluation beyond just preventing catastrophic forgetting. It proposes a more comprehensive view for measuring AI model performance in sequential learning scenarios.

AI metrics evaluation machine learning Catastrophic Forgetting

ARTICLEDEV.to AI·18d ago

Intercom: Outlines Key Factors Beyond Performance for Evaluating AI Customer Service Agents

Intercom published an article outlining crucial factors beyond raw performance for evaluating AI customer service agents. The post emphasizes integration, customization, and long-term value as essential metrics for selecting AI solutions.

evaluation customer service business strategy AI

RESEARCHarXiv CS.CL·5/5/2026

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

The CLEAR framework is introduced to assess how ambiguity and uncertainty impact medical Large Language Models' (LLMs) reliability, moving beyond simplified evaluation benchmarks. It systematically perturbs answer options and their semantic framing, revealing that increased plausible answers degrade LLM performance and caution decreases with uncertain abstention phrasing.

Ambiguity LLMs evaluation Reliability

RESEARCHarXiv CS.CL·5/1/2026

BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

This paper introduces BatteryPass-12K, the first public dataset for the novel task of digital battery passport (DBP) conformance classification, addressing a critical need before new EU regulations. It benchmarks 22 language models, finding that "Thinking models" like GPT-5.4 achieve the best performance, and few-shot examples significantly enhance results on this challenging task.

evaluation Benchmarking Natural Language Processing datasets

RESEARCHarXiv CS.CL·4/16/2026

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

KMMMU is a new native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings, featuring 3,466 questions from native exams. The study shows that current AI models achieve only 42.05% accuracy on the full set, with significant failures in culturally and discipline-specific problems.

language models multimodal AI evaluation Benchmarking

RESEARCHarXiv CS.CL·29d ago

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

IntentGrasp is a new comprehensive benchmark for evaluating the intent understanding capability of Large Language Models, derived from 49 high-quality corpora. Extensive evaluations on 20 LLMs showed unsatisfactory performance, with scores below 60% on the All Set and 25% on the Gem Set.

evaluation Benchmarking IntentGrasp intent understanding

RESEARCHarXiv CS.CL·7d ago

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

This paper describes a system for SemEval-2026 Task-1, which focuses on constrained humor generation. The approach uses a

evaluation Natural Language Processing humor generation AI Research

RESEARCHarXiv CS.AI·22d ago

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

This paper introduces a new paradigm for interactively evaluating Theory of Mind (ToM) improvements in Large Language Models (LLMs) for human-AI interactions. Empirical findings from real-world datasets and a user study reveal that ToM enhancements on static benchmarks do not always translate to benefits in dynamic human-AI interactions.

LLMs evaluation human-AI interaction empirical study

RESEARCHarXiv CS.CL·25d ago

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

This research evaluates large language models (LLMs) in biomedical question answering, specifically addressing their reliability when faced with conflicting or incomplete evidence. It reveals that LLM accuracy significantly drops, and predictions flip, when the order of correct and contradictory documents is reversed, highlighting issues with order effects and the need for conflict-aware abstention.

LLMs evaluation Reliability Biomedical AI

RESEARCHarXiv CS.LG·7d ago

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

This paper studies tool-calling in large language model (LLM) agents, examining its effectiveness and efficiency. It analyzes evaluation pipelines, showing results are sensitive to implementation choices, and identifies computational waste in reinforcement learning training.

LLMs evaluation reinforcement learning tool-calling

RESEARCHarXiv CS.CL·14d ago

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

This paper introduces a causal framework to study rationalization bias in LLMs used as automatic judges for summarization and dialogue evaluation. It investigates whether LLM judges' rankings and explanations remain stable when non-evidential cues are perturbed, proposing cue interventions and anchoring metrics.

LLMs evaluation AI rationalization

RESEARCHarXiv CS.CL·8d ago

Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

This protocol evaluates ChatGPT's ability to generate and verify disease-centric biomedical associations, using biomedical ontologies and literature. It employs a self-consistency strategy and a RAG-enabled workflow with open-source LLMs to address exact-match limitations and detect hallucination.

LLMs evaluation ChatGPT RAG

RESEARCHarXiv CS.CL·8d ago

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

This paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark for evaluating Retrieval-Augmented Generation (RAG) systems using realistic queries and expert-annotated case law answers. It highlights the sensitivity of retrieval performance, the competitiveness of open-source embedding models, and the limitations of automatic evaluations and LLM hallucinations in generated responses.

Retrieval Augmented Generation LLMs evaluation Legal AI