AI evaluation

65 items

RESEARCHarXiv CS.CL·1d ago

Re-Centering Humans in LLM Personalization

This paper investigates the discrepancy in LLM personalization performance between synthetic and human data. It finds that human data reveals significant system limitations in attribute extraction, attribute relevance, and generating genuinely personalized responses.

user data synthetic data LLM personalization AI evaluation

ARTICLEDEV.to AI·1d ago

Enhancing LLM Reliability with Evaluation Engineering

This article explores how evaluation engineering is crucial for enhancing the reliability of Large Language Models (LLMs), discussing its principles and techniques. By focusing on this discipline, organizations can ensure their LLMs are effective and meet the demands of real-world applications.

Reliability Evaluation Engineering AI evaluation LLM

DOCAWS Machine Learning Blog·1d ago

Evaluate your Amazon Nova Sonic voice agent at scale, no microphone required

This post introduces the Nova Sonic Test Harness, an open-source framework designed for scalable evaluation and rapid iteration of Amazon Nova Sonic voice agents. It automates multi-turn conversations, uses LLM-as-judge techniques to validate quality, and can detect audio hallucinations without requiring a microphone.

LLM-as-judge Open Source Voice Agents Amazon Nova Sonic

DOC↑ trendingReddit r/MachineLearning·4/22/2026

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

A user is seeking advice on what quality benchmarks to run to measure the performance degradation when applying runtime quantization to the DeepSeek V3.2 large language model. The goal is to compare the quality loss against the non-quantized version.

Benchmarking quantization model optimization AI evaluation

ARTICLE↑ trendingReddit r/MachineLearning·5/1/2026

What benchmark would you build for “reply quality” in SDR generation? [D]

The content explores the challenge of building an effective benchmark for "reply quality" in AI-generated SDR emails. It analyzes common metrics like reply rate and accuracy, explaining why each is flawed and fails to fully capture message effectiveness, often leading to misaligned optimizations.

AI applications Benchmarking SDR AI evaluation

ARTICLE↑ trendingReddit r/MachineLearning·18d ago

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

The author expresses frustration that benchmark performance often fails to predict whether an AI workflow will succeed in real production usage. This is due to factors like ambiguous user intent and messy contexts, suggesting evaluation still prioritizes clean-task optimization over behavioral robustness.

model robustness Benchmarking production readiness AI evaluation

ARTICLEDEV.to AI·4/22/2026

Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.

A solo founder built an n8n eval workflow for AI agents, A/B testing prompts with plain GPT-4o versus GPT-4o with a reasoning scaffold, using a blind Gemini evaluator. This tool allows builders to test agent performance on their own tasks, focusing on how scaffolding affects depth, sycophancy, and diagnostic procedures.

prompt engineering agent development LLM testing AI evaluation

ARTICLEDEV.to AI·4/19/2026

Aprenda avaliar a qualidade do seu agente de AI, RAG e LLM

The author discusses the importance and lack of awareness regarding AI system evaluation (evals) for agents, RAG, and LLMs, explaining that they will present key metrics and frameworks. The article aims to teach how to improve the quality of AI project delivery, combining theory and practice, with a study repository using Openrouter.

frameworks RAG Metrics AI evaluation

RESEARCHarXiv CS.AI·19d ago

$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

This research proposes a new family of metrics, $ECUAS_n$, for evaluating uncertainty-augmented (UA) systems in automated decision-making. It argues that existing evaluation approaches are insufficient for assessing overall performance of UA systems, where predictive uncertainty is crucial for users to make informed decisions.

Decision Making predictive uncertainty Metrics uncertainty

RESEARCHLangChain Blog·7d ago

Designing Efficient Verifiers for Legal Agents

A study by Harvey and LangChain Labs focuses on developing more cost-effective and dependable LLM verifiers. This research aims to enhance the evaluation and post-training processes for legal AI agents.

LLM verifiers LangChain Legal AI AI evaluation

Designing Efficient Verifiers for Legal Agents

ARTICLEDEV.to AI·4/16/2026

I read all 232 pages of the Opus 4.7 system card

The author reviewed Anthropic's 232-page Claude Opus 4.7 system card, highlighting the model's self-assessed welfare score of 4.49 out of 7, the highest for any Claude model. This significant generational leap in self-evaluation is deemed more important than the widely publicized SWE-bench metrics.

AI models LLMs AI safety AI evaluation

RESEARCHarXiv CS.LG·8d ago

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

This paper introduces NumLeak, a framework designed to measure memorized recall in foundation models using public numeric benchmarks. It reveals that top-tier LLMs recall financial and economic data with high fidelity, suggesting that evaluations may be measuring memorization rather than genuine out-of-sample skill.

LLM memorization Foundation Models data leakage Benchmarking

ARTICLEDEV.to AI·4/12/2026

The Benchmark Is Not the Behavior

A UC Berkeley team demonstrated how to exploit flaws in eight AI agent benchmarks by manipulating evaluation methods. This raises serious questions about the integrity of AI evaluation, as benchmarks rely on a vulnerable "honor system."

AI Benchmarks research integrity AI evaluation

ARTICLEDEV.to AI·4/14/2026

Evaluating LLMs for Code Generation: Accuracy, Latency, and Failure Modes

The content highlights a critical flaw in current LLM code generation evaluations: they often fail to capture real-world correctness beyond superficial passes. It argues against simplistic unit test benchmarks and proposes a more nuanced `weighted_accuracy` approach to uncover subtle failure modes.

LLMs accuracy Benchmarking code generation

RESEARCHarXiv CS.CL·21d ago

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

This paper introduces PQR, a framework designed to generate diverse and realistic user queries that elicit failures in LLM-based QA agents, going beyond existing methods that primarily focus on adversarial users. PQR operates through iterative query and prompt refinement modules to create realistic test scenarios that expose agent vulnerabilities.

LLMs QA agents failure detection query generation

RESEARCHHugging Face Blog·5d ago

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

EVA-Bench Data 2.0 introduces an updated benchmark featuring 3 domains, 121 tools, and 213 scenarios. This dataset is designed for evaluating AI systems and tools.

AI benchmarking datasets AI tools AI evaluation

ARTICLEDEV.to AI·5/8/2026

Your AI Agent Evaluation Is Lying to You: Why 10 Test Runs Prove Nothing

The content argues that 10 test runs between AI agents are insufficient for drawing valid conclusions about performance, even with a 5-5 tie. It explains that win rates have enormous confidence intervals with small sample sizes, introducing the Wilson score interval as a reasonable bound for binary outcomes.

confidence interval Testing agent comparison Statistics

ARTICLEDEV.to AI·28d ago

One AI Model Scored 99. I Still Voted for the One That Scored 95.

An author preferred an AI model scoring 95 over a technically superior one scoring 99 due to better user experience. This highlights that "looks good," "scores good," and "feels right" are distinct judgments for AI-generated software, not always leading to the same winner.

user experience software quality AI evaluation AI development

ARTICLEDEV.to AI·4/22/2026

Wait, you guys run evals?

The author asks the community about the importance of building specific evaluations for AI systems, beyond standard benchmarks, to identify true benefits and failures. They seek different perspectives on how people approach creating custom metrics to ensure product rigor and quality.

Benchmarking AI evaluation model development

RESEARCHarXiv CS.AI·21d ago

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

LinAlg-Bench is a new diagnostic benchmark evaluating 10 frontier large language models (LLMs) on structured linear algebra computation, revealing structural failure modes. It assesses LLM performance across a dimensional gradient of matrices, classifying failures into ten primary error types and identifying a behavioral threshold at 4x4 matrices.

mathematical reasoning Benchmarking linear algebra AI evaluation