← heapsort-ai

AI evaluation

65 items

ARTICLE↑ trendingReddit r/MachineLearning·18d ago

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

The author expresses frustration that benchmark performance often fails to predict whether an AI workflow will succeed in real production usage. This is due to factors like ambiguous user intent and messy contexts, suggesting evaluation still prioritizes clean-task optimization over behavioral robustness.

41
ARTICLEDEV.to AI·4/22/2026

Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.

A solo founder built an n8n eval workflow for AI agents, A/B testing prompts with plain GPT-4o versus GPT-4o with a reasoning scaffold, using a blind Gemini evaluator. This tool allows builders to test agent performance on their own tasks, focusing on how scaffolding affects depth, sycophancy, and diagnostic procedures.

35
ARTICLEDEV.to AI·4/19/2026

Aprenda avaliar a qualidade do seu agente de AI, RAG e LLM

The author discusses the importance and lack of awareness regarding AI system evaluation (evals) for agents, RAG, and LLMs, explaining that they will present key metrics and frameworks. The article aims to teach how to improve the quality of AI project delivery, combining theory and practice, with a study repository using Openrouter.

33
RESEARCHarXiv CS.AI·19d ago

$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

This research proposes a new family of metrics, $ECUAS_n$, for evaluating uncertainty-augmented (UA) systems in automated decision-making. It argues that existing evaluation approaches are insufficient for assessing overall performance of UA systems, where predictive uncertainty is crucial for users to make informed decisions.

30
RESEARCHarXiv CS.CL·21d ago

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

This paper introduces PQR, a framework designed to generate diverse and realistic user queries that elicit failures in LLM-based QA agents, going beyond existing methods that primarily focus on adversarial users. PQR operates through iterative query and prompt refinement modules to create realistic test scenarios that expose agent vulnerabilities.

28
ARTICLEDEV.to AI·4/22/2026

Wait, you guys run evals?

The author asks the community about the importance of building specific evaluations for AI systems, beyond standard benchmarks, to identify true benefits and failures. They seek different perspectives on how people approach creating custom metrics to ensure product rigor and quality.

28
RESEARCHarXiv CS.AI·21d ago

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

LinAlg-Bench is a new diagnostic benchmark evaluating 10 frontier large language models (LLMs) on structured linear algebra computation, revealing structural failure modes. It assesses LLM performance across a dimensional gradient of matrices, classifying failures into ten primary error types and identifying a behavioral threshold at 4x4 matrices.

28