evaluation

53 items

RESEARCHarXiv CS.CL·1d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench is introduced as a new benchmark to evaluate large language models' ability to capture true underlying distributions, addressing their tendency to collapse towards single answers. It provides 448 problems and a KS@N metric to test sampling outcomes from various target distributions.

AI models LLMs evaluation Benchmarking

ARTICLE↑ trendingHacker News (AI)·15d ago

Show HN: Unsiloed AI – #1 on olmOCR-Bench

UnSiloed Parser v3.1 achieved the #1 rank on olmOCR-Bench, outperforming 18 other OCR services including advanced AI models. The evaluation, conducted across 1,403 PDFs and 8,413 unit tests, demonstrated its capability to handle complex real-world document challenges like intricate tables and multi-column layouts.

AI benchmark evaluation document parsing UnSiloed

RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update [P]

The author trained Qwen2.5-0.5B-Instruct for Reddit post summarization using two reward strategies, finding that a combination of quality and length penalties yielded significantly better results. Evaluation was conducted using LLM-As-A-Judge and DeepEval tools for metrics like conscientiousness and clarity.

evaluation reinforcement learning AI training summarization

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]

This content addresses the issue of Word Error Rate (WER) penalizing formatting differences in STT evaluations, leading to inaccurate scores. To solve this, the open-source `gladia-normalization` library was released, which normalizes transcripts before WER calculation, ensuring a fairer assessment of recognition quality.

Open Source evaluation NLP Speech-to-Text

RESEARCH↑ trendingReddit r/MachineLearning·4/22/2026

EMNLP workshop any good? Or any other NLP venue good for VLM eval work? [D]

The content asks for opinions on the suitability of EMNLP workshops for Vision-Language Model (VLM) evaluation work. It also seeks recommendations for other good NLP venues for this type of research.

evaluation VLM NLP research venues

ARTICLE↑ trendingReddit r/LocalLLaMA·18d ago

Anyone evaluated the difference between Qwen Code for the local qwen models vs another harness? CC, OC, LC, Aider etc..

A user asks for a comparison between Qwen Code and other harnesses (like opencode) for evaluating local Qwen models. They wonder if Qwen Code offers superior native functionality and what benchmarking methodology was used.

AI models evaluation Benchmarking

ARTICLEDEV.to AI·4/16/2026

I was tired of complex RAG evaluation tools, so I built my own (and open-sourced it) 🚀

Tired of complex RAG evaluation tools, the author built and open-sourced a new lightweight tool called RAG-Destroyer. It aims to easily integrate into workflows to identify and eliminate bad context and hallucinations in RAG applications.

Open Source evaluation RAG AI tools

RESEARCHHugging Face Blog·4/21/2026

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA (قِمّة) is a new quality-first leaderboard designed for evaluating Arabic Large Language Models (LLMs). It aims to identify and promote top-performing AI models specifically for the Arabic language.

evaluation Benchmarking Arabic LLM

ARTICLEDEV.to AI·4/15/2026

OpenAI's Promptfoo deal puts evaluation and red-teaming at the centre of the agent stack

OpenAI's acquisition of Promptfoo signals a crucial shift in judging AI agent quality, moving beyond mere fluency to comprehensive testing, documentation, and governance of failures before deployment. This addresses critical operational risks like prompt injection and tool misuse, ensuring robustness in production systems.

red-teaming LLM Agents evaluation prompt injection

RESEARCHarXiv CS.LG·17d ago

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

The paper introduces HealthCraft, a public reinforcement-learning environment designed to evaluate the safety of frontier language models in emergency medicine. It focuses on trajectory-level safety, tool misuse, and clinical pressure, built on a FHIR R4 world state and offering 195 tasks for comprehensive assessment.

LLMs evaluation reinforcement learning medical AI

RESEARCHarXiv CS.CL·4/6/2026

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

SocioEval é um framework baseado em templates para avaliar sistematicamente o viés de status socioeconômico em modelos de fundação, incluindo LLMs, uma área pouco explorada. A pesquisa avaliou 13 LLMs e revelou variações substanciais nas taxas de viés (0,42% a 33,75%), manifestando-se de forma diferente em vários temas.

LLMs evaluation Foundation Models SocioEval

RESEARCHarXiv CS.AI·4d ago

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

This study examines the stability and manipulability of LLM judges in evaluation pipelines, finding that while they are stable under neutral reevaluation, they become reversible under targeted post-decision challenge. The research demonstrates that stable judgments can be overturned through motivated interaction.

robustness LLMs evaluation Benchmarking

ARTICLEDEV.to AI·16d ago

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm Versions Autonomously

This article details the creation of an eval-first AI harness that enabled the autonomous shipment of 25 algorithm versions in 13 days. The methodology focuses on immutable test sets and independent reviews to ensure changes do not cause regressions. The author emphasizes that the harness, rather than just prompt engineering or full automation, was key to the pace and safety of development.

evaluation Algorithms Software engineering automation

ARTICLEDEV.to AI·5d ago

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory

The size of the human-labeled calibration set for validating an LLM-as-judge depends on label balance. Fifty stratified traces suffice for balanced binary criteria, but 200 or more are mandatory for rare-but-expensive categories like safety violations, as kappa's variance is dominated by minority-class examples.

LLM-as-judge Calibration evaluation sample size

RESEARCHDEV.to AI·4/17/2026

A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability

This content provides a comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability, meaning its ability to convert natural language into SQL queries without prior examples. It explores the model's performance and limitations in this complex task.

evaluation Text-to-SQL ChatGPT benchmark

DOCAWS Machine Learning Blog·22d ago

Build custom code-based evaluators in Amazon Bedrock AgentCore

This post demonstrates how to implement custom code-based evaluators in Amazon Bedrock AgentCore. It teaches how to register Lambda-based evaluators for a financial market-intelligence agent and combine them with built-in evaluators for fact-checking and PII detection.

evaluation learning Amazon Bedrock AWS

RESEARCHarXiv CS.CL·4/6/2026

Pragmatics Meets Culture: Culturally-adapted Artwork Description Generation and Evaluation

Este artigo apresenta a tarefa de geração de descrições de arte culturalmente adaptadas para combater o viés cultural em modelos de linguagem na geração de texto aberto. Ele propõe um framework de avaliação baseado em perguntas e respostas culturalmente fundamentadas, mostrando que um modelo de locutor pragmático melhora significativamente a compreensão do ouvinte.

Art Description language models evaluation Pragmatics

ARTICLEDEV.to AI·5/10/2026

I open-sourced a 3-agent blind eval team. Any agent runtime can call it for pre-commitment review of its own plans.

An open-source, 3-agent blind evaluation workflow, released this weekend, allows any AI agent runtime to pre-commit review its plans via an HTTP endpoint. This system addresses the issue of models reliably self-evaluating by providing an external, blind primitive for honest assessment.

Open Source evaluation Self-evaluation Workflow

RESEARCHarXiv CS.CL·4/16/2026

Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity

This paper introduces bi-predictability (P) and the Information Digital Twin (IDT) architecture for real-time monitoring of LLM interaction integrity. It aims to continuously ensure structural coupling in multi-turn workflows, addressing the shortcomings of current evaluation methods that fail to detect gradual degradation.

information theory monitoring evaluation real-time AI

RESEARCHarXiv CS.CL·4/17/2026

MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios

MemGround is a new rigorous long-term memory benchmark for LLMs, designed to overcome the limitations of static evaluations by using rich, gamified interactive scenarios. It features a three-tier hierarchical framework to assess different memory types and a multi-dimensional metric suite for comprehensive quantification.

evaluation gamification memory benchmark