AI evaluation

65 items

RESEARCHarXiv CS.CL·4/17/2026

Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning

This research investigates whether Large Language Models (LLMs) can identify methodological flaws, such as data leakage, in published machine learning studies. A case study showed six state-of-the-art LLMs consistently detected evaluation flaws in a gesture recognition paper due to non-independent data partitioning.

deep learning machine learning large language models AI evaluation

RESEARCHarXiv CS.AI·20d ago

Open-World Evaluations for Measuring Frontier AI Capabilities

This paper advocates for "open-world evaluations" as a complement to traditional benchmarks for measuring frontier AI capabilities. It introduces CRUX, a project for conducting these regular, long-horizon, real-world task assessments, exemplified by an AI agent successfully publishing an iOS app.

AI capabilities CRUX project open-world evaluations frontier AI

RESEARCHarXiv CS.AI·18d ago

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench is a new benchmark grounded in 200 genuine multi-turn human-model conversations to assess LLM emotional intelligence. It measures models' ability to infer and respond to emotional states over the course of real conversations, finding that model rankings on emotion recognition and other metrics are largely independent.

Emotional Intelligence benchmarks human-AI interaction AI evaluation

RESEARCHarXiv CS.CL·5/11/2026

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

This research paper presents an atlas of domain-level metacognitive monitoring across 33 frontier LLMs, analyzing 1,500 MMLU items across six domains. It reveals significant within-model variation, with Applied/Professional knowledge being the easiest and Formal Reasoning/Natural Science the hardest domains to monitor.

LLMs Metacognition cognitive AI benchmarks

RESEARCHarXiv CS.CL·26d ago

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

This paper audits multimodal-physics evaluation pipelines, uncovering construction practices that distort how vision-language reasoning is measured. It addresses train-eval contamination, translation drift, and MCQ saturation, releasing new artifacts to tackle these gaps.

multimodal AI Physics reasoning Corpus development benchmarking

RESEARCHarXiv CS.CL·21d ago

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Low-resource natural language processing has experienced explosive growth, but its evaluation faces a critical challenge: the scarcity of sociolinguistic expertise needed to assess complex generative systems. This creates an "Annotation Scarcity Paradox," where the technical capacity to scale models vastly outpaces the human infrastructure required for authentic evaluation.

machine learning NLP Low-resource languages AI evaluation

RESEARCHarXiv CS.CL·27d ago

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

This paper proposes evaluating LLM fairness through in-situ conversational behavior instead of standardized tests. It introduces the MAC-Fairness framework for behavioral analysis in multi-agent dialogue, revealing the unreliability of traditional approaches.

LLM fairness Research Methods multi-agent systems AI evaluation

RESEARCHarXiv CS.CL·23d ago

Capability Conditioned Scaffolding for Professional Human LLM Collaboration

This research introduces Capability Conditioned Scaffolding, a framework addressing Professional Domain Drift in human-LLM collaboration by tailoring AI interventions based on user expertise levels. A pilot evaluation showed this approach improves reliable human-AI collaboration beyond mere stylistic personalization.

human-AI collaboration User expertise Domain Adaptation LLM interaction

RESEARCHarXiv CS.AI·12d ago

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

The BEAMS Initiative focuses on guiding the development of responsible and ethical AI tools for modeling and simulation by establishing human-centered benchmarks. It uses an open digital and organizational infrastructure, including the 'sd ai' open-source project, to collaboratively evaluate these AI tools.

open-source AI modeling and simulation benchmarking AI evaluation

RESEARCHarXiv CS.AI·12d ago

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

This research evaluates LLM-generated reviews for scientific papers from both author and reviewer perspectives. It identifies limited alignment between LLM and human reviews and explores how authors can effectively "game" LLM reviews to improve submissions.

scientific review human-AI interaction AI evaluation LLM

RESEARCHarXiv CS.AI·15d ago

Confidence Calibration in Large Language Models

This study investigates confidence calibration in Large Language Models (LLMs) across diverse tasks, finding that current LLMs are overconfident on difficult tests and underconfident on easy ones. The researchers developed LifeEval, a new test to evaluate model calibration across varying levels of difficulty.

Confidence Calibration Overconfidence machine learning large language models

RESEARCHarXiv CS.AI·14d ago

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM is a new benchmark designed to evaluate Theory of Mind in LLMs by explicitly modeling belief structures. This approach moves beyond end-point question answering, allowing for a deeper analysis of mental-state representations, including divergent or mistaken beliefs.

LLMs Social Reasoning benchmarking AI evaluation

RESEARCHarXiv CS.AI·14d ago

JobBench: Aligning Agent Work With Human Will

JobBench is a new benchmark that evaluates AI agents on workflows identified by experts as high-priority for delegation, covering 130 tasks across 35 occupations. It aims to shift the labour-market effect from replacement to enhancement, building agents that do what humans actually want delegated.

future-of-work job delegation benchmarking AI evaluation

RESEARCHarXiv CS.AI·14d ago

Can LLMs Introspect? A Reality Check

A new study questions whether large language models (LLMs) can truly introspect, arguing that current conclusions might be premature. It suggests that apparent success could stem from general anomaly detection rather than genuine introspection, drawing lessons from human metacognition research.

LLMs cognitive science Metacognition Introspection

RESEARCHDEV.to AI·4/21/2026

KWBench: New Benchmark Tests LLMs' Unprompted Problem Recognition

Researchers introduced KWBench, a 223-task benchmark to measure if LLMs can recognize the governing game-theoretic problem in professional scenarios without explicit prompts. The best-performing model passed only 27.9% of tasks, highlighting a critical gap between task execution and situational understanding.

LLMs benchmarks AI evaluation

RESEARCHarXiv CS.AI·4/21/2026

Computational Hermeneutics: Evaluating generative AI as a cultural technology

This paper proposes computational hermeneutics as a new framework for evaluating generative AI, viewing it as a cultural technology and a "context machine." It argues that evaluations must address interpretive challenges like situatedness, plurality, and ambiguity, using iterative, people-inclusive, and culturally contextual benchmarks.

humanities AI ethics AI evaluation Generative AI

ARTICLEHugging Face Blog·4/29/2026

AI evals are becoming the new compute bottleneck

AI evaluations are emerging as a significant new bottleneck in the development process, akin to the historical limitations posed by computational power. This suggests that the resources and time required to assess AI models are becoming a major constraint on progress.

computational resources machine learning infrastructure AI evaluation AI development

RESEARCHarXiv CS.AI·4/23/2026

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

ThermoQA is a new three-tier benchmark of 293 open-ended engineering thermodynamics problems introduced to evaluate thermodynamic reasoning in LLMs. Leading LLMs like Claude Opus 4.6 and GPT-5.4 achieve high scores, but cross-tier degradation confirms that property memorization does not imply thermodynamic reasoning, with the dataset and code being open-source.

Dataset benchmarking large language models AI evaluation

RESEARCHarXiv CS.CL·29d ago

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Magis-Bench is a new benchmark for evaluating Large Language Models (LLMs) on magistrate-level legal tasks, using 74 questions from recent Brazilian judicial competitive examinations. It evaluates 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with strong inter-judge agreement.

LLMs Legal AI Judicial tasks benchmarks

RESEARCHarXiv CS.CL·4/15/2026

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

This research introduces the "Filtered Reasoning Score," a novel metric designed to assess the quality of reasoning in AI models. It specifically focuses on evaluating the reasoning evident in a model's most confident outputs or traces.

AI metrics machine learning Reasoning AI evaluation