← heapsort-ai

AI evaluation

65 items

RESEARCHarXiv CS.CL·4/17/2026

Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning

This research investigates whether Large Language Models (LLMs) can identify methodological flaws, such as data leakage, in published machine learning studies. A case study showed six state-of-the-art LLMs consistently detected evaluation flaws in a gesture recognition paper due to non-independent data partitioning.

27
RESEARCHarXiv CS.AI·18d ago

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench is a new benchmark grounded in 200 genuine multi-turn human-model conversations to assess LLM emotional intelligence. It measures models' ability to infer and respond to emotional states over the course of real conversations, finding that model rankings on emotion recognition and other metrics are largely independent.

27
RESEARCHarXiv CS.CL·21d ago

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Low-resource natural language processing has experienced explosive growth, but its evaluation faces a critical challenge: the scarcity of sociolinguistic expertise needed to assess complex generative systems. This creates an "Annotation Scarcity Paradox," where the technical capacity to scale models vastly outpaces the human infrastructure required for authentic evaluation.

27
RESEARCHarXiv CS.AI·4/23/2026

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

ThermoQA is a new three-tier benchmark of 293 open-ended engineering thermodynamics problems introduced to evaluate thermodynamic reasoning in LLMs. Leading LLMs like Claude Opus 4.6 and GPT-5.4 achieve high scores, but cross-tier degradation confirms that property memorization does not imply thermodynamic reasoning, with the dataset and code being open-source.

27