← heapsort-ai

evaluation

53 items

RESEARCHarXiv CS.CL·5/5/2026

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

The CLEAR framework is introduced to assess how ambiguity and uncertainty impact medical Large Language Models' (LLMs) reliability, moving beyond simplified evaluation benchmarks. It systematically perturbs answer options and their semantic framing, revealing that increased plausible answers degrade LLM performance and caution decreases with uncertain abstention phrasing.

27
RESEARCHarXiv CS.CL·5/1/2026

BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

This paper introduces BatteryPass-12K, the first public dataset for the novel task of digital battery passport (DBP) conformance classification, addressing a critical need before new EU regulations. It benchmarks 22 language models, finding that "Thinking models" like GPT-5.4 achieve the best performance, and few-shot examples significantly enhance results on this challenging task.

27
RESEARCHarXiv CS.CL·4/16/2026

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

KMMMU is a new native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings, featuring 3,466 questions from native exams. The study shows that current AI models achieve only 42.05% accuracy on the full set, with significant failures in culturally and discipline-specific problems.

27
RESEARCHarXiv CS.AI·22d ago

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

This paper introduces a new paradigm for interactively evaluating Theory of Mind (ToM) improvements in Large Language Models (LLMs) for human-AI interactions. Empirical findings from real-world datasets and a user study reveal that ToM enhancements on static benchmarks do not always translate to benefits in dynamic human-AI interactions.

27
RESEARCHarXiv CS.CL·25d ago

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

This research evaluates large language models (LLMs) in biomedical question answering, specifically addressing their reliability when faced with conflicting or incomplete evidence. It reveals that LLM accuracy significantly drops, and predictions flip, when the order of correct and contradictory documents is reversed, highlighting issues with order effects and the need for conflict-aware abstention.

27
RESEARCHarXiv CS.CL·8d ago

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

This paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark for evaluating Retrieval-Augmented Generation (RAG) systems using realistic queries and expert-annotated case law answers. It highlights the sensitivity of retrieval performance, the competitiveness of open-source embedding models, and the limitations of automatic evaluations and LLM hallucinations in generated responses.

27