Benchmarks

67 items

NEWSDEV.to AI·7d ago

Claude Opus 4.8: Dynamic Workflows and Parallel Subagents

Anthropic launched Claude Opus 4.8, introducing dynamic workflows that enable hundreds of parallel subagents for complex tasks. This version shows significant improvements in benchmarks like SWE-bench Verified and USAMO, with unchanged standard pricing and a new, more affordable fast mode.

AI models Anthropic Benchmarks large language models

RESEARCHDEV.to AI·5/7/2026

AI agent logs expose reproducibility gaps

AI agent logs reveal significant reproducibility gaps, where autonomous agents frequently fail even after initial successes, especially in web navigation tasks. Research, including the SWE-chat corpus, highlights that less than half of agent-produced code survives into user commits, exposing a critical discrepancy between benchmark scores and real-world reliability.

software development Reliability Reproducibility Benchmarks

RESEARCHarXiv CS.CL·5/1/2026

CL-bench Life: Can Language Models Learn from Real-Life Context?

CL-bench Life is a new human-curated benchmark designed to assess whether frontier language models can effectively learn from complex, messy real-life contexts. It comprises 405 context-task pairs to test models' ability to reason over personal and social experiences.

context-learning language models Benchmarks

RESEARCHarXiv CS.AI·4/27/2026

Math Takes Two: A test for emergent mathematical reasoning in communication

This paper proposes Math Takes Two, a new benchmark designed to assess the emergence of mathematical reasoning in language models through communication. It tests whether two agents, without prior mathematical knowledge, can develop a shared symbolic protocol to solve a visually grounded task where a numerical system facilitates extrapolation.

language models mathematical reasoning AI communication Benchmarks

RESEARCHarXiv CS.CL·4/16/2026

WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

WorkRB is the first open-source, community-driven benchmark for AI in the work domain, addressing research fragmentation and employment data sensitivity. It unifies 13 diverse tasks from 7 groups as recommendation and NLP tasks, such as job/skill recommendation and skill extraction.

hiring future-of-work recommender systems NLP

RESEARCHarXiv CS.AI·5/4/2026

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

ARMOR 2025 is a new military-aligned benchmark designed to evaluate the safety of large language models (LLMs) in defense applications, beyond civilian contexts. It addresses the gap in existing benchmarks by grounding evaluations in military doctrines like the Law of War, Rules of Engagement, and Joint Ethics Regulation.

ethics military AI Benchmarks AI safety

RESEARCHarXiv CS.AI·17d ago

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench is a new benchmark grounded in 200 genuine multi-turn human-model conversations to assess LLM emotional intelligence. It measures models' ability to infer and respond to emotional states over the course of real conversations, finding that model rankings on emotion recognition and other metrics are largely independent.

Emotional Intelligence Benchmarks human-AI interaction AI evaluation

RESEARCHarXiv CS.CL·29d ago

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

MultiSoc-4D is a new Bengali social media dataset benchmark designed to diagnose LLM behavior in closed-set annotation. The research identifies "instruction-induced label collapse," a phenomenon where LLMs systematically prefer fallback labels, leading to under-detection of minority categories.

LLMs Natural Language Processing Data Annotation Benchmarks

RESEARCHarXiv CS.AI·17d ago

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

The paper introduces SMDD-Bench, a new challenging multi-turn benchmark with 502 guaranteed-solvable tasks to evaluate LLM agents' performance in real-world small molecule drug design. It aims to standardize evaluation across diverse chemistries and targets, requiring strong chemical, biological, and 3D intuition.

LLMs Scientific Discovery Benchmarks drug design

RESEARCHarXiv CS.CL·29d ago

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

This research paper presents an atlas of domain-level metacognitive monitoring across 33 frontier LLMs, analyzing 1,500 MMLU items across six domains. It reveals significant within-model variation, with Applied/Professional knowledge being the easiest and Formal Reasoning/Natural Science the hardest domains to monitor.

LLMs Metacognition cognitive AI Benchmarks

RESEARCHarXiv CS.CL·6d ago

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX is a large-scale multilingual benchmark introduced to address the challenges of idiomatic expressions in natural language processing. It contains over 190K contextualized examples spanning 12K+ idioms with aligned semantic representations in English, Arabic, and French.

language models Natural Language Processing datasets Benchmarks

RESEARCHarXiv CS.AI·14d ago

BODHI: Precise OS Kernel Specification Inference

This paper proposes BODHI, a domain knowledge prompting method for OS kernel specification inference, aiming to overcome current LLM limitations. It augments the standard few-shot prompt with a structured C-to-Python translation guide, improving automation and specification precision.

AI models LLMs operating systems Formal verification

RESEARCHarXiv CS.CL·8d ago

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

This paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark for evaluating Retrieval-Augmented Generation (RAG) systems using realistic queries and expert-annotated case law answers. It highlights the sensitivity of retrieval performance, the competitiveness of open-source embedding models, and the limitations of automatic evaluations and LLM hallucinations in generated responses.

Retrieval Augmented Generation LLMs evaluation Legal AI

RESEARCHarXiv CS.AI·13d ago

Constraint acquisition needs better benchmarks

Current benchmarks for Constraint Acquisition (CA) and Mathematical Programming (MP) models are insufficient, impeding research reproducibility and comparability. This work introduces MPMMine, a new benchmark suite designed to validate and enhance MP models through diverse domain knowledge artifacts, promoting consistency and openness.

Model Validation Constraint Acquisition Mathematical Programming Benchmarks

ARTICLEDEV.to AI·22d ago

GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, and Benchmarks

This content compares GPT-5.5 and Claude Opus 4.7, two leading AI language models, discussing their advancements and distinct focuses. It aims to guide the choice for AI projects by detailing their pricing, speed, and benchmark differences.

AI models GPT Claude Benchmarks

RESEARCHDEV.to AI·13d ago

SpatialBench: New Benchmark Tests Foundation Models on 3D Tasks

SpatialBench is a new benchmark from ropedia_ai designed to evaluate spatial foundation models across 7 tasks and 5 datasets. It tests true 3D spatial understanding in areas like depth estimation, surface normal prediction, and 3D object detection.

spatial computing 3D Foundation Models Benchmarks

RESEARCHDEV.to AI·13d ago

NVIDIA Vera CPU Benchmarks: 1.55x Faster Than Intel Xeon in Phoronix Tests

NVIDIA Vera CPU benchmarks by Phoronix show 1.55x faster performance than Intel Xeon 6980P and 10% over AMD EPYC 9575F. This 88-core ARM processor, featuring 1.2 TB/s memory bandwidth, is designed for agentic AI workloads.

CPU AI hardware Benchmarks NVIDIA

RESEARCHDEV.to AI·4/21/2026

KWBench: New Benchmark Tests LLMs' Unprompted Problem Recognition

Researchers introduced KWBench, a 223-task benchmark to measure if LLMs can recognize the governing game-theoretic problem in professional scenarios without explicit prompts. The best-performing model passed only 27.9% of tasks, highlighting a critical gap between task execution and situational understanding.

LLMs Benchmarks AI evaluation

RESEARCHarXiv CS.CL·4/7/2026

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

CresOWLve é um novo benchmark para avaliar a resolução criativa de problemas em LLMs, superando as limitações dos benchmarks existentes. Ele utiliza quebra-cabeças baseados em conhecimento do mundo real, exigindo diversas estratégias de pensamento criativo e combinação de fatos para encontrar soluções.

LLMs Creative Problem Solving Benchmarks Cognitive Abilities

RESEARCHarXiv CS.CL·28d ago

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Magis-Bench is a new benchmark for evaluating Large Language Models (LLMs) on magistrate-level legal tasks, using 74 questions from recent Brazilian judicial competitive examinations. It evaluates 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with strong inter-judge agreement.

LLMs Legal AI Judicial tasks Benchmarks