Benchmarks

67 items

RESEARCHarXiv CS.CL·4d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench is a new benchmark designed to assess the safety of Omni Large Language Models across vision, audio, and text inputs, revealing significant challenges in integrating multiple modalities for accurate safety judgments. It highlights that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings.

multimodal AI LLMs Cross-modal reasoning Benchmarks

RESEARCHarXiv CS.CL·4/14/2026

Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis

This paper introduces a new framework and benchmark for simulating organized group behavior, such as corporate decision-making in response to market dynamics. It formalizes the "Organized Group Behavior Simulation" task and presents GROVE, a benchmark with 8,052 real-world context-decision pairs to predict collective entity actions.

Decision Making Organizational Behavior Benchmarks Market Prediction

RESEARCHarXiv CS.AI·4/14/2026

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

LABBench2 is introduced as an improved benchmark for evaluating AI systems performing biology research, evolving from the original LAB-Bench. It aims to measure real-world capabilities in useful scientific tasks, moving beyond basic knowledge and reasoning, and comprises nearly 1,900 tasks.

Scientific Discovery Language Agents Biology Research Benchmarks

RESEARCHDEV.to AI·4/23/2026

qwen3.6-27b scores 77.2% on SWE-bench. the dense model is winning against MoE.

The Qwen3.6-27B dense model outperformed the Qwen3.6-35B-A3B MoE model on SWE-bench, scoring 77.2% versus 73.4%. This indicates that dense models may be proving more effective for real-world software engineering tasks.

AI models Model Architecture Benchmarks MoE

ARTICLEDEV.to AI·7d ago

<think>The user wants me to rewrite an article about open source AI models via API. Let me analyze the requirements:

The article explores the accessibility and cost-effectiveness of open-source AI models via API, detailing their pricing structures and performance metrics. It aims to provide a comparative analysis to help developers select the most suitable AI solution for their needs.

AI models open-source AI API Benchmarks

RESEARCHarXiv CS.CL·5/4/2026

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

This research explores efficient methods for evaluating Large Audio Models (LAMs) using minimal data subsets, achieving high correlation with full benchmarks. It also shows that regression models trained on these subsets can better predict human preferences for user satisfaction than full benchmarks.

Model Evaluation efficiency Benchmarks Large Audio Models

RESEARCHarXiv CS.CL·21d ago

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

This paper introduces CHI-Bench, a new benchmark designed to test AI agents' ability to automate complex, policy-rich, and long-horizon healthcare workflows. It addresses critical gaps in current benchmarks by focusing on policy density, multi-role composition, and multilateral interaction in realistic healthcare operations across multiple domains.

Workflows Healthcare Benchmarks automation

RESEARCHarXiv CS.CL·6d ago

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

A systematic inspection of extsf{FOLIO} and extsf{MALLS} validation splits revealed high rates of incorrect FOL formalizations and ambiguous NL sentences, distorting AI model evaluation. The authors developed and released corrected ground truths for these datasets, demonstrating how annotation errors impact the evaluation of state-of-the-art LLMs.

LLMs Neurosymbolic AI natural language processing Benchmarks

RESEARCHDEV.to AI·4/17/2026

Claude Opus 4.7 Just Dropped: 87.6% SWE-bench, Breaking API Changes, and the Hidden Cost Increase

Anthropic released Claude Opus 4.7, featuring significant performance improvements, particularly in coding (87.6% SWE-bench) and vision (98.5% visual acuity). The update includes aggressive breaking API changes and a hidden cost increase despite claims of unchanged pricing.

AI model release API Benchmarks performance

ARTICLEDEV.to AI·4/10/2026

LLM API Pricing in 2026: I Put Every Major Model in One Table

O artigo analisa os preços das APIs de LLMs em 2026, revelando uma variação de até 100x entre os modelos e compilando uma tabela de referência detalhada. Ele compara custos de entrada, saída, cache e performance (SWE-bench) para modelos como DeepSeek V4, GPT-5.4, Claude, Gemini, Mistral e Groq, destacando opções econômicas e outliers.

API pricing AI models comparison Benchmarks

RESEARCHarXiv CS.AI·4/22/2026

From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

This paper introduces a neuro-symbolic framework for translating natural-language reasoning problems into executable Narsese, leveraging first-order logic. It presents NARS-Reasoning-v0.1, a new benchmark featuring reasoning problems with corresponding formal representations and truth labels for evaluating reasoning capabilities.

LLMs Reasoning Benchmarks Neuro-symbolic AI

RESEARCHarXiv CS.AI·26d ago

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

This paper introduces BenchJack, an automated system designed to audit AI agent benchmarks for "reward hacking," where agents maximize scores without performing the intended task. It derives a taxonomy of recurring flaw patterns and uses an iterative generative-adversarial pipeline to improve benchmark robustness.

red-teaming reward hacking security Benchmarks

RESEARCHarXiv CS.CL·6d ago

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX is a large-scale multilingual benchmark introduced to address the challenges of idiomatic expressions in natural language processing. It contains over 190K contextualized examples spanning 12K+ idioms with aligned semantic representations in English, Arabic, and French.

language models natural language processing datasets Benchmarks

ARTICLEDEV.to AI·10d ago

Opus 4.8 ships Dynamic Workflows — hundreds of parallel subagents per session. Read this before you wire it into prod.

Anthropic's Opus 4.8 introduces Dynamic Workflows, a new programming model enabling hundreds of parallel subagents per session, which is critical for production agent deployment. The article warns users to pin their configurations in the preview version to avoid unexpected billing.

Dynamic Workflows Anthropic Benchmarks Opus 4.8

ARTICLEDEV.to AI·4/26/2026

GPT-5.5 Just Dropped. Here's What the Benchmarks Are Hiding.

This article analyzes the recently released GPT-5.5, comparing it against Claude models in specific benchmarks for different task types. It reveals that while GPT-5.5 excels in execution tasks, Claude models are preferred for research (due to lower hallucination rates), debugging, and orchestration.

AI models AI capabilities use cases model comparison

ARTICLEDEV.to AI·23d ago

AI Agent Evaluation in 2026: Beyond the Benchmark Trap

The content highlights the significant gap between high AI agent scores on benchmarks and their poor performance in production, arguing that current benchmarks test narrow capabilities and miss critical real-world challenges. This discrepancy is identified as the defining challenge for AI agent evaluation in 2026.

evaluation AI deployment Benchmarks AI development

RESEARCHDEV.to AI·15d ago

François Chollet 谈 AGI 未来

François Chollet discusses the future of AGI, predicting its arrival around 2030, and introduces NDI lab's mission to develop a new, "optimal" machine learning paradigm based on symbolic program synthesis. He critiques deep learning's limitations and outlines NDI's high-risk, high-reward strategy for foundational AI advancement.

AGI deep learning Symbolic AI Benchmarks

RESEARCHDEV.to AI·23d ago

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

A new Glean benchmark in Claude Cowork indicates that off-the-shelf MCP servers fail 2.5 times more often and use 30% more tokens than Glean's indexed context layer. Users have also reported cutting Claude token bills by 30% by adopting Glean's method.

language models Claude Cowork AI Efficiency Benchmarks

RESEARCHDEV.to AI·20d ago

Self-evolving retrieval lifts benchmark scores 25%

AI agents that adapt their retrieval configurations while running deliver a 25.7% performance lift on established benchmarks, overturning the assumption that retrieval stacks should be frozen. This new paradigm allows an LLM-driven "diagnosis" module to rewrite its search strategy as new queries arrive, treating the entire memory-access pipeline as a mutable policy.

Adaptive AI Benchmarks Retrieval systems AI agents

RESEARCHDEV.to AI·5/5/2026

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

BrowseComp is a new and challenging benchmark designed to evaluate browsing agents. It focuses on complex tasks that require contextual understanding and interaction with web interfaces, offering a new metric for AI performance.

evaluation research Benchmarks AI