← heapsort-ai

Benchmarks

67 items

RESEARCHarXiv CS.CL·4/14/2026

Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis

This paper introduces a new framework and benchmark for simulating organized group behavior, such as corporate decision-making in response to market dynamics. It formalizes the "Organized Group Behavior Simulation" task and presents GROVE, a benchmark with 8,052 real-world context-decision pairs to predict collective entity actions.

28
RESEARCHarXiv CS.CL·21d ago

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

This paper introduces CHI-Bench, a new benchmark designed to test AI agents' ability to automate complex, policy-rich, and long-horizon healthcare workflows. It addresses critical gaps in current benchmarks by focusing on policy density, multi-role composition, and multilateral interaction in realistic healthcare operations across multiple domains.

28
RESEARCHarXiv CS.CL·6d ago

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

A systematic inspection of extsf{FOLIO} and extsf{MALLS} validation splits revealed high rates of incorrect FOL formalizations and ambiguous NL sentences, distorting AI model evaluation. The authors developed and released corrected ground truths for these datasets, demonstrating how annotation errors impact the evaluation of state-of-the-art LLMs.

28
RESEARCHarXiv CS.AI·4/22/2026

From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

This paper introduces a neuro-symbolic framework for translating natural-language reasoning problems into executable Narsese, leveraging first-order logic. It presents NARS-Reasoning-v0.1, a new benchmark featuring reasoning problems with corresponding formal representations and truth labels for evaluating reasoning capabilities.

27
RESEARCHDEV.to AI·15d ago

François Chollet 谈 AGI 未来

François Chollet discusses the future of AGI, predicting its arrival around 2030, and introduces NDI lab's mission to develop a new, "optimal" machine learning paradigm based on symbolic program synthesis. He critiques deep learning's limitations and outlines NDI's high-risk, high-reward strategy for foundational AI advancement.

27
RESEARCHDEV.to AI·20d ago

Self-evolving retrieval lifts benchmark scores 25%

AI agents that adapt their retrieval configurations while running deliver a 25.7% performance lift on established benchmarks, overturning the assumption that retrieval stacks should be frozen. This new paradigm allows an LLM-driven "diagnosis" module to rewrite its search strategy as new queries arrive, treating the entire memory-access pipeline as a mutable policy.

27