performance evaluation

6 items

RESEARCH↑ trendingReddit r/MachineLearning·4/14/2026

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

ClawBench is a new benchmark evaluating AI browser agents on 153 everyday tasks across 144 live websites. Key findings reveal the best model (Claude Sonnet 4.6) achieves only a 33.3% success rate, indicating a significant gap in current AI capabilities for online task completion.

performance evaluation Benchmarking browser agents online tasks

RESEARCHarXiv CS.AI·5/4/2026

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

TokenArena is introduced as a continuous benchmark that measures AI inference at endpoint granularity across five core axes. It synthesizes output speed, time to first token, price, effective context, and quality, along with energy estimates, into composites like joules and dollars per correct answer and endpoint fidelity.

AI models Energy Efficiency performance evaluation Benchmarking

RESEARCHarXiv CS.AI·4/21/2026

Agentic Frameworks for Reasoning Tasks: An Empirical Study

This empirical study evaluates 22 agentic frameworks across three reasoning benchmarks (BBH, GSM8K, ARC) to compare their performance, efficiency, and practical suitability. Results indicate that 19 frameworks completed all tasks, with 12 demonstrating stable performance at 74.6-75.9% accuracy, 4-6 seconds execution time, and 0.14-0.18 cents per task cost.

AI frameworks performance evaluation Benchmarking AI agents

RESEARCHarXiv CS.LG·4/30/2026

Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts

This research addresses the bias in performance estimation for imbalanced classification, particularly regarding minority subconcepts within classes. It introduces a new practical utility-weighted evaluation metric, predicted-weighted balanced accuracy (pBA), which uses predicted posterior probabilities to correct this bias and provide a more accurate assessment.

imbalanced-classification bias-correction machine-learning-metrics subconcept-analysis

RESEARCHarXiv CS.AI·5/6/2026

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

This research introduces Terminus-4B, a finetuned small language model, to explore its capability in replacing frontier LLMs for agentic terminal execution tasks. The model is post-trained using Supervised Finetuning and Reinforcement Learning with rubric-based LLM-as-judge rewards.

LLMs model training performance evaluation Small Language Models

RESEARCHDEV.to AI·18d ago

Performance Comparisons of Routing Protocols in Mobile Ad Hoc Networks

This content compares various routing protocols in Mobile Ad Hoc Networks (MANETs). It likely analyzes their performance metrics under different network conditions to identify optimal solutions.

Routing Protocols Networking MANETs Wireless Communication