← heapsort-ai

performance evaluation

6 items

RESEARCH↑ trendingReddit r/MachineLearning·4/14/2026

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

ClawBench is a new benchmark evaluating AI browser agents on 153 everyday tasks across 144 live websites. Key findings reveal the best model (Claude Sonnet 4.6) achieves only a 33.3% success rate, indicating a significant gap in current AI capabilities for online task completion.

42
RESEARCHarXiv CS.AI·4/21/2026

Agentic Frameworks for Reasoning Tasks: An Empirical Study

This empirical study evaluates 22 agentic frameworks across three reasoning benchmarks (BBH, GSM8K, ARC) to compare their performance, efficiency, and practical suitability. Results indicate that 19 frameworks completed all tasks, with 12 demonstrating stable performance at 74.6-75.9% accuracy, 4-6 seconds execution time, and 0.14-0.18 cents per task cost.

27
RESEARCHarXiv CS.LG·4/30/2026

Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts

This research addresses the bias in performance estimation for imbalanced classification, particularly regarding minority subconcepts within classes. It introduces a new practical utility-weighted evaluation metric, predicted-weighted balanced accuracy (pBA), which uses predicted posterior probabilities to correct this bias and provide a more accurate assessment.

27