← heapsort-ai

Benchmarks

67 items

RESEARCHarXiv CS.AI·19d ago

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas addresses the fragmentation in benchmarks used to evaluate large language model (LLM) agents, which currently emphasize different units of measurement. It introduces four components, including a six-state control-decision taxonomy, a nine-category trajectory-failure taxonomy, and a methodology to measure model capability based on prompt supervision.

27
NEWSQwen Blog·4/28/2025

Qwen3: Think Deeper, Act Faster

Qwen3, a nova família de modelos de linguagem, foi lançada, com o modelo principal Qwen3-235B-A22B alcançando resultados competitivos em benchmarks. Modelos menores como Qwen3-30B-A3B e Qwen3-4B também demonstraram desempenho superior em comparação com outros modelos.

23