The Open Agent Leaderboard
This content describes the Open Agent Leaderboard, a platform designed to rank and compare the performance of various AI agents. It provides a standardized evaluation of their capabilities.
This content describes the Open Agent Leaderboard, a platform designed to rank and compare the performance of various AI agents. It provides a standardized evaluation of their capabilities.
O artigo discute as limitações das avaliações atuais de sistemas RAG (Retrieval-Augmented Generation) em ambientes corporativos, que não diagnosticam sistematicamente os desafios complexos além da precisão final. Para suprir essa lacuna, a pesquisa propõe um framework de diagnóstico multi-dimensional e um benchmark para RAG empresarial, baseado em uma taxonomia de dificuldade de quatro eixos.
This content evaluates the strategic reasoning capabilities in forecasting agents. It explores methodologies and findings related to how AI systems perform strategic predictions.
Prompted by recent LLM advances, this paper conducts a scoping review of NLP's long history of methodological reflection on evaluation concerns. It develops a taxonomy, synthesizing recurring positions and trade-offs, and provides a structured checklist to support deliberate evaluation design and interpretation.
This content announces the integration of Benchmaxxer Repellant into the Open ASR Leaderboard. This new addition aims to enhance the robustness and fairness of automatic speech recognition system evaluations.
This research evaluates how large reasoning models handle user queries containing factually inaccurate presuppositions. It finds that while reasoning models show a slight improvement over non-reasoning models, they still fail to challenge a significant fraction of false assumptions.
AgentAtlas addresses the fragmentation in benchmarks used to evaluate large language model (LLM) agents, which currently emphasize different units of measurement. It introduces four components, including a six-state control-decision taxonomy, a nine-category trajectory-failure taxonomy, and a methodology to measure model capability based on prompt supervision.
A user asks whether TurboQuant technology is truly revolutionary or just another mediocre technology that has been overhyped by Google and Twitter. The question aims to discern the true relevance and impact of TurboQuant.
This content analyzes the common limitations of image processing metrics, using visual examples to illustrate how traditional evaluation methods may not always align with human perception or accurately reflect algorithm performance. It highlights the challenges in objectively assessing image quality and processing effectiveness.
This article discusses how to build more effective AI agents by improving their "harnesses." It suggests using evaluations as a strong learning signal to autonomously guide the "hill-climbing" process for harness development.

The author observed that production RAG systems often lack proper evaluation, leading to poor performance and 40% wrong answers. They discovered that most RAG failures stem from retrieval issues, not LLM problems, and emphasize measuring Recall@k to address this.
Este conteúdo aborda a concepção e avaliação de agentes LLM para otimização interativa. Ele explora métodos para criar e medir a eficácia de sistemas de IA conversacionais.
Este conteúdo propõe um novo framework para a avaliação de agentes de voz, denominado EVA. O objetivo é estabelecer uma metodologia padronizada para medir a qualidade e o desempenho de sistemas de IA conversacional.