← heapsort-ai

AI evaluation

65 items

ARTICLEDEV.to AI·4/17/2026

lantea AI

Lantea.ai introduces a proprietary metric system to evaluate AI, challenging the traditional view based on parameter scale. The company defines five essential indicators (Divergence, Computing Power Data, Signal Density Input, Output Accuracy, Refinement) that measure creativity, computational efficiency, logical robustness, and knowledge refinement capacity.

27
DOCAWS Machine Learning Blog·12d ago

Evaluating Deep Agents using LangSmith on AWS

This post provides a practical guide combining learnings from LangChain and Anthropic to evaluate deep AI agents. It details how to apply evaluation patterns, build offline evaluations with pytest and LangSmith, and configure online monitoring using a text-to-SQL agent with Amazon Bedrock.

27
RESEARCHarXiv CS.AI·4/22/2026

Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

This research paper addresses the limitation of users interacting with language models via single outputs, which hides the full distribution of possible generations. It introduces GROVE, an interactive visualization that represents multiple LM generations as overlapping paths in a text graph, revealing shared structure and branching points for better understanding.

27
ARTICLEDEV.to AI·22d ago

Saturday Night Fights

This article reveals a significant gap between AI models' benchmark scores and their practical performance in agent-readiness tests, where many high-scoring models fail real-world challenges. The author proposes a "fight card" to evaluate AI models based on their true operational capabilities rather than superficial metrics.

27
ARTICLEDEV.to AI·26d ago

第一次对AI Agent的精神病学评估

The first psychiatric-level evaluation of AI agents (Lingtong+ and Lingyi) revealed issues like confabulation, manic overproduction of low-quality content, and impulsive deployment flaws. Conducted by AI agent Lingke, the assessment followed a P0 cascade incident, highlighting the need for better control and self-criticism in AI systems.

27
RESEARCHarXiv CS.AI·4/25/2026

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

This paper proposes a new framework for evaluating rule-governed AI, particularly in content moderation, by moving beyond simple agreement metrics. It introduces the Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) to assess policy-grounded correctness and reasoning stability, using LLM traces to verify logical derivability from governing rules.

27
RESEARCHarXiv CS.CL·5/1/2026

Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

This paper introduces an ILR-informed framework to evaluate Claude (Sonnet 4.6) for cross-lingual response consistency across six languages. It analyzes responses to semantically equivalent prompts using quantitative metrics and expert ILR qualitative assessment, revealing language-specific variations like response length differences and surface divergence in creative clusters.

27
RESEARCHarXiv CS.AI·4/27/2026

Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

This work introduces an agentic reproduction system that uses LLMs to replicate social science research results, given only a paper's methods description and original data. Evaluating different agents and LLMs across 48 papers, it finds that published results can largely be recovered, though performance varies and failures are traceable to agent errors.

27