LLM evaluation

18 items

ARTICLEDEV.to AI·3h ago

More eval traces will not stabilize your kappa. Stratify the ones you have

The content discusses the instability of LLM-as-judge agreement (Cohen's kappa) which swung weekly despite no rubric changes. Increasing sample size did not stabilize it; the solution was to stratify existing samples by score class and failure dimensions, which dramatically reduced variance, showing composition, not volume, was key.

AI metrics sampling strategy Cohen's Kappa LLM evaluation

RESEARCHDEV.to AI·9h ago

Aligning with Human Judgement: The Role of Pairwise Preference in Large LanguageModel Evaluators

This content explores the critical role of pairwise preference in evaluating Large Language Models (LLMs). It discusses how this method can help align LLM performance more effectively with human judgment.

Human Alignment Pairwise Preference natural language processing AI Research

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

OpenSimula — open implementation of Simula-style mechanism design for synthetic data (in AfterImage) [P]

OpenSimula is an experimental Python implementation of Simula mechanism design, added to the AfterImage open-source dataset tool. It addresses the need for controlled diversity in LLM SFT/eval setups by generating varied synthetic data through LLM-built taxonomies, weighted sampling, and critic loops.

synthetic data mechanism-design open-source-tool LLM evaluation

ARTICLE↑ trendingReddit r/LocalLLaMA·4/13/2026

Best Local LLMs - Apr 2026

This article discusses the best local LLMs in April 2026, highlighting new releases like Qwen3.5, Gemma4, GLM-5.1, Minimax-M2.7, and PrismML Bonsai. It invites users to share detailed experiences with open-weights models to aid in evaluation.

AI models open-source AI Local LLMs generative AI

ARTICLEDEV.to AI·21d ago

Building an Evaluation Harness for Financial RAG: What I Learned About LLM-as-Judge Calibration

The author built a RAG system for financial Q&A using SEC filings and the FinanceBench benchmark. They uncovered a significant discrepancy between LLM-as-judge evaluations and actual performance, leading to lessons on calibrating LLMs for assessment.

Financial AI Benchmarking GPT-4o-mini RAG system

RESEARCHarXiv CS.CL·4/7/2026

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

QIMMA é uma nova plataforma de avaliação de LLMs em árabe que prioriza a qualidade, realizando validação sistemática de benchmarks. Ela resolve problemas de qualidade em benchmarks existentes através de revisão automatizada e humana, resultando em um conjunto de avaliação reprodutível e multi-tarefa com mais de 52 mil amostras.

Arabic LLM NLP Benchmarks Quality Assurance

ARTICLEDEV.to AI·5d ago

How do you know your AI receptionist is actually following its instructions?

This article addresses the issue of voice AI, particularly large language models, fabricating information in customer service interactions, leading to incorrect details and potential problems. It proposes "evals" as a method to proactively test and ensure AI agents adhere to their instructions, preventing customer dissatisfaction.

AI hallucinations customer service AI AI reliability LLM evaluation

ARTICLEDEV.to AI·4/27/2026

Testing AI Systems in Production: From LLM Evals to Agent Reliability

The article criticizes current LLM testing in production, noting that 'smooth' deployments often mask subtle hallucinations leading to financial or data loss due to inadequate truth-based evaluations. It stresses the need for robust retrieval evaluation pipelines, better data, and specific strategies to test AI agents for reliability and prevent destructive failures.

AI reliability AI testing AI agents LLM evaluation

ARTICLEDEV.to AI·4/14/2026

AI Search Showdown: Perplexity vs SearchGPT vs Claude 3.5 Sonnet (2026)

This content presents a comparative analysis of AI search tools: Perplexity AI, OpenAI SearchGPT, and Claude 3.5 Sonnet. It details a hands-on evaluation using three distinct complex prompts to assess their performance across accuracy, speed, citations, and multi-modal capabilities.

AI comparison Perplexity AI Claude 3.5 Sonnet OpenAI SearchGPT

DOCDEV.to AI·22d ago

LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs

This content teaches indie hackers how to build a low-cost (£0.20/run) LLM evaluation system to catch real bugs in production. The system utilizes a golden dataset, an LLM as a judge for scoring outputs, and a CI gate to prevent merges.

indie hackers CI/CD Software Development Testing

ARTICLEDEV.to AI·22d ago

LLM Evaluation for Indie Hackers: Stop Paying Braintrust and Build This Instead

The article presents a cost-effective, rubric-based LLM evaluation system for indie hackers, designed to run in CI and prevent issues like hallucinated data in production. It offers an alternative to expensive enterprise solutions by defining quality through concrete attributes and using golden datasets.

indie hackers CI/CD Testing cost-effective solutions

CASEDEV.to AI·4/19/2026

A Truth Filter for AI-Generated Ideas: An Experiment with Property-Based Testing

The author used property-based testing to verify the factual claims in an AI-generated paper on building a second brain. While most claims held, one universal quantifier was falsified, highlighting the method's effectiveness in uncovering subtle structural requirements.

AI Verification AI Content Generation property-based testing LLM evaluation

RESEARCHarXiv CS.CL·5/5/2026

Compared to What? Baselines and Metrics for Counterfactual Prompting

This work argues that observed effects from "counterfactual prompting" in LLMs cannot be attributed to a targeted factor without accounting for meaning-preserving text modifications that establish general model sensitivity. The research shows that prediction flip rates when surgically changing patient gender are statistically indistinguishable from rates induced by simply paraphrasing inputs, suggesting that special sensitivity to patient gender cannot be concluded.

counterfactual prompting model robustness AI bias natural language processing

RESEARCHarXiv CS.CL·4/9/2026

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Este artigo enquadra a alucinação em grandes modelos de linguagem como um erro de classificação e propõe uma intervenção composta por recusa baseada em instruções e um gate de abstenção estrutural. O gate utiliza um score de déficit de suporte de sinais como auto-consistência e cobertura de citação, mas a avaliação controlada mostrou que nenhum mecanismo isolado foi suficiente para mitigar totalmente o problema.

hallucination Abstention Architectures large language models AI safety

RESEARCHarXiv CS.CL·18d ago

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge is introduced as a benchmark generator for evaluating LLM-as-a-judge in multi-turn conversations, addressing the complexity existing Q&A-focused benchmarks fail to capture. It creates paired conversations with single injected flaws, allowing unambiguous labeling and precise isolation for model developers relying on auto-evaluation.

Multi-turn conversations LLM-as-a-judge Benchmarking generative AI

RESEARCHarXiv CS.CL·12d ago

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

This research introduces CARE (Community-Aware Reaction Evaluation), a framework designed to benchmark large language models' (LLMs) ability to simulate community discourse against authentic human responses to real-world news. Through human-AI collaboration, the study identifies a "realism gap," showing that explicit community prompts do not inherently enhance the fidelity of LLM simulations.

linguistic behavior AI alignment computational social science LLM evaluation

RESEARCHarXiv CS.CL·28d ago

Sanity Checks for Long-Form Hallucination Detection

This research paper introduces a controlled-invariance methodology for hallucination detection in large language models. Using oracle tests like extsc{Force} and extsc{Remove}, it investigates whether detection methods evaluate reasoning or merely surface correlates of the final answer.

hallucination detection Chain-of-Thought large language models LLM evaluation

ARTICLEDEV.to AI·4/14/2026

I added a local eval loop to my personal AI assistant — here's what 800 scored interactions taught me

The author integrated a local evaluation loop using an Ollama model into their personal self-hosted AI assistant to score interactions based on accuracy, relevance, and appropriate confidence. After analyzing 800 interactions, they discovered that shorter, more direct answers consistently received higher scores.

AI assistant self-hosted AI Ollama DSPy