← heapsort-ai

LLM evaluation

18 items

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

OpenSimula — open implementation of Simula-style mechanism design for synthetic data (in AfterImage) [P]

OpenSimula is an experimental Python implementation of Simula mechanism design, added to the AfterImage open-source dataset tool. It addresses the need for controlled diversity in LLM SFT/eval setups by generating varied synthetic data through LLM-built taxonomies, weighted sampling, and critic loops.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·4/13/2026

Best Local LLMs - Apr 2026

This article discusses the best local LLMs in April 2026, highlighting new releases like Qwen3.5, Gemma4, GLM-5.1, Minimax-M2.7, and PrismML Bonsai. It invites users to share detailed experiences with open-weights models to aid in evaluation.

42
ARTICLEDEV.to AI·4/27/2026

Testing AI Systems in Production: From LLM Evals to Agent Reliability

The article criticizes current LLM testing in production, noting that 'smooth' deployments often mask subtle hallucinations leading to financial or data loss due to inadequate truth-based evaluations. It stresses the need for robust retrieval evaluation pipelines, better data, and specific strategies to test AI agents for reliability and prevent destructive failures.

27
RESEARCHarXiv CS.CL·5/5/2026

Compared to What? Baselines and Metrics for Counterfactual Prompting

This work argues that observed effects from "counterfactual prompting" in LLMs cannot be attributed to a targeted factor without accounting for meaning-preserving text modifications that establish general model sensitivity. The research shows that prediction flip rates when surgically changing patient gender are statistically indistinguishable from rates induced by simply paraphrasing inputs, suggesting that special sensitivity to patient gender cannot be concluded.

27
RESEARCHarXiv CS.CL·4/9/2026

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Este artigo enquadra a alucinação em grandes modelos de linguagem como um erro de classificação e propõe uma intervenção composta por recusa baseada em instruções e um gate de abstenção estrutural. O gate utiliza um score de déficit de suporte de sinais como auto-consistência e cobertura de citação, mas a avaliação controlada mostrou que nenhum mecanismo isolado foi suficiente para mitigar totalmente o problema.

27
RESEARCHarXiv CS.CL·18d ago

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge is introduced as a benchmark generator for evaluating LLM-as-a-judge in multi-turn conversations, addressing the complexity existing Q&A-focused benchmarks fail to capture. It creates paired conversations with single injected flaws, allowing unambiguous labeling and precise isolation for model developers relying on auto-evaluation.

27
RESEARCHarXiv CS.CL·12d ago

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

This research introduces CARE (Community-Aware Reaction Evaluation), a framework designed to benchmark large language models' (LLMs) ability to simulate community discourse against authentic human responses to real-world news. Through human-AI collaboration, the study identifies a "realism gap," showing that explicit community prompts do not inherently enhance the fidelity of LLM simulations.

27