Model Evaluation

28 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen3.6 can code

A user, frustrated with OpenAI models, tried Qwen3.6-27b for Svelte 5 code generation and got a perfect result, despite it taking longer. They anticipate interesting developments in the next 12 months, despite the informal nature of the evaluation.

AI models Model Evaluation code generation

RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Abliterlitics: Benchmark and Tensor Analysis Comparing Qwen 3/3.5 with HauhauCS / Heretic / Huihui models

This content presents a comparative research project analyzing "abliterated models" (HauhauCS, Heretic, Huihui) against Qwen 3/3.5, using a full forensic suite including benchmarks and safety evaluations. The goal is to verify claims of these models being "lossless uncensored" and replicable by the reader.

AI models LLMs Model Evaluation Benchmarking

RESEARCH↑ trendingReddit r/LocalLLaMA·4/26/2026

Confirmed: SWE Bench is now a benchmaxxed benchmark

The title announces that SWE Bench, a benchmark for evaluating AI in software engineering, has been confirmed as a "benchmaxxed" benchmark. This suggests it has reached a status of high relevance or saturation in the field.

software-engineering-ai Model Evaluation Benchmarks

Confirmed: SWE Bench is now a benchmaxxed benchmark

ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude

A user reports running Qwen3.6-35b-a3b locally on an M5 Max MacBook Pro with 8-bit quantization and 64k context, finding its performance comparable to Claude. They are highly impressed with its speed, ability to handle complex research tasks, and the privacy benefits of local execution.

LLMs privacy Model Evaluation Local AI

RESEARCHDEV.to AI·4/23/2026

Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Anthropic's Cyber Verification Program Run 3 tested the safety of its smallest Claude model (Haiku 4.5) against 13 agent-attack scenarios. The result was 13/13 clean, with zero exploit content executed and zero secrets leaked, confirming the safety stack's scalability to smaller models.

Model Evaluation security Anthropic AI safety

ARTICLE↑ trendingReddit r/LocalLLaMA·4/26/2026

Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found!

This content reviews the Qwen3.6 35B A3B Heretic model, praising it as the best uncensored 35B model the user has found. It highlights its ability to fit in 24GB VRAM, handle multi-turn tool calls, and its potential to benchmark higher than the original Qwen 3.6 model.

Model Evaluation Fine-tuning LLM

Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found!

DOCOpenAI Blog·4/23/2026

GPT-5.5 System Card

This document, titled "GPT-5.5 System Card", likely details the technical specifications, capabilities, and limitations of the GPT-5.5 language model. It serves as a comprehensive reference for understanding the operation and usage guidelines of this advanced AI system.

Model Evaluation large language models AI safety Generative AI

RESEARCHarXiv CS.LG·4/13/2026

Robust Reasoning Benchmark

This study proposes a new perturbation pipeline to evaluate the robustness of LLM reasoning, applying it to the AIME 2024 dataset. While frontier models show resilience, open-weight models suffer catastrophic accuracy drops, exposing structural fragility and potential issues with working memory or mechanical parsing.

robustness LLMs Model Evaluation Reasoning

ARTICLEAWS Machine Learning Blog·20d ago

Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals

This article emphasizes the critical role of multimodal evaluators, such as MLLM-as-a-judge, for validating AI model responses in image-to-text tasks for visual shopping and document understanding. It explains that traditional text-only evaluators cannot adequately ensure responses are grounded in the source images.

AI models multimodal AI MLLM Model Evaluation

RESEARCHDEV.to AI·4/22/2026

What VAKRA Reveals About Why Agents Actually Fail

VAKRA, a new benchmark from IBM Research, reveals that AI agents fail in predictable, structural ways by mapping fracture points between reasoning, tool selection, and execution. It decomposes agent failure into six specific categories, moving beyond traditional binary task completion evaluations to uncover common weaknesses.

failure analysis Model Evaluation Benchmarking Reasoning

RESEARCHarXiv CS.AI·27d ago

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

This research paper demonstrates that embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across multiple VLMs. Layer-wise probing reveals that optimal layers for quality prediction are deeper than where anchor classification saturates, establishing a causal account of visual anchoring bias.

neural networks Vision-Language Models Model Evaluation representation learning

RESEARCHarXiv CS.AI·4/14/2026

Seven simple steps for log analysis in AI systems

This research proposes a standardized pipeline for log analysis in AI systems, addressing the current lack of a common approach. It offers a framework with concrete code examples using the Inspect Scout library, guiding researchers through steps for rigorous and reproducible analysis.

Model Evaluation Log Analysis Reproducibility AI Systems

RESEARCHarXiv CS.CL·5/4/2026

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

This research explores efficient methods for evaluating Large Audio Models (LAMs) using minimal data subsets, achieving high correlation with full benchmarks. It also shows that regression models trained on these subsets can better predict human preferences for user satisfaction than full benchmarks.

Model Evaluation efficiency Benchmarks Large Audio Models

RESEARCHarXiv CS.CL·5/7/2026

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

This study investigates hallucinations in Large Language Models (ChatGPT, Grok, Gemini, Copilot) when generating academic content, using 80 prompts across four categories. A novel weighted metric, the Hallucination Index (HI), was introduced to measure factual accuracy and reference validity.

academic writing AI quality Model Evaluation hallucinations

ARTICLEDEV.to AI·4/21/2026

A boy and his dog.

The author describes training "Scout," a 50M-parameter language model, on TinyStories, emphasizing data quality and using prompt probes and Claude Code for evaluation. They detail the model's progress, noting its ability to recall subjects but struggling with context and exhibiting repetition at 12,800 steps.

prompt engineering Model Evaluation LLM training Data Quality

RESEARCHarXiv CS.CL·4/6/2026

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Este artigo revela que o viés em modelos de linguagem (LLMs) é dependente da tarefa, com modelos mitigando estereótipos em avaliações explícitas, mas reproduzindo-os em tarefas implícitas. Os autores introduzem uma taxonomia hierárquica e sete tarefas de avaliação para auditar nove tipos de viés, destacando as limitações do alinhamento de segurança.

linguistic bias stereotyping LLM bias task-dependent bias

RESEARCHarXiv CS.AI·6d ago

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

This paper evaluates "harmful overthinking" in Large Reasoning Models, where continued reasoning after a correct answer can destabilize a correct trajectory. It introduces a protocol to distinguish verbose from harmful overthinking, finding issues in multimodal benchmarks.

multimodal AI Overthinking Model Evaluation AI Reasoning

DOCDEV.to AI·5/10/2026

65. ROC Curves and AUC: Comparing Models Fairly

This content explains how to use ROC curves and AUC to fairly compare classification models by assessing performance across all possible thresholds. It details what they are, how to interpret them, and when to use them instead of other metrics, including common misconceptions.

Classification Model Evaluation machine learning ROC curve

RESEARCHarXiv CS.CL·4/27/2026

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.

reinforcement learning AI training Large Language Models (LLMs)Model Evaluation

RESEARCHarXiv CS.CL·4/30/2026

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

This research explores the use of lightweight Large Language Models (LLMs) for Biomedical Named Entity Recognition, demonstrating their competitive performance against larger models. The study highlights their potential as resource-efficient alternatives and identifies specific output formats that consistently improve performance.

LLMs named entity recognition Model Evaluation NLP