← heapsort-ai

Model Evaluation

28 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen3.6 can code

A user, frustrated with OpenAI models, tried Qwen3.6-27b for Svelte 5 code generation and got a perfect result, despite it taking longer. They anticipate interesting developments in the next 12 months, despite the informal nature of the evaluation.

52
RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Abliterlitics: Benchmark and Tensor Analysis Comparing Qwen 3/3.5 with HauhauCS / Heretic / Huihui models

This content presents a comparative research project analyzing "abliterated models" (HauhauCS, Heretic, Huihui) against Qwen 3/3.5, using a full forensic suite including benchmarks and safety evaluations. The goal is to verify claims of these models being "lossless uncensored" and replicable by the reader.

42
RESEARCHarXiv CS.LG·4/13/2026

Robust Reasoning Benchmark

This study proposes a new perturbation pipeline to evaluate the robustness of LLM reasoning, applying it to the AIME 2024 dataset. While frontier models show resilience, open-weight models suffer catastrophic accuracy drops, exposing structural fragility and potential issues with working memory or mechanical parsing.

30
RESEARCHarXiv CS.AI·27d ago

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

This research paper demonstrates that embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across multiple VLMs. Layer-wise probing reveals that optimal layers for quality prediction are deeper than where anchor classification saturates, establishing a causal account of visual anchoring bias.

29
ARTICLEDEV.to AI·4/21/2026

A boy and his dog.

The author describes training "Scout," a 50M-parameter language model, on TinyStories, emphasizing data quality and using prompt probes and Claude Code for evaluation. They detail the model's progress, noting its ability to recall subjects but struggling with context and exhibiting repetition at 12,800 steps.

27
RESEARCHarXiv CS.CL·4/6/2026

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Este artigo revela que o viés em modelos de linguagem (LLMs) é dependente da tarefa, com modelos mitigando estereótipos em avaliações explícitas, mas reproduzindo-os em tarefas implícitas. Os autores introduzem uma taxonomia hierárquica e sete tarefas de avaliação para auditar nove tipos de viés, destacando as limitações do alinhamento de segurança.

27
RESEARCHarXiv CS.CL·4/27/2026

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

This paper investigates whether outcome rewards in reinforcement learning for chain-of-thought reasoning guarantee verifiable or causally important reasoning in LLMs. Introducing Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics, the authors find that while RLVR improves accuracy, it does not reliably enhance CIR or SR, and a small amount of SFT can remedy these issues.

27
RESEARCHarXiv CS.CL·4/30/2026

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

This research explores the use of lightweight Large Language Models (LLMs) for Biomedical Named Entity Recognition, demonstrating their competitive performance against larger models. The study highlights their potential as resource-efficient alternatives and identifies specific output formats that consistently improve performance.

27