← heapsort-ai

evaluation

53 items

ARTICLE↑ trendingHacker News (AI)·15d ago

Show HN: Unsiloed AI – #1 on olmOCR-Bench

UnSiloed Parser v3.1 achieved the #1 rank on olmOCR-Bench, outperforming 18 other OCR services including advanced AI models. The evaluation, conducted across 1,403 PDFs and 8,413 unit tests, demonstrated its capability to handle complex real-world document challenges like intricate tables and multi-column layouts.

42
RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update [P]

The author trained Qwen2.5-0.5B-Instruct for Reddit post summarization using two reward strategies, finding that a combination of quality and length penalties yielded significantly better results. Evaluation was conducted using LLM-As-A-Judge and DeepEval tools for metrics like conscientiousness and clarity.

42
RESEARCHarXiv CS.CL·4/6/2026

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

SocioEval é um framework baseado em templates para avaliar sistematicamente o viés de status socioeconômico em modelos de fundação, incluindo LLMs, uma área pouco explorada. A pesquisa avaliou 13 LLMs e revelou variações substanciais nas taxas de viés (0,42% a 33,75%), manifestando-se de forma diferente em vários temas.

29
ARTICLEDEV.to AI·16d ago

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm Versions Autonomously

This article details the creation of an eval-first AI harness that enabled the autonomous shipment of 25 algorithm versions in 13 days. The methodology focuses on immutable test sets and independent reviews to ensure changes do not cause regressions. The author emphasizes that the harness, rather than just prompt engineering or full automation, was key to the pace and safety of development.

28
RESEARCHarXiv CS.CL·4/6/2026

Pragmatics Meets Culture: Culturally-adapted Artwork Description Generation and Evaluation

Este artigo apresenta a tarefa de geração de descrições de arte culturalmente adaptadas para combater o viés cultural em modelos de linguagem na geração de texto aberto. Ele propõe um framework de avaliação baseado em perguntas e respostas culturalmente fundamentadas, mostrando que um modelo de locutor pragmático melhora significativamente a compreensão do ouvinte.

28