← heapsort-ai

Benchmarks

67 items

RESEARCH↑ trendingReddit r/MachineLearning·5/7/2026

META Superintelligence Lab Presents: ProgramBench: Can SOTA AI Recreate Real Executable Programs(ffmpeg, SQLite, ripgrep) From Scratch Without The Internet?

Meta Superintelligence Lab introduces ProgramBench, an initiative testing the ability of advanced AIs to recreate executable programs like ffmpeg and SQLite from scratch, without internet access. This study aims to explore the limits of AI code generation. The research focuses on evaluating the autonomy and completeness of AI models in complex software synthesis.

42
ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

The author investigates why a specific Qwen3.6 27B INT8 Autoround quantization recipe outperforms others, observing the model "thinks" less but provides better outputs in benchmarks. They then replicated this performance with a new GGUF quant, noting both consistently achieve answers faster than UD Q8 K XL.

42
ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Um teste de benchmark agentic revela que o modelo GLM 5.1 alcança desempenho similar ao Opus por um terço do custo em tarefas agentic, superando outros modelos testados. O autor enfatiza a relevância de testes em ambientes reais como o OpenClaw, classificando o GLM 5.1 como um dos principais modelos para agentes atualmente.

41
RESEARCH↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6 GGUF Benchmarks

This content presents KLD performance benchmarks for Unsloth's Qwen3.6-35B-A3B GGUF quants, highlighting their efficiency in terms of KLD versus disk space. It also clarifies that frequent GGUF updates are typically due to external bug fixes or official improvements, rather than Unsloth's internal errors.

Qwen3.6 GGUF Benchmarks
41
ARTICLE↑ trendingReddit r/LocalLLaMA·4/21/2026

Kimi K2.6 is a legit Opus 4.7 replacement

Kimi K2.6 is recommended as a viable replacement for Opus 4.7, capable of handling 85% of tasks with good quality, featuring vision and strong browser use, especially for long-term workflows. The author suggests it highlights that frontier LLMs might not always offer groundbreaking new features, with local solutions becoming attractive due to usage limits.

36
RESEARCH↑ trendingReddit r/LocalLLaMA·4/20/2026

Kimi K2.6

This content announces the submission of benchmarks for Kimi K2.6 by a user, providing links to the submission and comments.

Kimi K2.6
36
ARTICLEDEV.to AI·3d ago

<think>

This content outlines requirements for a technical article analyzing AI model performance and pricing, focusing on metrics like TTFT and tokens/sec. It specifies the inclusion of exact pricing and model data, test regions, and code examples for a global API, targeting a backend engineer audience.

30
RESEARCHarXiv CS.LG·4/13/2026

Robust Reasoning Benchmark

This study proposes a new perturbation pipeline to evaluate the robustness of LLM reasoning, applying it to the AIME 2024 dataset. While frontier models show resilience, open-weight models suffer catastrophic accuracy drops, exposing structural fragility and potential issues with working memory or mechanical parsing.

30
ARTICLEDEV.to AI·4d ago

<think>

This content is a planning draft for an article about testing multimodal AI models. The author intends to share their personal discovery, benchmarking, and pricing data for various models.

29
RESEARCHarXiv CS.CL·4/24/2026

AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

AITP is introduced as a multimodal large language model designed for traffic accident responsibility allocation, enhancing reasoning through Multimodal Chain-of-Thought and integrating legal knowledge via Retrieval-Augmented Generation. The research also presents DecaTARA, a comprehensive decathlon-style benchmark with 67,941 annotated videos and 195,821 question-answer pairs.

29