benchmark

10 items

RESEARCH↑ trendingReddit r/LocalLLaMA·4/16/2026

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

The content details the performance of the Qwen 3.6 35B A3B model, achieving 187 tokens per second on an RTX 5090 32GB GPU. It highlights support for a 120K context size, using Q5 K S quantization and a temperature of 0.1.

inference AI hardware benchmark performance

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Speculative decoding tests using Gemma 4 E2B as a draft for Gemma 4 31B revealed a remarkable performance improvement. Average speed increased by 29%, reaching 50% in code generation, with specific hardware and software configurations.

Gemma 4 31B llama.cpp benchmark AI performance

RESEARCH↑ trendingReddit r/LocalLLaMA·5/1/2026

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

A local LLM gamedev contest compared Qwen 3.6 27B and Gemma 4 31B in creating a Pac-Man game. Gemma 4 31B was the clear winner, producing stronger game logic and higher quality in much less time, despite Qwen generating more tokens.

code generation model comparison benchmark LLM

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

RESEARCH↑ trendingReddit r/LocalLLaMA·4/14/2026

We benchmarked TranslateGemma-12b against 5 frontier LLMs on subtitle translation - it won across the board, with one significant catch

A study benchmarked TranslateGemma-12b against five frontier LLMs on subtitle translation for six language pairs, showing the task-specific model consistently outperformed general-purpose models. While initial numbers indicated a clear win, human QA added a significant catch which will be detailed in the full report.

Translation Gemma benchmark AI

We benchmarked TranslateGemma-12b against 5 frontier LLMs on subtitle translation - it won across the board, with one significant catch

RESEARCHarXiv CS.CL·4/10/2026

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Apesar da estagnação da precisão em benchmarks acadêmicos de fala para texto, as aplicações industriais exigem melhor reconhecimento de vocabulário raro e contextual. Este artigo introduz o Contextual Earnings-22, um novo dataset e benchmark para promover a pesquisa e revelar avanços no reconhecimento contextual de fala com vocabulário personalizado.

Dataset custom vocabulary Speech-to-Text benchmark

RESEARCHDEV.to AI·4/17/2026

A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability

This content provides a comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability, meaning its ability to convert natural language into SQL queries without prior examples. It explores the model's performance and limitations in this complex task.

evaluation Text-to-SQL ChatGPT benchmark

RESEARCHarXiv CS.CL·4/17/2026

MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios

MemGround is a new rigorous long-term memory benchmark for LLMs, designed to overcome the limitations of static evaluations by using rich, gamified interactive scenarios. It features a three-tier hierarchical framework to assess different memory types and a multi-dimensional metric suite for comprehensive quantification.

evaluation gamification memory benchmark

RESEARCHarXiv CS.CL·4/21/2026

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

CFMS introduces the first fine-grained Chinese multimodal sarcasm detection benchmark, comprising 2,796 image-text pairs with triple-level annotations. This dataset aims to improve AI's fine-grained semantic understanding and metaphoric reasoning, addressing limitations in existing benchmarks.

Dataset multimodal AI natural language processing benchmark

RESEARCHarXiv CS.CL·4/6/2026

Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

O artigo discute as limitações das avaliações atuais de sistemas RAG (Retrieval-Augmented Generation) em ambientes corporativos, que não diagnosticam sistematicamente os desafios complexos além da precisão final. Para suprir essa lacuna, a pesquisa propõe um framework de diagnóstico multi-dimensional e um benchmark para RAG empresarial, baseado em uma taxonomia de dificuldade de quatro eixos.

evaluation diagnostic framework RAG benchmark

RESEARCHarXiv CS.AI·4/6/2026

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

ESL-Bench é um benchmark longitudinal sintético e orientado a eventos. Ele foi desenvolvido para a avaliação de agentes de saúde, provavelmente envolvendo inteligência artificial.

synthetic data Agentes de Saúde IA na Saúde Healthcare