AI performance

27 items

ARTICLE↑ trendingHacker News (AI)·5d ago

Google employees internally share memes about how its AI sucks

Google employees are internally sharing memes that mock the poor quality of the company's AI. This reflects a sentiment of frustration and skepticism towards the internally developed AI products.

Internal culture Google AI Employee sentiment memes

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

The content describes an experiment demonstrating significant speed gains (up to 68.35 tokens/s) using speculative decoding with the Qwen-3.6-27B model via llamacpp. The author showcases the AI's ability to efficiently generate and debug code.

Benchmarking AI performance Speculative Decoding LLM

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

This article details a novel dynamic expert caching strategy implemented in llama.cpp to accelerate token generation for large MoE models like Qwen3.5-122B-A10B. The approach loads frequently routed experts into VRAM, leading to up to 26.8% faster token generation compared to layer-based partial offload.

Token Generation llama.cpp VRAM Optimization MoE

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Speculative decoding tests using Gemma 4 E2B as a draft for Gemma 4 31B revealed a remarkable performance improvement. Average speed increased by 29%, reaching 50% in code generation, with specific hardware and software configurations.

Gemma 4 31B llama.cpp benchmark AI performance

RESEARCH↑ trendingReddit r/LocalLLaMA·19d ago

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

The author achieved 110 tok/s with 12GB VRAM using ik_llama.cpp on the Qwen3.6 35B A3B model, noting a significant speed boost. This performance surpassed that of regular llama.cpp after its MTP PR merge.

GPU VRAM LLM optimization llama.cpp Benchmarking

CASE↑ trendingReddit r/LocalLLaMA·4/18/2026

qwen3.6 performance jump is real, just make sure you have it properly configured

A user reports that Qwen 3.6 demonstrates a significant performance leap, proving capable for workloads typically handled by Opus and Codex, though not yet at their level. The user highlights its usefulness and speed when properly configured with `preserve_thinking` on an M5 Max with specific settings.

LLMs AI hardware local inference AI performance

qwen3.6 performance jump is real, just make sure you have it properly configured

NEWS↑ trendingReddit r/LocalLLaMA·4/15/2026

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

The new DFlash support in oMLX 0.3.5 RC1 has reportedly doubled the generation speed of the Qwen3.5 27B (BF16) model on a Mac M5 Max, increasing it from 9 to 22 T/S. This breakthrough could significantly improve local deployment of this high-quality model at higher quantizations/full weights.

oMLX DFlash Qwen3.5 AI performance

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?

A user is attempting to perform real coding tasks with Qwen3.6-35B on a 32GB M2 Macbook Pro, encountering memory exhaustion and context window management issues. Despite the model identifying the essence of a bug, it struggles with implementation as critical information is lost during context compaction.

LLMs open-source AI local inference code generation

ARTICLE↑ trendingReddit r/LocalLLaMA·4/21/2026

An actual example of "If you dont run it, you dont own it" and Gemma 4 beats both Chat GPT and Gemini Chat

The author shares their experience using various AI models (GPT OOS 120B, Qwen 3 Max, Chat GPT 4o) for translating a Chinese novel, highlighting challenges with name consistency and unexpected censorship. Chat GPT 4o was initially the best for accuracy and translation quality, but some models showed degradation or filtering over time.

Translation censorship model comparison AI performance

ARTICLEDEV.to AI·3d ago

<think>

This content outlines requirements for a technical article analyzing AI model performance and pricing, focusing on metrics like TTFT and tokens/sec. It specifies the inclusion of exact pricing and model data, test regions, and code examples for a global API, targeting a backend engineer audience.

AI pricing API Benchmarks AI performance

ARTICLEDEV.to AI·5d ago

Context Window Management: Tactics That Survive Real Sessions

Large language models often have a significantly smaller practical context window than their advertised nominal limit due to overhead and attention degradation. This discrepancy affects prompt design and leads to quality drops and truncation long before the hard token limit is reached.

prompt engineering Technical limitations AI performance large language models

RESEARCHDEV.to AI·5/10/2026

Diffusion models approach AR quality and improve inference speed

Diffusion language models are now achieving significant throughput gains and narrowing the gap with autoregressive decoders in inference speed. New Introspective Diffusion Language Models (I-DLM) address prior issues of introspective consistency and inefficient sampling loops, improving both quality and latency.

inference speed Diffusion Models language models machine learning

RESEARCHarXiv CS.AI·5/4/2026

Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

This research challenges the assumption that tool-augmented reasoning always improves LLM performance, showing that it can underperform native CoT due to a "tool-use tax" from the tool-calling protocol, especially with semantic noise. A Factorized Intervention Framework is proposed to analyze this, and G-STEP is introduced as a partial mitigation for protocol-induced errors.

LLM Agents Reasoning AI performance tool use

ARTICLEDEV.to AI·19d ago

Gemini 3.5 Flash & Google Antigravity 2.0: A Real-World Performance Analysis

Google's Gemini 3.5 Flash challenges the assumption that smarter AI models must be slower, powering Antigravity 2.0 for AI agents. It significantly outperforms competing models in real-world performance benchmarks, demonstrating superior speed.

AI models Antigravity 2.0 Google I/O Gemini 3.5 Flash

ARTICLEDEV.to AI·13d ago

Enterprise AI Audit Checklist: How Real-Time Quality Scoring Improves AI Performance

As enterprise AI adoption grows, continuous monitoring of system performance becomes crucial. An "Enterprise AI Audit Checklist" and real-time quality scoring are essential to ensure accuracy and prevent model degradation post-deployment.

AI Monitoring AI audit Quality Scoring AI performance

ARTICLETwo Minute Papers (YouTube)·6d ago

Claude Opus 4.8: Lying Machine No More?

This article discusses Claude Opus 4.8, questioning whether its capabilities have improved to avoid providing misleading information. It analyzes the model's performance in terms of reliability and accuracy.

AI models LLMs AI reliability AI performance

ARTICLEDEV.to AI·13d ago

AI Agents Fail 70%. The Replacement Story Is A Lie.

Recent independent studies debunk the myth of AI agents replacing jobs soon, revealing that even the best agents complete only about 30% of office tasks autonomously. Research from Carnegie Mellon, Huawei, and Salesforce indicates high failure rates, often involving data fabrication or inability to handle complex multi-turn tasks safely and effectively.

future-of-work task automation Benchmarking AI performance

RESEARCHDEV.to AI·5/8/2026

Micro LM delivers large‑model quality on device

A new study introduces Micro Language Models (μLMs), ultra-compact models (8M–30M parameters) that can deliver large-model quality on devices. This approach solves the dilemma between responsive first words and complete answers in edge assistants by seeding answers locally, reducing latency caused by cloud round-trips.

language models micro LMs Edge AI on-device AI

RESEARCHarXiv CS.AI·4/25/2026

Deep FinResearch Bench: Evaluating AI's Ability to Conduct Professional Financial Investment Research

Deep FinResearch Bench introduces a comprehensive evaluation framework for deep research agents in financial investment research. It finds that AI-generated reports still fall short compared to professional financial analysts, highlighting the need for domain-specialized AI.

Financial AI Benchmarking AI performance AI evaluation

ARTICLEDEV.to AI·4/14/2026

MiniMax M2 on OpenClaw: Setup, Pricing, and Performance...

The article describes MiniMax's M2 family of large language models, utilizing a Mixture of Experts architecture for high performance at low inference cost. The M2.7 model achieves 90% of frontier model quality at 7% of the cost, with benchmark results comparable to Claude Sonnet 4.

OpenClaw AI performance Mixture of Experts MiniMax M2