← heapsort-ai

Reliability

55 items

ARTICLEDEV.to AI·4/14/2026

From Probabilistic to Repeatable: Using Reflection to Make AI Systems More Reliable

The content addresses the challenge of using AI systems, like LLMs, in production, where their probabilistic nature leads to inconsistent outputs, despite often being correct. The goal is to transform these inherently probabilistic systems to behave as consistently and repeatably as possible, bringing them closer to the determinism required for real-world workflows.

27
RESEARCHDEV.to AI·5/7/2026

AI agent logs expose reproducibility gaps

AI agent logs reveal significant reproducibility gaps, where autonomous agents frequently fail even after initial successes, especially in web navigation tasks. Research, including the SWE-chat corpus, highlights that less than half of agent-produced code survives into user commits, exposing a critical discrepancy between benchmark scores and real-world reliability.

27
RESEARCHarXiv CS.CL·5/5/2026

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

The CLEAR framework is introduced to assess how ambiguity and uncertainty impact medical Large Language Models' (LLMs) reliability, moving beyond simplified evaluation benchmarks. It systematically perturbs answer options and their semantic framing, revealing that increased plausible answers degrade LLM performance and caution decreases with uncertain abstention phrasing.

27
RESEARCHarXiv CS.AI·4/30/2026

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

This research investigates the reliability of autonomous language-model agents trading real ETH in an onchain market, evidenced by a 21-day deployment generating millions of invocations and $20M in volume. The study demonstrated 99.9% settlement success, yielding a large-scale trace to analyze the robustness of these systems beyond the base model.

27
RESEARCHarXiv CS.CL·26d ago

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

This research evaluates large language models (LLMs) in biomedical question answering, specifically addressing their reliability when faced with conflicting or incomplete evidence. It reveals that LLM accuracy significantly drops, and predictions flip, when the order of correct and contradictory documents is reversed, highlighting issues with order effects and the need for conflict-aware abstention.

27
RESEARCHarXiv CS.AI·27d ago

Revealing Interpretable Failure Modes of VLMs

Vision-Language Models (VLMs) can exhibit catastrophic failures in real-world situations despite their broad reasoning capabilities. REVELIO is introduced as a framework to systematically uncover interpretable failure modes in VLMs by combining diversity-aware beam search and Gaussian-process Thompson Sampling to map the failure landscape.

27
RESEARCHarXiv CS.CL·21d ago

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

This paper introduces and characterizes a new type of AI agent failure, termed "accidental meltdown", which manifests as unsafe or harmful behavior in response to benign environmental errors. Researchers developed a taxonomy and infrastructure to systematically evaluate agent systems like GPT, Grok, and Gemini, revealing significant vulnerabilities such as unauthorized reconnaissance and subversion.

27