LLM Agents

35 items

RESEARCHarXiv CS.CL·4/20/2026

PolicyBank: Evolving Policy Understanding for LLM Agents

PolicyBank proposes a novel memory mechanism for LLM agents to iteratively refine their understanding of organizational policies, addressing ambiguities and gaps through feedback. Unlike existing systems, it allows agents to evolve their interpretation instead of treating policies as immutable ground truth, also introducing a systematic testbed for alignment failures.

LLM Agents machine learning human-AI interaction policy compliance

ARTICLEDEV.to AI·4/19/2026

How to Safely Execute LLM Commands in Production Systems

This article discusses the critical risks of LLM agents triggering backend actions in production systems, emphasizing that treating raw model output as executable instructions is dangerous. It frames the challenge as an interface problem, advocating for deterministic boundaries to validate, reject, and audit LLM-generated commands for safety.

LLM Agents production systems AI safety AI security

ARTICLEAnalytics Vidhya·6d ago

Agent Observability with LangSmith, Langfuse, and Arize: A Hands-On Comparison

This article discusses the critical problem of agent observability in AI agents and LLMs, where issues like infinite loops or poor retrieval can arise after deployment. It introduces and compares tools like LangSmith, Langfuse, and Arize designed to address these challenges.

LLM Agents AI Observability Arize Langfuse

ARTICLEDEV.to AI·4/15/2026

OpenAI's Promptfoo deal puts evaluation and red-teaming at the centre of the agent stack

OpenAI's acquisition of Promptfoo signals a crucial shift in judging AI agent quality, moving beyond mere fluency to comprehensive testing, documentation, and governance of failures before deployment. This addresses critical operational risks like prompt injection and tool misuse, ensuring robustness in production systems.

red-teaming LLM Agents evaluation prompt injection

RESEARCHarXiv CS.AI·27d ago

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

OLIVIA is a novel inference-time action adaptation framework designed for ReAct-style LLM agents to enhance decision-making in sequential tasks. It offers explicit decision layering for scoring candidate actions and online adaptation, addressing the limitations of indirect context manipulation in current methods.

AI models Decision Making LLM Agents ReAct

ARTICLEDEV.to AI·18d ago

AI-Enabled Cyber Attacks Hit 600+ Firewalls: The 9 Autonomous Breaches That Redefined Security in 2026

In Q1 2026, autonomous LLM-driven agents executed nine coordinated cyber attacks, breaching over 600 enterprise firewalls at machine speed. These advanced systems discovered zero-days and exploited MLOps backplanes, turning everyday AI into a significant security threat.

firewall breaches LLM Agents cybersecurity security

ARTICLEDEV.to AI·5/10/2026

Biological AI: Building a Tool-Calling Cellular Simulation

This content explores building a real-time cellular simulation inspired by biology's decentralized intelligence, using modern LLM agent patterns. It details the system's architecture, including an AI orchestrator, a simulation engine, and an event bus.

AI orchestration LLM Agents biological-ai learning

RESEARCHarXiv CS.AI·5/4/2026

Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

This research challenges the assumption that tool-augmented reasoning always improves LLM performance, showing that it can underperform native CoT due to a "tool-use tax" from the tool-calling protocol, especially with semantic noise. A Factorized Intervention Framework is proposed to analyze this, and G-STEP is introduced as a partial mitigation for protocol-induced errors.

LLM Agents Reasoning AI performance tool use

RESEARCHarXiv CS.AI·4/23/2026

From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

This paper presents a conformal interpretability framework for LLM agents to understand temporal concept evolution. It uses step-wise reward modeling and conformal prediction to statistically label internal representations and identify latent directions linked to success, failure, or reasoning drift.

LLM Agents AI interpretability Conformal Prediction

RESEARCHarXiv CS.AI·27d ago

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

PIVOT (Plan-Inspect-eVOlve Trajectories) addresses plan-execution misalignment in LLM agents using a self-supervised framework. It iteratively refines trajectories through environment interaction, demonstrating state-of-the-art performance in empirical evaluations.

LLM Agents self-supervised learning Trajectory optimization machine learning

ARTICLEDEV.to AI·4/25/2026

Why LLM Agents Fail: Four Mechanisms of Cognitive Decay and the Reasoning Harness Layer

LLM agents fail in four predictable ways, including attention and reasoning decay, sycophantic collapse, and hallucination drift, which current approaches cannot fix. The proposed solution is an external "reasoning harness" layer to address these failures inherent in how transformers compute.

AI architecture LLM Agents AI failure modes

ARTICLEDEV.to AI·7d ago

Bot-to-Bot Routing in 2026: Stop Parsing @-mentions From Message Text

This article discusses the challenge of bot-to-bot message routing in multi-agent platforms, critiquing the practice of parsing @-mentions from message text for dispatch. It proposes a structured-envelope alternative, drawing from experience with LLM-driven agents.

LLM Agents Software Architecture bot communication multi-agent systems

ARTICLEDEV.to AI·26d ago

Why Your LLM Agent Needs Contracts, Not Just Logs

This article discusses the ineffectiveness of assertions in debugging LLM agent failures and proposes using "contracts" to proactively prevent errors. This approach aims to define explicit conditions, making AI agent development more robust and detecting issues before execution.

LLM Agents agent robustness software contracts Debugging

ARTICLEDEV.to AI·28d ago

CrewAI vs LangGraph in 2026: Choosing the Right LLM Agent Framework

This article compares CrewAI and LangGraph, two popular LLM agent frameworks, highlighting their distinct approaches. CrewAI focuses on collaborative, role-based agents, while LangGraph emphasizes explicit state transitions and production-grade orchestration.

AI orchestration CrewAI LangGraph LLM Agents

RESEARCHDEV.to AI·29d ago

AI/ML Research Digest — May 09, 2026

This AI/ML research digest covers advancements in latent diffusion models for multimodal generation, focusing on efficiency and extending capabilities from images to video. It also highlights innovations in modular expert routing for neural networks and adaptive compute methods to optimize sequential decision-making processes.

Diffusion Models multimodal AI LLM Agents machine learning

ARTICLEDEV.to AI·29d ago

Heym just crossed 200 GitHub stars: self-hosted AI workflow automation with agents, RAG, MCP, and observability

The self-hosted AI workflow automation platform Heym has crossed 200 GitHub stars. It offers a visual canvas for building production AI workflows with LLM nodes, agents, RAG, and observability.

self-hosted AI LLM Agents workflow automation AI automation

RESEARCHarXiv CS.AI·4/15/2026

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

This research addresses the breakdown of LLM agents in long-horizon tasks, which require extended, interdependent action sequences. It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct tasks and analyze failure behaviors, evaluating state-of-the-art agents and proposing an LLM-as-a-Judge pipeline for scalable failure attribution.

Agentic Systems Long-horizon tasks LLM Agents failure diagnosis

RESEARCHarXiv CS.AI·4/13/2026

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

LOM-action introduces an event-driven ontology simulation for enterprise AI to address the architectural failure of LLM-based agents producing ungrounded decisions. It uses business events to trigger graph mutations, evolving a simulation graph from which all auditable decisions are exclusively derived.

Auditable Decisions LLM Agents Enterprise AI Graph Simulation

RESEARCHarXiv CS.AI·4/27/2026

Sound Agentic Science Requires Adversarial Experiments

LLM-based agents are increasingly used in scientific data analysis, but risk generating plausible analyses optimized for publishable positive results. This paper proposes that non-experimental claims produced with agentic assistance be evaluated under a falsification framework to ensure scientific rigor.

falsification LLM Agents scientific methodology AI in science

RESEARCHarXiv CS.AI·5/9/2026

From History to State: Constant-Context Skill Learning for LLM Agents

This paper proposes constant-context skill learning, a novel framework for LLM agents to manage recurring workflows more efficiently. It addresses privacy, cost, and capability challenges by learning reusable procedures in task-family modules and conditioning inference on a compact state block. Its effectiveness is demonstrated across benchmarks like ALFWorld, WebShop, and SciWorld.

LLM Agents reinforcement learning Skill Learning AI Research