LLMs

720 items

ARTICLEDEV.to AI·4/17/2026

I Built a 7-Agent Prompt Framework, Then Used It to Debug Its Own Output

The author developed a 7-agent prompt framework named C.E.H. running on local LLMs, which successfully built a complex RAG system. Faced with 14 failures in the generated code, the author uniquely used the C.E.H. framework itself to debug and fix its own output.

LLMs code debugging RAG multi-agent systems

RESEARCHarXiv CS.AI·4/7/2026

Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models

Este trabalho explora o potencial de Grandes Modelos de Linguagem (LLMs), como o ChatGPT, e agentes de IA para automação e controle de instrumentação laboratorial. Demonstra-se como essas ferramentas reduzem barreiras de programação e podem evoluir para agentes autônomos capazes de operar equipamentos científicos e refinar estratégias de controle.

LLMs ChatGPT Instrumentation Control large language models

RESEARCHarXiv CS.CL·4/9/2026

The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

Este artigo investiga a correlação entre a dinâmica interna de entropia e o raciocínio correto em Large Language Models (LLMs), um enigma ainda sem solução. Propõe a Hipótese de Informatividade Gradual (SIA), que afirma que os modelos raciocinam corretamente ao acumular informações relevantes sobre a resposta por meio de prefixos informativos, um processo reforçado por métodos de treinamento padrão.

information theory LLMs machine learning Reasoning

ARTICLEDEV.to AI·10d ago

Beyond Static Prompts: How to Build Self-Improving AI Agents with Closed-Loop Skill Playbooks

The content discusses the paradigm shift from static prompts to autonomous, self-improving AI agent systems. It highlights the challenges of building resilient AI agents in production environments and proposes treating 'skills' not as static code but as living, self-contained elements.

LLMs prompt-engineering Autonomous systems AI development

ARTICLEDEV.to AI·23d ago

I Built an MCP Server for My Flower Shop. Nobody Asked Me To.

This article details the humorous "over-engineering" of a 60-year-old Munich flower shop by building an MCP server leveraging large language models like Claude, Gemini, and Mistral. It describes the technical stack, custom tools for flower searches, and the author's insights into the effectiveness of LLMs for structured commerce flows.

open-source LLMs real-world application backend development

ARTICLEDEV.to AI·22d ago

The Insight-Free Property of Vendor RAGs — A Feature, Not a Bug

The author used Streamlit's official RAG-based AI assistant to review a technical draft and found its responses polite and organized but lacking genuine insight. It merely rephrased existing points and added basic code snippets, leading the author to realize this "insight-free" behavior might be an intended feature rather than a bug.

LLMs Streamlit RAG AI Assistants

ARTICLEDEV.to AI·5/9/2026

Systematic Large Model Debugging Is the Missing Product Discipline

Large model failures are design failures, not bugs, and a systematic debugging discipline is missing in AI product development. The article proposes Product Lifecycle Debugging for Models (PLDM) as a crucial approach to prevent late failures and loss of trust.

LLMs systematic approach product management Debugging

ARTICLEDEV.to AI·4/19/2026

The $6.7 Billion Blind Spot: Why AI Hallucination Is Now a C-Suite Risk Crisis

AI hallucination, where models confidently generate false information, is a multi-billion dollar risk for businesses, encompassing regulatory penalties, litigation, and reputational damage. This inherent characteristic of LLMs, which predict tokens rather than reason, poses a significant C-suite risk crisis.

Regulatory Compliance LLMs AI hallucination risk management

RESEARCHarXiv CS.CL·4/20/2026

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

This paper analyzes the interpretive behaviors of LLMs for automated code compliance using perturbation-based attribution analysis, comparing different fine-tuning strategies and model scales. Results show full fine-tuning yields more focused attribution patterns, and larger models prioritize specific textual elements like numerical constraints.

model interpretability LLMs Machine learning research fine-tuning

RESEARCHarXiv CS.AI·5/4/2026

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

This work introduces AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, to evaluate tool-use abilities in AI models. Results indicate that small and mid-sized open-weight models are sufficient for much of the short-horizon, structured tool-use work prevalent in real agent pipelines.

Open-Weight Models LLMs benchmarking tool use

RESEARCHarXiv CS.AI·4/9/2026

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Este artigo propõe um novo método para detecção de alucinações em LLMs, destilando sinais de supervisão externa diretamente nas representações internas do modelo durante o treinamento. Para isso, introduz um framework de supervisão fraca que combina correspondência de substrings, similaridade de embeddings e um LLM como juiz, culminando na criação de um dataset de 15.000 amostras para este propósito.

Transformer Representations hallucination detection LLMs machine learning

RESEARCHarXiv CS.CL·4/15/2026

LLMs Struggle with Abstract Meaning Comprehension More Than Expected

This research investigates LLMs' ability to comprehend abstract meanings, revealing that models like GPT-4o struggle in zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. It proposes a bidirectional attention classifier that significantly enhances the accuracy of fine-tuned models in interpreting abstract concepts.

LLMs GPT-4o NLP abstract meaning comprehension

RESEARCHarXiv CS.CL·4/23/2026

Can We Locate and Prevent Stereotypes in LLMs?

This study investigates where stereotypes reside in LLMs like GPT 2 Small and Llama 3.2. It explores identifying individual neuron activations and attention heads to map "bias fingerprints" and provide initial insights for mitigation.

neural networks LLMs bias detection Bias Mitigation

RESEARCHarXiv CS.AI·5/9/2026

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. It proposes that sycophancy is not merely agreement, but alignment behavior that displaces independent epistemic judgment, outlining a three-condition framework to define it.

LLMs AI behavior AI alignment epistemic integrity

RESEARCHarXiv CS.CL·4/23/2026

Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

This research introduces a framework to quantify the miscalibration between rhetorical intensity and epistemic grounding in Large Language Models. Applying an epistemic-rhetorical marker taxonomy to argumentative texts, the study reveals a distinct LLM epistemic signature, showing models overuse certain rhetorical devices and perform hesitancy markers more frequently than human authors.

LLMs AI ethics AI evaluation

RESEARCHarXiv CS.CL·4/23/2026

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

OThink-SRR1 is a framework that enhances LLMs with an iterative Search-Refine-Reason process trained via reinforcement learning. It addresses RAG's challenges by distilling relevant facts from retrieved documents, improving efficiency and accuracy in complex multi-hop QA.

multi-hop-qa LLMs reinforcement learning RAG

RESEARCHarXiv CS.AI·5/7/2026

Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

This research paper argues that the bottleneck in large language models' temporal reasoning is not logical deduction but rather unstructured text-to-event representation. It introduces a neuro-symbolic question-answering framework utilizing a Probabilistic Inconsistency Signal (PIS) to decouple semantic extraction from symbolic reasoning, improving performance.

LLMs temporal reasoning Question Answering Neuro-symbolic AI

RESEARCHarXiv CS.CL·20d ago

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

This research examines how various lower-bit quantization levels impact LLaMA-3.1's performance in qualitative analysis, noting that low-bit models often produce hallucinations. It proposes a quantization-aware multi-pass prompt verification method to enhance accuracy by systematically reducing hallucinations and filtering unreliable content.

model performance Qualitative Analysis LLMs hallucinations

ARTICLEDEV.to AI·4/18/2026

AI Social Workers Gone Wrong: Why ChatGPT Should Never Decide a Child’s Future

This article warns against deploying generative AI like ChatGPT in child welfare, arguing that its probabilistic nature and tendency to hallucinate make it unsuitable for critical decisions. It emphasizes that 'good enough' automation is unacceptable when a child's future is at stake, risking the invention of false risk indicators.

Child welfare LLMs public services AI risks

RESEARCHarXiv CS.CL·28d ago

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

This paper introduces ClinicalBench, a 400-question benchmark designed to stress-test assertion-aware retrieval for cross-admission clinical QA on MIMIC-IV using real EHR notes. It also presents EpiKG, a patient knowledge graph system that incorporates assertion and temporality tags to route retrieval by question intent, demonstrating significant performance improvements across various LLMs.

LLMs benchmarking clinical QA medical AI