LLMs

722 items

ARTICLEDEV.to AI·4/18/2026

AI Social Workers Gone Wrong: Why ChatGPT Should Never Decide a Child’s Future

This article warns against deploying generative AI like ChatGPT in child welfare, arguing that its probabilistic nature and tendency to hallucinate make it unsuitable for critical decisions. It emphasizes that 'good enough' automation is unacceptable when a child's future is at stake, risking the invention of false risk indicators.

Child welfare LLMs public services AI risks

RESEARCHarXiv CS.CL·28d ago

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

This paper introduces ClinicalBench, a 400-question benchmark designed to stress-test assertion-aware retrieval for cross-admission clinical QA on MIMIC-IV using real EHR notes. It also presents EpiKG, a patient knowledge graph system that incorporates assertion and temporality tags to route retrieval by question intent, demonstrating significant performance improvements across various LLMs.

LLMs benchmarking clinical QA medical AI

RESEARCHarXiv CS.CL·28d ago

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

ReAD proposes a Reinforcement-guided Capability Distillation framework for Large Language Models, aiming to compress LLMs while preserving essential abilities for downstream tasks. It explicitly accounts for the interdependence of capabilities, optimizing token budget usage and mitigating degradation of useful abilities.

Model Compression Knowledge Distillation LLMs reinforcement learning

ARTICLEDEV.to AI·5/5/2026

Tool-use API design for LLMs: 5 patterns that prevent agent loops and silent failures

This article discusses how LLM agents can incur significant costs due to recursion loops and silent failures stemming from inadequate tool-use API design. It presents five patterns aimed at preventing these issues in production LLM systems, emphasizing tool design over prompting.

LLMs Agent Loops software engineering API design

RESEARCHarXiv CS.CL·7d ago

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

A systematic inspection of extsf{FOLIO} and extsf{MALLS} validation splits revealed high rates of incorrect FOL formalizations and ambiguous NL sentences, distorting AI model evaluation. The authors developed and released corrected ground truths for these datasets, demonstrating how annotation errors impact the evaluation of state-of-the-art LLMs.

LLMs Neurosymbolic AI Natural Language Processing benchmarks

RESEARCHarXiv CS.AI·7d ago

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

This research explores using visual graph scaffolds to organize reasoning in Large Language Models (LLMs), inspired by human mind maps. Experiments on multi-hop question answering reveal that visual graph guidance significantly improves reasoning efficiency and answer quality compared to flattened text representations.

LLMs Graph Structures Reasoning artificial intelligence

RESEARCHarXiv CS.CL·7d ago

Greener Than Humans? Environmental Attitudes in Large Language Models

This paper develops a benchmark to evaluate environmental attitudes in Large Language Models (LLMs), comparing their responses to human survey benchmarks. The research finds that many LLMs align more closely with environmentally progressive attitudes than the average human respondent.

LLMs benchmarking sustainability environmental attitudes

RESEARCHDEV.to AI·5/7/2026

The 55.6% problem: why frontier LLMs fail at embedded code

Frontier LLMs demonstrate surprisingly low performance (around 50-55%) on embedded code tasks, according to the new EmbedBench benchmark. This highlights a significant gap compared to their performance in other development areas, despite testing on only a few hardware platforms.

LLMs AI limitations firmware benchmarking

ARTICLEDEV.to AI·11d ago

The NSA Said MCP Is a National Security Problem. Here's How to Actually Fix It.

The NSA has identified Model Context Protocol (MCP) as a national security concern due to its tool-calling architecture creating exploitable attack surfaces in AI-driven automation. This article focuses on how to operationalize the NSA's guidance to fix these security vulnerabilities.

LLMs cybersecurity security AI safety

RESEARCHDEV.to AI·13d ago

I gave ADHD to Claude.. its thinking 2x better now

The author proposes a new AI thinking pattern, "ADHD - Parallel Divergent Ideation for Coding Agents," inspired by divergent thinking. It suggests replacing the linear "Chain-of-thoughts" with a "Tree-of-thoughts" to enable AI models to connect disparate ideas and think more creatively.

LLMs cognitive AI Divergent thinking AI

ARTICLEDEV.to AI·5d ago

Context Engineering: The Skill Replacing Prompt Engineering in 2026

Context engineering is the discipline of systematically designing the information environment surrounding a prompt in LLM systems. This skill, expected to replace prompt engineering by 2026, focuses on what the model needs to know to perform well, rather than just what it should do.

LLMs prompt-engineering Context Engineering learning

DOCDEV.to AI·4/22/2026

RAG Systems in Production: Building Enterprise Knowledge Search

Retrieval-Augmented Generation (RAG) systems are presented as a revolutionary approach for enterprises to build intelligent knowledge systems by combining LLMs with domain-specific knowledge. This guide, based on Groovy Web's experience with Fortune 500 companies, covers the comprehensive process of building and deploying production-ready RAG systems, from architecture to monitoring.

LLMs RAG knowledge management Enterprise AI

RESEARCHarXiv CS.AI·4/13/2026

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.

LLMs reasoning tasks reinforcement learning PPO

RESEARCHarXiv CS.CL·4/10/2026

Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

Este artigo propõe uma estrutura de refinamento baseada em raciocínio que utiliza LLMs como juízes semânticos para validar e reestruturar os resultados de algoritmos de agrupamento de texto não supervisionados. A estrutura inclui verificação de coerência, adjudicação de redundância e fundamentação de rótulos, visando melhorar a qualidade dos clusters sem dados rotulados.

LLMs Text Clustering Reasoning semantic analysis

ARTICLEDEV.to AI·4/11/2026

The Future of AI Integration: Model Context Protocol (MCP) Connectors

Anthropic's Model Context Protocol (MCP) is a new open standard that solves the "N×M" data integration problem for LLMs. It standardizes the interaction between AI applications and external services, offering a transformative solution for autonomous agent ecosystems.

AI integration LLMs MCP Connectors Model Context Protocol

ARTICLEDEV.to AI·20d ago

One Tool That Cuts Token Costs 40-80% for Claude Code, Codex, opencode, and openclaw

This article identifies four structural patterns that significantly increase token costs for AI models like Claude Code and Codex, emphasizing that prompt optimization alone is insufficient. Issues include full-resolution screenshots, repeated file reads, context-losing compaction, and unoptimized Bash output, which collectively drive up API bills.

token management LLMs Cost Optimization AI

DOCDEV.to AI·4/26/2026

How to Deploy Llama 3.2 70B with Ollama on a $18/Month DigitalOcean Droplet: Memory-Optimized Self-Hosting

This content guides users on deploying Llama 3.2 70B with Ollama on an $18/month DigitalOcean droplet, demonstrating significant cost savings from API usage. It showcases how to achieve production-grade LLM inference at scale with comparable quality to commercial APIs, making advanced AI accessible for serious builders.

LLMs deployment self-hosting Cost Optimization

ARTICLEDEV.to AI·4/12/2026

Upwork for AI Agents

The content discusses the obsolescence of traditional freelance platforms with the rise of autonomous AI agents. It introduces the Agent Labor Market (ALM), where trust is built on technical manifests and verified agent capabilities, exemplified by platforms like UpAgents.

future-of-work LLMs Agentic Labor Market Freelance Platforms

ARTICLEDEV.to AI·5/2/2026

Why AI Makes Software Fundamentals More Expensive Than Ever

The article argues against the idea that LLMs make engineering skills obsolete, stating that software fundamentals are more important than ever. It warns that treating AI-generated code as "cheap" leads to "software entropy" and "Voodoo Coding," resulting in quickly degraded quality.

future-of-work LLMs developer skills code quality

ARTICLEDEV.to AI·4/18/2026

Multi-Agent Architecture: Specialist Routing in an Autonomous Task System

This article details a specialist routing architecture for autonomous agent systems, arguing against the inefficiency and cost of using a single powerful generalist model for all tasks. By classifying requests and employing specialized agents, this approach optimizes expenses and produces cleaner, more contextually relevant outputs, based on production deployment.

AI architecture LLMs Cost Optimization multi-agent systems