LLMs

723 items

RESEARCHarXiv CS.CL·4/22/2026

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

This empirical study investigates jailbreak detection in large language models, showing that single output evaluation systematically underestimates vulnerability. Increasing the number of sampled generations, especially from one to moderate sampling, significantly improves the detection of harmful behavior.

LLMs security AI safety

RESEARCHarXiv CS.AI·4/22/2026

From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

This paper introduces a neuro-symbolic framework for translating natural-language reasoning problems into executable Narsese, leveraging first-order logic. It presents NARS-Reasoning-v0.1, a new benchmark featuring reasoning problems with corresponding formal representations and truth labels for evaluating reasoning capabilities.

LLMs Reasoning benchmarks Neuro-symbolic AI

RESEARCHarXiv CS.AI·5/6/2026

Towards Multi-Agent Autonomous Reasoning in Hydrodynamics

This paper introduces a multi-agent system (MAS) prototype designed for hydrodynamics, addressing the limitations of single-agent LLM workflows. Specialized agents are coordinated through a Layer Execution Graph (LEG) to improve reliability and context management in scientific tasks.

LLMs Hydrodynamics Autonomous Reasoning Scientific Workflows

RESEARCHarXiv CS.AI·27d ago

Learning Transferable Latent User Preferences for Human-Aligned Decision Making

This paper introduces CLIPR, a framework designed to enable Large Language Models (LLMs) to make human-aligned decisions by inferring latent user preferences from limited interactions. It addresses the challenge of LLMs struggling with human alignment and the limitations of existing approaches in generalizing preferences across tasks.

user preferences LLMs Decision-making learning

RESEARCHarXiv CS.AI·21d ago

Evaluating the Utility of Personal Health Records in Personalized Health AI

This research evaluates Gemini 3.0 Flash's ability to answer user health queries using Personal Health Records (PHRs) as context. It analyzes responses generated with and without PHR data across various query types to assess the utility of PHRs in personalized health AI.

LLMs Patient Empowerment AI in healthcare Gemini

RESEARCHarXiv CS.LG·5/5/2026

Agentopic: A Generative AI Agent Workflow for Explainable Topic Modeling

Agentopic is a novel agent-based workflow for explainable topic modeling that leverages the reasoning capabilities of Large Language Models (LLMs). It enhances transparency by enabling users to trace the reasoning behind topic assignments, achieving an F1-score of 0.95, matching GPT-4.1.

LLMs Topic Modeling Explainable AI AI agents

RESEARCHarXiv CS.CL·21d ago

Prompting language influences diagnostic reasoning and accuracy of large language models

This research evaluated the impact of prompting language on the diagnostic reasoning and accuracy of large language models (LLMs) in clinical settings. Four out of five models performed better in English, highlighting the uncertainty regarding LLM reliability across different languages.

Multilingual AI LLMs clinical decision support Diagnostic Accuracy

RESEARCHarXiv CS.LG·21d ago

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

HELLoRA proposes a novel method for fine-tuning Mixture-of-Experts (MoE) models by applying Low-Rank Adaptation (LoRA) modules only to the most frequently activated experts at each layer. This technique significantly reduces trainable parameters and improves downstream performance, attributing its success to structured regularization that maintains expert specialization.

LLMs MoE AI fine-tuning

ARTICLEDEV.to AI·4/16/2026

Claude Workflows & Opus 4.7 Drive AI Code Generation; Python Observability Boosts Deployment

This week highlights practical strategies for AI code generation using Claude's latest Opus 4.7 capabilities, promising enhanced performance. Additionally, a significant Python proposal aims to boost system-level observability, vital for robust AI framework deployments and leveraging advanced prompt engineering techniques.

LLMs prompt-engineering AI Workflows Python

RESEARCHarXiv CS.CL·28d ago

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

This research systematically evaluates the relationship between differential privacy (DP) and social bias in large language models (LLMs). It compares a DP-trained LLM against non-DP baselines across various tasks, finding that DP reduces bias in sentence scoring but not universally, and reveals a discrepancy between logit-level and output-level bias.

LLMs security AI ethics Bias

RESEARCHarXiv CS.CL·14d ago

SPEAR: Code-Augmented Agentic Prompt Optimization

SPEAR introduces a free-form agentic optimizer for automatic prompt engineering, leveraging a Python sandbox for error analysis and autonomous improvement. It utilizes tools like evaluation, code execution, and auto-rollback to optimize prompts for LLM-as-judge tasks.

Optimization LLMs prompt-engineering Code-Augmentation

ARTICLEDEV.to AI·4/16/2026

Ai Hallucination Sanctions Surge How The Oregon Vineyard Ruling Walmart S Shortcut And California Ba

Sanctions for AI hallucinations became a serious board-room issue in April 2026, driven by new state privacy laws adding AI transparency rules and a White House framework holding deployers accountable. Companies are now expected to understand and mitigate hallucinations, with specific rulings highlighting the legal and financial risks of unverified LLM output.

Regulatory Compliance AI hallucinations LLMs legal responsibility

ARTICLEDEV.to AI·11d ago

Why I'm building Hyphae: provenance over prediction (and the 3-line baseline that tied it)

The author started building Hyphae to create a cognitive substrate without large language models, but a simple baseline matched its performance, highlighting a critical issue. This project now focuses on ensuring provenance in AI-generated answers, which is essential for auditability in critical applications.

LLMs Auditability provenance AI

ARTICLEDEV.to AI·26d ago

We Built a Compound AI System Instead of an Agent. It Costs $200/month and 100k People Use It.

This article highlights the inefficiency of autonomous AI agents, citing high failure rates and costs. It introduces "Compound AI Systems" as a successful alternative, where traditional code orchestrates LLM calls.

AI architecture LLMs Compound AI System AI implementation

ARTICLEDEV.to AI·4/26/2026

Building a 21-Layer Memory Stack for an AI That Forgets Every 5 Minutes

This article addresses the fundamental architectural problem of Large Language Models (LLMs) forgetting context in autonomous AI agents every few hours. Meridian, an autonomous AI, describes how it solved this issue by building a 21-layer memory stack to ensure continuous operation.

AI architecture LLMs Autonomous AI AI agents

ARTICLETwo Minute Papers (YouTube)·6d ago

Claude Opus 4.8: Lying Machine No More?

This article discusses Claude Opus 4.8, questioning whether its capabilities have improved to avoid providing misleading information. It analyzes the model's performance in terms of reliability and accuracy.

AI models LLMs AI reliability AI performance

ARTICLEDEV.to AI·11d ago

Why Most RAG Pipelines Fail in Production

This article explores why most RAG (Retrieval-Augmented Generation) pipelines fail in production, contrasting the simplicity of demos with the complexity and messiness of real-world datasets. It highlights the challenges of AI systems engineering, particularly in data ingestion for scaling RAG to production environments.

data ingestion LLMs production RAG

ARTICLEDEV.to AI·5/4/2026

Cut Your AI Agent Token Costs by 75% With One Skill Plugin

A plugin named Caveman can reduce AI agent token costs by 75% by stripping away redundant communication and optimizing context space. It teaches agents to be efficient communicators, focusing on essential information for developers.

LLMs token efficiency SKILL.md Plugin cost optimization

ARTICLEfreeCodeCamp (YouTube)·18d ago

Why understanding key ML concepts really helps you use LLMs more effectively

This content explores why a strong grasp of core Machine Learning concepts is crucial for effectively leveraging Large Language Models. It highlights how foundational ML knowledge enhances the practical application and understanding of LLMs.

LLMs learning machine learning AI

Why understanding key ML concepts really helps you use LLMs more effectively

ARTICLEDEV.to AI·25d ago

Origami - a workspace-oriented terminal

The author introduces Origami, a terminal built with LLMs, and shares valuable insights from its development. They emphasize that AI coding isn't a simple solution and highlight software architecture as the most vital skill for effective AI integration.

LLMs Software Architecture developer tools AI development